The proposal primarily measures the novelty of the 3,985 pilot pathway images based on unique genes — genes not currently in any WikiPathway pathways. While this is the simplest measure and therefore a good starting point, I think unique edges are a better assessment of novelty than unique genes. For example, a pathway could be composed of genes that are all already present elsewhere but are not connected anywhere else. This distinction is mentioned in the proposal:
Thus, even the 28% of symbols that overlap with current human pathways (Fig. 2, blue) may provide new interaction content.
While the initial OCR implementation does not appear to extract edges, each pathway could be represented as a gene set. Then novelty could be measured as:
whether any other pathways are a superset or subset of the query pathway
the max Jaccard index of the query pathway with all other pathways
An edge-based conception of novelty will provide greater recall of novel pathways. At the current stage, where plentiful novel information exists, node-based novelty will definitely work. I agree that expanding the number of genes in at least one pathway should be a primary goal but think that your current node-based metrics may undersell the extent of novel pathways.
I'm most excited about the novel edges as well, but this is simply impossible to estimate from the current OCR results on our sample set. Realize that on average one gene is recognized per image... The distribution of recognized genes is heavily skewed, however (Fig 2). But even if we consider the 1104 pathways were 5 or more genes were recognized, how could we honestly estimate edges. We can't assume that the detected nodes are connected to each other.
Likewise, the other novelty measures you suggest would be great to apply to complete gene sets from a sample set of pathways. Unfortunately, we just aren't even close to getting that from the current OCR results.
Of course, after we actually model a few hundred of these images, we will be able to start estimating novel nodes, novel sets and novel edges more reliably. Between now and the NIH reviews, perhaps we'll try to brute force a set of these to get at these numbers.