How do you plan to handle when the same pathway is represented in multiple images? For example, one publication may add an additional gene to a pathway first put forward by a previous publication. Is merging duplicated pathways outside the scope of this proposal?
I don't think that consolidating duplicates needs to be a focus of this project, but am intrigued by the difficulty of this problem and curious as to your insights. Does WikiPathways currently rely on users to merge duplicates?
Contrary to the strategy of most pathway archive that strive for a single set of canonical pathways, WikiPathways loves redundancy... because, well, that's how biology actually works :)
For example, instead of 1 Apoptosis pathway, we'd like to see 50+ Apoptosis pathways, depending on cell type, tissue type, conditions, developmental stages, disease states, etc. These might have a lot of redundancy, but the typical pathway analysis methods can already handle that. They typically provide a rank order of overrepresented pathways, for example, and being able to see exactly which Apoptosis version is most like your dataset would be very useful.
So, this is predicated on having ontology-based tags for the sources of variation. We've started with 3 standard ontologies, but this can expand by demand. And, if two pathways are 100% identical in all ways, well then, yes, we should merge them. This will be rare (hasn't happened yet) and the bulk of the work will be on providing good search, browse and grouping tools in our UI.
For this project, I left out these details because I'm just focusing on nodes and edges (for simplicity), but you can easily imagine taking the same products from this first round and doing another cycle where the focus is capturing biological context. The issue of redundancy is also address by our simple approach of prioritizing (and providing point bonuses for) novel genes. This alone will drive focus to less redundant pathway first... though we still want all variants in the end.