Pathways4Life: Crowdsourcing Pathway Modeling from Published Figures

+

# Abstract

A wealth of novel pathway information is trapped in published figures. This information, if properly modeled would be immensely useful for analyzing and interpreting large-scale omics datasets. Aligned with the broader BD2K initiative, this proposal sets out to transform the wealth of information currently embedded in countless figure images (big data) into pathway models amenable to analysis and research (knowledge). Computational approaches alone have failed to fully automate the extraction of this knowledge, which is no surprise given the wide diversity among images. Likewise, human efforts have fallen short, both at the level of internal curation teams (it’s a massively distributed problem, after all) and at the level of individual researchers who choose PowerPoint and Illustrator over the freely available pathway modeling standards and tools. This challenge is particularly well suited for a computer-assisted, human-crowdsourced solution. We propose to develop the Pathways4Life platform, which combines image processing, text recognition, and cutting-edge pathway modeling software together with scalable infrastructure for content management and tunable game mechanics to facilitate the rapid modeling of pathway images through human crowdsourced tasks.

# RESEARCH STRATEGY
## Significance
Pathway information is immensely useful for analyzing and interpreting large-scale omics data [@10.1038/ng1109 @10.1371/journal.pbio.1000472]. Pathway analysis software for general use was developed to meet the challenge of analyzing and interpreting high-throughput transcriptomics data. After the commercialization of microarrays, pathway analysis tools such as GenMAPP [@10.1038/ng0502-19], Pathway Tools [@10.1093/bioinformatics/18.suppl_1.s225], and Ingenuity (www.qiagen.com/ingenuity) proliferated. These tools ultimately rely on knowledge bases of pathway models. In this grant, we stress the distinction between pathway figures, which are drawn purely for illustration purposes in a graphical file format (e.g., jpg, gif, png) and pathway models, which contain standard identifiers and semantics that can be mapped to external resources in a structured file format (e.g., xml, owl, json). Aligned with the broader BD2K initiative, this proposal sets out to transform the wealth of information currently trapped in countless figure images (big data) into properly modeled pathways amenable to analysis and research (knowledge).

-

Eight years ago, we conceived of WikiPathways to bring a new approach to the task of collecting, curating and distributing pathway models [@10.1371/journal.pbio.0060184 @10.1093/nar/gkr1074]. Like other pathway knowledgebases, such as KEGG [@10.1093/nar/28.1.27] and Reactome [@10.1093/nar/gkq1018], WikiPathways manages a wide range of canonical metabolic, regulatory, and signaling pathways ([Figure](#VennHuman)). The pathway creation and editing tools we use to centrally curate content, however, are the same ones we embedded into the MediaWiki platform and make available to anyone at wikipathways.org. Crowdsourcing the tasks involved in curation enables WikiPathways to manage more updates, tap into more diverse domain experts, and service specialized research communities ([Table](#CurationActivity)). WikiPathways hosts any pathway model that is of interest to any individual researcher.

+

Eight years ago, we conceived of WikiPathways to bring a new approach to the task of collecting, curating and distributing pathway models [@10.1371/journal.pbio.0060184 @10.1093/nar/gkr1074]. Like other pathway knowledgebases, such as KEGG [@10.1093/nar/28.1.27] and Reactome [@10.1093/nar/gkq1018], WikiPathways manages a wide range of canonical metabolic, regulatory, and signaling pathways ([Figure {n}](#VennHuman)). The pathway creation and editing tools we use to centrally curate content, however, are the same ones we embedded into the MediaWiki platform and make available to anyone at wikipathways.org. Crowdsourcing the tasks involved in curation enables WikiPathways to manage more updates, tap into more diverse domain experts, and service specialized research communities ([Table {n}](#CurationActivity)). WikiPathways hosts any pathway model that is of interest to any individual researcher.

[:figure](VennHuman)

[:table](CurationActivity)

-

We are collaborating with Reactome to convert and host their human content for crowdsourced curation and distribution. Thus, despite the yellow portion in [Figure](#VennHuman) (the KEGG-only content that is not programmatically accessible without a license), WikiPathways is an ideal resource from which to launch a new, ambitious crowdsourcing initiative. Despite our efforts and the tremendous effort by all pathway knowledge bases over the past decade, most pathway information is still published solely as static, arbitrarily drawn images—isolated, inert representations of knowledge that cannot readily be reused or remixed in future studies.

+

We are collaborating with Reactome to convert and host their human content for crowdsourced curation and distribution. Thus, despite the yellow portion in [Figure {n}](#VennHuman) (the KEGG-only content that is not programmatically accessible without a license), WikiPathways is an ideal resource from which to launch a new, ambitious crowdsourcing initiative. Despite our efforts and the tremendous effort by all pathway knowledge bases over the past decade, most pathway information is still published solely as static, arbitrarily drawn images—isolated, inert representations of knowledge that cannot readily be reused or remixed in future studies.

-

A survey of ~4000 published pathway figures highlights the challenges we propose to address. A PubMed Central (PMC) image search using the keyword “signaling pathway” generates over 40,000 results. Visual inspection of the first 5000 results, from publications spanning 2000–2015, revealed that 3985 (79.7%) contain a pathway image; the remainder contained only the word “pathway” in their captions. We then performed optical character recognition (OCR) using two parallel approaches: Adobe Acrobat Text Recognition (www.adobe.com) and Google’s Tesseract (code.google.com/p/tesseract-ocr). We cross-referenced the extracted text results against all known HGNC human gene symbols [@10.1093/nar/gku1071], including aliases and prior symbols, to assess the potential of these images to inform the curation of human and orthologous pathways. Acrobat and Tesseract each extracted over ~2300 HGNC symbols; ~730 (~32%) contained new information, human genes, and orthologs not captured in any pathway for any species at WikiPathways. Further, these approaches found significantly different sets of symbols—each with its own uniquely trained OCR method—such that the combined results provide greater extraction counts across all categories: 3187 HGNC symbols in total and 1087 (34%) new to WikiPathways ([Figure](#HGNC), green).

+

A survey of ~4000 published pathway figures highlights the challenges we propose to address. A PubMed Central (PMC) image search using the keyword “signaling pathway” generates over 40,000 results. Visual inspection of the first 5000 results, from publications spanning 2000–2015, revealed that 3985 (79.7%) contain a pathway image; the remainder contained only the word “pathway” in their captions. We then performed optical character recognition (OCR) using two parallel approaches: Adobe Acrobat Text Recognition (www.adobe.com) and Google’s Tesseract (code.google.com/p/tesseract-ocr). We cross-referenced the extracted text results against all known HGNC human gene symbols [@10.1093/nar/gku1071], including aliases and prior symbols, to assess the potential of these images to inform the curation of human and orthologous pathways. Acrobat and Tesseract each extracted over ~2300 HGNC symbols; ~730 (~32%) contained new information, human genes, and orthologs not captured in any pathway for any species at WikiPathways. Further, these approaches found significantly different sets of symbols—each with its own uniquely trained OCR method—such that the combined results provide greater extraction counts across all categories: 3187 HGNC symbols in total and 1087 (34%) new to WikiPathways ([Figure {n}, green](#HGNC)).

[:figure](HGNC)

-

Several caveats to this survey suggest that even more new pathway information can be extracted from published images. (1) Another 38% of extracted symbols ([Figure](#HGNC), red) are found only on nonhuman pathways and thus may still represent novel human pathway content. (2) The OCR methods were used “out-of-the-box” and not trained on pathway images, and the images were not pre-processed. Thus there is a significant opportunity to increase total extraction counts. (3) Since we considered only 5000 of 40,000 results from only a single pathway-related search term, the search result space is much greater. (4) This survey ignored the interactions shown in the images. Thus, even the 28% of symbols that overlap with current human pathways ([Figure](#HGNC), blue) may provide new interaction content. These caveats far outweigh the potential for false positives in this survey. (1) About 5.5% of extracted symbols are dictionary words that might not represent human genes in a pathway (e.g., BIG, CELL, HOOK, MASS). (2) Some occurrences of extracted symbols might be peripheral to the image and not part of the pathway. (3) Each OCR method has an inherent false-positive rate of <5% [@ 10.1109/icdar.2007.4376991], which is only partially mitigated by the narrow focus on HGNC cross-referenced hits.

+

Several caveats to this survey suggest that even more new pathway information can be extracted from published images. (1) Another 38% of extracted symbols ([Figure {n}, red](#HGNC)) are found only on nonhuman pathways and thus may still represent novel human pathway content. (2) The OCR methods were used “out-of-the-box” and not trained on pathway images, and the images were not pre-processed. Thus there is a significant opportunity to increase total extraction counts. (3) Since we considered only 5000 of 40,000 results from only a single pathway-related search term, the search result space is much greater. (4) This survey ignored the interactions shown in the images. Thus, even the 28% of symbols that overlap with current human pathways ([Figure {n}, blue](#HGNC)) may provide new interaction content. These caveats far outweigh the potential for false positives in this survey. (1) About 5.5% of extracted symbols are dictionary words that might not represent human genes in a pathway (e.g., BIG, CELL, HOOK, MASS). (2) Some occurrences of extracted symbols might be peripheral to the image and not part of the pathway. (3) Each OCR method has an inherent false-positive rate of <5% [@ 10.1109/icdar.2007.4376991], which is only partially mitigated by the narrow focus on HGNC cross-referenced hits.

It is difficult to estimate the total number of new human gene symbols in the entire corpus of published pathway images. The proportion of new genes must plateau—new results giving diminishing returns—as the total number of known genes in pathways is approached. But for perspective, even if we were to only consider the first 3985 images and conservatively estimate the average number of genes per pathway image to be 7, and generously estimate the false positive rate to be as high as 20%, then at a proportion of 34% we would expect ~6100 new genes. **This would almost double the count of unique human genes at WikiPathways today—the equivalent of 6 years of work at current crowdsourcing rates.** The effort would also greatly expand upon the interactions and biological context for practically all the genes in WikiPathways. As detailed in our data sharing plan, all of this new knowledge will be available in multiple formats, including BioPAX and RDF, and distributed to a wide range of independent resources through channels already established by the WikiPathways project, including Pathway Commons [@10.1093/nar/gkq1039], [NCBI](www.ncbi.nlm.nih.gov/biosystems) and Network2Canvas [@10.1093/bioinformatics/btt319], which is a LINCS-BD2K project.

### Aim 1: Collect, Process, and Classify Pathway Images
We implemented an efficient process to maximize collection of pathway images from PMC publication figures as part of the sample analysis described above. The process starts with a query into the PMC image search feature. Results are returned as HTML, which is parsed to download the full-size figure image files and generate an annotation file for each image. The annotation file, containing figure caption, author names, article title, and article hyperlink, will allow us to present critical contextual information during the crowdsourcing stage. It will also be used to index images by author and to support focused keyword searches (e.g., for images relating to particular diseases). This process will scale to encompass a broad range of queries to collect diverse and high-value pathway image file sets. We plan to collect a set of 16,000 pathway images in the first iteration of Aim 1, approximately 4x the size of the sample set described above.

-

Also as part of the sample analysis, we began to explore text-extraction software. We have already scripted the application of Adobe Acrobat and Tesseract to generate extracted text file sets. These are useful for assessing the results in terms of recognizable gene symbol counts. But these programs also provide positional information for each block of extracted text. We will use this information to generate JSON models of nodes, preserving the annotation and position from the original image. This will allow us to overlay an interactive layer of modeled nodes onto the original pathway figures (see Aim 2). We will also improve the out-of-the-box performance of these programs. As an actively developed, open-source effort, Tesseract, in particular has considerable potential for improvement. To both programs, we will add a common image pre-processing step that uses ImageMagick (www.imagemagick.org) to increase contrast, adjust orientation, and remove noise (e.g., other graphics) from a copy of the image. We will also use OpenCV [@10.1145/2184319.2184337] to identify and optimize regions of text in the image prior to OCR. Other methods are available to isolate, orient, and filter “objects” that may contain recognizable text [@10.1093/bioinformatics/bts018 @10.1093/bioinformatics/btp318 @10.1109/bsec.2011.5872319 @10.1186/2041-1480-5-10]. This is a worthwhile area to explore, considering the Difficulty Matrix on our sample set of 3985 images ([Figure](#DifficultyMatrix)). Fortunately, the Easy-Easy corner of the matrix (top left) contains a large proportion of pathway figure images, which can be targeted in early rounds of our crowdsourcing effort. However, half of the images are in the Easy-Hard quadrant (easy for human, hard for computer; top right), meaning that any improvements in automatic text extraction will result in having more pathways that are ready and amenable for human processing. The lower half of the matrix includes such small percentages that we can simply ignore it for the purposes of this proposal. Of course, the difficulty level is not really binary, so we can leverage the gradient of difficulty in the human scale, for example, to rank the pathways for display to a gradient of participants from novice to expert.

+

Also as part of the sample analysis, we began to explore text-extraction software. We have already scripted the application of Adobe Acrobat and Tesseract to generate extracted text file sets. These are useful for assessing the results in terms of recognizable gene symbol counts. But these programs also provide positional information for each block of extracted text. We will use this information to generate JSON models of nodes, preserving the annotation and position from the original image. This will allow us to overlay an interactive layer of modeled nodes onto the original pathway figures (see Aim 2). We will also improve the out-of-the-box performance of these programs. As an actively developed, open-source effort, Tesseract, in particular has considerable potential for improvement. To both programs, we will add a common image pre-processing step that uses ImageMagick (www.imagemagick.org) to increase contrast, adjust orientation, and remove noise (e.g., other graphics) from a copy of the image. We will also use OpenCV [@10.1145/2184319.2184337] to identify and optimize regions of text in the image prior to OCR. Other methods are available to isolate, orient, and filter “objects” that may contain recognizable text [@10.1093/bioinformatics/bts018 @10.1093/bioinformatics/btp318 @10.1109/bsec.2011.5872319 @10.1186/2041-1480-5-10]. This is a worthwhile area to explore, considering the Difficulty Matrix on our sample set of 3985 images ([Figure {n}](#DifficultyMatrix)). Fortunately, the Easy-Easy corner of the matrix (top left) contains a large proportion of pathway figure images, which can be targeted in early rounds of our crowdsourcing effort. However, half of the images are in the Easy-Hard quadrant (easy for human, hard for computer; top right), meaning that any improvements in automatic text extraction will result in having more pathways that are ready and amenable for human processing. The lower half of the matrix includes such small percentages that we can simply ignore it for the purposes of this proposal. Of course, the difficulty level is not really binary, so we can leverage the gradient of difficulty in the human scale, for example, to rank the pathways for display to a gradient of participants from novice to expert.

[:figure](DifficultyMatrix)

-

The independent difficulty levels for humans and computers are just one example of how images will be classified. We will also assess the potential gene content per image based on the automatically extracted text. As we did with the sample set, we will contrast these gene sets with those already captured in properly modeled pathways. Given the set of novel genes, we will classify images based on the proportion and absolute number each pathway represents ([Figure](#HGNC)B, green). This will allow us to define high-value targets for the crowdsourcing effort—those images that will add the most new unique genes at the highest rate to pathway knowledge bases. We will also classify pathway images by the overrepresented GO terms and disease associations in their gene sets. Classification by disease, for example, will allow us to prioritize not only generally but also specifically for crowdsourcing efforts that target a single disease or research area. Even in cases where most of the genes are not novel, the interactions and contextual information that will be modeled are just as likely to impact subsequent research.

+

The independent difficulty levels for humans and computers are just one example of how images will be classified. We will also assess the potential gene content per image based on the automatically extracted text. As we did with the sample set, we will contrast these gene sets with those already captured in properly modeled pathways. Given the set of novel genes, we will classify images based on the proportion and absolute number each pathway represents ([Figure {n}B, green](#HGNC)). This will allow us to define high-value targets for the crowdsourcing effort—those images that will add the most new unique genes at the highest rate to pathway knowledge bases. We will also classify pathway images by the overrepresented GO terms and disease associations in their gene sets. Classification by disease, for example, will allow us to prioritize not only generally but also specifically for crowdsourcing efforts that target a single disease or research area. Even in cases where most of the genes are not novel, the interactions and contextual information that will be modeled are just as likely to impact subsequent research.

The product of this aim is an ever-growing database of pathway images, annotated not only with source information but also with extracted gene symbols, multiple dimensions of functional classifications, and a JSON data overlay ready to be rendered and made interactive in Aim 2.

We have sufficient infrastructure in place to develop and test the platform. As a modular set of virtualized services, we will deploy them using Amazon Web Services (AWS) to host large-scale beta and production crowdsourcing events toward the end of the funding period. By then we will have demonstrated the viability, scalability, and initial popularity of our approach, which will inform the strategy plan in future iterations.

-

**Customized Pvjs**—Pvjs will require customization to work as a component of the Pathways4Life platform. The modular architecture of pvjs will readily accommodate customization. The new modules will add support for attribute-value accessory data and an SVG visual feedback layer ([Figure](#Interface)).

+

**Customized Pvjs**—Pvjs will require customization to work as a component of the Pathways4Life platform. The modular architecture of pvjs will readily accommodate customization. The new modules will add support for attribute-value accessory data and an SVG visual feedback layer ([Figure {n}](#Interface)).

[:figure](Interface)

### Aim 3: Crowdsource Tasks and Engage Participation
Participants will define nodes and interactions. To define an interaction, the participant clicks on an existing node to anchor the source (an active “rubber band” line will now track with the mouse position) and then clicks on a second node or another interaction to indicate the target (an interaction arrow will now be drawn). A list of interaction types will appear from which the participant must select to complete the task and move on. To define a node, the participant right-clicks on the image where the node should be added (e.g., on the name or symbol for a gene, protein, or metabolite that OCR failed to recognize), types the name or symbol (which triggers an autocomplete pvjs database lookup), and then selects the correct identity (a new node will then be drawn). Subsequent interactions to and from the newly added nodes can then be drawn. In this way, complete pathway images can be traced and effectively modeled by a series of these two easy-to-learn tasks.

-

Each task will be associated with an adjustable point value. Tunable point values support the basic game mechanic of balancing the economy of player’s attention and time investment. For example, rare nodes and interactions will be worth more points than common ones already captured in the current archives of pathway models. This will allow us to tune the prioritization of novel information. The overall difficulty of a given pathway (i.e., per human difficulty scale in [Figure](#DifficultyMatrix)) can also be a variable in calculating a task’s value, both to balance challenge and reward and to encourage skill building and return participation. Even the ordinality (1st...Nth) could be used to value nodes and interactions to encourage completion of a given pathway.

+

Each task will be associated with an adjustable point value. Tunable point values support the basic game mechanic of balancing the economy of player’s attention and time investment. For example, rare nodes and interactions will be worth more points than common ones already captured in the current archives of pathway models. This will allow us to tune the prioritization of novel information. The overall difficulty of a given pathway (i.e., per human difficulty scale in [Figure {n}](#DifficultyMatrix)) can also be a variable in calculating a task’s value, both to balance challenge and reward and to encourage skill building and return participation. Even the ordinality (1st...Nth) could be used to value nodes and interactions to encourage completion of a given pathway.

-

To assess quality and confidence (see Aim 4), we will need to collect redundant information from multiple participants on any given task. We will do this in two ways: (1) by showing the same version of a pathway to multiple participants, excluding newly added nodes and interactions that have yet to be confirmed and (2) by allowing participants to right-click on existing nodes and interactions to contest the information, which would contribute to a confirmed rejection and removal of that information in future rounds. This strategy will also help address false-positive OCR results that generate inaccurate nodes. Again, tunable point values will be used to balance confirmation versus pioneering activity (e.g., by increasing the values for successive confirmations). And participants will gain/lose points post hoc based on the long-term confirmation/rejection status of their tasks. This tunable value will balance accuracy against speed. [Table](#TunableVariables) summarizes the tunable economy of the platform via task point values, as well as when the calculation occurs.

+

To assess quality and confidence (see Aim 4), we will need to collect redundant information from multiple participants on any given task. We will do this in two ways: (1) by showing the same version of a pathway to multiple participants, excluding newly added nodes and interactions that have yet to be confirmed and (2) by allowing participants to right-click on existing nodes and interactions to contest the information, which would contribute to a confirmed rejection and removal of that information in future rounds. This strategy will also help address false-positive OCR results that generate inaccurate nodes. Again, tunable point values will be used to balance confirmation versus pioneering activity (e.g., by increasing the values for successive confirmations). And participants will gain/lose points post hoc based on the long-term confirmation/rejection status of their tasks. This tunable value will balance accuracy against speed. [Table {n}](#TunableVariables) summarizes the tunable economy of the platform via task point values, as well as when the calculation occurs.

[:table](TunableVariables)

Pathways4Life: Crowdsourcing Pathway Modeling from Published Figures

Proposal revision comparison

Content

 # RESEARCH STRATEGY

## Significance

Pathway information is immensely useful for analyzing and interpreting large-scale omics data [@10.1038/ng1109 @10.1371/journal.pbio.1000472]. Pathway analysis software for general use was developed to meet the challenge of analyzing and interpreting high-throughput transcriptomics data. After the commercialization of microarrays, pathway analysis tools such as GenMAPP [@10.1038/ng0502-19], Pathway Tools [@10.1093/bioinformatics/18.suppl_1.s225], and Ingenuity (www.qiagen.com/ingenuity) proliferated. These tools ultimately rely on knowledge bases of pathway models. In this grant, we stress the distinction between pathway figures, which are drawn purely for illustration purposes in a graphical file format (e.g., jpg, gif, png) and pathway models, which contain standard identifiers and semantics that can be mapped to external resources in a structured file format (e.g., xml, owl, json). Aligned with the broader BD2K initiative, this proposal sets out to transform the wealth of information currently trapped in countless figure images (big data) into properly modeled pathways amenable to analysis and research (knowledge).
 [:figure](VennHuman)



[:table](CurationActivity)
 [:figure](HGNC)
 It is difficult to estimate the total number of new human gene symbols in the entire corpus of published pathway images. The proportion of new genes must plateau—new results giving diminishing returns—as the total number of known genes in pathways is approached. But for perspective, even if we were to only consider the first 3985 images and conservatively estimate the average number of genes per pathway image to be 7, and generously estimate the false positive rate to be as high as 20%, then at a proportion of 34% we would expect ~6100 new genes. **This would almost double the count of unique human genes at WikiPathways today—the equivalent of 6 years of work at current crowdsourcing rates.** The effort would also greatly expand upon the interactions and biological context for practically all the genes in WikiPathways. As detailed in our data sharing plan, all of this new knowledge will be available in multiple formats, including BioPAX and RDF, and distributed to a wide range of independent resources through channels already established by the WikiPathways project, including Pathway Commons [@10.1093/nar/gkq1039], [NCBI](www.ncbi.nlm.nih.gov/biosystems) and Network2Canvas [@10.1093/bioinformatics/btt319], which is a LINCS-BD2K project.
 ### Aim 1: Collect, Process, and Classify Pathway Images

We implemented an efficient process to maximize collection of pathway images from PMC publication figures as part of the sample analysis described above. The process starts with a query into the PMC image search feature. Results are returned as HTML, which is parsed to download the full-size figure image files and generate an annotation file for each image. The annotation file, containing figure caption, author names, article title, and article hyperlink, will allow us to present critical contextual information during the crowdsourcing stage. It will also be used to index images by author and to support focused keyword searches (e.g., for images relating to particular diseases). This process will scale to encompass a broad range of queries to collect diverse and high-value pathway image file sets. We plan to collect a set of 16,000 pathway images in the first iteration of Aim 1, approximately 4x the size of the sample set described above.
 [:figure](DifficultyMatrix)
 The product of this aim is an ever-growing database of pathway images, annotated not only with source information but also with extracted gene symbols, multiple dimensions of functional classifications, and a JSON data overlay ready to be rendered and made interactive in Aim 2.
 We have sufficient infrastructure in place to develop and test the platform. As a modular set of virtualized services, we will deploy them using Amazon Web Services (AWS) to host large-scale beta and production crowdsourcing events toward the end of the funding period. By then we will have demonstrated the viability, scalability, and initial popularity of our approach, which will inform the strategy plan in future iterations.
 [:figure](Interface)
 ### Aim 3: Crowdsource Tasks and Engage Participation

Participants will define nodes and interactions. To define an interaction, the participant clicks on an existing node to anchor the source (an active “rubber band” line will now track with the mouse position) and then clicks on a second node or another interaction to indicate the target (an interaction arrow will now be drawn). A list of interaction types will appear from which the participant must select to complete the task and move on. To define a node, the participant right-clicks on the image where the node should be added (e.g., on the name or symbol for a gene, protein, or metabolite that OCR failed to recognize), types the name or symbol (which triggers an autocomplete pvjs database lookup), and then selects the correct identity (a new node will then be drawn). Subsequent interactions to and from the newly added nodes can then be drawn. In this way, complete pathway images can be traced and effectively modeled by a series of these two easy-to-learn tasks.
 [:table](TunableVariables)