Awaiting Funder

# Abstract

A wealth of novel pathway information is trapped in published figures. This information, if properly modeled would be immensely useful for analyzing and interpreting large-scale omics datasets. Aligned with the broader BD2K initiative, this proposal sets out to transform the wealth of information currently embedded in countless figure images (big data) into pathway models amenable to analysis and research (knowledge). Computational approaches alone have failed to fully automate the extraction of this knowledge, which is no surprise given the wide diversity among images. Likewise, human efforts have fallen short, both at the level of internal curation teams (it’s a massively distributed problem, after all) and at the level of individual researchers who choose PowerPoint and Illustrator over the freely available pathway modeling standards and tools. This challenge is particularly well suited for a computer-assisted, human-crowdsourced solution. We propose to develop the Pathways4Life platform, which combines image processing, text recognition, and cutting-edge pathway modeling software together with scalable infrastructure for content management and tunable game mechanics to facilitate the rapid modeling of pathway images through human crowdsourced tasks.

# RESEARCH STRATEGY

## Significance

Pathway information is immensely useful for analyzing and interpreting large-scale omics data [1, 2]. Pathway analysis software for general use was developed to meet the challenge of analyzing and interpreting high-throughput transcriptomics data. After the commercialization of microarrays, pathway analysis tools such as GenMAPP [3], Pathway Tools [4], and Ingenuity (www.qiagen.com/ingenuity) proliferated. These tools ultimately rely on knowledge bases of pathway models. In this grant, we stress the distinction between pathway figures, which are drawn purely for illustration purposes in a graphical file format (e.g., jpg, gif, png) and pathway models, which contain standard identifiers and semantics that can be mapped to external resources in a structured file format (e.g., xml, owl, json). Aligned with the broader BD2K initiative, this proposal sets out to transform the wealth of information currently trapped in countless figure images (big data) into properly modeled pathways amenable to analysis and research (knowledge).

Eight years ago, we conceived of WikiPathways to bring a new approach to the task of collecting, curating and distributing pathway models [5, 6]. Like other pathway knowledgebases, such as KEGG [7] and Reactome [8], WikiPathways manages a wide range of canonical metabolic, regulatory, and signaling pathways (Figure 1). The pathway creation and editing tools we use to centrally curate content, however, are the same ones we embedded into the MediaWiki platform and make available to anyone at wikipathways.org. Crowdsourcing the tasks involved in curation enables WikiPathways to manage more updates, tap into more diverse domain experts, and service specialized research communities (Table 1). WikiPathways hosts any pathway model that is of interest to any individual researcher.

Table 1. Curation Activity of Pathway Resources

Estimates of curation activity for WikiPathways, Reactome and KEGG, including examples of specialized domains captured by crowdsourcing at WikiPathways. Reactome statistics are based on their archive of editorial calendars. KEGG statistics are based on their update history and Kanehisa lab membership.

Curation ActivityWikiPathwaysReactomeKEGG
Updated pathways in past year1048 (>3,200 edits)6216 (peak year was 2009 at 71)
Unique contributors in past year20877est. ~11 (i.e., half of Kanehisa lab)
Canonical metabolic, regulatory, signaling, and disease pathways
Stem cell and tissue differentiation pathways
Extracellular RNA pathways
Micronutrient pathways
Curated (not inferred) pathways for fly, chicken, and Arabidopsis✔ third-party sites
Curated (not inferred) pathways for mouse, cow, zebrafish, yeast, plants, E. coli, tuberculosis, etc.

We are collaborating with Reactome to convert and host their human content for crowdsourced curation and distribution. Thus, despite the yellow portion in Figure 1 (the KEGG-only content that is not programmatically accessible without a license), WikiPathways is an ideal resource from which to launch a new, ambitious crowdsourcing initiative. Despite our efforts and the tremendous effort by all pathway knowledge bases over the past decade, most pathway information is still published solely as static, arbitrarily drawn images—isolated, inert representations of knowledge that cannot readily be reused or remixed in future studies.

A survey of ~4000 published pathway figures highlights the challenges we propose to address. A PubMed Central (PMC) image search using the keyword “signaling pathway” generates over 40,000 results. Visual inspection of the first 5000 results, from publications spanning 2000–2015, revealed that 3985 (79.7%) contain a pathway image; the remainder contained only the word “pathway” in their captions. We then performed optical character recognition (OCR) using two parallel approaches: Adobe Acrobat Text Recognition (www.adobe.com) and Google’s Tesseract (code.google.com/p/tesseract-ocr). We cross-referenced the extracted text results against all known HGNC human gene symbols [9], including aliases and prior symbols, to assess the potential of these images to inform the curation of human and orthologous pathways. Acrobat and Tesseract each extracted over ~2300 HGNC symbols; ~730 (~32%) contained new information, human genes, and orthologs not captured in any pathway for any species at WikiPathways. Further, these approaches found significantly different sets of symbols—each with its own uniquely trained OCR method—such that the combined results provide greater extraction counts across all categories: 3187 HGNC symbols in total and 1087 (34%) new to WikiPathways (Figure 2, green).

Several caveats to this survey suggest that even more new pathway information can be extracted from published images. (1) Another 38% of extracted symbols (Figure 2, red) are found only on nonhuman pathways and thus may still represent novel human pathway content. (2) The OCR methods were used “out-of-the-box” and not trained on pathway images, and the images were not pre-processed. Thus there is a significant opportunity to increase total extraction counts. (3) Since we considered only 5000 of 40,000 results from only a single pathway-related search term, the search result space is much greater. (4) This survey ignored the interactions shown in the images. Thus, even the 28% of symbols that overlap with current human pathways (Figure 2, blue) may provide new interaction content. These caveats far outweigh the potential for false positives in this survey. (1) About 5.5% of extracted symbols are dictionary words that might not represent human genes in a pathway (e.g., BIG, CELL, HOOK, MASS). (2) Some occurrences of extracted symbols might be peripheral to the image and not part of the pathway. (3) Each OCR method has an inherent false-positive rate of <5% [@ 10.1109/icdar.2007.4376991], which is only partially mitigated by the narrow focus on HGNC cross-referenced hits.

It is difficult to estimate the total number of new human gene symbols in the entire corpus of published pathway images. The proportion of new genes must plateau—new results giving diminishing returns—as the total number of known genes in pathways is approached. But for perspective, even if we were to only consider the first 3985 images and conservatively estimate the average number of genes per pathway image to be 7, and generously estimate the false positive rate to be as high as 20%, then at a proportion of 34% we would expect ~6100 new genes. This would almost double the count of unique human genes at WikiPathways today—the equivalent of 6 years of work at current crowdsourcing rates. The effort would also greatly expand upon the interactions and biological context for practically all the genes in WikiPathways. As detailed in our data sharing plan, all of this new knowledge will be available in multiple formats, including BioPAX and RDF, and distributed to a wide range of independent resources through channels already established by the WikiPathways project, including Pathway Commons [10], NCBI and Network2Canvas [11], which is a LINCS-BD2K project.

Can the state of current pathway knowledge bases really be so poor that this minor fraction of published pathways images could have such a large effect? Here, again, the survey set of PMC pathway figures is illustrative. Despite the combined efforts of KEGG, Reactome, SBML, SBGN, WikiPathways, and even Ingenuity to provide pathway models as publication-quality images over the past decade, we counted only 230 images from any of these sources in our sample set of 3985 pathway figures. No other modeled formats were seen in appreciable number. Evidently, the vast majority of pathway images (over 94%) are arbitrarily drawn with illustration software, with little to no consistency in visual lexicon or layout and without reference to standard identifiers, interaction types, or contextual semantics.

## Innovation

Clearly, a wealth of novel pathway information is trapped in published figures. The next challenge: how to efficiently extract this information and model it as biological knowledge. Computational approaches alone have failed to fully automate this process, which is no surprise given the wide diversity among the images. Likewise, human efforts have fallen short, both at the level of central curation teams (it’s a massively distributed problem, after all) and at the level of individual researchers who choose PowerPoint and Illustrator over freely available pathway modeling standards and tools. This challenge is particularly well suited for a computer-assisted, human-crowdsourced solution. We propose to develop the Pathways4Life platform as such an innovative solution. The Pathways4Life platform combines image processing, text recognition, and cutting-edge pathway modeling software together with scalable infrastructure for content management and tunable game mechanics to facilitate the rapid modeling of pathway images through human crowdsourced tasks. Each component derives from existing research projects and technologies—the innovation lies in bringing them together to address an ongoing biomedical challenge at an unprecedented rate.

In Aim 1, we are following in the footsteps of other groups who tackled the problem of parsing and indexing figures from the scientific literature. Yale Image Finder [12] indexed over 1.5 million open-access images, but they parsed only the captions and not the text embedded in the images. Michael Baitaluk, et al. fine tuned an OCR method to parse entire pathways to generate BiologicalNetworks.org [13]. This work will certainly inform our optimization work in Aim 1. Unfortunately, the models from this project were never released in a community standard format, and are no longer publically available. Regardless, even with optimization, the results were limited to 1012 pathways from ~25,000 images, of which 87% were considered to be high quality. The only human input in this process was a “like/don’t like” button and comment form. So, the real innovation we propose is to not rely solely on computational OCR, but rather to couple it from the start with a human crowdsourcing component. The OCR results are intended to lower the barrier to entry for participants, giving them a handful of recognized nodes to build upon and in the process learn how to add the remaining nodes.

The technical innovation in Aims 2, 3, and 4 has less to do with individual technologies (relational databases, Python/Django, JavaScript, etc.) and more to do with the modular, extensible architecture, which will support iterative, agile development. This approach allows us to roll out early iterations of the platform and enables others to leverage the same framework for a wide range of crowdsourcing challenges. For example, swap in an indexed set of articles and a text highlighting tool and you would have a platform for crowdsourcing text annotation or semantic knowledge extraction, with tunable game mechanics built-in. The goal of the technology is to streamline the display, aggregation, and valuation of tasks. Embedded in this platform will be a customized version of our pathway modeling software (described in Aim 2) that will simplify the human tasks associated with pathway curation down to the barest minimum. Combined with the assistance of pre-parsed nodes from Aim 1, the curation experience will be made effortless—even enjoyable—and will far exceed the capabilities of the current WikiPathways toolset.

The WikiPathways project is already contributing to a paradigm shift in modeling biomedical knowledge. Rather than relying solely on a centralized group of curators, we have engaged a relatively broad spectrum of researchers to contribute pathway information in their areas of expertise. This proposal aims to shift the paradigm even further. By analogy to the limitations of a centralized group that the WikiPathways project overcame by enabling all active researchers, we propose to overcome the limitations of active researchers (e.g., their number and available time) by enabling the general public. The scope and timescale of this approach will have a dramatic impact on the rate and limits of new knowledge in the form of pathway models. By scaling up the pathway image collection to 16,000 and extrapolating the sample set results (4 x 6100 = 24,400), we predict that we will be inside the region of diminishing returns (w.r.t. unique genes because there are only so many), allowing us to prioritize our selection of pathways to crowdsource based on organism, density of novel genes and biological contexts (Aim 1). This collection will contain more unique human genes than all current pathway archives combined. The modeling of this collection would thus approach the goal of having at least one representation of every human gene that is in a known pathway context. This will lead to entirely new metrics for pathway resources, as we go on to target the full diversity of interactions and contexts for genes, as well as splice variants, miRNA, metabolites, drugs, etc. The impact on the number of interactions modeled from this collection will be even greater, as novel interactions are captured even for non-novel nodes in pathway images. More broadly, the platform has the potential to make a lasting impact on how pathway modeling as well as other knowledge extraction is performed, shifting the focus closer and closer to the source, to capture this information in sync with the act of publication.

## Approach

We propose four aims. First, we will collect, process, and classify pathway images from the open-access literature. The classification step will allow us to prioritize images based on curation goals (e.g., novel genes and disease contexts) and difficulty (both human and computational). The second aim will be to develop an engaging interactive digital media platform for presenting these images to human participants with pre-annotated nodes (from OCR results) and simple tasks, such as connecting existing nodes and adding new nodes. The third aim will focus on the crowdsourcing effort, including recruiting participants and “tuning” their experience by adjusting for difficulty level, regulating point systems and visual feedback (e.g., animations), as well as by communicating scientific accomplishments. The fourth aim will entail the transformation of data from the crowdsourced tasks into confirmed pathway models. Statistical analyses will automatically feedback to the prior aim of prioritization and experience tuning to drive activity toward completion and accuracy. Community review and curation of the results will lead to their dissemination via multiple open-standard formats and communication channels, including WikiPathways, Pathway Commons (BioPAX), and linked data (RDF).

### Aim 1: Collect, Process, and Classify Pathway Images

We implemented an efficient process to maximize collection of pathway images from PMC publication figures as part of the sample analysis described above. The process starts with a query into the PMC image search feature. Results are returned as HTML, which is parsed to download the full-size figure image files and generate an annotation file for each image. The annotation file, containing figure caption, author names, article title, and article hyperlink, will allow us to present critical contextual information during the crowdsourcing stage. It will also be used to index images by author and to support focused keyword searches (e.g., for images relating to particular diseases). This process will scale to encompass a broad range of queries to collect diverse and high-value pathway image file sets. We plan to collect a set of 16,000 pathway images in the first iteration of Aim 1, approximately 4x the size of the sample set described above.

Also as part of the sample analysis, we began to explore text-extraction software. We have already scripted the application of Adobe Acrobat and Tesseract to generate extracted text file sets. These are useful for assessing the results in terms of recognizable gene symbol counts. But these programs also provide positional information for each block of extracted text. We will use this information to generate JSON models of nodes, preserving the annotation and position from the original image. This will allow us to overlay an interactive layer of modeled nodes onto the original pathway figures (see Aim 2). We will also improve the out-of-the-box performance of these programs. As an actively developed, open-source effort, Tesseract, in particular has considerable potential for improvement. To both programs, we will add a common image pre-processing step that uses ImageMagick (www.imagemagick.org) to increase contrast, adjust orientation, and remove noise (e.g., other graphics) from a copy of the image. We will also use OpenCV [14] to identify and optimize regions of text in the image prior to OCR. Other methods are available to isolate, orient, and filter “objects” that may contain recognizable text [13, 15, 16, 17]. This is a worthwhile area to explore, considering the Difficulty Matrix on our sample set of 3985 images (Figure 3). Fortunately, the Easy-Easy corner of the matrix (top left) contains a large proportion of pathway figure images, which can be targeted in early rounds of our crowdsourcing effort. However, half of the images are in the Easy-Hard quadrant (easy for human, hard for computer; top right), meaning that any improvements in automatic text extraction will result in having more pathways that are ready and amenable for human processing. The lower half of the matrix includes such small percentages that we can simply ignore it for the purposes of this proposal. Of course, the difficulty level is not really binary, so we can leverage the gradient of difficulty in the human scale, for example, to rank the pathways for display to a gradient of participants from novice to expert.

The independent difficulty levels for humans and computers are just one example of how images will be classified. We will also assess the potential gene content per image based on the automatically extracted text. As we did with the sample set, we will contrast these gene sets with those already captured in properly modeled pathways. Given the set of novel genes, we will classify images based on the proportion and absolute number each pathway represents (Figure 2B, green). This will allow us to define high-value targets for the crowdsourcing effort—those images that will add the most new unique genes at the highest rate to pathway knowledge bases. We will also classify pathway images by the overrepresented GO terms and disease associations in their gene sets. Classification by disease, for example, will allow us to prioritize not only generally but also specifically for crowdsourcing efforts that target a single disease or research area. Even in cases where most of the genes are not novel, the interactions and contextual information that will be modeled are just as likely to impact subsequent research.

The product of this aim is an ever-growing database of pathway images, annotated not only with source information but also with extracted gene symbols, multiple dimensions of functional classifications, and a JSON data overlay ready to be rendered and made interactive in Aim 2.

#### Strategic Vision

The FOA points out that this work is expected to be iterative. The work we propose covers the first iteration from conceptual design to established demonstration. Each section of the plan will include a Strategic Vision subsection like this to describe a plan for subsequent iterations on this work.

Beyond scouring html results for published pathway images, the longer-term strategy should be to get closer and closer to the source. The Pathways4Life platform could be extended to allow journal editors to manage inputs. For example, they could directly populate the figures to be processed by the crowd in sync with the publishing process, even as a way of promoting a new article. In this way, they would learn to appreciate the need for providing editorial feedback to authors concerning the legibility of pathway figures, making them more amenable to computational processing. In the final iteration, the authors themselves should find these modeling tools so easy and useful that they will model pathways from the start.

### Aim 2: Develop an Interactive Digital Media Platform

We began to explore this approach in 2007 with the WikiPathways project [5, 18, 19]. Today, my group is in Year 3 of 5 of the first R01 for WikiPathways development (GM100039). We are currently rolling out a JavaScript replacement of the Java Applet, called pvjs (pathvisiojs), which converts our xml pathway models into a JSON model and renders it as SVG. The user can interact with the rendered view by clicking on nodes (e.g., genes, proteins, metabolites) and interactions to pull up information panels driven by the use of standard identifiers. In edit mode, users can add new identifiers, reposition nodes, and draw new interactions. Beyond these immediate functions, we designed pvjs with a view toward specialization for educational, publishing, and even game possibilities. The first step in our development process for pvjs was to synthesize a set of best practices by reviewing the architectures of 40 relevant libraries and frameworks, including d3.js, Cytoscape.js, AngularJS, SVG-edit, VISIBIOweb, and biographer. The insights we gained led us to a modularized, model-view-controller (MVC) architectural pattern that integrates virtual DOM capabilities for fast performance and easy extensibility. We use an agile development process in which new features are broken into their smallest independently useful components and released frequently to ensure a tight coupling of user and developer goals and expectations. This proposal will be the first realization of this potential for extensibility.

A specialized version of pvjs will be a critical component of the Pathways4Life platform. For this proposal, we will describe the unique requirements, refactoring, and new development that will make up the overall implementation plan for the platform.

Backend database and control logic—We will design and host a database to contain image, annotation, and JSON file references, with indexed classification values and various progress-tracking metrics. These entries will map to participant, node, and interaction tables in the database. The schema will support queries to cache specific subsets of content for targeted events (e.g., with a disease focus). Indexed classification values will also be used in dynamic queries to determine the next pathway image to show a given user. The skill level of participants and the point value of nodes and interactions will be updated in their respective tables in rounds of activity (e.g., per pathway, set of pathways, or even per day), depending on performance profiling.

A basic Python/Django web framework will be implemented to form template-based queries and views to serve content to the customized pvjs tool and to update the database with new contributions and calculated activity metrics. For example, as a participant adds nodes and interactions, they will accumulate corresponding points and attain a higher skill level. The next pathway (or set of pathways) shown to this participant will be based on their skill level and the difficulty class of the pathway. Simple calculations based on the participant’s actions with a given pathway (e.g., accumulation of points) will be done in the client browser and returned to server after each session. On the server side, we will process aggregated data across all sessions to assign confidence scores for each node and interaction, updating their database records. This strategy will allow us to distribute low-CPU, frequent computation at scale with the number of participants while also restricting server-side, moderate-CPU computation to fixed periods that we can adjust according to demand and resources.

We have sufficient infrastructure in place to develop and test the platform. As a modular set of virtualized services, we will deploy them using Amazon Web Services (AWS) to host large-scale beta and production crowdsourcing events toward the end of the funding period. By then we will have demonstrated the viability, scalability, and initial popularity of our approach, which will inform the strategy plan in future iterations.

Customized Pvjs—Pvjs will require customization to work as a component of the Pathways4Life platform. The modular architecture of pvjs will readily accommodate customization. The new modules will add support for attribute-value accessory data and an SVG visual feedback layer (Figure 4).

The first module will leverage the extensibility of our existing JSON format for pathway information by defining attributes to handle pre-calculated point values and confidence scores for each pathway, node, and interaction. These values will be retrieved from the backend database described above and combined with standard pathway information from the XML model. As part of the JSON model, this information will be available to the SVG layer and the browser. The module will be designed to work with any third-party database and any set of arbitrary attribute-value pairs. Thus, the same module could also be used to represent any accessory data, such as public or user-provided omics datasets, linkouts to custom resources, or metadata from other modeling standards, such as SBML.

The next module will build upon the current SVG rendering and interaction capabilities of pvjs. Currently, it supports only basic representations of nodes and interactions. To make pvjs more engaging and interactive, we will design and implement more visually engaging objects and activity feedback. For example, color and animation effects can be used to indicate the point value of a particular node or interaction, and the act of forming a new or confirmed interaction could be accompanied by a visually rewarding glow, pulse, or burst effect. The module will define the mapping between available JSON attributes and SVG elements. The mapping pattern can be reused to provide custom graphics and animations for any defined set of accessory data, such as gradient-fill colors for omics datasets, hover effects for custom linkouts, and support for other graphical standards, such as SBGN.

A potential challenge that could arise is the performance of SVG for highly complex diagrams. SVG supports dynamic diagrams with up to about 5000 to 10,000 elements, depending on the browser. This limitation is unlikely to present a problem, because the vast majority of pathways are more focused than the average raw network visualization. But if the goals of the project shift such that support is required for additional elements, pvjs is designed so that some or all of the SVG rendering can be replaced with technologies suited for rendering extremely large numbers of elements, such as webGL or canvas. For example, it would be possible to render completed portions of a pathway using SVG, on top of which additional elements, such as a large number of candidate pathway elements generated by automated techniques, could be rendered as a webGL layer. This could be done using an open-source library such as Pixi.js, a performance-focused HTML5 rendering engine that defaults to webGL but falls back to canvas to support older browsers.

#### Strategic Vision

The backend database and control logic elements will be designed to scale with demand and hardware resources. Thus, future iterations will require minimal refactoring as computer, storage, and bandwidth resources are increased. During the initial funding period, we will coordinate with other science crowdsourcing efforts to leverage any common platforms we might contribute to in order to reach these goals faster and more sustainably. For example, during an NIH-hosted informational webinar on this FOA, the potential grantees formed a Google Group for Crowdsourced Science Games that will also be a source of collaborative idea sharing, development, and outreach strategies. In particular, for many years, we have collaborated on open science and crowdsourcing strategies with Drs. Su and Good at Scripps. They are exploring a crowdsourcing platform built around their Mark2Cure effort (mark2cure.org). We will share our requirements and feature ideas with them and other groups to work together wherever possible on a platform that could support our independently developed tools. In the iterations that follow this funding period, we will be in a position to consider longer-term strategies for hosting Pathways4Life. There are already enthusiastic hosting services for science-related games and crowdsourcing efforts, such as Purpose Games, Games for Change, and Zooniverse. Such hosting opportunities will continue to diversify and grow in number.

In terms of pvjs customization, we have outlined a strategy that will meet the immediate goals of this proposal, to produce an engaging interactive digital media experience, while also enabling a wide range of future project ideas. We routinely accept patches and extensions to our open-source projects, especially in areas where we have established a framework for extensions and clear programing patterns. Thus, in addition to our own further customization of JSON attributes and mapped SVG graphics, we anticipate that Pathways4Life will be a popular framework for other groups to extend for their own custom use cases. In particular, we would work with colleagues in the SBGN community to support their standard visual lexicon for pathways through an iteration of this module strategy.

### Aim 3: Crowdsource Tasks and Engage Participation

Participants will define nodes and interactions. To define an interaction, the participant clicks on an existing node to anchor the source (an active “rubber band” line will now track with the mouse position) and then clicks on a second node or another interaction to indicate the target (an interaction arrow will now be drawn). A list of interaction types will appear from which the participant must select to complete the task and move on. To define a node, the participant right-clicks on the image where the node should be added (e.g., on the name or symbol for a gene, protein, or metabolite that OCR failed to recognize), types the name or symbol (which triggers an autocomplete pvjs database lookup), and then selects the correct identity (a new node will then be drawn). Subsequent interactions to and from the newly added nodes can then be drawn. In this way, complete pathway images can be traced and effectively modeled by a series of these two easy-to-learn tasks.

Each task will be associated with an adjustable point value. Tunable point values support the basic game mechanic of balancing the economy of player’s attention and time investment. For example, rare nodes and interactions will be worth more points than common ones already captured in the current archives of pathway models. This will allow us to tune the prioritization of novel information. The overall difficulty of a given pathway (i.e., per human difficulty scale in Figure 3) can also be a variable in calculating a task’s value, both to balance challenge and reward and to encourage skill building and return participation. Even the ordinality (1st...Nth) could be used to value nodes and interactions to encourage completion of a given pathway.

To assess quality and confidence (see Aim 4), we will need to collect redundant information from multiple participants on any given task. We will do this in two ways: (1) by showing the same version of a pathway to multiple participants, excluding newly added nodes and interactions that have yet to be confirmed and (2) by allowing participants to right-click on existing nodes and interactions to contest the information, which would contribute to a confirmed rejection and removal of that information in future rounds. This strategy will also help address false-positive OCR results that generate inaccurate nodes. Again, tunable point values will be used to balance confirmation versus pioneering activity (e.g., by increasing the values for successive confirmations). And participants will gain/lose points post hoc based on the long-term confirmation/rejection status of their tasks. This tunable value will balance accuracy against speed. Table 2 summarizes the tunable economy of the platform via task point values, as well as when the calculation occurs.

Table 2. Tunable Variables

Five examples of variables that can be tuned to shift activity with respect to various outcomes. The last column specifies when and where each variable would be evaluated with respect to tasks performed by participants.

VariableOutcomeComputed
Pathway difficultySkill building and return playServer-side, before task
Rarity vis-à-vis current modelsNovel information captureServer-side, before task
Ordinality per pathwayCompleted pathwaysClient-side, during task

The precise values assigned to each variable will be determined during initial “play testing” and periodically adjusted to match our evolving goals and crowd of participants. The tuning and balancing of game economies is standard practice in simulation and massively multiplayer online games, which share these same evolving properties. By building in these mechanisms from the start, we can seamlessly redirect attention to tasks we deem a priority, even as our priorities change. Changes to values will be determined by three mechanisms, each optimized to bring about a specific outcome: statistical analysis, direct feedback, and manual override. The first two are described in Aim 4. The third mechanism can simply be described as us intervening from time to time based on our observations of game play and outcome. Regardless of mechanism, the final key property of our task management strategy will be transparency. At any given time, participants will know the value of each task they perform by means of immediate visual feedback (e.g., a brief animation of the point value). They will also see their current total score and progress toward successive skill levels. At the end of each round (e.g., 10 tasks), we can apply bonus points (or deduct points) according to the ongoing server-side calculation of confirmed/rejected status of prior tasks and then display their current level. This highlighting of bonus points, and anticipation of leveling, will thus be directly associated with the importance of accuracy. Leveling will unlock more difficult pathways with greater point potential, etc. Each participant will progress through training levels as well, where the point systems are highlighted and pre-selected pathway images are used to demonstrate the tasks to be performed on the elements they will encounter. These simple game mechanics will provide sufficient incentive for a large, yet to be fully engaged population of people interested in purposeful games and science and disease-related crowdsourced tasks.

If we cannot engage a sufficiently large volunteer base in this fashion, we will use Amazon Mechanical Turk (AMT) to finish assessing the viability of our platform by the end of the funding period. The basic idea of AMT is to facilitate the distribution of tasks to a ready “army” of workers who receive micropayments per completed task. Since the Pathways4Life platform is designed to be deployed on AWS already (as described in Aim 2), we would only need to add a few calls to AMT’s application programming interface to send and retrieve task data as sets of name-value pairs [20]. The budget for AWS time and bandwidth would thus be shifted to AMT workers, giving us fewer months of hosted time, but guaranteed returns if outreach efforts fall short in this compressed time period.

#### Strategic Vision

We outline a minimal set of game mechanics for this initial iteration of the project, but we envision the potential for rapid iterations in this direction without any changes to the infrastructure or architecture. The visual nature of the source material—the pathway images that authors, graphic designers, and editors have already taken care in producing—can be woven together with creative storytelling to make a more engaging and broadly appealing experience. For example, these pathway images can be framed as navigable maps discovered from ancient alien civilizations. The act of tracing and interpreting thus becomes one of exploration and risk/reward adventure. In additional to valuing individual nodes and interactions, we can also calculate points and generate animations based on the extent of connectivity across an entire pathway, thus encouraging activity along extended paths. A progression of simple animations depicting flowing water, marching ants, migrating animals and colonizing humans, for example, could play out over the graph as it grows and gains confirmation status. We can assign landmark names to sets of gene symbols to carry along the story, reserving special categories of landmarks for the rare genes we value most. The confirmation/rejection then translates directly into reward/risk in the adventure of accurately navigating these maps. And the discovery of more advanced landmarks leads to the progression to more difficult pathway maps. Social features, such as teams competing in tournaments or in conquering of new territory, could also be layered onto the tasks, together with compelling storytelling.

### Aim 4: Assemble Results: Transforming Big Data into Knowledge

The output of the components described so far will be a stream of JSON snippets that represent the individual changes (or “diffs”) made per task. Each snippet will be associated with a particular pathway image and participant. As structured data in a predefined JSON format, they can readily and reliably be compared. With the collection periodically indexed by pathway image identifier, for example, we can quickly confirm that a particular snippet is novel or an Nth confirmation of a prior result. The comparison of snippets representing new nodes will require a tolerance factor to account for minor deviations in positioning. But numerical interval comparison is still a trivially fast calculation. These data then feed into a calculated confidence score for the snippet as well as a potential bonus score for the participant. The equation for calculating confidence scores will be the sum of observations (+1 for confirmation, –1 for rejection, o), weighted by the relative skill level of the participant (0–1, w). We can include a multiplier for negative observations to convey the extra effort in making a negative call (e.g., 2, m), and assess this sum against a threshold (e.g., 10, T) to ultimately mark a snippet as confirmed (e.g., S≥1).

$S =∑(owm)_i /T$

We will periodically reassess the modifier and threshold values based on manual assessment of confirmed results. When a snippet is confirmed, it will be excluded from the pool of confirmable entities on that particular pathway, unless a rejection observation produces a subthreshold score or the content is reset (e.g., due to a change in thresholding). When all the snippets on a particular pathway are confirmed, the model will be queued for manual review before being added to the WikiPathways archive for distribution. A model might be rejected, for example, if a portion of the image has not been modeled (i.e., missed by both OCR and crowdsourcing). In these cases, we can simply add to back to the pool pending additional confirmed snippets or even initiate a few new nodes ourselves before re-releasing it. We expect each resulting pathway will require some level of editing and final touching-up. The progression from computational OCR, to crowdsourcing of simple tasks, to final assembly will ultimately pass through WikiPathways review stage. However, this type of curation activity is routinely performed by WikiPathways staff and volunteers, and does not pose a significant burden on this grant. Community review and curation of the results will lead to their dissemination via multiple open-standard formats and communication channels, including but not limited to WikiPathways, Pathway Commons (BioPAX), and linked data (RDF).

Aggregate statistics on snippet confidence scores per pathway will also be used to statistically assess the tuning variables described in Aim 3. Ideally, we want to see an average score near 0.5 during the bulk of the crowdsourcing activity for a given pathway, indicating a balance of initiating and confirming activity. While it is less than 0.5, we can increase the point values associated with Confirmation level to encourage confirmation activity; and while it is greater than 0.5, including when it is near completion, we can decrease the value to encourage finding new content in the image. If average alone is not sufficiently sensitive, we can also determine the slope of a sigmoidal fit to a plot of sorted scores per pathway and similarly use it in a function to adjust Confirmation level. This tuning can be completely automated by using confidence scores to calculate point values for each node and interaction in the model before being served to pvjs, where the points will be displayed to the next participant. In the same way, the values associated with Ordinality per pathway can be adjusted by direct feedback to the model and display to the participant based on a simple count of snippets detected thus far. And if the rejection rates are deemed to be too high (another number we can readily count per pathway or across the entire collection), we can increase the penalty assessed per round and use these intermissions to point out mistakes, make suggestions, and even direct participants to repeat training levels.

During this funding period, we plan to complete at least one disease-focused crowdsourcing event using the Pathways4Life platform. Following the precedent set by science competitions and other crowdsourcing events, we will spearhead a publication together with all participants as co-authors, focusing on the characterization of the extracted data and the resource of new knowledge that has been generated. Where possible, we will coordinate with journal editors before these events to incentivize involvement and stress both the attribution and responsibility that comes with Pathways4Life participation.

#### Strategic Vision

Following this first iteration, we will continue to organize events around specific diseases and research areas. We will also continue to feed in new pathway images from more extensive searches and new publications. We will work with publishers to submit pathway images themselves or provide clear author instructions. In this manner, we envision a two-fold solution to the pathway modeling problem: (1) we will get closer and closer to the source of published pathway images while simultaneously capturing prior published work and (2) we will be putting easy-to-use pathway modeling tools in the hands for more and more people. The post hoc modeling tool proposed here is based on the same technology we provide for de novo modeling of original pathways. Thus, our larger strategic goal is for researchers to draw their pathways in modeling tools in the first place and deliver the stylized versions as a byproduct for publication figures.

We also envision an evolution of the digital media platform and tools as the community of participants evolves, both technically (e.g., tablet and mobile support) and interactively (e.g., more layers of gamification and story-based abstraction). These iterations will continue to lower the barrier for broader participation in the curation of biomedically relevant pathway knowledge.

## Milestones, Metrics, and Benchmarks

The following timeline outlines a set of milestones, including a few key metrics and benchmarks along the way to measure progress on our aims and longer-term goals.

• Feb-Apr 2016:
• Refine image preprocessing and optimize OCR results
• Amass collection of 16,000 pathway image
• May-Jul 2016:
• Process, OCR and classify 16,000 pathway images.
• Identify novel genes
• Identify at least 3 disease-related subsets
• May-Oct 2016:
• Initial development of database, control logic and web framework to host pathway images and their metadata
• Initial customization of pvjs to work with expanded JSON model, tasks and API development
• Nov-Jan 2017:
• Completed participant registration and account system; added to database schema
• Prototype of Pathways4Life platform hosted on local server for early alpha testing
• Feb-Mar 2017:
• Completed SVG and style designs for tasks, points, animations and round summaries
• Completed client- and server-side calculations for assessing snippet diffs and points; added to database schema
• Apr-June 2017:
• Feature complete beta version hosted on local server for live testing
• Launch official campaign to engage participants for beta testing and upcoming First Event
• July-Aug 2017:
• Testing, debugging, user feedback, initial round of variable tuning
• Deploy to Amazon Web Services
• Sept-Nov 2017:
• Host First Event on carefully selected disease-relevant set of pathways
• Tune variables and collect feedback
• Disseminate new pathway knowledge in multiple formats via WikiPathways
• Publish results with participants as co-authors
• Dec-Jan 2018:
• Host Second Event; or run continuously; or employ Amazon Mechanical Turk
• Continue to tune, collect feedback and disseminate new pathway knowledge
• Assess platform; publish on technology, initial impact, future events and future developments

# References

 0 1 Bioinformatics in the post-sequence eraMinoru Kanehisa, Peer Bork (2003) Nat Genet. doi:10.1038/ng1109 0 2 Finding the Right Questions: Exploratory Pathway Analysis to Enhance Biological Discovery in Large DatasetsThomas Kelder, Bruce R. Conklin, Chris T. Evelo, Alexander R. Pico (2010) PLoS Biol. doi:10.1371/journal.pbio.1000472 0 3 GenMAPP, a new tool for viewing and analyzing microarray data on biological pathwaysKam D. Dahlquist, Nathan Salomonis, Karen Vranizan, Steven C. Lawlor, Bruce R. Conklin (2002) Nat. Genet.. doi:10.1038/ng0502-19 0 4 The Pathway Tools softwareP. D. Karp, S. Paley, P. Romero (2002) Bioinformatics. doi:10.1093/bioinformatics/18.suppl_1.s225 0 5 WikiPathways: Pathway Editing for the PeopleAlexander R. Pico, Thomas Kelder, Martijn P. van Iersel, Kristina Hanspers, Bruce R. Conklin, Chris Evelo (2008) Plos Biol. doi:10.1371/journal.pbio.0060184 0 6 WikiPathways: building research communities on biological pathwaysT. Kelder, M. P. van Iersel, K. Hanspers, M. Kutmon, B. R. Conklin, C. T. Evelo, A. R. Pico (2011) Nucleic Acids Research. doi:10.1093/nar/gkr1074 0 7 KEGG: Kyoto Encyclopedia of Genes and GenomesM. Kanehisa (2000) Nucleic Acids Research. doi:10.1093/nar/28.1.27 0 8 Reactome: a database of reactions, pathways and biological processesD. Croft, G. O'Kelly, G. Wu, R. Haw, M. Gillespie, L. Matthews, M. Caudy, P. Garapati, G. Gopinath, B. Jassal, S. Jupe, I. Kalatskaya, S. Mahajan, B. May, N. Ndegwa, E. Schmidt, V. Shamovsky, C. Yung, E. Birney, H. Hermjakob, P. D'Eustachio, L. Stein (2010) Nucleic Acids Research. doi:10.1093/nar/gkq1018 0 9 Genenames.org: the HGNC resources in 2015K. A. Gray, B. Yates, R. L. Seal, M. W. Wright, E. A. Bruford (2014) Nucleic Acids Research. doi:10.1093/nar/gku1071 0 10 Pathway Commons, a web resource for biological pathway dataE. G. Cerami, B. E. Gross, E. Demir, I. Rodchenkov, O. Babur, N. Anwar, N. Schultz, G. D. Bader, C. Sander (2010) Nucleic Acids Research. doi:10.1093/nar/gkq1039 0 11 Network2Canvas: network visualization on a canvas with enrichment analysisC. M. Tan, E. Y. Chen, R. Dannenfelser, N. R. Clark, A. Ma'ayan (2013) Bioinformatics. doi:10.1093/bioinformatics/btt319 0 12 Yale Image Finder (YIF): a new search engine for retrieving biomedical imagesS. Xu, J. McCusker, M. Krauthammer (2008) Bioinformatics. doi:10.1093/bioinformatics/btn340 0 13 Mining and integration of pathway diagrams from imaging dataS. Kozhenkov, M. Baitaluk (2012) Bioinformatics. doi:10.1093/bioinformatics/bts018 0 14 Real-time computer vision with OpenCVKari Pulli, Anatoly Baksheev, Kirill Kornyakov, Victor Eruhimov (2012) Communications of the ACM. doi:10.1145/2184319.2184337 0 15 Figure mining for biomedical researchR. Rodriguez-Esteban, I. Iossifov (2009) Bioinformatics. doi:10.1093/bioinformatics/btp318 0 16 Boosting text extraction from biomedical images using text region detectionSonghua Xu, Michael Krauthammer (2011) Proceedings of the 2011 Biomedical Sciences and Engineering Conference: Image Informatics and Analytics in Biomedicine. doi:10.1109/bsec.2011.5872319 0 17 Mining images in biomedical publications: Detection and analysis of gel diagramsTobias Kuhn, Mate Nagy, ThaiBinh Luong, Michael Krauthammer (2014) J Biomed Sem. doi:10.1186/2041-1480-5-10 0 18 The Emerging World of WikisJ. C. Hu, R. Aramayo, D. Bolser, T. Conway, C. G. Elsik, M. Gribskov, T. Kelder, D. Kihara, T. F. Knight Jr., A. R. Pico, D. A. Siegele, B. L. Wanner, R. D. Welch (2008) Science. doi:10.1126/science.320.5881.1289b 0 19 Big data: WikiomicsMitch Waldrop (2008) Nature. doi:10.1038/455022a 20 Amazon Mechanical Turk, External Questions documentation.
Jesse Spaulding

At this point it was unclear to me what an internal curation team was

Jesse Spaulding

I like that you are highlighting your efforts to correct the problem at its source

Jesse Spaulding

The word 'curate' is used a lot. I personally find this word vague. It may be helpful to be more specific about what you're doing.

Benjamin Good

This is a word that is used frequently and pretty well understood within the community of folks that build and maintain biological databases. e.g. an important annual conference is run by the 'biocuration' society. http://www.biocurator.org From my viewpoint the use is pretty clear. Perhaps a definition early on would clarify this for folks a little farther outside this community though.

Jesse Spaulding

It is unclear to me what I am supposed to learn from this graph. Does it aide in understanding something?

Benjamin Good

Scanning the proposal without reading deeply, I would concur with your confusion. The point of the figure and its relevance could be more clear.

Jesse Spaulding

Is it safe to assume that people already know what tunable game mechanics are?

Jesse Spaulding

It's unclear to me what aggregation and valuation mean here.

Jesse Spaulding

Great, but this seems weak without explanation. Perhaps mention you'll get into the detail later?

Jesse Spaulding

I find drawing this analogy just makes things harder to read. I also don't think it adds much value.

Jesse Spaulding

Why not just say something like "Engaging the general public will accelerate pathway modeling"? Are you trying to match the language of the FOA?

Jesse Spaulding

Is it worth saying this here if you're not going to explain how it does this?

Jesse Spaulding

If points calculations are done in the browser it seems like a user could cheat and artificially inflate their score. If we are talking about simple calculations it is hard to imagine there is a real need to have these done client side.

Jesse Spaulding

Looks good

Jesse Spaulding

Perhaps you should say that you are in fact designing it for others to extend — not just anticipating. You could also mention other open source projects that were successfully extended in the manner you imagine for Pathways4Life.

Jesse Spaulding