|
Predicting Bcr-Abl targets using a hybrid information extraction systemChronic myeloid leukemia (CML) is a disorder of uncontrolled leukocyte proliferation caused by a chromosome translocation resulting in Bcr-Abl, a constitutively active tyrosine kinase [1, 2]. Over a thousand Americans die from CML each year due to resistance to existing therapies like imatinib [3, 4, 5]. Difficulties in selectively inhibiting Bcr-Abl have shifted drug development towards targeting downstream Bcr-Abl interactors, many of which are still unknown [6]. Identifying new Bcr-Abl targets would spur drug development and lead to novel CML treatments. Informatics techniques utilizing knowledge networks built from the biomedical literature have successfully identified novel p53 kinases and prostate cancer drugs [7, 8]. By exhaustively processing millions of documents, informatics techniques discover insights previously missed by individual scientists. However, existing computer programs cannot extract a significant portion of important information due to limitations like being able to only process individual sentences [9]. Improvements to processing programs would increase the breadth and depth of extractable information [10, 11]. Crowdsourcing, an approach using groups of untrained experts to perform complex tasks, has been applied to predict protein and mRNA structure and mine biomedical text [12, 13, 14]. Machine learning (ML) is an analytical technique that allows programs, or classifiers, to infer patterns from provided data [15]. I propose to identify cytoplasmic Bcr-Abl targets using an informatics technique combining crowdsourcing and automated text mining (Figure 1). I will develop an information extraction workflow using crowdsourcing and integrate machine learning methods for automated pattern recognition. This will allow a higher number of documents to be processed than what is possible with crowdsourcing alone at a quality higher than what is currently possible with automated programs. Applying the hybrid method to the literature surrounding CML and Bcr-Abl will answer the following two questions: 1) What are the limitations of existing automated information extraction programs? 2) What are previously unidentified targets of Bcr-Abl? Researchers in the natural language processing and drug discovery domains have long sought answers to these questions. Specific AimsAim 1: Validate non-expert information extraction abilityThe first step to developing a hybrid information extraction technique is to demonstrate that non-experts can read and interpret scientific publications. This step is necessary to ensure that the assertions extracted by the system can be trusted, a common problem in interpreting output of purely automated methods. In preliminary experiments we have determined that the pooled judgments of six non-experts is almost as accurate as domain experts at identifying biological concepts in text (Figure 2a). On a more complex task of determining chemical-induced diseases, we found that using as few as four non-experts already gave accuracies of 65% (Figure 2b), demonstrating that non-experts can interpret publications with little training. I will expand the difficulty and variety of tasks, and continue to evaluate worker performance by comparing performance to gold standards produced by expert biocurators. Figure 2. Non-experts can interpret biomedical text.
a) Non-expert performance on a disease annotation task steadily improves as the number of workers increases, eventually reaching that of a domain expert. b) On a more complex chemical-disease relation identification task, non-experts found 56% of relations picked by experts, but 75% of the predictions were correct, leading to a overall performance of 65%. Aim 2: Integrate machine learning with a crowdsourcing workflowAlthough preliminary studies have shown that ML techniques can be used to resolve judgment disagreements better than majority voting (Figure 3a), how the approaches can be used to complement one another is still mostly unknown, as the two approaches have historically been used independently [16]. To address this question, I will perform experiments attempting to fuse the two techniques together in novel ways in order to utilize the strengths of each technique. For this aim, I will identify areas where predictive classifiers would provide an advantage over currently used methods. Our preliminary experiments show that task allotment and judgment aggregation would benefit from enhanced modelling. I will apply standard classifiers trained on crowd gathered data and compare whether performance or scalability increased. As a control, I will perform the same tasks with no classifiers. I expect that classifier application will enhance performance and increase the number of documents which can be processed, but that these improvements will be partially dependent upon the consistency of the human judgments. Aim 3: Identify downstream protein targets of Bcr-AblI will answer the following question: which cytoplasmic proteins in the literature are likely downstream targets of Bcr-Abl? To address this question, I will apply the hybrid information extraction system to documents related to Bcr-Abl to produce a knowledge network centered around the protein. I will then overlay known Bcr-Abl targets and known non-targets with the extracted information graph, and use concept similarities and term connectivities to rank potential proteins based on their resemblance to known Bcr-Abl targets. For validation, I will ask a group of blinded clinicians to review the predictions based on the original supporting literature documents and judge for prediction validity. Real and fake targets will be introduced as control items to verify expert consistency. Since a similar network analysis was used to identify prostate cancer drugs (Figure 3b), I expect that the method will find at least one novel Bcr-Abl target, and that other proteins like extracellular proteins will rank near the bottom of the predictions [7]. Pitfalls and AlternativesOne potential complication is that processing the large set of CML documents using paid online crowdsourcing platforms will be prohibitively expensive. If this were to occur, we would instead re-adapt our crowdsourcing model to instead harness volunteers, large communities of which already exist [17]. In addition, the concept similarity method for finding new Bcr-Abl targets may not work for sparsely connected networks, in which case we would use an alternative concept category knowledge-based method instead [7]. Significance and InnovationThis proposal represents one of the first studies aiming to solve biological problems using a representation of the literature generated through crowdsourcing. The technique is novel in that it can be adapted to other domain specific problems, and can therefore be reused on different parts of the literature. Exposure of the general public to scientific literature through crowdsourcing also improves scientific literacy and outreach. Personal StatementEveryone's path to bioinformatics is different, and mine is no exception. During my early schooling I spent much of my time exploring the physical sciences broadly, trying to find the area where I felt the greatest interest. Although I found physics to be powerful and chemistry to be precise, I discovered that biology fascinated me the most because it has a direct and lasting influence on the people around me. Thanks to two excellent high school teachers who instilled in me an appreciation for both computer science and biology, I proceeded down the path which would eventually lead me to computational biology. My journey towards bioinformatics began in earnest during high school. I was a member of our school's Programming Enrichment Group, an extracurricular activity focused on problem solving and critical thinking through algorithmic design. We participated in many programming competitions, and I along with other members of our club competed for a chance to represent Canada at the International Olympiad in Informatics at the Canadian Computing Competition. I relished in the logical rigor of informatics, and sought to apply it to other domains. As an undergraduate I worked in Virginia Cornish's lab at Columbia University on a directed evolution project seeking to evolve protein binders. The project was motivated by a desire to develop protein therapeutics for targets difficult to target specifically with small molecules, such as Bcr-Abl. Under the guidance of my graduate mentor I helped to characterize variants generated by the in vivo protein recombination system. This was my first experience working in a laboratory, and I found science to be a demanding but ultimately fulfilling endeavour. I returned to the Cornish lab the following summer to continue my work, but unfortunately had to leave prior to the project's completion due to a year spent studying abroad at Oxford. The most valuable experience from my time in the lab, other than learning to balance research with my other academic commitments, was the exposure to experimental techniques requiring large scale data interpretation and analysis. Upon starting my Ph.D. at Scripps after graduation, I quickly sought out Andrew Su's lab so that I could fulfill my interest in computation and biology. In particular the application of computational approaches to generate hypotheses from the existing literature fascinated me because I saw new opportunities to solve old problems like finding Bcr-Abl targets. In order to generate novel hypotheses, it is first necessary to create a computational representation of the literature. I realized that existing methods for generating these representations suffered either from a lack of quality or scalability, and sought to solve both problems by merging human judgments with automated machine learning. While working on this project I have had the opportunity to hone my communication skills through conference presentations, and managed to win a presentation prize. Following completion of my Ph.D. I hope to broaden my research background by working as a postdoctoral researcher in a laboratory which is both able to make computationally guided hypotheses and able to verify them biologically. My eventual goal is to establish a research group of my own which is able to utilize and verify the predictions of rapidly advancing computational tools in order to produce treatments and results for patients directly. Support from HHMI at this early stage in my career would provide me with the freedom to guide and develop my own research path without having to worry about financial burdens. Thank you for considering my application. References
|
Benjamin Good
CML causes Bcr-Abl ? Don't you mean that a chromosomal translocation results in the fusion of Bcr and Abl which results in the new gene Bcr-Abl, whose constitutive expression produces CML? A small figure might be nice to explain the biology.
Daniel Himmelstein
I had similar thoughts (written before reading Dr. Good's comment): I am not a biologist. I have no idea what Bcr-Abl is. I am guessing it is a protein because of "constitutively active tyrosine kinase" but I don't fully grasp what is going on with the chromosome translocation. Without the translocation, are humans devoid of Bcr-Abl? Also searching for "Bcr-Abl" hasn't been helpful because "BCR-ABL" is a gene name. Is all CML caused by Bcr-Abl? Is all Bcr-Abl bad?
Andrew Su
This is a key motivating principle for your proposal, so would be best to find a stronger citation and/or more citations.
Daniel Himmelstein
More Bcr-Abl confusion: are you considering a downstream protein a "Bcr-Abl target"?
Obi Griffith
I found this figure a little too general for its ambitious goals. It feels a bit naive. I don't think glucose metabolism is a good example of a part of the knowledge network that is likely to contribute to knowledge of Bcr-Abl targets.
Daniel Himmelstein
I have a hard time understanding the "Knowledge Network" from Figure 1. Drawing node borders in addition to edge arrows may help. My current understanding of your network at this point in reading is that you are using text mining + crowdsourcing to construct a metabolite network.
Benjamin Good
The figure should not come before its mention in the text. You should not state your hypothesis a figure caption.
Benjamin Good
Would be better to introduce the concept first before you use it. A knowledge network is ... Don't depend on the figure to do that for you.
Andrew Su
You might consider briefly relating the story of Swanson / Reynaud / fish oil here instead of these refs. It has the advantage of saying how information extraction can lead to testable hypotheses in an intuitively understandable way.
Benjamin Good
either drop this or expand and be specific. The sentence limitation does not apply to all techniques.
Daniel Himmelstein
The paragraph up till the end of this sentence has primed me: you've led me to believe that NLP programs need to span multiple sentences. The rest of the paragraph is a complete non-sequitur. I finish the paragraph confused, having just read a list of definition with no cohesive connection.
Andrew Su
If crowdsourcing is a central theme, then devote a paragraph to it. Here, it's buried and easily missed.
Tim Putman
I was under the impression that the untrained folk would be doing simple tasks.
Obi Griffith
you haven't explained/introduced why cytoplasmic targets are of particular interest
Daniel Himmelstein
Don't be vague. I would get more out of:
Andrew Su
will you mention feedback between crowdsourcing and text mining later? I think this is an attractive point...
Andrew Su
I'm assuming the glucose, g6p, f6p example is merely an example of a larger knowledge network. If so, I think this is too specific — better generalize into a schematic network.
Benjamin Good
Not seeing how this follows from the sentence before it. I would take this out.
Obi Griffith
This is a bit bold. It remains to be seen whether the proposed approach will identify ANY previously unidentified targets
Benjamin Good
Seems like the first step would be to benchmark a fully automated approach to the specific task you had in mind. (And before that, to define the specific task and why it is going to solve the biological problem)
Benjamin Good
If this were required as written, I would not believe that this would work. You need to show the ability to divide and conquer the problem and not depend on individuals really deeply understanding scientific literature on their own.
Daniel Himmelstein
My comment on this phrase: I feel that this is a precondition for funding you. If in general it is unknown whether non-experts can read science publications, I don't see the point of exploring such a specific application of crowdsourcing. Now I agree that you should evaluate whether non-experts succeed on your specific problem, but you should have examples of where non-experts read science successfully. Here is one example that non-experts understood drug labels [1]. Update: your next sentences answer the point, so you should assert that non-experts can from the start.
Benjamin Good
Accuracy of the system is fundamental. I don't believe the previous sentence is enough to 'ensure' it.
Andrew Su
Is this from Ben's PSB paper? How do you define "performance"? Surprised that experts achieve 100%...
Obi Griffith
I think you should be specific here in the sentence about what type of biological concept. Say that you have shown that groups of non-experts can identify biological concepts such as diseases, genes, etc with high accuracy. If you just say "biological concepts" a reader might imagine something more complex and get riled up by this statement.
Daniel Himmelstein
y-axis labels are unreadable. Either make them larger or remove them
Daniel Himmelstein
Move the circles closer to increase the overlap size. The visual elicits the opposite response as the numbers.
Benjamin Good
Up until now I was assuming the ML had something to do with the target identification step.
Tim Putman
I know you mention it a few paragraphs back, but maybe a little refresher or more description on what a classifier is.
Daniel Himmelstein
Details please! What are your positives and negatives? What are your predictors? All this talk of machine learning and I can't figure out what your models will predict and what datasets they will be trained and tested on.
Obi Griffith
This is a good opportunity for a specific hypothesis (always needed/expected in NIH-style grant applications). Rephrase as a hypothesis
Benjamin Good
This should remain the focus. You should be composing the other aims much more specifically to get to here.
Tim Putman
I think that this should be expanded on. The knowledge network is key. By making the connections through unbiased observations, the answer will come. Very important to make it clear what this means.
Daniel Himmelstein
A little more background is needed in the caption to appreciate Figure 3A — performance of what?
Daniel Himmelstein
What is a "known non-targets" and where does that knowledge come from?
Benjamin Good
Clinicians are the wrong judges of these predictions. (Should be scientists) What does blinded mean here? Do they get both predicted and known? You are missing the computational validation step before this.
Andrew Su
Clinicians I don't think are the target audience (nor the best people to evaluate). You need either experts in Bcr-Abl, or ideally an experimental system to evaluate your predictions.
Benjamin Good
You need stronger evidence than the one reference regarding prostate cancer to get here. As it stands, I would not be convinced. Further, since that work did not use crowdsourcing, why do you need to ? If you just did exactly what they did, but enhanced their workflow through crowdsourcing, could you identify better drugs than they did? WHy do you think so? How similar is that to what you are planning to do here?
Daniel Himmelstein
The expense issue should be worked out beforehand. Given that you haven't explained the "concept similarity method", what's the point of introducing an alternative? You could just say we will use A if the network is sparse and B if the network is dense.
Obi Griffith
This is the first mention of paid online crowdsourcing platforms. It should be introduced as an approach before it can become a potential pitfall
Benjamin Good
Seems easily estimated. If you can predict it clearly, than its not a potential pitfall, its part of the plan.
Benjamin Good
What about using a crowd-trained machine learning model ? This is what I thought you were going for. (Personally I would be much more excited about this as a direction for integration)
Andrew Su
... though also agree with Toby that the generalizability of the significant too...
Obi Griffith
This is an awkward tautology. If everyone's path is different then by saying your is no exception you mean that yours is also different? What if your path wasn't the exception then it would be the same? Impossible. I suggest you drop this first sentence and just begin the story of your unique journey to comp biol.
Obi Griffith
This feels redundant with above. Maybe just start with "In high school, I was ..."
Obi Griffith
Sounds better as either "I relished the logical rigor of" or "I thrived on the logical rigor" or "I flourished in the logical rigor"
Obi Griffith
Why raise this as a negative? Introduce your experience studying abroad as a separate positive growth experience.
Tim Putman
This could be expanded on. That doesn't necessarily sound like an unfortunate thing. |