Predicting Bcr-Abl targets using a hybrid information extraction system

Awaiting Funder

Predicting Bcr-Abl targets using a hybrid information extraction system

Chronic myeloid leukemia (CML) is a disorder of uncontrolled leukocyte proliferation caused by a chromosome translocation resulting in Bcr-Abl, a constitutively active tyrosine kinase [1, 2]. Over a thousand Americans die from CML each year due to resistance to existing therapies like imatinib [3, 4, 5]. Difficulties in selectively inhibiting Bcr-Abl have shifted drug development towards targeting downstream Bcr-Abl interactors, many of which are still unknown [6]. Identifying new Bcr-Abl targets would spur drug development and lead to novel CML treatments.

Figure 1. Informatics workflow for predicting Bcr-Abl targets.

Biomedical publications are converted into a structured knowledge network using a hybrid information extraction system. I hypothesize that such a network can be used to predict novel Bcr-Abl targets.

Informatics techniques utilizing knowledge networks built from the biomedical literature have successfully identified novel p53 kinases and prostate cancer drugs [7, 8]. By exhaustively processing millions of documents, informatics techniques discover insights previously missed by individual scientists. However, existing computer programs cannot extract a significant portion of important information due to limitations like being able to only process individual sentences [9]. Improvements to processing programs would increase the breadth and depth of extractable information [10, 11]. Crowdsourcing, an approach using groups of untrained experts to perform complex tasks, has been applied to predict protein and mRNA structure and mine biomedical text [12, 13, 14]. Machine learning (ML) is an analytical technique that allows programs, or classifiers, to infer patterns from provided data [15].

I propose to identify cytoplasmic Bcr-Abl targets using an informatics technique combining crowdsourcing and automated text mining (Figure 1). I will develop an information extraction workflow using crowdsourcing and integrate machine learning methods for automated pattern recognition. This will allow a higher number of documents to be processed than what is possible with crowdsourcing alone at a quality higher than what is currently possible with automated programs. Applying the hybrid method to the literature surrounding CML and Bcr-Abl will answer the following two questions: 1) What are the limitations of existing automated information extraction programs? 2) What are previously unidentified targets of Bcr-Abl? Researchers in the natural language processing and drug discovery domains have long sought answers to these questions.

Specific Aims

Aim 1: Validate non-expert information extraction ability

The first step to developing a hybrid information extraction technique is to demonstrate that non-experts can read and interpret scientific publications. This step is necessary to ensure that the assertions extracted by the system can be trusted, a common problem in interpreting output of purely automated methods. In preliminary experiments we have determined that the pooled judgments of six non-experts is almost as accurate as domain experts at identifying biological concepts in text (Figure 2a). On a more complex task of determining chemical-induced diseases, we found that using as few as four non-experts already gave accuracies of 65% (Figure 2b), demonstrating that non-experts can interpret publications with little training. I will expand the difficulty and variety of tasks, and continue to evaluate worker performance by comparing performance to gold standards produced by expert biocurators.

Figure 2. Non-experts can interpret biomedical text.

a) Non-expert performance on a disease annotation task steadily improves as the number of workers increases, eventually reaching that of a domain expert. b) On a more complex chemical-disease relation identification task, non-experts found 56% of relations picked by experts, but 75% of the predictions were correct, leading to a overall performance of 65%.

Aim 2: Integrate machine learning with a crowdsourcing workflow

Although preliminary studies have shown that ML techniques can be used to resolve judgment disagreements better than majority voting (Figure 3a), how the approaches can be used to complement one another is still mostly unknown, as the two approaches have historically been used independently [16]. To address this question, I will perform experiments attempting to fuse the two techniques together in novel ways in order to utilize the strengths of each technique.

For this aim, I will identify areas where predictive classifiers would provide an advantage over currently used methods. Our preliminary experiments show that task allotment and judgment aggregation would benefit from enhanced modelling. I will apply standard classifiers trained on crowd gathered data and compare whether performance or scalability increased. As a control, I will perform the same tasks with no classifiers. I expect that classifier application will enhance performance and increase the number of documents which can be processed, but that these improvements will be partially dependent upon the consistency of the human judgments.

Aim 3: Identify downstream protein targets of Bcr-Abl

I will answer the following question: which cytoplasmic proteins in the literature are likely downstream targets of Bcr-Abl? To address this question, I will apply the hybrid information extraction system to documents related to Bcr-Abl to produce a knowledge network centered around the protein.

Figure 3. Advantages of machine learning.

a) Using machine learning to aggregate votes increases performance by 26% compared to simple majority voting (adapted from [16]). b) An example of a prostate cancer drug identified using a network analysis [7].

I will then overlay known Bcr-Abl targets and known non-targets with the extracted information graph, and use concept similarities and term connectivities to rank potential proteins based on their resemblance to known Bcr-Abl targets. For validation, I will ask a group of blinded clinicians to review the predictions based on the original supporting literature documents and judge for prediction validity. Real and fake targets will be introduced as control items to verify expert consistency. Since a similar network analysis was used to identify prostate cancer drugs (Figure 3b), I expect that the method will find at least one novel Bcr-Abl target, and that other proteins like extracellular proteins will rank near the bottom of the predictions [7].

Pitfalls and Alternatives

One potential complication is that processing the large set of CML documents using paid online crowdsourcing platforms will be prohibitively expensive. If this were to occur, we would instead re-adapt our crowdsourcing model to instead harness volunteers, large communities of which already exist [17]. In addition, the concept similarity method for finding new Bcr-Abl targets may not work for sparsely connected networks, in which case we would use an alternative concept category knowledge-based method instead [7].

Significance and Innovation

This proposal represents one of the first studies aiming to solve biological problems using a representation of the literature generated through crowdsourcing. The technique is novel in that it can be adapted to other domain specific problems, and can therefore be reused on different parts of the literature. Exposure of the general public to scientific literature through crowdsourcing also improves scientific literacy and outreach.

Contributions

I conceived and devised the specific aims described. My advisor Dr. Su provided high-level feedback and advice regarding the proposal's research direction. Dr. Benjamin Good provided helpful discussion and insight.


Personal Statement

Everyone's path to bioinformatics is different, and mine is no exception. During my early schooling I spent much of my time exploring the physical sciences broadly, trying to find the area where I felt the greatest interest. Although I found physics to be powerful and chemistry to be precise, I discovered that biology fascinated me the most because it has a direct and lasting influence on the people around me. Thanks to two excellent high school teachers who instilled in me an appreciation for both computer science and biology, I proceeded down the path which would eventually lead me to computational biology.

My journey towards bioinformatics began in earnest during high school. I was a member of our school's Programming Enrichment Group, an extracurricular activity focused on problem solving and critical thinking through algorithmic design. We participated in many programming competitions, and I along with other members of our club competed for a chance to represent Canada at the International Olympiad in Informatics at the Canadian Computing Competition. I relished in the logical rigor of informatics, and sought to apply it to other domains.

As an undergraduate I worked in Virginia Cornish's lab at Columbia University on a directed evolution project seeking to evolve protein binders. The project was motivated by a desire to develop protein therapeutics for targets difficult to target specifically with small molecules, such as Bcr-Abl. Under the guidance of my graduate mentor I helped to characterize variants generated by the in vivo protein recombination system. This was my first experience working in a laboratory, and I found science to be a demanding but ultimately fulfilling endeavour. I returned to the Cornish lab the following summer to continue my work, but unfortunately had to leave prior to the project's completion due to a year spent studying abroad at Oxford. The most valuable experience from my time in the lab, other than learning to balance research with my other academic commitments, was the exposure to experimental techniques requiring large scale data interpretation and analysis.

Upon starting my Ph.D. at Scripps after graduation, I quickly sought out Andrew Su's lab so that I could fulfill my interest in computation and biology. In particular the application of computational approaches to generate hypotheses from the existing literature fascinated me because I saw new opportunities to solve old problems like finding Bcr-Abl targets. In order to generate novel hypotheses, it is first necessary to create a computational representation of the literature. I realized that existing methods for generating these representations suffered either from a lack of quality or scalability, and sought to solve both problems by merging human judgments with automated machine learning. While working on this project I have had the opportunity to hone my communication skills through conference presentations, and managed to win a presentation prize.

Following completion of my Ph.D. I hope to broaden my research background by working as a postdoctoral researcher in a laboratory which is both able to make computationally guided hypotheses and able to verify them biologically. My eventual goal is to establish a research group of my own which is able to utilize and verify the predictions of rapidly advancing computational tools in order to produce treatments and results for patients directly. Support from HHMI at this early stage in my career would provide me with the freedom to guide and develop my own research path without having to worry about financial burdens. Thank you for considering my application.

References

0
1.
0
2.
3.
4.
0
5.
New insights into the molecular resistance mechanisms of chronic myeloid leukemia
Rui Huang, Qian Kang, Huimin Liu, Yuhua Li (2015) Current Cancer Drug Targets. doi:10.2174/1568009615666150921141004
6.
0
7.
Exploiting Literature-derived Knowledge and Semantics to Identify Potential Prostate Cancer Drugs
Zhang, Rui Zhang, Michael Cairelli, Marcelo Fiszman, Halil Kilicoglu, Thomas Rindflesch, Serguei Pakhomov, Genevieve Melton (2014) CIN. doi:10.4137/CIN.S13889
0
8.
Automated hypothesis generation based on mining scientific literature
Scott Spangler, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, Olivier Lichtarge, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer (2014) Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. doi:10.1145/2623330.2623667
0
9.
0
10.
New Challenges for Biological Text-Mining in the Next Decade
Hong-Jie Dai, Yen-Ching Chang, Richard Tzong-Han Tsai, Wen-Lian Hsu (2010) J. Comput. Sci. Technol.. doi:10.1007/s11390-010-9313-5
0
11.
Text mining for biology - the way forward: opinions from leading scientists
Russ B Altman, Casey M Bergman, Judith Blake, Christian Blaschke, Aaron Cohen, Frank Gannon, Les Grivell, Udo Hahn, William Hersh, Lynette Hirschman, Lars Jensen, Martin Krallinger, Barend Mons, Seán I O'Donoghue, Manuel C Peitsch, Dietrich Rebholz-Schuhmann, Hagit Shatkay, Alfonso Valencia (2008) Genome Biol. doi:10.1186/gb-2008-9-s2-s7
0
12.
RNA design rules from a massive open laboratory
Jeehyung Lee, Wipapat Kladwang, Minjae Lee, Daniel Cantu, Martin Azizyan, Hanjoo Kim, Alex Limpaecher, Snehal Gaikwad, Sungroh Yoon, Adrien Treuille, Rhiju Das (2014) Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1313039111
0
13.
Predicting protein structures with a multiplayer online game
Seth Cooper, Firas Khatib, Adrien Treuille, Janos Barbero, Jeehyung Lee, Michael Beenen, Andrew Leaver-Fay, David Baker, Zoran Popović, Foldit players (2010) Nature. doi:10.1038/nature09304
0
14.
Crowdsourcing for bioinformatics
B. M. Good, A. I. Su (2013) Bioinformatics. doi:10.1093/bioinformatics/btt333
15.
0
16.
Dynamic Bayesian Combination of Multiple Imperfect Classifiers
Edwin Simpson, Stephen Roberts, Ioannis Psorakis, Arfon Smith (2013) Decision Making and Imperfection. doi:10.1007/978-3-642-36406-8_1
17.
Benjamin Good

CML causes Bcr-Abl ? Don't you mean that a chromosomal translocation results in the fusion of Bcr and Abl which results in the new gene Bcr-Abl, whose constitutive expression produces CML? A small figure might be nice to explain the biology.

Daniel Himmelstein

I had similar thoughts (written before reading Dr. Good's comment):

I am not a biologist. I have no idea what Bcr-Abl is. I am guessing it is a protein because of "constitutively active tyrosine kinase" but I don't fully grasp what is going on with the chromosome translocation. Without the translocation, are humans devoid of Bcr-Abl? Also searching for "Bcr-Abl" hasn't been helpful because "BCR-ABL" is a gene name. Is all CML caused by Bcr-Abl? Is all Bcr-Abl bad?

Andrew Su

This is a key motivating principle for your proposal, so would be best to find a stronger citation and/or more citations.

Daniel Himmelstein

More Bcr-Abl confusion: are you considering a downstream protein a "Bcr-Abl target"?

Obi Griffith

I found this figure a little too general for its ambitious goals. It feels a bit naive. I don't think glucose metabolism is a good example of a part of the knowledge network that is likely to contribute to knowledge of Bcr-Abl targets.

Daniel Himmelstein

I have a hard time understanding the "Knowledge Network" from Figure 1. Drawing node borders in addition to edge arrows may help. My current understanding of your network at this point in reading is that you are using text mining + crowdsourcing to construct a metabolite network.

Benjamin Good

The figure should not come before its mention in the text. You should not state your hypothesis a figure caption.

Benjamin Good

Would be better to introduce the concept first before you use it. A knowledge network is ... Don't depend on the figure to do that for you.

Andrew Su

You might consider briefly relating the story of Swanson / Reynaud / fish oil here instead of these refs. It has the advantage of saying how information extraction can lead to testable hypotheses in an intuitively understandable way.

Benjamin Good

either drop this or expand and be specific. The sentence limitation does not apply to all techniques.

Obi Griffith

awkward wording

Daniel Himmelstein

The paragraph up till the end of this sentence has primed me: you've led me to believe that NLP programs need to span multiple sentences. The rest of the paragraph is a complete non-sequitur. I finish the paragraph confused, having just read a list of definition with no cohesive connection.

Andrew Su

If crowdsourcing is a central theme, then devote a paragraph to it. Here, it's buried and easily missed.

Tim Putman

I was under the impression that the untrained folk would be doing simple tasks.

Obi Griffith

you haven't explained/introduced why cytoplasmic targets are of particular interest

Daniel Himmelstein

Don't be vague. I would get more out of:

I propose combining crowdsourcing and automated text mining to identify cytoplasmic Bcr-Abl targets

Andrew Su

will you mention feedback between crowdsourcing and text mining later? I think this is an attractive point...

Andrew Su

I'm assuming the glucose, g6p, f6p example is merely an example of a larger knowledge network. If so, I think this is too specific — better generalize into a schematic network.

Tim Putman

A little redundant to use "than what is..." twice here

Obi Griffith

The novelty of the hybrid method is a strength of this proposal

Benjamin Good

Not seeing how this follows from the sentence before it.

I would take this out.

Obi Griffith

This is a bit bold. It remains to be seen whether the proposed approach will identify ANY previously unidentified targets

Benjamin Good

Seems like the first step would be to benchmark a fully automated approach to the specific task you had in mind. (And before that, to define the specific task and why it is going to solve the biological problem)

Benjamin Good

If this were required as written, I would not believe that this would work. You need to show the ability to divide and conquer the problem and not depend on individuals really deeply understanding scientific literature on their own.

Daniel Himmelstein

My comment on this phrase:

I feel that this is a precondition for funding you. If in general it is unknown whether non-experts can read science publications, I don't see the point of exploring such a specific application of crowdsourcing.

Now I agree that you should evaluate whether non-experts succeed on your specific problem, but you should have examples of where non-experts read science successfully. Here is one example that non-experts understood drug labels [1].

Update: your next sentences answer the point, so you should assert that non-experts can from the start.

0
1.
Scaling drug indication curation through crowdsourcing
R. Khare, J. D. Burger, J. S. Aberdeen, D. W. Tresner-Kirsch, T. J. Corrales, L. Hirchman, Z. Lu (2015) Database. doi:10.1093/database/bav016
Benjamin Good

Accuracy of the system is fundamental. I don't believe the previous sentence is enough to 'ensure' it.

Andrew Su

Is this from Ben's PSB paper? How do you define "performance"? Surprised that experts achieve 100%...

Obi Griffith

I think you should be specific here in the sentence about what type of biological concept. Say that you have shown that groups of non-experts can identify biological concepts such as diseases, genes, etc with high accuracy. If you just say "biological concepts" a reader might imagine something more complex and get riled up by this statement.

Andrew Su

I don't think the figure illustrates well what the text says.

Obi Griffith

Provide an example of what you mean here

Obi Griffith

It would be helpful to explain the source of these non-experts

Benjamin Good

You need to have a specific plan

Tim Putman

I agree that specific steps need to be added

Obi Griffith

word not necessary and repeats recent use

Daniel Himmelstein

y-axis labels are unreadable. Either make them larger or remove them

Daniel Himmelstein

Move the circles closer to increase the overlap size. The visual elicits the opposite response as the numbers.

Benjamin Good

This section is really not very clear.

Benjamin Good

Up until now I was assuming the ML had something to do with the target identification step.

Tim Putman

I know you mention it a few paragraphs back, but maybe a little refresher or more description on what a classifier is.

Daniel Himmelstein

Details please! What are your positives and negatives? What are your predictors? All this talk of machine learning and I can't figure out what your models will predict and what datasets they will be trained and tested on.

Obi Griffith

This is a good opportunity for a specific hypothesis (always needed/expected in NIH-style grant applications). Rephrase as a hypothesis

Benjamin Good

This should remain the focus. You should be composing the other aims much more specifically to get to here.

Tim Putman

I think that this should be expanded on. The knowledge network is key. By making the connections through unbiased observations, the answer will come. Very important to make it clear what this means.

Andrew Su

based on what corpus?

Daniel Himmelstein

A little more background is needed in the caption to appreciate Figure 3A — performance of what?

Benjamin Good

these are your controls. Can your method predict them?

Daniel Himmelstein

What is a "known non-targets" and where does that knowledge come from?

Benjamin Good

Clinicians are the wrong judges of these predictions. (Should be scientists) What does blinded mean here? Do they get both predicted and known? You are missing the computational validation step before this.

Andrew Su

Clinicians I don't think are the target audience (nor the best people to evaluate). You need either experts in Bcr-Abl, or ideally an experimental system to evaluate your predictions.

Obi Griffith

Again, rephrase this as specific hypothesis

Benjamin Good

You need stronger evidence than the one reference regarding prostate cancer to get here. As it stands, I would not be convinced. Further, since that work did not use crowdsourcing, why do you need to ? If you just did exactly what they did, but enhanced their workflow through crowdsourcing, could you identify better drugs than they did? WHy do you think so? How similar is that to what you are planning to do here?

Daniel Himmelstein

Move this citation to the first part of the sentence?

Daniel Himmelstein

The expense issue should be worked out beforehand. Given that you haven't explained the "concept similarity method", what's the point of introducing an alternative? You could just say we will use A if the network is sparse and B if the network is dense.

Obi Griffith

This is the first mention of paid online crowdsourcing platforms. It should be introduced as an approach before it can become a potential pitfall

Benjamin Good

Seems easily estimated. If you can predict it clearly, than its not a potential pitfall, its part of the plan.

Benjamin Good

What about using a crowd-trained machine learning model ?

This is what I thought you were going for. (Personally I would be much more excited about this as a direction for integration)

Tim Putman

I completely agree. The free option is always more attractive

Obi Griffith

delete for redundancy

Benjamin Good

the significance is the medical impact

Andrew Su

... though also agree with Toby that the generalizability of the significant too...

Obi Griffith

emphasize the hybrid approach here

Obi Griffith

This is an awkward tautology. If everyone's path is different then by saying your is no exception you mean that yours is also different? What if your path wasn't the exception then it would be the same? Impossible. I suggest you drop this first sentence and just begin the story of your unique journey to comp biol.

Obi Griffith

that most inspired me

Obi Griffith

it was biology that

Obi Griffith

Make this its own sentence. it will have greater impact that way.

Obi Griffith

I think you need a comma after this

Obi Griffith

This feels redundant with above. Maybe just start with "In high school, I was ..."

Obi Griffith

Sounds better as either "I relished the logical rigor of" or "I thrived on the logical rigor" or "I flourished in the logical rigor"

Obi Griffith

Explain just a little more what this means

Daniel Himmelstein

Is Bcr-Abl a small molecule, target, or protein therapeutic?

Obi Griffith

Why raise this as a negative? Introduce your experience studying abroad as a separate positive growth experience.

Tim Putman

This could be expanded on. That doesn't necessarily sound like an unfortunate thing.

Andrew Su

agree, turn it into a positive...

Obi Griffith

further pursue my passion for?

Obi Griffith

focused on utilizing and verifying

Obi Griffith

improve outcomes