Predicting Bcr-Abl targets using a hybrid information extraction system

Awaiting Funder

I am a second-year graduate student in Dr. Andrew Su's lab at The Scripps Research Institute. My current research focus is on resolving biological problems using literature-based discovery improved by crowdsourcing and machine learning.

I am currently applying for the 2016 Howard Hughes Medical Institute's (HHMI) International Student Research Fellowship (official announcement). The application consists of a one page personal statement and a research proposal, which must:

  1. Be aligned with the research goals of HHMI.
  2. Be structured similarly to a NIH grant application (e.g. address the significance and innovation, potential pitfalls, alternative strategies, and include a testable hypothesis).
  3. Be no longer than three (3) pages in length (excluding references).

Thank you for reviewing my proposal. PDFs of the research proposal and personal statement are also available. My email is


Chronic myeloid leukemia (CML) is a disorder of uncontrolled leukocyte proliferation caused by a chromosome translocation resulting in Bcr-Abl, a constitutively active tyrosine kinase [1, 2]. Over a thousand Americans die from CML each year due to resistance to existing therapies like imatinib [3, 4, 5]. Difficulties in selectively inhibiting Bcr-Abl have shifted drug development towards targeting downstream Bcr-Abl interactors, many of which are still unknown [6]. Identifying new Bcr-Abl targets would spur drug development and lead to novel CML treatments.

Figure 1. Informatics workflow for predicting Bcr-Abl targets.

Biomedical publications are converted into a structured knowledge network using a hybrid information extraction system. I hypothesize that such a network can be used to predict novel Bcr-Abl targets.

Informatics techniques utilizing knowledge networks built from the biomedical literature have successfully identified novel p53 kinases and prostate cancer drugs [7, 8]. By exhaustively processing millions of documents, informatics techniques discover insights previously missed by individual scientists. However, existing computer programs cannot extract a significant portion of important information due to limitations like being able to only process individual sentences [9]. Improvements to processing programs would increase the breadth and depth of extractable information [10, 11]. Crowdsourcing, an approach using groups of untrained experts to perform complex tasks, has been applied to predict protein and mRNA structure and mine biomedical text [12, 13, 14]. Machine learning (ML) is an analytical technique that allows programs, or classifiers, to infer patterns from provided data [15].

I propose to identify cytoplasmic Bcr-Abl targets using an informatics technique combining crowdsourcing and automated text mining (Figure 1). I will develop an information extraction workflow using crowdsourcing and integrate machine learning methods for automated pattern recognition. This will allow a higher number of documents to be processed than what is possible with crowdsourcing alone at a quality higher than what is currently possible with automated programs. Applying the hybrid method to the literature surrounding CML and Bcr-Abl will answer the following two questions: 1) What are the limitations of existing automated information extraction programs? 2) What are previously unidentified targets of Bcr-Abl? Researchers in the natural language processing and drug discovery domains have long sought answers to these questions.

Specific Aims

Aim 1: Validate non-expert information extraction ability

The first step to developing a hybrid information extraction technique is to demonstrate that non-experts can read and interpret scientific publications. This step is necessary to ensure that the assertions extracted by the system can be trusted, a common problem in interpreting output of purely automated methods. In preliminary experiments we have determined that the pooled judgments of six non-experts is almost as accurate as domain experts at identifying biological concepts in text (Figure 2a). On a more complex task of determining chemical-induced diseases, we found that using as few as four non-experts already gave accuracies of 65% (Figure 2b), demonstrating that non-experts can interpret publications with little training. I will expand the difficulty and variety of tasks, and continue to evaluate worker performance by comparing performance to gold standards produced by expert biocurators.

Figure 2. Non-experts can interpret biomedical text.

a) Non-expert performance on a disease annotation task steadily improves as the number of workers increases, eventually reaching that of a domain expert. b) On a more complex chemical-disease relation identification task, non-experts found 56% of relations picked by experts, but 75% of the predictions were correct, leading to a overall performance of 65%.

Aim 2: Integrate machine learning with a crowdsourcing workflow

Although preliminary studies have shown that ML techniques can be used to resolve judgment disagreements better than majority voting (Figure 3a), how the approaches can be used to complement one another is still mostly unknown, as the two approaches have historically been used independently [16]. To address this question, I will perform experiments attempting to fuse the two techniques together in novel ways in order to utilize the strengths of each technique.

For this aim, I will identify areas where predictive classifiers would provide an advantage over currently used methods. Our preliminary experiments show that task allotment and judgment aggregation would benefit from enhanced modelling. I will apply standard classifiers trained on crowd gathered data and compare whether performance or scalability increased. As a control, I will perform the same tasks with no classifiers. I expect that classifier application will enhance performance and increase the number of documents which can be processed, but that these improvements will be partially dependent upon the consistency of the human judgments.

Aim 3: Identify downstream protein targets of Bcr-Abl

I will answer the following question: which cytoplasmic proteins in the literature are likely downstream targets of Bcr-Abl? To address this question, I will apply the hybrid information extraction system to documents related to Bcr-Abl to produce a knowledge network centered around the protein.

Figure 3. Advantages of machine learning.

a) Using machine learning to aggregate votes increases performance by 26% compared to simple majority voting (adapted from [16]). b) An example of a prostate cancer drug identified using a network analysis [7].

I will then overlay known Bcr-Abl targets and known non-targets with the extracted information graph, and use concept similarities and term connectivities to rank potential proteins based on their resemblance to known Bcr-Abl targets. For validation, I will ask a group of blinded clinicians to review the predictions based on the original supporting literature documents and judge for prediction validity. Real and fake targets will be introduced as control items to verify expert consistency. Since a similar network analysis was used to identify prostate cancer drugs (Figure 3b), I expect that the method will find at least one novel Bcr-Abl target, and that other proteins like extracellular proteins will rank near the bottom of the predictions [7].

Pitfalls and Alternatives

One potential complication is that processing the large set of CML documents using paid online crowdsourcing platforms will be prohibitively expensive. If this were to occur, we would instead re-adapt our crowdsourcing model to instead harness volunteers, large communities of which already exist [17]. In addition, the concept similarity method for finding new Bcr-Abl targets may not work for sparsely connected networks, in which case we would use an alternative concept category knowledge-based method instead [7].

Significance and Innovation

This proposal represents one of the first studies aiming to solve biological problems using a representation of the literature generated through crowdsourcing. The technique is novel in that it can be adapted to other domain specific problems, and can therefore be reused on different parts of the literature. Exposure of the general public to scientific literature through crowdsourcing also improves scientific literacy and outreach.


I conceived and devised the specific aims described. My advisor Dr. Su provided high-level feedback and advice regarding the proposal's research direction. Dr. Benjamin Good provided helpful discussion and insight.

Personal Statement

Everyone's path to bioinformatics is different, and mine is no exception. During my early schooling I spent much of my time exploring the physical sciences broadly, trying to find the area where I felt the greatest interest. Although I found physics to be powerful and chemistry to be precise, I discovered that biology fascinated me the most because it has a direct and lasting influence on the people around me. Thanks to two excellent high school teachers who instilled in me an appreciation for both computer science and biology, I proceeded down the path which would eventually lead me to computational biology.

My journey towards bioinformatics began in earnest during high school. I was a member of our school's Programming Enrichment Group, an extracurricular activity focused on problem solving and critical thinking through algorithmic design. We participated in many programming competitions, and I along with other members of our club competed for a chance to represent Canada at the International Olympiad in Informatics at the Canadian Computing Competition. I relished in the logical rigor of informatics, and sought to apply it to other domains.

As an undergraduate I worked in Virginia Cornish's lab at Columbia University on a directed evolution project seeking to evolve protein binders. The project was motivated by a desire to develop protein therapeutics for targets difficult to target specifically with small molecules, such as Bcr-Abl. Under the guidance of my graduate mentor I helped to characterize variants generated by the in vivo protein recombination system. This was my first experience working in a laboratory, and I found science to be a demanding but ultimately fulfilling endeavour. I returned to the Cornish lab the following summer to continue my work, but unfortunately had to leave prior to the project's completion due to a year spent studying abroad at Oxford. The most valuable experience from my time in the lab, other than learning to balance research with my other academic commitments, was the exposure to experimental techniques requiring large scale data interpretation and analysis.

Upon starting my Ph.D. at Scripps after graduation, I quickly sought out Andrew Su's lab so that I could fulfill my interest in computation and biology. In particular the application of computational approaches to generate hypotheses from the existing literature fascinated me because I saw new opportunities to solve old problems like finding Bcr-Abl targets. In order to generate novel hypotheses, it is first necessary to create a computational representation of the literature. I realized that existing methods for generating these representations suffered either from a lack of quality or scalability, and sought to solve both problems by merging human judgments with automated machine learning. While working on this project I have had the opportunity to hone my communication skills through conference presentations, and managed to win a presentation prize.

Following completion of my Ph.D. I hope to broaden my research background by working as a postdoctoral researcher in a laboratory which is both able to make computationally guided hypotheses and able to verify them biologically. My eventual goal is to establish a research group of my own which is able to utilize and verify the predictions of rapidly advancing computational tools in order to produce treatments and results for patients directly. Support from HHMI at this early stage in my career would provide me with the freedom to guide and develop my own research path without having to worry about financial burdens. Thank you for considering my application.


New insights into the molecular resistance mechanisms of chronic myeloid leukemia
Rui Huang, Qian Kang, Huimin Liu, Yuhua Li (2015) Current Cancer Drug Targets. doi:10.2174/1568009615666150921141004
Exploiting Literature-derived Knowledge and Semantics to Identify Potential Prostate Cancer Drugs
Zhang, Rui Zhang, Michael Cairelli, Marcelo Fiszman, Halil Kilicoglu, Thomas Rindflesch, Serguei Pakhomov, Genevieve Melton (2014) CIN. doi:10.4137/CIN.S13889
Automated hypothesis generation based on mining scientific literature
Scott Spangler, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, Olivier Lichtarge, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer (2014) Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. doi:10.1145/2623330.2623667
New Challenges for Biological Text-Mining in the Next Decade
Hong-Jie Dai, Yen-Ching Chang, Richard Tzong-Han Tsai, Wen-Lian Hsu (2010) J. Comput. Sci. Technol.. doi:10.1007/s11390-010-9313-5
Text mining for biology - the way forward: opinions from leading scientists
Russ B Altman, Casey M Bergman, Judith Blake, Christian Blaschke, Aaron Cohen, Frank Gannon, Les Grivell, Udo Hahn, William Hersh, Lynette Hirschman, Lars Jensen, Martin Krallinger, Barend Mons, Seán I O'Donoghue, Manuel C Peitsch, Dietrich Rebholz-Schuhmann, Hagit Shatkay, Alfonso Valencia (2008) Genome Biol. doi:10.1186/gb-2008-9-s2-s7
RNA design rules from a massive open laboratory
Jeehyung Lee, Wipapat Kladwang, Minjae Lee, Daniel Cantu, Martin Azizyan, Hanjoo Kim, Alex Limpaecher, Snehal Gaikwad, Sungroh Yoon, Adrien Treuille, Rhiju Das (2014) Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1313039111
Predicting protein structures with a multiplayer online game
Seth Cooper, Firas Khatib, Adrien Treuille, Janos Barbero, Jeehyung Lee, Michael Beenen, Andrew Leaver-Fay, David Baker, Zoran Popović, Foldit players (2010) Nature. doi:10.1038/nature09304
Crowdsourcing for bioinformatics
B. M. Good, A. I. Su (2013) Bioinformatics. doi:10.1093/bioinformatics/btt333
Dynamic Bayesian Combination of Multiple Imperfect Classifiers
Edwin Simpson, Stephen Roberts, Ioannis Psorakis, Arfon Smith (2013) Decision Making and Imperfection. doi:10.1007/978-3-642-36406-8_1