This project aims to predict new therapeutic indications for small molecules. We will focus on repurposing drugs for well-studied complex human diseases, relying on recently-available high-throughput data sources. The approach is integrative, seeking to combine multiple information domains through heterogeneous networks and modern machine learning techniques.
Pharmaceutical companies seeking to bring a novel therapeutic compound to market face a single digit success rate, price tag in the billions, and duration spanning decades . The trend in research efficiency is equally grim: the cost of developing a new drug has increased exponentially, doubling approximately every nine years since 1950 .
Since the 90’s, the prevailing model of drug discovery has focused on identifying compounds that target a single protein with maximum specificity. Through a molecular, reductionist approach to understanding disease, a plausible target is selected. Drugs are then designed to modulate the target or small molecules with a strong target affinity are identified using high throughput screens. However, overwhelming evidence suggests that the potential of the 'one drug, one target, one disease' approach is limited. Biological systems are characterized by phenotypic robustness: knockout experiments in model organisms reveal that less than one fifth of genes are essential for survival . Similarly, pathology may represent a resilient homeostatic state, resistant to disruptions of a single protein. Approved small molecules affect on average 2.7 known targets, and when accounting for speculative targets that number jumps to 6.3 . This promiscuity can play an important role in drug efficacy as exemplified by clozapine which remains the preeminent anti-psychotic drug over compounds engineered to bind a subset of its dozen-plus targets .
Uncovering disease therapies that rely on multiple mechanisms, known as polypharmacology, requires escaping the limitations of the 'magic bullet' paradigm in favor of a 'magic 00 buckshot' understanding of drug efficacy. An approach called network pharmacology seeks to characterize the multitude of corruptions embodying a pathology. With that knowledge, drugs are selected to restore a normal state. Network pharmacology encompasses polypharmacology by evaluating drugs which intervene at multiple points to achieve healthy homeostasis.
Drug repurposing — identifying novel uses for existing therapeutics — avoids many pitfalls and challenges of designing drugs from scratch. FDA approved drugs have undergone extensive toxicology profiling during development and safety evaluation in Phase III clinical trials. Given ample time on the market, post-marketing trials and adverse event reporting uncover potential flaws that could lead to withdrawal. The wealth of information surrounding approved drugs creates a favorable outcome for repurposed compounds compared to new molecular entities: time to approval is cut in half to as low as three years ; the success rate of advancing from phase II trials to approval increases from 10 to 25 percent ; and the average development cost for successful drugs plummets from 1.3 billion to as low as 8.4 million dollars .
Between 1999 and 2008, more first-in-class small-molecule drugs were discovered with phenotypic screening than target-centric approaches, despite preferential investment towards the later . The advent of omics-technologies has enabled the quantification of several intermediate phenotypes between a disease's (or drug's) molecular basis and clinical manifestation. Intermediate phenotypes include transcriptional profiles, biological pathways, and genetic susceptibility markers. Traditional phenotypic approaches have focused on in vivo screening to identify compounds that alter a primary clinical indicator. In silico screening that instead relies on intermediate phenotypes offers a less costly and time-consuming way forward. Such approaches are easily amenable to leveraging repurposing, polypharmacology, and network pharmacology.
We propose an integrative method for repurposing approved small molecules to treat additional complex diseases. The approach relies on characterizing the effect of compounds and diseases using high-throughput resources — many of which provide intermediate-phenotypic profiles for compounds and diseases — and, from this information, calculating features that describe specific aspects of a compound-disease relationship. From these features, a machine learning approach identifies the influential mechanisms behind drug efficacy and predicts additional indications for existing drugs.
We chose to focus on complex diseases because they frequently exhibit:
We chose to focus on small molecules because they exhibit:
Part 1. Resource Construction
First, we will construct a resource that encodes a systems perspective of pathogenesis and pharmacology. We will structure the resource as a network where entities (nodes) are connected by their relationships (edges). Nodes and edges belong to predefined types — respectively called metanodes (Table 1) and metaedges (Table 2). The schematic view showing how types relate is called a metagraph (Figure 1).
Table 1. Metanodes
The network will consist of the following node types. Domain-specific vocabularies provide standardized terminologies for each node type.
Table 2. Metaedges
The network will consist of the following edge types. High-throughput bioinformatics resources provide the necessary information for connecting nodes.
Each node type will be populated using a domain-specific vocabulary (Table 1). Controlled vocabularies provide a backbone for data integration, ensure entities are conceptually unique, and enable easy annotation for future users. Edges will be extracted from high-throughput bioinformatics resources (Table 2). We aim to incorporate resources that are high-throughput, high-quality, and publicly-available. When possible, systematic resources that circumvent knowledge biases will be employed.
Figure 1. Metagraph of the heterogeneous network
A schematic view of the node and edge types composing the network.
We are currently exploring various resources to provide a high-throughput catalog of indications. Feedback here would be appreciated.
Part 2. Discovering Mechanisms of Drug Efficacy
Features describe the relationship between a compound and disease. Each feature measures a certain aspect of a compound-disease relationship: for example, whether the compound targets a susceptibility gene of the disease or whether the compound downregulates genes that are overexpressed in the disease state. Features that distinguish therapeutic from untherapeutic compound-disease pairs represent mechanisms of drug efficacy. We refer to the discriminatory power of each feature as its performance. The performance of each feature indicates its pharmacological importance. And by comparing performance across features, we can contrast the informativeness of orthogonal domains of information. Finally, features describing the same general relationship but based on different data sources can identify the most informative resource or technology out of many.
Features will be computed from the network. Each feature will measure the prevalence of a specific type of path between a compound and disease. This approach was initially developed for social network analysis , and later adapted by us for predicting disease-associated genes . Briefly, the method identifies all paths between a source and target node that follow a specified type of path (metapath). The contribution of each path is weighted by its specificity: paths through high-degree nodes, which are likely to be less informative, are downweighted. The sum of the weighted paths results in a value of 0 or greater, where 0 indicates no connectivity. We plan to use the degree-weighted path count metric for computing features . The interpretation of a specific feature depends on its corresponding metapath. Table 3 provides example metapaths and describes their pharmacological significance.
Table 3. The interpretation of features for select metapaths
Features measure the prevalence of a specific metapath between the source compound and target disease. Metapaths are abbreviated using the first letter of each metanode (uppercase) and metaedge (lowercase). Refer to Figure 1 for metanode and metaedge names.
Metapath-based approaches have several advantages for predictive data integration including :
Harnessing these advantages, we hope to evaluate and compare a diverse and broad set of potential mechanisms of efficacy.
Part 3. Predicting Probabilities of Efficacy
We plan to predict probabilities of efficacy for compound-disease pairs using heterogeneous network edge prediction [40, 41]. This approach trains a model from the network-based features and can return a probability of efficacy for any compound-disease pair. Previously, our implementation relied on regularized logistic regression, but modern software packages will allow us to rigorously evaluate a broad range of machine learning algorithms.
We are committed to a transparent, freely available, reusable, and reproducible scientific process and believe open science can revolutionize medicine . To this end, we will release all project related code on GitHub. Datasets that are too large for GitHub will be published on figshare. All original materials will be released under CC-BY (requires attribution) or CC-0 (public domain) licences. Derivatives of restrictively-licensed works will be released under the most permissive option available. Our analyses will be made available in real-time using GitHub pages to host R Markdown documents and IPython Notebooks. Finally, we plan to follow 10 proposed rules for reproducible research in computational biology .
Share your ideas by commenting on an existing discussion or by starting a new thread. Our goal in joining ThinkLab is to generate as much interaction as possible.
Team & Resources
Daniel Himmelstein is a PhD candidate in the Biological & Medical Informatics program at UCSF. Daniel works in the Sergio Baranzini Lab whose mission is to apply cutting-edge bioinformatic approaches to genomic data, with a focus on multiple sclerosis. Dr. Baranzini has extensive experience with data integration, genomic profiling, and disease bioinformatics.
UCSF and the surrounding Bay Area are hotspots for drug development and data analytics. The team has access to QB3 resources, which include a computing cluster and small molecule discovery center.
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant Number 1144247. We would like to thank BrowserStack for providing cross browser testing to help us compatibly share our research.
PublishedJan. 12, 2015
(Last updated Feb. 5, 2016)
TopicsHeterogeneous NetworksSmall MoleculesMachine learningDrug RepurposingSystems PharmacologyBioinformaticsComplex DiseaseHNEPMultipartite GraphsHetnets
Cite this asDaniel Himmelstein, Antoine Lizee, Chrissy Hessler, Leo Brueggeman, Sabrina Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio Baranzini (2015) Rephetio: Repurposing drugs on a hetnet [proposal]. Thinklab. doi:10.15363/thinklab.a5