The following modifications to Hetionet were brought up:
Adding biologic drugs, e.g. antibodies, since Hetionet v1.0 only includes approved small molecules. In previous private communications, @sergiobaranzini has also strongly supported adding biologics as well as a more comprehensive set of small molecules.
Adding metabolites since the efficacy of many small molecules results from metabolites.
Updating Hetionet continuously in an automated fashion. Versioning could enable updates, using a scheme similar to API versioning. For example, we could use versioned URLs for our Neo4j Browser, such as v1.het.io.
Adding microbiome information, such as microbial communities and drug-microbe interactions.
Create the network by snorkeling the literature. Hetionet v1.0 could be used as the "ground truth" to extract labeling functions. We alluded to data programming techniques in the report, writing "Going forward, advances in automated mining of the scientific literature could enable extraction of precise relationship types at omics scale ."
Adding SNPs, potentially using ExAC for allele frequencies.
Adding a discovery date property to relationships. Year of earliest publication could provide a proxy for some relationship types with source publication metadata. This could allow us to see if we can predict future knowledge from past knowledge and help address knowledge biases.
Incorporating universal edge weights representing confidence in a relationship. In the past, @larsjuhljensen has also strongly advocated against our binarization of uncertain relationships.
Supporting the addition of user-supplied relationships
Adopting more ontologies. Perform propagation during query-time. See Hetontology which aims to host all openly-licensed OBOFoundry ontologies in a public Neo4j instance.
The following modifications to Project Rephetio were brought up:
Quantifying the significance of a predication. Showing enrichment over null rather than probability of treatment to provide more context.
Algorithms that account for relationship weight/confidence when measuring path prevalence for a given metapath between two nodes.
Focusing on drugs and diseases as perturbations to biological systems/pathways.
Better capturing tissue specific effects. Accounting for which tissues/anatomies drug efficacy occurs in for a specific compound and disease. How do drugs work in a specific tissue?
Queries that leverage metapatterns, rather than just metapaths. Metapatterns encompass queries that search for more than just a simple path. Metapatterns could help create tissue-specific aware features and drill down on the mechanisms of drug efficacy. Currently, we identify when a drug will work, but not necessarily why a drug will work (mechanisms).
The following new applications of Hetionet were discussed:
Predicting drug-drug interactions.
Allowing users to provide their own gene sets. Enabling some sort of gene set enrichment queries. For example, what drugs interact with a given gene set?
When multiple drugs are available for a disease, what are the differences and which drug should be used for a given patient?
Clustering nodes, such as extracting gene modules.
Thanks to everyone who participated: @caseygreene, Jaclyn Taroni, Jie Tan 💻, @gregway, Brett Beaulieu-Jones, René Zelaya, Matt Huyck 💻, Dongbo Hu, Amy Campbell, Kathy Chen, Timothy Chang, Linda Zhou, and Stephen Woloszynek (💻 denotes virtual participation). There're clearly more ideas than time, which is why these discussions are important — so we pursue only the best!
Building on Adopting more ontologies. Perform propagation during query-time, here is text I had sitting in a draft:
Currently, we don't include ontological edges. For example, there is no myelination→is_a→axon ensheathment relationship despite its presence in the Gene Ontology. We omitted this information because we didn't have a query framework that accommodated its complexity. However, Neo4j supports variable-length pattern matching, which will make realtime transitive closure feasible.
@dhimmel I would really like to find a way to collaborate more directly. We have two PhD students and a postdoc all working on extensions to the hetionet project - thanks entirely to your efforts to be open here on ThinkLab. They are interested in: DeepDive/SNorkel, extending to rare genetic diseases, expanding drug coverage (particularly in relation to the Calibr collection we now will have access to), and generally improving the framework itself.
Apart from their interests, one of the things I would personally like to see is a better exposure of the network you have assembled to the community beyond bioinformaticians. The purposes are both to support hypothesis generation for them and to tap into their collective expertise to crowdsource the curation and extension of the network. One way that might be achieved would be to load it into an incarnation of an application I am building in partnership with Richard Bruskiewich and his team at Star Informatics called knowledge.bio. The first version of knowledge.bio is online now, described in a BioarXiv article and available on in the kb1 bitbucket repo. The next version is due to be released at the end of the year, is a complete rewrite (including a move to neo4j), makes extensive use of wikidata, supports limited social features, and is available in the kb2 repo.
Everything above is at the 'strong starting point' phase. If you or other members of your group are interested in joining forces to take them to fruition, let us know!
We are also working with the Monarch consortium on a large knowledge integration project known as TransMed. This will be leveraging and extending the SciGraph system they built for Monarch. SciGraph exposes OWL ontologies and associated knowledge bases for query in the form of graph databases like neo4j.
@dhimmel thats an excellent list of potential ideas. I'd fully agree with expanding to include (reliable) data types as much as possible. The idea of automated continuous updating is very cool, but seems like a ton of workhours to maintain long term. Drug-drug interactions can easily be tested against a "gold standard" database. I'm guessing it's relatively cheap (in work hours) for you to re-write the code to create drug-drug matches?
The idea of automated continuous updating is very cool, but seems like a ton of workhours to maintain long term.
@pouyakhankhanian, if we move to more automated techniques on more standardized inputs, I thinks it's possible to entirely automate updates. One area of interest in my current lab, the Greene Lab, is automation . As a simple example, see dhimmel/thinklytics which automatically backs up and analyzes Thinklab content on a daily basis. Obviously, building Hetionet is much more complex, which means automation will be more difficult but also more valuable.
Drug-drug interactions can easily be tested against a "gold standard" database. I'm guessing it's relatively cheap (in work hours) for you to re-write the code to create drug-drug matches?
Yeah that wouldn't be too bad. I'd probably rework some of the current analysis pipeline to be more reproducible and automatable. Currently, it's difficult for others to re-execute and extend our computations — see for example, this issue by @tongli.
@dhimmel I would really like to find a way to collaborate more directly. We have two PhD students and a postdoc all working on extensions to the hetionet project - thanks entirely to your efforts to be open here on ThinkLab.
@b_good definitely. First, let me say that it's been super encouraging to see your team extend Hetionet / Rephetio. I'm not entirely sure what extensions you're working on, but the GitHub Issues opened have helped me see the sticking points of my old workflows.
From your comment, I think we're all thinking along the same lines. All biomedical knowledge should be mined from the literature into a hetnet. The hetnet should be stored in Neo4j. Nodes should be from ontologies and standardized vocabularies that can be automatically maintained. Curation should be fully outsourced or, if necessary, crowdsourced. Layers on top of the Neo4j database will likely be needed to provide biologists with user-friendly interfaces for domain-specific functionality.
I think that we're at a point where we can achieve the above objectives. One outstanding issue that I'd love input on is how to deal with context-specific relationships. For example, protein interactions that only exist in a certain tissue . I don't think my conception of hetnets or Neo4j's graph model are well-suited for this task.
But to the point, let's not duplicate efforts. I'm hoping that everyone who shares our vision can join forces into a collaborative open source development community. There's lot's of overlap with the NCATS Biomedical Translator Program, which funded investigators from 11 Universities. @b_good are you affiliated with any of these projects and are they something we could also leverage — i.e. will their development be open source and are they pursuing related or complementary approaches to what we've outlined?
So any suggestions on how best to structure collaboration? I personally think open source GitHub repos that operate on a pull request model are ideal. The difficultly will be coordination, recruitment, and providing an incentive structure that rewards collaboration.
Lars Juhl Jensen: I'm generally happy to contribute - let me know if there are any holes that need to be plugged, and I'll let you know if I have anything or know who does. Also, whereas I'm not part of the NCATS Biomedical Translator, I am part of the NIH funded IDG Knowledge Management Central, which would be another public source of already integrated data.
Benjamin Good: Our lab is part of the NCATS Translator program. On a technical level, our team (led by OHSU) will mainly be extending the scigraph system mentioned above. We will be supporting/using/building this knowledge graph with the Wikidata and BioThings projects. More to follow on that as things take shape (just got started).
I've posed the question of how to organize a collaboration here to our group and will also circle back on that front - I hope that they will chime in individually here as I invited them to do. I think github pull request structure makes sense in general. Would like to see what specific projects would make sense to take on as a first step though. If we are serious here, a periodic shared virtual lab meeting would be a good step forward.
In terms of context, the wikidata qualifier structure is a pretty lightweight approach that suits a graph db and can be pretty expressive. I wouldn't limit things to neo4j - there more and more solif competitors out there with different optimal uses. e.g. I am personally quite impressed with BlazeGraph performance and its ability to host a SPARQL endpoint is compelling from the standpoint of encouraging interoperability on the Web.
Mining knowledge from the literature
In the early days of Project Rephetio, @b_goodsuggested mining text to populate our network. We didn't end up doing much text mining to construct Hetionet v1.0. We did indirectly use relationships extracted from the literature, though MEDLINE term co-occurrence and resources that curated literature.
In general, we were more excited about "edges from systematic technologies that are not subject to knowledge biases" and invested our time in integrating resources such as LINCS L1000 and STARGEO. Our hope was to leverage systematic high-throughput technologies to make novel and insightful drug repurposing predictions. However, as we discuss in our report :
Ideally, different data types would provide orthogonal information. However, our model for whether a compound treats a disease focused on 11 metapaths — a small portion of the hundreds of metapaths available. While parsimony aids interpretation, our model did not draw on the weakly-predictive high-throughput data types — which are intriguing for their novelty, scalability, and cost-effectiveness — as much as we had hypothesized. Instead our model selected types of information traditionally considered in pharmacology.
I think increasing our number of positives and fixing edge dropout contamination — for example by training on novel clinical trail indications — could help incorporate weakly predictive metapaths into our models. However, the findings of Project Rephetio suggest that we should redirect focus to incorporating established knowledge embedded in the literature.
@b_good, I saw that knowledge.bio uses the Semantic MEDLINE Database (SemMedDB) . How good is SemMedDB? Unfortunately, SemMedDB appears to have troubling terms & conditions despite being a US government work and primarily factual. I know @andrewsu is in communication with the National Library of Medicine to fix these legal impediments to reuse. In my next post, I'll discuss Snorkel — the question being whether implementing Snorkel will provide sufficient advantages over existing databases, such as SemMedDB.
Benjamin Good: SemMedDB (created using SemRep which you can install locally) is a good general purpose starting point. It certainly has a lot of good content in it, but it also has a pretty high false positive rate at both the concept tagging and relation prediction levels. I think for any given entity type, say genes, you could make or find a better tagger (e.g. BANNER) and that the same could be said for most of the relations. Getting all those different systems running at PubMed scale would be hard but could be worthwhile.. Though the license isn't perfect, the intent is very clearly along the lines of what we hope to see: "SKR resources are available to all requesters, both within and outside the United States, at no charge. " and I doubt very much there would be any real legal ramifications of using the tools or their outputs.
Improving the quality of SemMedDB is something I've been hoping to work on with a crowdsourcing/machine learning angle for some time. One of the main ideas behind knowledge.bio and the reason we seeded it at first with Semmed content was to attract expert users to flag correct and incorrect assertions..
Snorkeling the literature
Snorkel describes itself as a "a lightweight platform for developing information extraction systems using data programming" (HazyResearch/snorkel on GitHub) . There's a short video describing data programming, although the terminology is a bit foreign to me. Therefore, @caseygreene and I videochatted with the lead developer, Alex Ratner, of the Christopher Ré Lab at Stanford.
Here were the takeaways. We could use snorkel to fill in missing relationships in Hetionet. Each of the 24 relationship types would be a distinct classification problem, where we instruct snorkel to learn how to extract that relationship type from the literature. We would start with a single labeling function, which would return true for relationships in Hetionet and false otherwise (distant supervision). Snorkel would fit a generative model that outputs the existing relationships and more.
This approach is intriguing for several reasons:
We could fill in missing relationships where Hetionet doesn't currently contain all of the knowledge from the literature.
Each extracted relationship could refer back to the specific phrases supporting it throughout the literature.
The approach can be largely automated and deployed as new literature enters the corpus.
The approach scales whereas existing human curation approaches do not.
The approach generalizes meaning each relationship type doesn't require an entirely new processing pipeline.
Alex mentioned the following three steps that take the most time:
data processing and munging — preparing the input data. Hopefully, the burden here could be minimized since Hetionet v1.0 is already standardized and integrated. Furthermore, Alex mentioned that they've indexed certain portions of PubMed, so we wouldn't have to spend lot's of time configuring a literature corpus in the beginning.
test set curation — expert curation of specific phrases from the literature corpus as negative or positives to evaluate performance. Potentially, we could withhold relationships and use them for testing, but the relationships are downstream from the actual classification which occurs at the level of actual text. It sounded like uneven prior probabilities of relationship based on node degree, which we've experienced, could be an issue with testing on withheld relationships.
error analysis — investigating where errors are being made and adding labeling functions to rectify the errors (direct supervision). This step could be amenable to crowdsourcing by community review.
We discussed doing a pilot project based on a single relationship type in Hetionet v1.0 — perhaps Compound–treats–Disease since we've already done substantial curation to create PharmacotherapyDB.
Benjamin Good: We have had several folks attempt to Snorkel and before that to use raw DeepDive with very limited success so far. Conceptually it seems completely awesome and exactly the right way to go. The implementation(s) have been in a serious state of flux and that has made it a challenge to use. I still like it and and I know Chris Re is keen to get more use in the biomedical space. Perhaps this is an area where joining forces would help tip the balance.
Alex Ratner: @b_good I think this is a very fair take, and we're excited to tip that balance! Working with a bunch of our "local" colaborators at Stanford, we've been having a lot of success using Snorkel. However, it's certainly still at a stage where our active involvement–both for help/bugfixes and active collaboration on code development–seems to be a big help, in part since the code is evolving so rapidly in response to feedback. We're hoping that it will begin to increasingly stabilize soon; and any feedback you have for us on how we could make it easier to use would be greatly appreciated, either via github issues, a skype call, or on this platform!
Regarding your point about not having to create an entirely new pipeline for each relation (edge) type: This is indeed an area we're very excited about! To clarify: in a standard ML approach to relation extraction, you would indeed have to set up a new pipeline for each relation type- most annoyingly, you'd have to hand label training data for each one. In our approach, you have to write labeling functions (LFs)–just python fns in Snorkel–but then you can potentially share some of these across multiple relation types. And as I was getting excited about on our call, we're starting to work on automating this "sharing" of supervision signal within a multi-relation extraction project...
Re: Data processing & test set curation: To reiterate what we discussed, we'd love to get more of this data out in the open & shared! To that end, we're currently preprocessing a lot of the data (PubMed abstracts to start) and plan to post this. This preprocessing (e.g.: running through NLP pipelines to split sentences, tokenize words, parse linguistic signals & grammatical structure, and then aligning with external entity annotations) is often very messy and time consuming, even though it's not part of what we'd consider Snorkel's "core task". So we're hoping to get it out of the way for people. Any pooling of efforts here would be awesome!
Re: Error analysis: Just to be clear, I would consider error analysis as one component of the broader process of "developing the extractor", or more specifically for Snorkel, developing labeling functions. And this would still not be "direct supervision", because you're still not hand-labeling individual examples (although you always could)
Anyway, very excited to chat further with everyone here!
Daniel Himmelstein: Thanks @b_good and @alexratner for weighing in. The Greene Lab has got a new rotation student, David Nicholson, who is interesting in piloting Snorkel. We were thinking of starting with Compound–treats–Disease relationships. The project is on GitHub at greenelab/snorkeling. Your time permitting, @b_good we'd love your involvement, especially in the early stages where your feedback will lead us down the right path. @b_good, do you have a GitHub handle?
Benjamin Good: @dhimmel would love to be involved but am currently on family leave. (Back in February). Suggest engaging with @tongli from our lab. I believe he has the most relevant experience to offer and can connect you to other lab members (active and departed) that have worked with it. Obviously @alexratner is probably your best resource here :)..
@tongli (veleritas on GitHub) — feel free to get involved on greenelab/snorkeling as little or much as you want. If interested, I can mention you for pull request review or issues where I think your input may be helpful.
Benjamin Good: @dhimmel quick note that you might consider starting with a relation type (or 2) that are less semantically complex and more plentiful in the research literature than 'treats'. As you and others have demonstrated, potential new 'treats' relations can be inferred based on specific constellations of other relations (e.g. genetic interaction, physical interaction etc). Even if DeepDive magic can eventually start figuring out those deeper signatures automatically, its probably worth a look as a starting point..
Regarding the initial preprocessing (e.g. splitting, NER) of PubMed, I wonder if the PubAnnotation project might provide some useful structures and data ?
Hi @dhimmel this is really exciting! Thoughts (some others also in the issues discussion of your repo):
Is compound functionally too different from chemical? We have tags for chemical and disease in Snorkel format already, for PubMed abstracts via PubTator. If compound is different, we also have / are working on tools for building custom entity taggers for this, but it's just another step that would have to be tackled
I definitely agree with @b_good's general sentiment of starting with simpler relation types, although don't have a lot of intuition for which are more tractable. Easy things to check off when selecting extraction schema:
Make sure that the relation type is well defined / unambiguous enough that you can cleanly annotate some example yourself
Test some simple baseline methods (e.g. a regex or two) to make sure it's not too simple / general (if it is, maybe the regex is good enough, and no need for Snorkel; or maybe this is good cue to jump to the more specific, complex relation type you actually care about!)
@b_good that resource looks really cool, thanks for sharing!
We're trying to wrap up a bunch of dev stuff that's been scattered about over last month or so, hoping to push something to dev and master in next 1-2 weeks, stay tuned!
I think it would be a great experimental setup to begin your extraction work using the relationships in your current hetnet as your gold standard (standards.. one per relation type). First that means you have a gold standard already which is a huge part of the work in building up an NLP system (and can likely publish good results in NLP-related arenas). Second, if it works, you have an automated way to grow the hetnet on a daily basis and potentially to scrub out the licensed content. Third you set yourself up for a great experiment - can a drug-treats-disease automatic relation finder outperform your results from the hetnet + R inference model? How can those methods be harmonized?
This will be a great project...
@b_good, your comment nails the promise of snorkeling hetnets. I was envisioning using Snorkel for network construction rather than relationship prediction. In other words, I'd want each literature extracted relationship to derive from specific phrases that directly attest to the relationships' existence. However you bring up an interesting point — will Snorkel begin to infer relationships and become a substitute for hetnet edge prediction?
The Compound–treats–Disease Pilot
you might consider starting with a relation type (or 2) that are less semantically complex and more plentiful in the research literature than 'treats'.
Is whether a compound treats a disease really so semantically complex? For example, here is an example abstract sentence which implies a CtD relationship from an abstract:
Acamprosate proved to be a safe and effective aid in treating alcohol-dependent patients and in maintaining the abstinence of patients during 2 years.
I'm hoping we can steer Snorkel away from making inferences say based on the genes involved in alcohol dependence and instead focus on actual statements of clinical efficacy. PharmacotherapyDB has 755 disease-modifying treatments — @ajratner will this be sufficient to support an effective labeling function? If not, we could expand to all clinical trial indications, which would give us ~30,000 gold standard relationships.
@b_good if we struggle at snorkeling treats relationships, we could always pilot another relationship type that was more semantically straightforward and plentiful. Gene–interacts–Generelationships would be one possibility.
Entity tagging and mapping
Is compound functionally too different from chemical? We have tags for chemical and disease in Snorkel format already, for PubMed abstracts via PubTator.
@alexratner, compound and chemical are the same thing. Looks like PubTator tags genes, diseases, species, chemicals, and mutations [2, 3]. Of these, Hetionet contains genes, diseases, and chemicals.
PubTator uses Entrez Gene for genes, as do we. Genes are tagged using GNormPlus.
PubTator uses DNorm to tag diseases , which uses MEDIC. MEDIC is the disease vocabulary used by the Comparative Toxicogenomics Database (CTD) , which consists of OMIM and MeSH terms.
Compounds in PubTator are tagged with tmChem, which uses MeSH and ChEBI, so we'll have to find a way to map MeSH to Drugbank.
Future Snorkel discussion
I'd like to make sure this brainstorming discussion doesn't explode with Snorkel implementation details. So for any Snorkel comments that will generate lot's of discussion, let's either create new discussions on Thinklab or issues on greenelab/snorkeling. We can also make a new Thinklab project for snorkeling.
@dhimmel Sorry for the delayed response here. I 100% agree with your statement "I was envisioning using Snorkel for network construction rather than relationship prediction. In other words, I'd want each literature extracted relationship to derive from specific phrases that directly attest to the relationships' existence."
In some sense this is an engineering decision: we could train any predictive model, but it seems best in our experience to train one to extract the concrete explicit mentions of relations, and then to do further reasoning / prediction / analysis over these.
In terms of the complexity of the treats relationship is probably higher than you'd expect... thank you natural language... but definitely very doable (actually, we have a chemical-induced disease extraction application using CTD distant supervision which beats state-of-the-art on a BioCreatives task, will clean up as tutorial soon...)
However as to which labeling functions / weak supervision will be good enough... no great easy answers, let's see what happens!
Great to hear that our entity types are aligned.
Also, we just released v0.5.0, so hopefully this helps!
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini (2016) Cold Spring Harbor Laboratory Press. doi:10.1101/087619
Monkol Lek, Konrad J. Karczewski, Eric V. Minikel, Kaitlin E. Samocha, Eric Banks, Timothy Fennell, Anne H. O’Donnell-Luria, James S. Ware, Andrew J. Hill, Beryl B. Cummings, Taru Tukiainen, Daniel P. Birnbaum, Jack A. Kosmicki, Laramie E. Duncan, Karol Estrada, Fengmei Zhao, James Zou, Emma Pierce-Hoffman, Joanne Berghout, David N. Cooper, Nicole Deflaux, Mark DePristo, Ron Do, Jason Flannick, Menachem Fromer, Laura Gauthier, Jackie Goldstein, Namrata Gupta, Daniel Howrigan, Adam Kiezun, Mitja I. Kurki, Ami Levy Moonshine, Pradeep Natarajan, Lorena Orozco, Gina M. Peloso, Ryan Poplin, Manuel A. Rivas, Valentin Ruano-Rubio, Samuel A. Rose, Douglas M. Ruderfer, Khalid Shakir, Peter D. Stenson, Christine Stevens, Brett P. Thomas, Grace Tiao, Maria T. Tusie-Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David M. Altshuler, Diego Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua, Jose C. Florez, Stacey B. Gabriel, Gad Getz, Stephen J. Glatt, Christina M. Hultman, Sekar Kathiresan, Markku Laakso, Steven McCarroll, Mark I. McCarthy, Dermot McGovern, Ruth McPherson, Benjamin M. Neale, Aarno Palotie, Shaun M. Purcell, Danish Saleheen, Jeremiah M. Scharf, Pamela Sklar, Patrick F. Sullivan, Jaakko Tuomilehto, Ming T. Tsuang, Hugh C. Watkins, James G. Wilson, Mark J. Daly, Daniel G. MacArthur (2016) Nature. doi:10.1038/nature19057
Richard Bruskiewich, Kenneth Huellas-Bruskiewicz, Farzin Ahmed, Rajaram Kaliyaperumal, Mark Thompson, Erik Schultes, Kristina M Hettne, Andrew I Su, Benjamin M Good (2016) Cold Spring Harbor Laboratory Press. doi:10.1101/055525