You say elsewhere that you want to avoid bias - that you want to work basically form experimental measurements as much as possible. This has some merit, but it also seems like there must be quite a lot of value out there in the space of biased knowledge.. Some of that bias will be real signal. Would be great to execute an experiment to empirically test the impact of bringing in prior knowledge (e.g. via text mining techniques for relation extraction). Does it make it better or worse at the task at hand?
there must be quite a lot of value out there in the space of biased knowledge
You are correct. In our current proposal, we do use text mining for the symptom-disease edges and tissue-disease edges. We also rely on literature curation for compound-target binding and protein interactions.
Would be great to execute an experiment to empirically test the impact of bringing in prior knowledge
I agree, we should fit a model that excludes all knowledge-biased domains. I reckon this model's performance on known indications will be drastically inferior. The worry with predicting known indications with known biology is that your testing performance becomes nearly perfect. However, your novel predictions are not interesting — they would be readily apparent to a pharmacologist.
The more text mining data you include the larger your gap between testing performance and generative performance (see our discussion on evaluation). Therefore, I like the following workflow:
Start with high-throughput resources that are not affected by knowledge bias (a.k.a. study bias)
If the algorithm performs significantly better than random, explore the top predictions.
If performance is mediocre, add a biased resource that provides orthogonal information (information not already included from a systematic resource)
One final note to help explain the insidiousness of the knowledge bias. In CTD, curators may read that compound_X treats disease_X and also targets gene_X and therefore add interactions between all three entities to their knowledgebase. In reality the study hasn't proven that gene_X is associated with disease_X but this relationship was still extracted. This happens on a macro-scale across the entire compendium of published literature: specific network vicinities become well studied and the resulting disease-gene-compound triangles are more a result of attention rather than noteworthy biology.
Knowledge biased and unbiased edges
Thanks to @b_good's suggestion for text mining and curated databases, we have incorporated several edges that are subject to knowledge bias.
Thus, for each edge we will create an unbiased attribute which takes a True or False value. Using our network masking feature, we can easily switch between using the whole network or only the knowledge-unbiased portion.
Some metaedges will contain a mix of biased and unbiased edges. For example, protein interactions based on their source database. When both a biased and unbiased source contribute an edge, we will give precedence to the unbiased designation.
Benjamin Good: I'm wondering about your relation ontology here. Can you really distinguish a causal relation from co-occurrence information? Do the names on these edges actually matter based on how you are using the network ? If the meaning of the relations is really important, I think there are other text-mining approaches that you should look into. If not, the co-occurrence stuff should be fine. Really curious about this experiment.. Would also like to see how the result of other text-ming approaches would influence the outcome. e.g. would it change things if you swapped in the relations from semmedDB ?
Do the names on these edges actually matter based on how you are using the network?
@b_good, for predicting the probability of efficacy of a compound–disease pair, the metaedge names do not matter. The algorithm only considers the structure of the network. The names are used to assist with interpretability. For example, the CtGad feature (capturing when a compound targets genes that are associated with the disease) may be predictive. In this case, we would conclude that disease-associated genes are informative for repurposing. If what we call an 'association' is actually some other type of relationship, then the interpretation that associations are influential will be unfounded.
When we have multiple metaedges between the same metanodes, we hope there is a difference in the type of information encoded. Otherwise, we would be better off having only a single metaedge. For example, we included binding and target edges between compounds and genes. It is unclear whether merging these edges would be beneficial, because it's difficult to know how they differ. Therefore, a good understanding what information each metaedge captures will assist with metagraph design. Accurate metaedge names can help with understanding edge content and therefore network design decisions.
Would also like to see how the result of other text-ming approaches would influence the outcome. e.g. would it change things if you swapped in the relations from semmedDB?
Currently, I am happy with the quality of our MEDLINE topic cooccurrence approach. I assume that you highlight SemMedDB because it has the ability to extract the type of relationship. I agree this could be a valuable addition. However, in the interest of time, this will most likely have to wait till a successive project.