Evolutionary rate covariation (ERC) assesses whether two genes have a similar evolutionary history. A recent study computed ERC values in humans and found that genes associated with the same disease were often tied together by similar evolutionary histories . The study based their gene sets on OMIM, which focuses on Mendelian genetics, so whether ERC prioritizes disease-associated genes for complex diseases in unclear. However, this resource is attractive as an orthogonal, systematic, and unbiased indicator of common gene functionality.
We began working with the human data from the website. First, we parsed the data, converted from a matrix format to a tidy pairwise format, and mapped the UCSC gene ids to Entrez Gene (notebook). Next, we collapsed the values by Entrez Gene pairs (code). Almost all UCSC–Entrez mappings were one-to-one, but in the case of many-to-one, we took the average correlation value.
Our goal is to extract gene pairs that share an evolutionary history. ERC values are provided for all gene pairs with sufficient data, but we are only interested in the small subset of biologically-meaningful correlations. Here we consider using the ERC value as a threshold:
Figure 1. Distribution of ERC values
Figure 2. Probability of positive or negative sign given absolute ERC value
Assuming that dissimilar evolutionary histories are not present, we can use Figure 2 to select an ERC threshold. Assuming a symmetric null distribution, selecting a threshold of ERC > 0.75 would lead to a false discovery rate of approximately 10%. We can also take an empirical approach and optimize the threshold based on performance.
We would appreciate any community feedback on rational thresholding techniques. Additionally, we may want to consider a separate edge for dissimilar evolutionary history — is that a meaningful concept?
We recently worked on a (mini-)study to investigate the relationship between the ERC values of gene-pairs and the extent to which they share their interacting partners. The Jaccard coefficient was used to quantify the extent to which genes share interacting partners (higher the Jaccard, more the fraction of interacting partners shared). We used the yeast ERC dataset, and interestingly found that there is a weak, but significant positive correlation between ERC values of gene pairs and Jaccard coefficient (JC) of the interacting partners of the two genes. We additionally saw that using JC in conjunction with ERC has potential to reduce the number of false discoveries in interaction prediction. I could attach the full report of our investigation if it interests you.
I think it would be interesting to refine the approach further, and apply the same in the human context as well - it may very well turn out to be more powerful than thresholding based on ERC values.
The protein interaction project sounds interesting. Have you considered using a random walk with restart on the protein interaction network for PPI-similarity? I think you will find it preferable to the Jaccard coefficient, since it considers more than just first degree neighbors. I have some python code for the random walk that I can open source, if you can't find an implementation. What protein interaction network are you using? I think a systematic PPI network (that isn't ridden with knowledge bias) would be especially interesting. I suggest the HI-II-14 network from here.
In terms of this project, we would like to keep the ERC edges independent of the PPI edges. The ERC values are attractive to us as a completely orthogonal resource to the protein interactions.