Calculating molecular similarities between DrugBank compounds

Daniel Himmelstein, Sabrina Chen

doi:10.15363/thinklab.d70

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Calculating molecular similarities between DrugBank compounds

Daniel Himmelstein Researcher May 19, 2015

We have calculated molecular (aka chemical/structural) similarities between DrugBank compounds. First, we retrieved the compound structures as an SDF file from the download page. Then we calculated extended connectivity fingerprints for each compound using the Morgan/circular method [1]. We chose a radius of 2, since "Typically, two iterations is sufficient for fingerprints that will be used for similarity or clustering. [1]" Finally, we computed all pairwise similarities using the Dice coefficient [2].

The similarities for the subset of DrugBank compounds included in our network is available here. We posted the full set of similarities (for all DrugBank compounds with structures) on figshare [3].

See the notebook of the analysis for more details.

Sabrina Chen Researcher Aug. 3, 2015

Relationship between transcriptional and chemical similarity

Notebook

Objective:

Determine the correlation and visualize the relationship between L1000 transcriptional and chemical compound similarity.

The data:

We extracted L1000 perturbation data which lists drug types with perturbation IDs and similarity data which gives a value between zero and one based on the chemical similarity of a pair of drugs.

Jointplot comparing chemical and transcriptional similarities:

From the imported L1000 perturbation data, we calculated the spearman correlation values for each combination of pair of drugs. This correlation value was labeled as the "transcriptional similarity." This value was graphed against the imported similarity data for each drug pair. From the Jointplot, (which plots both bivariate data on a hexbin plot and univariate data on a histogram,) we concluded that there was no strong correlation visually. The darkest parts of the hexbin graph were grouped at the bottom left of the grid and the graph showed no real positive pattern. Nevertheless, the graph had an extremely small p-value, indicating significance, and leading to a small effect.

Figure 1: Compares transcriptional similarity for each drug pair and chemical similarity

Rounded chemical similarity pointplot:

To visualize the correlation a different way, the chemical similarity data was rounded off to the nearest tenth and the mean transcriptional value was found for each subset. This data was graphed on a pointplot which demonstrated a clear positive correlation, especially when the chemical similarity reached 0.5. After this point, the correlation was even stronger, increasing steadily as the chemical similarity reached 1.0. It should be noted that as the chemical similarity increased, there were less and less data points to be used in graphing. In fact, the last data point where chemical similarity was 1.0 had only a single data point (as seen in the table of rounded values shown in the cell before the pointplot.)

Figure 2: Demonstrates relationship between chemical similarity of compounds and the mean transcriptional similarity values

Daniel Himmelstein: Just came across this publication [1], which analyzed how chemical similarity correlated with transcriptional similarity in LINCS L1000.

Daniel Himmelstein Researcher Aug. 4, 2015

Nice analysis. Chemical similarity appears to be a weak predictor of transcriptional similarity (Figure 1, $$\rho = 0.02, p = 10 ^ {-66}$$). However, this correlation is highly influenced by the majority of compound pairs where chemical similarity is less than 0.5. As we have previously noticed, chemical similarity becomes predictive of other types of similarity above 0.5. As seen in Figure 2, the same trend applies to transcriptional similarity. Therefore, within the meaningful range of chemical similarity values, the association looks stronger.

Nice catch that the highest bin (chemical similarity ≥ 0.95) only has a single compound pair. This is due to our selection criteria for compounds which aims to avoid redundancy. We have computed chemical similarities for the entire LINCS L1000 perturbation set, so we could rerun this analysis with all perturbagens.

Status: Completed

Views

241

Topics

DrugBank Structural Similarity Fingerprint Dice coefficient Molecular Similarity Similarity Chemical Similarity

Referenced by

Prediction in epilepsy
Research report: Rephetio: Repurposing drugs on a hetnet

Cite this as

Daniel Himmelstein, Sabrina Chen (2015) Calculating molecular similarities between DrugBank compounds. Thinklab. doi:10.15363/thinklab.d70

License