## Seeking an open source implementation of the Gerstein-Sonnhammer-Chothia Algorithm

I am looking for a method to weight features based on their uniqueness: I want to downweight redundant features and upweight distinct features.

The specific problem is weighting side effects when calculating side effect similarity between two drugs. The initial implementation of this analysis [1] used the Gerstein-Sonnhammer-Chothia Algorithm [2]. The initial implementation describes their method:

Not all side effects are independent of each other; for example, 90% of drugs that cause nausea also cause vomiting. We correct for this redundancy by weighting side effects in a manner analogous to the down-weighting of similar protein sequences within multiple alignments [2] (Fig. S1C). In order to determine the correlation weight, the correlation of side effects was determined by clustering all side effects according to their assigned drugs using a Tanimoto/Jacquard score to compute a distance matrix: The distance between two drugs was calculated by dividing the number of drugs that feature both side effects by the number of drugs that have either side effect associated. The Gerstein–Sonnhammer–Chothia algorithm [2] was used to compute weights based on a hierarchal clustering with the aforementioned distance matrix [3].

I would like to perform an identical analysis, while making the code and results public. The analysis follows the following steps:

1. Calculating pairwise side effect correlations.
2. Performing hierarchical clustering of side effects to produce a dendrogram
3. Inputting the denrogram into the GSC algorithm to calculate side effect weights.

I am looking for an implementation of step 3, that it easy to integrate with implementations of steps 1 and 2. My preferred languages are R, python, and julia (in that order). The author should be willing to post the code publicly under an open source license.

An in detail description of the GSC algorithm can be found in the paper appendix titled, A Method to Weight Protein Sequences to Correct for Unequal Representation.

Also, if you know of a simpler or superior method for accomplishing this weighting task, please suggest.

Hey Daniel,

This should do it:
https://github.com/antoine-lizee/R-GSC/blob/master/GSC.R

It uses dendrogram objects in R.

Once you get it, the implementation is quite straightforward and short, so it doesn't really justify creating a package. Everything in one file, you can just source it.
Have a go at it and tell me what you think.

Best,
Antoine

PS: the attached reference that explains the algorithm has its main example wrong because of the low (and surprisingly inconsistent) precision they use to do the maths.

Daniel Himmelstein Researcher

@alizee, thanks for the open source implementation — you just spared scientists all over the pain of unnecessary reimplementation.

Here is the code I wrote, which calls your GSC function

GitHubScript <- function(...) {
# Source a script from GitHub
library(RCurl)
github.url <- file.path('https://raw.githubusercontent.com', ...)
script <- RCurl::getURL(github.url)
eval(parse(text = script), envir = .GlobalEnv)
}

UnderrepresentationWeight <- function(mat) {
# Returns underrepresentation weight for each column
col.dist <- stats::dist(t(mat), method = 'binary')
col.clust <- hclust(col.dist, method = 'ward.D2')
col.dendro <- as.dendrogram(col.clust)
GitHubScript('antoine-lizee', 'R-GSC', 'master', 'GSC.R')
GSC(col.dendro)
}

I have come accross two errors for my use cases

The following code (which computes only the first 100 side effects for time) returns a vector of only NaN:

underrep.ind.vec <- UnderrepresentationWeight(se.mat[, 1:100])

On a slightly larger matrix, I also got a maximum recursion depth error.

se.mat is a matrix where rows are compounds and columns are side effects. An element is 0 for no relationship and 1 for side effect. You can load se.mat from this file.

Hi Daniel,

I solved your first "Nan" problem. Below is a short explanation, and on the repository I added a file in which I investigate the problem.

The problem came from the fact that some of your vectors of features in your original matrix were exactly the same, i.e. some of your compounds have the exact same side effects. This resulted in a computed distance of zero between these two objects, and the clustering is putting them at the bottom of the dendrogram, at height 0. The algorithm wasn't designed for such an edge case and a division by 0 was giving you the NaN. I slightly changed the algorithm to work even with this edge case.
I quickly tested that the coefficients at this edge case were roughly the limit of the coefficients when the two vectors are getting similar, which shows that the handling of this edge case is appropriate.

I am looking into some profiling and making it work for bigger matrices, so I'll update you regarding your other problem soon.

Best,
Antoine

• Daniel Himmelstein: Okay, I made some code organization and documentation changes. My first github fork and pull request (:

• Antoine Lizee: Sorry, I saw your request too late to make a merge efficient, but I implemented your small formal changes that improve readability. Thanks!

Hey,

I think the code is ready now.

In addition to the fix above that let the algorithm run on datasets with objects that have the same representation, I added some routine check to increase the maximum recursive depth of R when calling the function. I used a robust adaptive approach to increase this recursion limit based on the estimated number of elements in the dendrogram, see the code.

I also did some extensive testing and profiling in the dhimmel file mentioned earlier. As expected, the algorithm is very cheap: execution time is linear in the number of elements of the dendrogram, with roughly 10ms / kElements. I tried to half the stack depth by writing a more verbose version of the GSC algorithm, but it didn't change performance, so I kept the first more elegant version.

Enjoy!

Status: Completed
Views
184
Topics
Cite this as
Daniel Himmelstein, Antoine Lizee (2015) Seeking an open source implementation of the Gerstein-Sonnhammer-Chothia Algorithm. Thinklab. doi:10.15363/thinklab.d25