Project:
Rephetio: Repurposing drugs on a hetnet [rephetio]

Transforming DWPCs for hetnet edge prediction


Our approach for hetnet edge prediction models the relationship between two nodes by extracting degree-weighted path counts (DWPCs) [1]. We use DWPCs, each corresponding to a different metapath, as the main features for a logistic regression classifier. Here we will investigate whether DWPCs should be transformed prior to being used as predictors.

Below, we show the distribution of DWPCs for randomly selected metapaths, stratified by percent of non-zero values (notebook). We look at three metapaths for each non-zero quintile. These distributions are calculated from all positives (Compound–treats–Disease pairs) but only a small subset of negatives (4 times the # of positives). Since positives tend to be more connected than negatives, we expect the distribution for all compound–disease pairs to be even sparser.

Raw DWPC distributions

Note that the y-axis (histograms counts) is heavily transformed. The DWPC distribution is zero-inflated. The non-zero portion of the distribution has a long right tail, looking potentially lognormal.

My concern is that these long-tailed distributions are suboptimal for linear modeling. For example, they lead to very few extremely high predictions at the expense of all other predictions. We observed this trend when we used a DWPC approach for predictng gene–disease associations.

This discussion will look into whether transforming DWPCs makes sense.

Inverse hyperbolic sine transformation

One option is to use the inverse hyperbolic sine (IHS) transformation [1, 2]. The IHS transformation has nice properties. Foremost, it's zero preserving and easy to implement. It has a single parameter, θ that controls to what extent values are pulled towards 0. Here's an R implementation:

ihs_transform <- function(x, theta = 1) {
  # Inverse Hyperbolic Sine transformation
  return(asinh(theta * x) / theta)
}

This implementation can be easily ported to Python by replacing asinh with numpy.arcsinh.

Now many applications, where the values extend into the natural number range, will do fine with the default θ = 1. However, since DWPCs tend to be very small numbers, the IHS transformation will have a negligible effect unless we increase θ.

Others have discussed choosing θ to achieve normality using maximum likelihood. There is also a recent R package ihs that could be useful.

So the question becomes, what exactly do we want our transformation to do? Should we base the fitting of θ only on the non-zero DWPCs or on all DWPCs. Should we use an efficient and simple heuristic to fit θ or should we go with a more intense likelihood method?

Overall I'd say arcsin is a fine function. But to achieve your stated goals, I wonder why you didn't just use a log transformation?

To achieve goal
1.zero preserving
2. easy to implement

x -> log(x+1)

To acheive a third goal
3. has a "theta" value that could be use to "pull" values toward zero

x-> log(ax+1) / a

The arsinh, as you know, is a linear derivative of the exponential function. And thus the inverse of this is a derivative of the log. So I guess it's unclear why you chose a derivative rather than the actual.

  • Antoine Lizee: @pouyakhankhanian I find your last remark interesting, but unfortunately not quite accurate. I'll try to show it below (notes are not good for math...)

Log versus IHS transformation

@pouyakhankhanian, originally I stayed away from log1p because because it's practically linear across the range of our DWPCs. You bring up a good point regarding scaling DWPCs prior to transformation.

Yesterday, @alizee and I looked into scaling DWPCs before transformation. This would eliminate the need to fit θ: we could use θ = 1 for both the log and IHS transforms. For example, if x is the vector of DWPCs for a single feature, we could transform using:

# Standard deviation scaling
x_scale = sd(x)

# Mean absolute deviation scaling
x_scale = mad(x, center = mean(x))

# Mean scaling
x_scale = mean(x)

# Scale
x_scaled = x / x_scale

# Inverse hyperbolic sine transform
asinh(x_scaled)

# Log transform
log1p(x_scaled)

I think we should choose between the log and IHS methods based on which gives better performance. Regarding choosing a derivative rather than the actual, I don't view one as inherently superior, especially since the IHS has better transformation properties than the log, such as handling negatives (although this isn't an issue here).

I think we should also use performance to choose the scaling method, but with a preference for standard deviation scaling since @alizee thinks it the most versatile method.

A remark on asinh, log1p and derivatives

[The aim of this post is mainly to respond to @pouyakhankhanian regarding his post above]

@pouyakhankhanian: I find your last remark interesting, but unfortunately not quite accurate as currently stated.

(i) $$\sinh$$ is not a linear derivative of the exponential function, but a simple linear combination of it: $$ \sinh = \frac{1}{2}(e^x - e^{-x}) $$.

(ii) Further, your second sentence has no grounding: the inverse of a derivative of a function $$f$$ is definitely not the derivative of the inverse of the same $$f$$... So even if (i) was right, the result wouldn't hold.

To see more clearly that $$log1p$$ and $$asinh$$ are not derivatives of each other, we can look at their analytical formulae:

$$$ \begin{align} \text{log1p}(x) &= \log(1 + x)\\ \sinh^{-1}(x) &= \log(x + \sqrt{1 + x^2}) \end{align} $$$

On a related note, we see here that both functions are very similar. This similarity is even clearer when looking at their derivatives:

$$$ \begin{align} \text{log1p}'(x) &= \frac{1}{1 + x}\\ (\sinh^{-1})'(x) &= \frac{1}{\sqrt{1 + x^2}} \end{align} $$$

When plotted, the trend is clear. The derivative of $$\text{asinh}$$ takes some time to get into the form of $$ \frac{1}{x} $$, thus expanding the beginning of the range (for $$x \in 0..2$$) more than its counterpart.

  • Pouya Khankhanian: To be clear, I didn't mean "derivative" in the strict mathematical definition (i.e. d/dx f(x)) . I supposed a better word would be a "relative" or "derivation", so please excuse the language. And I do agree that both functions are very similar, that was the point I was trying to make. I think Daniel understood the spirit of my comment and has addressed it appropriately.

  • Antoine Lizee: I thought it was maybe the case, but also that clarification might be useful in general.

Transformation sweep

I ran a parameter sweep of transformation options on the all-features dataset (notebook). The sweep used 10-fold cross-validation to select the regularization strength and summarized results over 10 cross-validation random seeds. "All features" refers to the dataset that includes all metapaths as features but covers only a portion of compound–disease pairs. These findings are not guaranteed to hold for the all-observations dataset, which we use for predictions.

The performance results (table) show that it's important to transform degrees and DWPCs. As @alizee hypothesized, transforming with log1p versus asinh didn't make a big difference. If for simplicity we choose the same transformation for degrees and DWPCs, then asinh ranked higher than log1p. Regarding DWPC scaling, the mean appears to perform better than the standard deviation although the difference is small. I know @alizee thinks the mean is less standard for this purpose, but it seems intuitive for DWPCs which start at zero.

Hence, I plan to proceed by asinh transforming degrees and mean scaling and asinh transforming DWPCs.

 
Status: In Progress
Views
76
Topics
Referenced by
Cite this as
Daniel Himmelstein, Pouya Khankhanian, Antoine Lizee (2016) Transforming DWPCs for hetnet edge prediction. Thinklab. doi:10.15363/thinklab.d193
License

Creative Commons License

Share