experiments: cancer mutation data from COSMIC[4, 5] and GWAS data from DistiLD
We did a preliminary processing of the integrated dataset, which yielded 81,499 gene–disease relationships for DO Slim diseases (notebook, download). Filtering for scores ≥ 3, resulted in 2,441 relationships.
@larsjuhljensen, are scores in DISEASES comparable between datasets? In other words, are confidence scores standardized to a common gold standard?
Regarding the scores, they are designed to be as comparable as we could make them; however, it was not possible to do so purely through benchmarking, since a high-quality unbiased benchmark set does not exist.
If you already have GWAS from another source, I would exclude DistiLD too. You already import mutation data from e.g. COSMIC, I would exclude the experiments channel entirely. This also makes comparability of scores much less of an issue, since you're left with only automatically text-mined associations, which are scores the same way as tissue associations, and manually curated associations, which are inherently highly reliable.
We visualized the relationships between scores on the full dataset. The off-diagonal plots show a 2D histogram, using hexagonal bins. The diagonal of the grid contains 1D histograms for the x-variable. Bin counts for all panels are log-transformed.
S. A. Forbes, D. Beare, P. Gunasekaran, K. Leung, N. Bindal, H. Boutselakis, M. Ding, S. Bamford, C. Cole, S. Ward, C. Y. Kok, M. Jia, T. De, J. W. Teague, M. R. Stratton, U. McDermott, P. J. Campbell (2014) Nucleic Acids Research. doi:10.1093/nar/gku1075