Diseases in DisGeNET are identified with UMLS identifiers. We were able to map 125 out of 137 of our DO Slim diseases.
Data format suggestion
The datasets on the download page are gzipped tarballs but only contain a single text file. Using zless or pandas.read_table() on the tar.gz file led to strange behavior. I ended up extracting the file from the tarball and then gzipping it again to reduce filesize (new file).
@janispi, would it make sense to remove the tarball at your end and go with a plain .txt.gz or .tsv.gz extension?
you should not be having problems, but maybe you should just untar the file and then load it? Let me know how it goes, so if there is any issue, we will take care of it.
I ended up doing the following steps to strip out the tarball:
tar -xzf all_gene_disease_associations.tar.gz
The procedure isn't difficult, but you could save users some time by doing away with the tarball, since it only contains a single file.
janet piñero: you are absolutely right, I have removed the tar the files, and now I just gzip them.
Choosing a score threshold
DisGeNET includes a score for reported gene–disease relationships, described as :
The score ranges from 0 to 1 and is computed according to the formula described in ‘Methods’ section. The DisGeNET score allows obtaining a ranking of GDAs and a straightforward classification of curated vs predicted vs literature-based associations since it stratifies the associations based on their level of evidence. For instance, associations only reported by UniProt or CTD, which have been curated by experts, have higher scores (i.e. associations with S ≥ 0.3) than those only supported by animal models or text-mining based sources.
We will need to choose a minumum threshold for edge inclusion in our network. @janispi, can you give us some more information regarding scores? Specifically,
how do scores correspond to precision (the probability of the relationship being real)?
what is a reasonable cutoff to eliminate junk? Does any relationship with score > 0 already have acceptable confidence?
We would like a permissive threshold, allowing up to a ~30% false discovery rate.
janet piñero: I was waiting for this question. All GDAs have score > 0. If you choose score >= 0.06, then you will be including associations reporting by curated sources, or having animal models supporting them, or being reported by several papers (20 -200). It will not be permissive, though (less than 10% of GDAs satisfies this criteria). MAybe you could start with this score, and see how it goes.
DisGeNET uses a different nomenclature. 'Association' refers to all disease–gene relationships while 'genetic variation' is more in line with what we call 'association'.
Should we continue to call our GWAS edge 'association' and put DisGeNET into our 'function' edge? Or we could rename 'function' to 'relationship' to be more general? Or we could switch our GWAS edge to 'variation'.
I would recommend using the same criteria as in DisGeNET. "Genetic Variation" would be equivalent to GWAS.
Preliminary processing complete
We processed DisGeNET by converting to DO Slim diseases (notebook, download). We used propagated mappings, so for example relationships with relapsing-remitting multiple sclerosis would be included for multiple sclerosis.
The result was 82,833 gene–disease associations. After filtering for scores ≥ 0.06, 7,779 associations remained with large variability in the number of associations per disease. Additionally, many of the associations appear to be 'genetic variation' edges, which may be captured by our GWAS edge. As a reminder, the 0.06 score threshold includes the following (thanks @janispi):
If you choose score ≥ 0.06, then you will be including associations reporting by curated sources, or having animal models supporting them, or being reported by several papers (20–200).
We mapped DO Slim terms to DisGeNET using UMLS cross-references. The UMLS cross-references in the DO were often non-exact, so one DO term would reference many UMLS terms. Several UMLS terms referenced by the DO were not in DisGeNET.