UniChem Mapping to LINCS Small Molecules

Leo Brueggeman, Daniel Himmelstein, Caty Chung

doi:10.15363/thinklab.d51

UniChem Mapping to LINCS Small Molecules

Leo Brueggeman Researcher April 9, 2015

While mapping DrugBank compounds to LINCS, we have noticed that UniChem's mapping is potentially outdated. UniChem maps to LINCS compound identifiers that begin with LSM- while the current LINCS small molecule identifiers (called pert_id) begin with BRD-.

UniChem searches that match to LINCS, hyperlink to the LIFE resource, which is not up to date with the released LINCS data (example search and example link).

LIFE contains ~9000 small molecules versus >20000 at LINCS and also does not consistently supply the main pert_id used by LINCS. The LINCS ID used by LIFE doesn't appear in the compound information that is currently obtainable from the LINCS API, so it may be an obsolete ID system replaced by pert_id.

We will contact UniChem to alert them and will solicit feedback from the LINCS team regarding the ID confusion.

Daniel Himmelstein Researcher April 9, 2015

Perhaps this explains why 39% of DrugBank approved small molecules did not map to a single LINCS compound: the LINCS resource was half it's current size.

Until this issue is resolved, we have two workarounds:

Identify LINCS small molecules with their PubChem identifiers, which are provided for most compounds.
Match LINCS compounds to DrugBank using the provided InChIKeys and UniChem.

Caty Chung April 13, 2015

The BD2K-LINCS Data Coordination and Integration Center is part of the Big Data to Knowledge (BD2K) NIH initiative, and it is the data coordination center for the NIH Common Fund's Library of Integrated Network-based Cellular Signatures (LINCS) program, which aims to characterize how a variety of human cells, tissues and the entire organism respond to perturbations by drugs and other molecular factors.

The BD2K- LINCS DCIC works closely with each LINCS Data and Signature Generation Centers (DSGC). The DCIC also collaborates with other other organizations and projects like EBI (UniChem, ChEMBL, ChEBI), PubChem.
The LINCS Production Phase (LP2) DSGC's are:
- Drug Toxicity Signature Generation Center (Icahn School of Medicine at Mount Sinai)
- HMS LINCS Center (Harvard Medical School)
- LINCS Center for Transcriptomics (Broad Institute)
- LINCS Proteomic Characterization Center for Signaling and Epigenetics (Broad Institute)
- Microenvironment Perturbagen (MEP) LINCS Center (Oregon Health and Science University)
- NeuroLINCS Center (University of California, Irvine)

The UniChem chemical structure cross-reference is mapped against the standardized LINCS small molecule (LSM). The DCIC uses a simple schema to combine LINCS and other data into a coherent and computable knowledge framework. The Center develops meta-data standards that enable data integration and representation across the data and signature generation centers (DGSCs). Members of the DCIC are actively developing a next generation integrated web-based platform for the LINCS project that will serve as the foundation for LINCS activities and federate LINCS data, signatures, analysis algorithms, pipelines, APIs and web tools.

UniChem cross-references the standardized LINCS Small Molecule ID (LSM ID). The LSM IDs are mapped to each center's compound and / or sample identifiers. One example of such Center-specific IDs are identifier with the prefix "BRD", which correspond to small molecule compound IDs from the LINCS Center for Transcriptomics (Broad Institute).

Caty Chung April 15, 2015

The following tables will help you map between LINCS Centers to LINCS Small Molecule ID (LSM):
mapping of LINCS Small Molecules to LINCS Facility ID
LINCS compound table
LINCS by SM_Center_Sample_ID

Leo Brueggeman Researcher April 15, 2015

@cchung, Thank you for the useful links. Looking into the file you linked to named mapping of LINCS Small Molecules to LINCS Facility ID (LincsID2FacilityID_LINCS_StandardizedCmpds_LSMIDs.txt), we found that it contained 35,305 BRD- small molecule perturbagen IDs. Separate from this, we have a smaller set of BRD- small molecule perturbagen IDs (20,413) extracted from the LINCS L1000 API. In comparing these two sets, we found only 13,796 common BRD- IDs.

This means that many of the BRD- small molecule perturbagen IDs mapped to LSM IDs do not have transcriptional profiles from Broad. Additionally many compounds that were transcriptionally profiled are not mapped to the LSM ID set.

Therefore, to integrate BRD compounds, we will rely on the supplied pubchem CIDs, which almost all compounds had.

Caty Chung April 16, 2015

The count difference is because the L1000 dataset contains 20,413 small-molecules profiled as part of the LINCS program, the Broad Connectivity Map, NIH efforts such as CDRP, and other projects. We are standardizing the remaining - will keep you posted on the release!

Daniel Himmelstein Researcher May 20, 2015

I came across the following paper that has useful information regarding the LINCS data integration standards:

Metadata Standard and Data Exchange Specifications to Describe, Model, and Integrate Complex and Diverse High-Throughput Screening Data from the Library of Integrated Network-based Cellular Signatures (LINCS). [1]

Daniel Himmelstein Researcher March 7, 2016

Method for mapping L1000 compounds to external vocabularies

We chose to map LINCS L1000 compounds to external vocabularies by querying UniChem with the InChIKey of each L1000 compound (strategy 2 above). This approach enabled us to map L1000 compounds not only to DrugBank, but also to the other vocabularies covered by UniChem.

The InChIKey for each L1000 compound was retrieved from the L1000 API (notebook). Only perturbations with pert_type == 'trt_cp' and a non-null inchi_key were mapped. We queried the UniChem API for each L1000 InChIKey to retrieve matches (notebook).

We used the same UniChem Connectivity Search parameters that we used for mapping DrugBank. Our search permissively matches compounds by atomic structure, ignoring small molecular details [1]. We store all the UniChem output in our SQLite database (l1000.db [2], unichem table), so users could later choose more restrictive parameters without having to requery UniChem.