Integrating drug target information from BindingDB

Daniel Himmelstein, Mike Gilson

doi:10.15363/thinklab.d53

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Publication:

BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities

Integrating drug target information from BindingDB

Daniel Himmelstein Researcher April 13, 2015

We have chosen to rely on BindingDB [1, 2] for compound–protein binding (ligand–target affinity) information. Their website provides a nice background on this topic. BindingDB includes several other databases including ChEMBL [3, 4] and PubChem BioAssay [5, 6, 7] and provides well-annotated data in convenient formats.

We have begun our parsing of BindingDB from the BindingDB_All_2015m3 release. We have taken the following steps:

Some binding experiments refer to multichain protein complexes. However, 96% = 1073428 / 1115639 of experiments assayed a single chain protein. For simplicity, we decided to omit binding to complexes.
Of the remaining interactions, the protein targets for 20% = 215107 / 1073428 did not map to a SwissProt identifier and were excluded.
Of the remaining interactions 78% = 673750 / 858321 were curated with Homo sapiens as the organism. While we will most likely only end up using the human targets, this filtering should naturally occur at a later stage.

The next step is translating BindingDB compounds to DrugBank compounds. Many BindingDB compounds may match a single DrugBank compound. Additionally multiple experiments may report affinities for the same compound–target pair. There are several affinity metrics (Ki (nM), IC50 (nM), Kd (nM), EC50 (nM)) that are in nanomolar units. We would like to combine all affinities for a DrugBank–EntrezGene pair into a single molarity value (which can then be used as a network inclusion threshold).

We would like input on how to combine affinities across experiments. Are some molarity measurements more precise? Are certain source databases less error prone? We are looking for simple and rational rules and are willing to accept a healthy dose of reductionism.

Mike Gilson April 14, 2015

Hi Daniel,

Most of the data in bindingDB are 50% inhibitor concentrations (IC50s), inhibition constants (Kis), with occasional dissocation constants (Kds), all in default units of nanomolar (nM). The IC50 values are regarded as less rigorous measures of binding affinity as they depend to some degree on the association constant of the enzyme substrate used in measuring the IC50 value. The Ki and Kd values should be somewhat more rigorous. If a given compound and protein target have multiple measurements of different types, I'd probably use them in the following order of preference: Kd over Ki over IC50. That said, most of the data are IC50s, so this case won't arise all that often.

Hope this helps!
Regards,
Mike

Daniel Himmelstein Researcher April 14, 2015

The table below categorizes each binding by the affinity measures reported in bindingDB. Some reports include multiple measures. After filtering multichain and unmapped proteins, the number of bindings per category is reported:

IC50 (nM)	Ki (nM)	Kd (nM)	EC50 (nM)	count
-	-	-	-	881
IC50	-	-	-	480,070
-	Ki	-	-	245,475
-	-	Kd	-	55,044
-	-	-	EC50	74,549
IC50	Ki	-	-	985
IC50	-	Kd	-	76
IC50	-	-	EC50	631
-	Ki	Kd	-	4
-	Ki	-	EC50	574
-	-	Kd	EC50	8
IC50	Ki	Kd	-	1
IC50	Ki	-	EC50	23

Definitions for reference:

IC50 - half maximal inhibitory concentration
Kd - dissociation constant
Ki - inhibitor constant
EC50 - half maximal effective concentration

@mkgilson, thanks! What about EC50s, of which there are 74,549 reports? Can we include this information, and if so where should EC50s fall in the preference order?

Also is it possible to infer whether a ligand is an agonist or antagonist based on which measures are reported?

Mike Gilson April 14, 2015

Hi Dan,

I'd significantly downweight or ignore the EC50s, as these are at greatest risk of not corresponding to a confirmed binding reaction. For example, there might be a compound which binds, or is expected to bind, a particular protein target; but the measurement done is to expose cells to the compound and report the concentration at which it is "50% effective" (hence EC50) in producing a biological effect supposedly due to binding of the protein target.

Unfortunately, agonism/antagonism is not readily available.

Regards,
Mike

Daniel Himmelstein Researcher April 16, 2015

BindingDB Processing

For our network, we desire binding edges between entrez genes and drugbank compounds. We coerced the bindingDB to conform to our desires using the following steps:

Dataset cleanup and tidying:

downloading and reading bindingDB
removing interactions with multichain complexes or without uniprot protein IDs
converting binding affinities to floats
retrieving entrez genes corresponding to uniprot proteins (download mapping)
gathering data so rows contain only a single affinity measurement, uniprot protein, and entrez gene (download tidied data)

Collapsing bindingDB into compound-gene relationships:

restricting to human interactions
mapping bindingDB compounds to drugbank (download fuzzy mapping)
multiple affinities for the same bindingdb–uniprot pairs were resolved by preferentially selecting Kd over Ki over IC50 and taking a geometric mean when there were multiple measurements of the same measure (download)
collapsing into drugbank–gene pairs, taking the minimum affinity reported across grouped bindingdb–uniprot pairs (download)

The resulting drugbank–gene dataset contained 21,617 interactions. Setting an affinity threshold at 1 micromolar (1000 nanomolar) — a threshold suggested by both @mkgilson and @alessandrodidonna in conversation — retained ~20% of interactions. After thresholding at 1 micromolar, 890 genes and 1,634 drugbank compounds participated in 5,701 binding interactions.

Daniel Himmelstein Researcher Nov. 19, 2015

Update to October 2015 release

We updated our analysis [1] to the latest BindingDB release (BindingDB_All_2015m10.tsv) and made several implementation enhancements. Now, our collapsed datasets retain source and pubmed information to help with licensing and attribution.

For more information, see the notebook for processing the BindingDB export, the rmarkdown output for collapsing to compound–gene relationships, and the data download directory.

Issue feedback

A few issues arose which were not present for BindingDB_All_2015m3.tsv. Paging @mkgilson:

Rows 192304–192473 (one indexed) start off with SMILES rather than reactant set IDs.
Numeric binding affinities could not be extracted for 19 rows.

I recommend switching from the ragged tsv to a format that can handle nested structure, such as json or xml.

Status: Completed

Views

298

Topics

Databases Compounds PubChem BioAssay Affinity Binding Affinity BindingDB Drug Targets ChEMBL

Referenced by

Cite this as

Daniel Himmelstein, Mike Gilson (2015) Integrating drug target information from BindingDB. Thinklab. doi:10.15363/thinklab.d53

License