Rephetio: Repurposing drugs on a hetnet [rephetio]

BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities

Integrating drug target information from BindingDB

We have chosen to rely on BindingDB [1, 2] for compound–protein binding (ligand–target affinity) information. Their website provides a nice background on this topic. BindingDB includes several other databases including ChEMBL [3, 4] and PubChem BioAssay [5, 6, 7] and provides well-annotated data in convenient formats.

We have begun our parsing of BindingDB from the BindingDB_All_2015m3 release. We have taken the following steps:

  1. Some binding experiments refer to multichain protein complexes. However, 96% = 1073428 / 1115639 of experiments assayed a single chain protein. For simplicity, we decided to omit binding to complexes.
  2. Of the remaining interactions, the protein targets for 20% = 215107 / 1073428 did not map to a SwissProt identifier and were excluded.
  3. Of the remaining interactions 78% = 673750 / 858321 were curated with Homo sapiens as the organism. While we will most likely only end up using the human targets, this filtering should naturally occur at a later stage.

The next step is translating BindingDB compounds to DrugBank compounds. Many BindingDB compounds may match a single DrugBank compound. Additionally multiple experiments may report affinities for the same compound–target pair. There are several affinity metrics (Ki (nM), IC50 (nM), Kd (nM), EC50 (nM)) that are in nanomolar units. We would like to combine all affinities for a DrugBank–EntrezGene pair into a single molarity value (which can then be used as a network inclusion threshold).

We would like input on how to combine affinities across experiments. Are some molarity measurements more precise? Are certain source databases less error prone? We are looking for simple and rational rules and are willing to accept a healthy dose of reductionism.

Hi Daniel,

Most of the data in bindingDB are 50% inhibitor concentrations (IC50s), inhibition constants (Kis), with occasional dissocation constants (Kds), all in default units of nanomolar (nM). The IC50 values are regarded as less rigorous measures of binding affinity as they depend to some degree on the association constant of the enzyme substrate used in measuring the IC50 value. The Ki and Kd values should be somewhat more rigorous. If a given compound and protein target have multiple measurements of different types, I'd probably use them in the following order of preference: Kd over Ki over IC50. That said, most of the data are IC50s, so this case won't arise all that often.

Hope this helps!

The table below categorizes each binding by the affinity measures reported in bindingDB. Some reports include multiple measures. After filtering multichain and unmapped proteins, the number of bindings per category is reported:

IC50 (nM)Ki (nM)Kd (nM)EC50 (nM)count

Definitions for reference:

  • IC50 - half maximal inhibitory concentration
  • Kd - dissociation constant
  • Ki - inhibitor constant
  • EC50 - half maximal effective concentration

@mkgilson, thanks! What about EC50s, of which there are 74,549 reports? Can we include this information, and if so where should EC50s fall in the preference order?

Also is it possible to infer whether a ligand is an agonist or antagonist based on which measures are reported?

Hi Dan,

I'd significantly downweight or ignore the EC50s, as these are at greatest risk of not corresponding to a confirmed binding reaction. For example, there might be a compound which binds, or is expected to bind, a particular protein target; but the measurement done is to expose cells to the compound and report the concentration at which it is "50% effective" (hence EC50) in producing a biological effect supposedly due to binding of the protein target.

Unfortunately, agonism/antagonism is not readily available.


BindingDB Processing

For our network, we desire binding edges between entrez genes and drugbank compounds. We coerced the bindingDB to conform to our desires using the following steps:

Dataset cleanup and tidying:

  1. downloading and reading bindingDB
  2. removing interactions with multichain complexes or without uniprot protein IDs
  3. converting binding affinities to floats
  4. retrieving entrez genes corresponding to uniprot proteins (download mapping)
  5. gathering data so rows contain only a single affinity measurement, uniprot protein, and entrez gene (download tidied data)

Collapsing bindingDB into compound-gene relationships:

  1. restricting to human interactions
  2. mapping bindingDB compounds to drugbank (download fuzzy mapping)
  3. multiple affinities for the same bindingdb–uniprot pairs were resolved by preferentially selecting Kd over Ki over IC50 and taking a geometric mean when there were multiple measurements of the same measure (download)
  4. collapsing into drugbank–gene pairs, taking the minimum affinity reported across grouped bindingdb–uniprot pairs (download)

The resulting drugbank–gene dataset contained 21,617 interactions. Setting an affinity threshold at 1 micromolar (1000 nanomolar) — a threshold suggested by both @mkgilson and @alessandrodidonna in conversation — retained ~20% of interactions. After thresholding at 1 micromolar, 890 genes and 1,634 drugbank compounds participated in 5,701 binding interactions.

Update to October 2015 release

We updated our analysis [1] to the latest BindingDB release (BindingDB_All_2015m10.tsv) and made several implementation enhancements. Now, our collapsed datasets retain source and pubmed information to help with licensing and attribution.

For more information, see the notebook for processing the BindingDB export, the rmarkdown output for collapsing to compound–gene relationships, and the data download directory.

Issue feedback

A few issues arose which were not present for BindingDB_All_2015m3.tsv. Paging @mkgilson:

  • Rows 192304–192473 (one indexed) start off with SMILES rather than reactant set IDs.
  • Numeric binding affinities could not be extracted for 19 rows.

I recommend switching from the ragged tsv to a format that can handle nested structure, such as json or xml.

Status: Completed
Referenced by
Cite this as
Daniel Himmelstein, Mike Gilson (2015) Integrating drug target information from BindingDB. Thinklab. doi:10.15363/thinklab.d53

Creative Commons License