Python for the modern biodata scientist

Daniel Himmelstein, Ola O, Venkat Malladi

doi:10.15363/thinklab.d84

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

Python for the modern biodata scientist

Daniel Himmelstein Researcher June 24, 2015

Python is our first-line language because it is powerful, elegant, and widely adopted. In combination with Jupyter notebooks, python is a data science jackhammer, while also being general purpose.

Python 3 was released in 2008 and contains small incompatibilities with Python 2. 3 is superior to 2. Many training resources and codebases are still in 2, but new users should begin with 3.

Installation

We will use Anaconda for package management. Anaconda makes installing packages easier and includes most important ones by default. It also supports environments — distinct and independent installations — which allow specific installations for specific purposes. Anaconda creates a default environment (root) that becomes your primary python distribution. Here, we create a root python 3 environment and an elective python 2 environment that can be activated when needed.

Download anaconda for python 3. Avoid the graphical installer which installs unneeded GUI programs. Install according to the defaults. Once installed run the following terminal command for updates:

conda update --all

When needed, additional packages should be installed like conda install seaborn, which installs the seaborn visualization package.

To install python 2.7, we will create a new environment called py27 with the following command:

conda create -n py27 python=2.7 anaconda

Activate the python 2 environment with source activate py27 on linux or mac and activate py27 on windows. Once activated, run conda update --all to update packages and jupyter kernelspec install-self --user to make the Python 2 kernel available in Jupyter. Then run source deactivate (or deactivate on windows) to return to the default python 3 environment.

Python 2 should only be used when necessitated by legacy codebases. Otherwise, we use Python 3.

If you need rdkit for chemoinformatics, you should follow the installation instructions here, which creates a dedicated rdkit environment with InChI support.

Usage

We recommend using Jupyter notebooks for most analyses. Launch jupyter by running jupyter notebook in the terminal. Dedicated .py files can be edited using the Jupyter text editor or atom.

Familiarize yourself with the PEP 0008 style guide.

Local development

When developing a package locally, run pip install -e . from the package's root directory. The package will then be importable from within your conda environment. The -e flag specifies editable mode and makes the package automatically update when you modify the source.

Packages

pandas for dataframes — important functions include DataFrame.merge, melt, DataFrame.pivot_table, DataFrame.groupby, pandas.read_table, DataFrame.to_csv(path, sep='\t', index=False), pandas.isnull, and DataFrame.head. Be forewarned of the horrors of int to float conversion when missing values are present.
seaborn for data visualization.
numpy for arrays and linear algebra.
requests for http calls.
sklearn for machine learning and classification.

For example code, check out my notebooks.

Ola O June 25, 2015

Thank you. This is very helpful.

Daniel Himmelstein: Thanks @akolow! Let us know if you experience any problems or have suggestions.

Venkat Malladi June 30, 2015

This is very helpful and something I can use to improve my python workflow.

Daniel Himmelstein Researcher July 10, 2015

SciPy 2015

Some incredible presentations and materials are being unleashed at the SciPy 2015 conference as I type. The full playlist of presentations is on YouTube.

One excellent presentation is State of the Stack (video, slides) by Jake Vanderplas, which details the latest developments in python tools for data science.

Some additional projects of interest are:

Daniel Himmelstein Researcher Aug. 16, 2015

Jupyter 1.0.0 released

Last Wednesday marked a historic day for biodata science. The language agnostic parts of IPython, including the notebook, have been repackaged as Jupyter. The big split was necessary because the project now supports many languages not just python.

I am updating the above guide, by replacing ipython with jupyter in code snippets. The revised guidance will apply to new installations. If you have an existing Anaconda installation, you can install Jupyter with conda install jupyter.

Now who is excited for September 13th and the features this day will bring!

Daniel Himmelstein Researcher Dec. 2, 2015

Concurrency for data science

Processors are now multicore and our code should take advantage of this opportunity. If execution time is not an issue, don't waste time optimizing. But if you find yourself waiting for your program to finish and your computation can be parallelized, look no further than concurrent.futures.

concurrent.futures gives you easy, no boilerplate, access to the two methods of concurrency in python. The methods are:

`threading`

Threads allow multiple paths of execution within a single program. Each thread has access to the global data space, which makes threads convenient for programming. However, proceed with caution: since a single object can be altered by multiple threads simultaneously, there is danger. Avoid the danger by using locks (via with for convenience) whenever a thread writes to a communal resource.

The main drawback of threading is the global interpreter lock (GIL) meaning that "only one thread can execute Python code at once." Therefore, if you want to reap the benefits of multiple cores, you need to make sure your code isn't limited by the GIL. You can escape the GIL by moving the time intensive computations outside of python by:

Using code written in other languages such as C. Many functions implement their core features outside of python. Additionally, many operations rely on external resources such as database or web queries.
Using numba to automatically compile your code with the @numba.jit(nogil=True) decorator.

`multiprocessing`

When your code isn't amenable to releasing the GIL, try multiprocessing. Multiprocessing spawns new python instances for each task. Therefore, any needed data must be serialized via pickling and dispatched to the subprocess. This creates large overhead. Try to send the minimum amount of data required for your application to each process to reduce this overhead.

`concurrent.futures`

concurrent.futures provides a queue-based system for executing functions in parallel. To use, first initiate an Executor using concurrent.futures.ThreadPoolExecutor() for threading or concurrent.futures.ProcessPoolExecutor() for multiprocessing. Both constructors accept a max_workers argument for the maximum number of threads/processes you would like to devote to the task.

You can interact with the Executor in the following ways:

Executor.submit() which submits a single job to the queue
Executor.map() which submits many jobs to the queue
Executor.shutdown() which shuts down the executor. The default parameter wait=True means this method will wait for all jobs to finish before returning.

Cheers to a concurrent future!

Status: Open

Views

988

Topics

Programming Jupyter Coding Computer Science Python Notebooks IPython

Referenced by

Cite this as

Daniel Himmelstein, Ola O, Venkat Malladi (2015) Python for the modern biodata scientist. Thinklab. doi:10.15363/thinklab.d84

License