More Inclusive Open Science with Language Workbench Technology

Executive Summary
While most scientists could benefit from using computational tools in their research, many scientists are not sufficiently computationally savvy to use current advanced tools. An ongoing challenge of Open Science is to enable the broadest community of scientists to use the computational tools necessary to formally document analyses, or manage data, and achieve routine reproducibility of analyses.

Language workbench technology (LWT) was pioneered by JetBrains (https://www.jetbrains.com) to facilitate the in-house development of integrated development environments for programmers. In 2004, JetBrains open-sourced the MPS language workbench (https://www.jetbrains.com/mps/) to encourage the adoption of LWT in other domains. Our team has recently shown that LWT is effective to help biologists and clinicians who are not computer savvy perform data analysis. For instance, tools developed with LWT make it possible to train most biologists and clinicians, in just a few hours, how to analyze data about the expression of genes. We have so far taught close to 200 biologists with languages designed with LWT.

Our experience strongly suggests that LWT is a methodology that can provide a much needed bridge between non computational scientists and computational experts. LWT can facilitate many activities necessary for a widespread adoption of Open Science, supporting routine and seamless reproducibility of data analysis, as well as allowing more technical freedom to test innovations. Importantly, we believe that with LWT, scientists will find it easier to engage in a more inclusive Open Science.

We propose to (1) apply LWT to test novel ideas to record and credit scientists for contributions of software for data analysis and (2) teach the scientific community about this innovative technology. Our prototype will evaluate alternatives to citing only a few software tools in articles. We anticipate that demonstrating these innovations will inspire the design of future scholarly systems and help promote open-science.

Short introduction video (5 minutes)

Our Team. Our team is composed of an academic laboratory (Fabien Campagne’s lab, located at the Weill Cornell Medical College, in NYC, USA) and of the JetBrains MPS team (Alexander Shatalin’s group, at JetBrains Inc., Prague, Czech Republic). Our team strongly believes in and supports open-source, data sharing, and facilitating the process of Open Science by developing novel tools that empower engineers and scientists. Members of our team have practiced openness and Open Science for many years: JetBrains Inc. has been a strong industry contributor to open-source (JetBrains repositories have received more than 10,000 stars on GitHub). The Campagne laboratory also has a history of developing open-source bioinformatics software and has embraced open-access journals [1, 2] and preprints [3] several years ago (both as an author and as a peer-reviewer and editor: Fabien Campagne serves as an associate editor for PeerJ).

Open Science. Open Science is likely to mean different things to different teams. For some, it means practicing science in a completely open way, from early data generation, through data analysis, up to the writing of interpretations for publication in open-access journals, or the deposition of data, and interpretations in knowledge bases. Some groups, perhaps because they focus on clinical research projects and are cautious about protecting patient populations who do not have scientific or clinical skills, put more insistence on the second half of the scientific process: the publication and data deposition after scientists have built up confidence in their interpretation of the data and summarized the key aspects of the research. For us, Open Science is all these aspects, plus one. We think of Open also as a synonym of Inclusive, and hope for science that is more inclusive of the public, but also more inclusive of scientists across disciplines.

Our Long Term Goal. We strongly believe that Open Science is the future of scientific enquiry and aim to speed up its adoption by making it easier and more desirable to practice Open Science than it is to continue with the status quo.

The Challenge. Many scientists who would like to adopt Open Science processes in their labs face considerable difficulties because Open Science practices require a high amount of technical sophistication. While some individuals have mastered the current tools of Open Science, and championed its practice, we believe that many more scientists are finding it difficult to learn the technical skills that are necessary to open one’s science to the world.

The Innovation. We propose to apply language workbench technology (LWT), fully supported by the JetBrains open-source Meta Programming System (MPS), to help scientists with the daily practice of Open Science. The Campagne laboratory has recently demonstrated that LWT can be applied to help biologists with limited computational experience perform sophisticated data analyses that would have traditionally required programming or command line skills [4]. Tools developed with LWT make it possible for biologists to fully participate to data analysis, which promotes inclusive science across disciplines.

Preliminary Results. We have shown that LWT makes it possible to design languages for data analysis (Figure 1) and reproducible bioinformatics workflows/pipelines (Figure 2) that are easier to teach to biologists and clinicians than traditional approaches (see [4]). In the MetaR project, we have designed a simple data analysis language that can be taught to biologists with no programming experience in 2 hours and helps trainees call genes differentially expressed and create heatmap visualizations. We are obtaining similar results teaching NextflowWorkbench to biologists and clinicians (see Figure 2, and [5]). These short training sessions are enabled by LWT and attention to detail during language design.

Figure 1. Illustration of the MetaR data analysis language.

This screen capture shows a differential expression analysis and creation of a heatmap visualization, two common analysis tasks in bioinformatics. Participants to the MetaR training sessions are taught how to perform this analysis with MetaR in less than 2 hours. This language was constructed with Language Workbench Technology (LWT) and was found to substantially accelerate the teaching of data analysis to beginners (biologists and clinicians).

Figure 2. Illustration of the NextflowWorkbench pipeline language

This screen capture shows an analysis workflow/pipeline that downloads short reads from the Short Read Archive at NCBI/NIH, performs quality control on the read files, and estimates counts with Kallisto before producing a combined matrix of counts. The language to express this workflow was developed with LWT. We have started teaching biologists and clinicians how to develop such pipelines and found that we can teach beginners with no programming or command line experience how to assemble and run the pipeline of the screenshot on their laptop in 2 hrs. This short training time is possible because beginners can use the language as an interactive user-interface, and because we designed the language to seamlessly integrate with docker container technology. Container technology, when coupled with LWT allows beginners to run tools such as fastq-dump or Kallisto without having to know about software compilation and dependency management. We believe that combining these technologies can effectively solve what others have nicknamed the `dependency hell’ (see survey). An added advantage of using container technology is that pipelines constructed with LWT and NextflowWorkbench are reproducible and portable [5].

The Key Advantages of LWT for Data Analysis. LWT has several advantages for Data Analysis:

It can be used to develop computational languages that are easier to teach to biologists and clinicians than programming languages (see Figs 1& 2). LWT makes science more inclusive across disciplines.
LWT removes the need for monolithic standardization or common specification. Instead, LWT languages can evolve from small individual contributions that work well together by design.
Languages can include warning and error messages that warn the analyst about very specific conditions. In contrast to programming languages whose compilers report mostly about low-level syntax errors, LWT errors can be defined in a language to warn the user about high-level semantic problems. Such warnings and errors are extremely useful to guide beginners, or remind expert users about conditions known to cause errors or unreliable results.
Languages developed with LWT can include graphical and tabular notations, or interactive user interface elements. This is particularly useful in scientific applications, as illustrated in Figure 1 and Supp File 1 & 2.
Languages developed with LWT can be tightly integrated with other technologies, as shown in Figure 2 with support for docker in the NextflowWorkbench. Such integrations can be presented in a natural way that makes it straightforward for biologists and clinicians to take advantage of otherwise very technical tools and technologies. Biologists who learn NextflowWorkbench need only a 5-minute introduction to docker before they can take advantage of a docker container to write workflows.
LWT supports source control systems, such as git, in a seamless manner that helps beginners use these tools (see [6, 7]). Source control is a key component of Open Science. By keeping a trace of source control commits, scientists can document what analysis steps they have taken. They can diagnose and trace back problems while working on a study, retrieve specific versions of an analysis, and can release the commits openly at publication time to enables other scientists and the public to follow and understand how a study has been developed.

LWT for Open Science. We think that several of the previous points can help scientist open their science because many scientists are struggling with the know-how and minutia required to practice Open Science. As Holly Bik (2015 keynote speaker at the Bioinformatics Open Source Conference) has put it:

Figure 3. Holly Bik's Tweet

Here, the ability of LWT to integrate many technologies, yet offer a seamless and consistent user interface, may enable completely new ways to develop tools to support Open Science. In this proposal, for instance, we will prototype a new way to record credit and associate it with analysis software. We believe that developing better ways to credit scientists and track contributions automatically could drive an accelerated adoption of Open Science. LWT makes it orders of magnitude faster to prototype such ideas and quickly evaluate their impact than would be possible with other techniques.

Aims. In this proposal, we will develop a prototype that (1) pioneer new ways to track contributions and crediting contributors of scientific data analyses. This aim will evaluate new ideas to help crediting data analysts for their contributions to scientific studies. It will make it easier for scientists to share LWT analyses while at the same time helping contributors receive due credit for their work.
In aim (2), we will offer training workshops for LWT at other biomedical institutions. This aim will increase the visibility of this new technology and help accelerate its adoption at universities, medical colleges and biotech companies. One day training workshops will combine the MetaR and NextflowWorkbench training sessions currently taught at the Weill Cornell Medical College and extend them to describe the sharing of data analyses.

Originality. First, this proposal is original because it will use a technology (LWT) that is not widely known, yet is showing promise in addressing some of the pressing challenges of Open Science (e.g., portability, reproducibility, openness and inclusiveness towards the public and scientists in other disciplines). Second, this proposal is original because it aims to advance Open Science by changing how the contributions of individual scientists are recorded, preserved and presented to other scientists and the public. Our team believes that Open Science will start to become mainstream when a majority of scientists have realized that openness gets them more credit than the alternatives.

Innovations for recording scientific contributions and credit. We will start working on this aim in Phase I, and will continue to develop and refine the prototype in Phase II. The goal is to demonstrate new ways to (i) record contributions to data analysis in scientific studies and (ii) facilitate crediting scientists and data analysts who have contributed data analysis programs to a study (iii) help scientists and data analysts showcase their contributions and their impact on other studies.

Need for the innovation. We believe that the current approach to acknowledging data analysis contributions in scientific publications is not ideal. A first problem is that crediting somebody currently involves judgment calls that PIs make according to their understanding and appreciation of the work of others. A second problem is that credit is often given in the form of authorship or acknowledgement in a publication, and there is no efficient way for a scientist to gather an exhaustive list of all the studies that have used an analysis script or program. Given that various journals limit the maximum number of citations in an article and considering that some analyses require tens of tools, it may not be surprising that references about programs or data analysis tools are frequently dropped from a citation list.
Approach: scientific contributions and credit. We will address these problems as follows:

First, we will attach the identity of a data analyst to the programs that they create (analysts will need to opt-in by providing an ORCID identifier, or email). Once an analyst has enabled this feature, any analysis or program developed by this analyst will be tagged with contributor meta-data. Contrary to the common practice of signing programs in comment at the top of each file, LWT makes it possible to annotate every statement of an analysis. This is useful to preserve meta-data when statements are copied and pasted across analyses and to automatically merge meta-data for distinct contributions.
Second, we will implement the ability to publish an analysis to the web and to preserve contribution meta-data in the database. The web front-end will support publishing different versions of an analysis.
Third, to facilitate reuse, we will support downloading analysis programs directly from the web front-end into MPS.
Fourth, running an analysis on new data will optionally record usage in the web front-end (no details about the data being processed will be shared). This can be done automatically with LWT because we can control what happens when a user wants to execute an analysis. Records of analysis used and successfully completed with be made in the database and reflected on the web front-end immediately. This will make it possible for scientists to determine the popularity and usage of analysis programs developed with LWT and will help data analysts document the reuse of their contributions.
Fifth, the system will offer a unique DOI for each version of a published analysis. This DOI will make it possible for authors to include an electronic reference to a specific analysis in manuscripts, preprints and publications. We will encourage scientists to record this information in their manuscripts during the training workshops. Including the analysis DOI will clearly specify the details of how the analysis was conducted and provide a de facto mean for reviewers and future audience to know exactly how the analysis was conducted. We will work with NCBI to add this type of DOI in Medline data under the Secondary Source ID field. Contributions will be electronically discoverable to enable integration with third-parties (such as ImpactStory).

Publishing analyses to the web. LWT programs will be converted to Scalable Vector Graphics (SVG) format. SVG is an appropriate format to display LWT programs on the web because it supports graphics as well as text, is a vectorial format that can be printed at arbitrary resolutions, and is well supported by most modern browsers. SVG documents will be written to a database, either maintained by the team (either by JetBrains Inc. or the Campagne laboratory at the Weill Cornell’s Institute for Computational Biomedicine) or hosted by a cloud provider. A web-front end will be developed to serve the SVG at persistent URLs and provide search functionality over the documents. Deliverables will consist of (i) an MPS plugin that any user of the MPS LW can install. The plugin will support publishing one or more analyses to an SVG repository (in technical terms, we will support publishing arbitrary MPS solutions and languages). (ii) a set of docker containers and deployment instructions to configure a running instance of the public DB and web front-end (this will make it possible for others to host their own repository if they wish to do so). (iii) a running SVG repository, which we will maintain as a demonstration system for a minimum of 5 years.

Technical feasibility. Writing text programs is currently a mainstream approach for data analysis. When writing an analysis with LWT, users are effectively assembling a data structure (and seeing parts of it in the MPS user interface). We will use this capability to extend these data structure with meta-data about who created and edited the analysis program. Construction of the web front-end has high feasibility because JetBrains is well versed in working with databases and web technology and because MPS already supports an Application Programming Interface (API) for custom persistence. Team members with JetBrains will provide expertise in the internal software architecture of MPS that will speed up the prototype. For web publishing of analyses, SVG documents will be rendered from MPS using code similar to that developed for the Editor2PDF project (developed by Fabien Campagne to render LWT programs to PDF to inclusion in books and publications).

Training workshops. If we are recipients of a Phase I prize, we will advertise training workshops on the ISCB mailing list and seek institutions to host workshops. We will also contact directly past trainees who have taken the training at Weill Cornell and have now established their laboratory at other institutions (in the US and Europe). We will give a minimum of 5 one day training workshops (approximately one workshop a month in Phase I). We will aim to offer workshops on a first request basis, but may modify this order to reach a more diverse community in the US and Europe. Training sessions will continue throughout Phase II if we are recipient of the Prize. We will survey trainees post-training to gather feedback about whether the workshops fulfill expectations (see Figure 4 for survey questions).

Long-term sustainability. The MPS LW has been developed as an open-source project and maintained by JetBrains Inc. since 2004. The software is used internally by JetBrains to develop languages that make it easier to develop commercial products of JetBrains. For this reason, the MPS LW is a sustainable, industry funded, open-source project. All software developed in this project will be released under the Apache 2.0 software license. Any publications about the prototype will be submitted to open-access journals (e.g., PeerJ, PLOS or BMC journals).

Figure 4. Post-training anonymous survey

Supplementary File 1. MetaR and the NextflowWorkbench Video

Supplementary File 2. NextflowWorkbench Getting Started Video

Supplementary File 3. Full Prezi: Introduction to Proposal

References

0	1.	Compression of Structured High-Throughput Sequencing Data Fabien Campagne, Kevin C. Dorff, Nyasha Chambwe, James T. Robinson, Jill P. Mesirov (2013) PLoS ONE. doi:10.1371/journal.pone.0079871
0	2.	Composable languages for bioinformatics: the NYoSh experiment Manuele Simi, Fabien Campagne (2014) PeerJ. doi:10.7717/peerj.241
0	3.	Composable languages for bioinformatics: the NYoSh experiment Manuele Simi, Fabien Campagne (2013) PeerJ Inc.. doi:10.7287/peerj.preprints.112v2
0	4.	MetaR: simple, high-level languages for data analysis with the R ecosystem Fabien Campagne, William ER Digan, Manuele Simi (2015) Cold Spring Harbor Laboratory Press. doi:10.1101/030254
0	5.	NextflowWorkbench: Reproducible and Reusable Workflows for Beginners and Experts Jason P Kurs, Manuele Simi, Fabien Campagne (2016) Cold Spring Harbor Laboratory Press. doi:10.1101/041236
0	6.	Language workbench user interfaces for data analysis Victoria M Benson, Fabien Campagne (2015) PeerJ PrePrints. doi:10.7287/peerj.preprints.511v2
0	7.	Language workbench user interfaces for data analysis Victoria M. Benson, Fabien Campagne (2015) PeerJ. doi:10.7717/peerj.800