More Inclusive Open Science with Language Workbench Technology
Language workbench technology (LWT) was pioneered by JetBrains (https://www.jetbrains.com) to facilitate the in-house development of integrated development environments for programmers. In 2004, JetBrains open-sourced the MPS language workbench (https://www.jetbrains.com/mps/) to encourage the adoption of LWT in other domains. Our team has recently shown that LWT is effective to help biologists and clinicians who are not computer savvy perform data analysis. For instance, tools developed with LWT make it possible to train most biologists and clinicians, in just a few hours, how to analyze data about the expression of genes. We have so far taught close to 200 biologists with languages designed with LWT.
Our experience strongly suggests that LWT is a methodology that can provide a much needed bridge between non computational scientists and computational experts. LWT can facilitate many activities necessary for a widespread adoption of Open Science, supporting routine and seamless reproducibility of data analysis, as well as allowing more technical freedom to test innovations. Importantly, we believe that with LWT, scientists will find it easier to engage in a more inclusive Open Science.
We propose to (1) apply LWT to test novel ideas to record and credit scientists for contributions of software for data analysis and (2) teach the scientific community about this innovative technology. Our prototype will evaluate alternatives to citing only a few software tools in articles. We anticipate that demonstrating these innovations will inspire the design of future scholarly systems and help promote open-science.
Our Team. Our team is composed of an academic laboratory (Fabien Campagne’s lab, located at the Weill Cornell Medical College, in NYC, USA) and of the JetBrains MPS team (Alexander Shatalin’s group, at JetBrains Inc., Prague, Czech Republic). Our team strongly believes in and supports open-source, data sharing, and facilitating the process of Open Science by developing novel tools that empower engineers and scientists. Members of our team have practiced openness and Open Science for many years: JetBrains Inc. has been a strong industry contributor to open-source (JetBrains repositories have received more than 10,000 stars on GitHub). The Campagne laboratory also has a history of developing open-source bioinformatics software and has embraced open-access journals [1, 2] and preprints  several years ago (both as an author and as a peer-reviewer and editor: Fabien Campagne serves as an associate editor for PeerJ).
Open Science. Open Science is likely to mean different things to different teams. For some, it means practicing science in a completely open way, from early data generation, through data analysis, up to the writing of interpretations for publication in open-access journals, or the deposition of data, and interpretations in knowledge bases. Some groups, perhaps because they focus on clinical research projects and are cautious about protecting patient populations who do not have scientific or clinical skills, put more insistence on the second half of the scientific process: the publication and data deposition after scientists have built up confidence in their interpretation of the data and summarized the key aspects of the research. For us, Open Science is all these aspects, plus one. We think of Open also as a synonym of Inclusive, and hope for science that is more inclusive of the public, but also more inclusive of scientists across disciplines.
The Challenge. Many scientists who would like to adopt Open Science processes in their labs face considerable difficulties because Open Science practices require a high amount of technical sophistication. While some individuals have mastered the current tools of Open Science, and championed its practice, we believe that many more scientists are finding it difficult to learn the technical skills that are necessary to open one’s science to the world.
The Innovation. We propose to apply language workbench technology (LWT), fully supported by the JetBrains open-source Meta Programming System (MPS), to help scientists with the daily practice of Open Science. The Campagne laboratory has recently demonstrated that LWT can be applied to help biologists with limited computational experience perform sophisticated data analyses that would have traditionally required programming or command line skills . Tools developed with LWT make it possible for biologists to fully participate to data analysis, which promotes inclusive science across disciplines.
Preliminary Results. We have shown that LWT makes it possible to design languages for data analysis (Figure 1) and reproducible bioinformatics workflows/pipelines (Figure 2) that are easier to teach to biologists and clinicians than traditional approaches (see ). In the MetaR project, we have designed a simple data analysis language that can be taught to biologists with no programming experience in 2 hours and helps trainees call genes differentially expressed and create heatmap visualizations. We are obtaining similar results teaching NextflowWorkbench to biologists and clinicians (see Figure 2, and ). These short training sessions are enabled by LWT and attention to detail during language design.
Figure 1. Illustration of the MetaR data analysis language.
This screen capture shows a differential expression analysis and creation of a heatmap visualization, two common analysis tasks in bioinformatics. Participants to the MetaR training sessions are taught how to perform this analysis with MetaR in less than 2 hours. This language was constructed with Language Workbench Technology (LWT) and was found to substantially accelerate the teaching of data analysis to beginners (biologists and clinicians).
Figure 2. Illustration of the NextflowWorkbench pipeline language
This screen capture shows an analysis workflow/pipeline that downloads short reads from the Short Read Archive at NCBI/NIH, performs quality control on the read files, and estimates counts with Kallisto before producing a combined matrix of counts. The language to express this workflow was developed with LWT. We have started teaching biologists and clinicians how to develop such pipelines and found that we can teach beginners with no programming or command line experience how to assemble and run the pipeline of the screenshot on their laptop in 2 hrs. This short training time is possible because beginners can use the language as an interactive user-interface, and because we designed the language to seamlessly integrate with docker container technology. Container technology, when coupled with LWT allows beginners to run tools such as fastq-dump or Kallisto without having to know about software compilation and dependency management. We believe that combining these technologies can effectively solve what others have nicknamed the `dependency hell’ (see survey). An added advantage of using container technology is that pipelines constructed with LWT and NextflowWorkbench are reproducible and portable .
LWT for Open Science. We think that several of the previous points can help scientist open their science because many scientists are struggling with the know-how and minutia required to practice Open Science. As Holly Bik (2015 keynote speaker at the Bioinformatics Open Source Conference) has put it:
Here, the ability of LWT to integrate many technologies, yet offer a seamless and consistent user interface, may enable completely new ways to develop tools to support Open Science. In this proposal, for instance, we will prototype a new way to record credit and associate it with analysis software. We believe that developing better ways to credit scientists and track contributions automatically could drive an accelerated adoption of Open Science. LWT makes it orders of magnitude faster to prototype such ideas and quickly evaluate their impact than would be possible with other techniques.
Aims. In this proposal, we will develop a prototype that (1) pioneer new ways to track contributions and crediting contributors of scientific data analyses. This aim will evaluate new ideas to help crediting data analysts for their contributions to scientific studies. It will make it easier for scientists to share LWT analyses while at the same time helping contributors receive due credit for their work.
Originality. First, this proposal is original because it will use a technology (LWT) that is not widely known, yet is showing promise in addressing some of the pressing challenges of Open Science (e.g., portability, reproducibility, openness and inclusiveness towards the public and scientists in other disciplines). Second, this proposal is original because it aims to advance Open Science by changing how the contributions of individual scientists are recorded, preserved and presented to other scientists and the public. Our team believes that Open Science will start to become mainstream when a majority of scientists have realized that openness gets them more credit than the alternatives.
Innovations for recording scientific contributions and credit. We will start working on this aim in Phase I, and will continue to develop and refine the prototype in Phase II. The goal is to demonstrate new ways to (i) record contributions to data analysis in scientific studies and (ii) facilitate crediting scientists and data analysts who have contributed data analysis programs to a study (iii) help scientists and data analysts showcase their contributions and their impact on other studies.
Need for the innovation. We believe that the current approach to acknowledging data analysis contributions in scientific publications is not ideal. A first problem is that crediting somebody currently involves judgment calls that PIs make according to their understanding and appreciation of the work of others. A second problem is that credit is often given in the form of authorship or acknowledgement in a publication, and there is no efficient way for a scientist to gather an exhaustive list of all the studies that have used an analysis script or program. Given that various journals limit the maximum number of citations in an article and considering that some analyses require tens of tools, it may not be surprising that references about programs or data analysis tools are frequently dropped from a citation list.
Publishing analyses to the web. LWT programs will be converted to Scalable Vector Graphics (SVG) format. SVG is an appropriate format to display LWT programs on the web because it supports graphics as well as text, is a vectorial format that can be printed at arbitrary resolutions, and is well supported by most modern browsers. SVG documents will be written to a database, either maintained by the team (either by JetBrains Inc. or the Campagne laboratory at the Weill Cornell’s Institute for Computational Biomedicine) or hosted by a cloud provider. A web-front end will be developed to serve the SVG at persistent URLs and provide search functionality over the documents. Deliverables will consist of (i) an MPS plugin that any user of the MPS LW can install. The plugin will support publishing one or more analyses to an SVG repository (in technical terms, we will support publishing arbitrary MPS solutions and languages). (ii) a set of docker containers and deployment instructions to configure a running instance of the public DB and web front-end (this will make it possible for others to host their own repository if they wish to do so). (iii) a running SVG repository, which we will maintain as a demonstration system for a minimum of 5 years.
Technical feasibility. Writing text programs is currently a mainstream approach for data analysis. When writing an analysis with LWT, users are effectively assembling a data structure (and seeing parts of it in the MPS user interface). We will use this capability to extend these data structure with meta-data about who created and edited the analysis program. Construction of the web front-end has high feasibility because JetBrains is well versed in working with databases and web technology and because MPS already supports an Application Programming Interface (API) for custom persistence. Team members with JetBrains will provide expertise in the internal software architecture of MPS that will speed up the prototype. For web publishing of analyses, SVG documents will be rendered from MPS using code similar to that developed for the Editor2PDF project (developed by Fabien Campagne to render LWT programs to PDF to inclusion in books and publications).
Training workshops. If we are recipients of a Phase I prize, we will advertise training workshops on the ISCB mailing list and seek institutions to host workshops. We will also contact directly past trainees who have taken the training at Weill Cornell and have now established their laboratory at other institutions (in the US and Europe). We will give a minimum of 5 one day training workshops (approximately one workshop a month in Phase I). We will aim to offer workshops on a first request basis, but may modify this order to reach a more diverse community in the US and Europe. Training sessions will continue throughout Phase II if we are recipient of the Prize. We will survey trainees post-training to gather feedback about whether the workshops fulfill expectations (see Figure 4 for survey questions).
Long-term sustainability. The MPS LW has been developed as an open-source project and maintained by JetBrains Inc. since 2004. The software is used internally by JetBrains to develop languages that make it easier to develop commercial products of JetBrains. For this reason, the MPS LW is a sustainable, industry funded, open-source project. All software developed in this project will be released under the Apache 2.0 software license. Any publications about the prototype will be submitted to open-access journals (e.g., PeerJ, PLOS or BMC journals).