It’s not enough sequencing DNA without knowledge (or capacity) to process and analyze the data
The computing field lives a great moment in our society, becoming ubiquitous within all contexts in the modern world. It has been acting as a driving force to the recent scientific advances, assisting in discoveries previously thought to be impossible, similar to what tasks’ automation did in the XX century (autonomous vehicles, computer-aided diagnosis, among many others).
Computers provided a huge impact on the medical context, more specifically in the genetic area.
The new sequencers (which intensively use several computing techniques) and the biological methods dramatically decreased the costs needed to collect and process a person’s DNA, producing an unprecedented volume of data to be analyzed.
That brought up several questions, mainly regarding the scalability of these new approaches since, even though there is more data available, the analysis and interpretation process remains restricted to the speed of the human analyst of checking each variant present in the sample.
Considering this scenario, it’s possible to synthesize these points in three main questions:
- How to make all the processing files available to all the bioinformatics and analysts of my organization?
- How to process this huge data volume in an efficient way?
- How to analyze the thousands of identified variants in my samples in a structured way which guarantees the quality of the reports?
These were questions that guided Varstation‘s development and that, on a personal level, allow me to experience several concepts that I’ve heard a lot since my graduation as a computer scientist specialized in Big Data, Cloud Computing and Deep learning. Next, I will demonstrate how we managed to answer these questions in three steps.
First step – Upload and data sharing
Initially, we had the main development decision that consisted of choosing which platform our system would be available at. Since our mission has always been to democratize genetics for as many people as possible, the choice to build a cloud platform was immediate, although this entailed several technological challenges.
Furthermore, we did not know how much data would meet our customers demand, so our infrastructure had to be prepared for both low (small megabyte files) and large demand labs (dozens of gigabytes daily). For these reasons, other companies choose to provide a hybrid platform where part of the infrastructure must be built or configured locally to enable the transfer of this large data volume and make security aspects simpler to manage.
Here on Varstation we provide two main ways for the user to upload their files:
1. Via browser
To make sending large files viable, we use multipart transfer which, in a simplified way, consists of “splitting” the original files into smaller parts for parallel sending, creating a redundancy to failures (since if one part fails it does not cause whole file error) and increasing the throughput. To make this process even faster, we also use data transfer acceleration technology that makes use of dedicated internet access points (as if data is “cutting line” over other internet data) to minimize the upload time required. To ensure the integrity of these transfers, each Varstation client has a unique credential that is validated with each file upload attempt and is also periodically changed as an extra security measure.
A second approach to file transfer targets clients that use BaseSpace (Illumina sequencers platform). In this scenario, the files aren’t in the users machine, but in Illumina‘s cloud, demanding that our system interacts interacts with this API and making all the authentication necessary. In this case, Varstation provides an interface where the user can log in to BaseSpace, and then starts navigating all the samples directly from our platform. Thus, dedicated servers connect to the specific servers that only work transferring this data, which is accessed directly from our cloud. This process makes the transferring process much faster. It is possible to transfer 5 gigabytes files in only 11 seconds regardless of the computer or internet connection speed of the user.
Both solutions focus on the problem of file upload and sharing without requiring any adaptation by users regardless of infrastructure.
Second step – Data processing
After overcoming the first hurdle, we must then enable the processing of these “raw” files to identify the patient’s genetic variants. This process follows a pipeline divided into 3 major steps:
1. Mapping: organizes the readings made by the sequencer identifying in which position of the genome each one belongs;
2. Variant Calling: compares which positions of the analyzed DNA are different from a reference genome;
3. Annotation: Adds all pertinent data to each variant so that the analyst has more information when evaluating it.
Each of these steps can be done in different ways depending on the exam type, which results in very different computational costs (for example, an exome tends to use more memory than a panel), preventing our servers from having a single configuration that meets all these demands. Besides, optimizing this processing step will dictate how much processing cost is required and, therefore, how much the customers will pay for each sample processed.
So instead of building a single “super configuration” that can handle all types – but extremely expensive for simpler pipelines – we choose to customize our servers to minimize processing costs, allowing our customers to pay less and use it as a competitive advantage by lowering your exam prices. Not to mention that this reduction helps democratize the population’s access to genetic testing (our big plan is coming true!).
In addition to building custom configurations, we also do our best to minimize the required processing time, which not only reduces costs but also speeds up analysis and release of reports. For this, we built own versions of algorithms openly available by the scientific community by inserting several optimizations such as parallelism via multithreading. With this we have several pipelines that had their processing time reduced by a few hours.
Third and last step – Data analysis
With the system capable of receiving and sharing files quickly and intuitively, and processing them efficiently, we have come to the point where the user becomes more aware of the impact of development decisions, which is precisely the time to analyze an already processed sample.
Perhaps this stage presents the biggest challenges due to choosing to be a cloud platform.
Using an exome as an example we have a scenario where we need to display to the user over 300,000 variants, each one with aggregated data from different sources (ClinVar, dbSNP, OMIM, etc.) in a web browser (which has memory limit) available in real time and does not require an extremely fast internet connection to work. Challenging, right?
The key to making the analysis more agile and efficient for the user is to ensure that variants most likely to be pathogenic are highlight displayed and that occurs the automatic exclusion of those not relevant for analysis.
For this, we mainly use filtering (we suggest predefined filters and also allow the user to create his own) and display variants as needed (pagination of data). We also index these variants in different ways in our database to make related queries more efficient. We are also currently building some Machine Learning models so that the computer automatically “learns” what are the best ways to classify the pathogenicity of these variants.
Computing – driving force to genetic revolution
Thus, we have demonstrated solutions to all three questions originally proposed at the beginning of this text. Through them, we give the user the power to share files, process samples and perform analysis using only the computer they already own. As illustrated in a simplified way, this involves many obstacles that force the development team to always be looking for alternatives and improvements to further optimize the approaches we have today.
What makes this whole process even more challenging is that there are very few genetics solutions that work 100% in the cloud, preventing us from basing on any prior knowledge and forcing us to create all of these approaches from scratch. However, it is extremely gratifying to occupy this leading edge and employ computing as the driving force for a new revolution in our society.
To reach a future where genetic analysis is fully integrated into our society, from identifying pathologies in a baby to using personalized drugs to treat different cancers, it takes people who broaden the technological horizons we have today and create the necessary environment for this. And that’s exactly what we are doing at Varstation.