In the previous post, we discussed what a pipeline is, and some thoughts, developers should take into account when developing one. Today, let’s take a step further and discuss one of the main aspects of a pipeline: the reliability.
Let’s picture a situation where two variant calling pipelines are available, and we must decide which one is the best. Both of them do not demonstrate any bugs at first – they get the sequencing reads and report the variants in the end, as planned. However, the author of one of them claims his pipeline can detect more variants than the other.
What about now? Should we just make use of his pipeline, because of his claim?
“A prudent question is one-half of wisdom.”Francis Bacon
First of all, we must ask ourselves, whether a report containing more variants is, in fact, a good thing. At a first sight, it might be, if not for another important querying: are they true variants?
This question changes everything! What’s the point of reporting more variants that are not variants at all? That’s definitely something we do not want. On the same line, what’s the point of missing to report lots of true variants? That’s something, we do not want, either! These things are intrinsically related and are extremely important, once assertiveness of the pipeline will not only guide users in choosing the best one, but also helping developers to improve their own pipes.
In order to answer how trustworthy those pipelines are, let’s imagine the following scenario, where a hypothetical 1000 pb genome contains 100 known SNPs. To evaluate them, let’s build a matrix (a confusion matrix, to be more precise) of each pipeline, so we can check the true and false calls delivered by each.
First, realize the sum of elements in the first column, in both matrices, are equal to 100 and corresponds to the number of true SNPs in the sample. Likewise, the second column sums up to 900, which corresponds to the remaining sites of the genome. Under rows, we have the pipeline calls. Pipeline 1, for instance, claims there are 120 SNPs in the sample, while the remaining are supposed to be non-SNPs. On the other hand, the second pipeline claims to have detected 86 SNPs with the remaining as negatives.
These matrices give us a very nice intuition of how good our pipelines are performing. We know, there are 100 SNPs in the sample, and yet, pipeline one tells us there are 120! That sounds too much, don’t you agree? In fact, this pipeline is actually reporting lots of variants that are false positives (25 to be more precise). On the other hand, the second pipeline reports there are 86 SNPs, of which, only one is not a true SNP. This result is, definitely, a more reasonable approximation of the true number of SNPs in the sample. In this sense, pipeline 2 looks like is doing a better job, compared to pipeline 1.
However, we know there are 900 sites in the genome that aren’t SNPs at all. Pipeline 1 claims 880 sites are not, when in reality it wasn’t able to detect that 5 of them were. Pipeline 2, instead, reports 914 non-SNPs in the sample, where 15 of them were missed as authentic SNPs. In this sense, it looks like pipeline 1 outperforms pipeline 2, as it avoids non-calling a SNP, when in fact there is one.
These confusion matrices allowed us to compare both pipelines, even though they made us somehow (wait for it……) CONFUSED about picking the best one. Once these comparisons can be tricky, some metrics were developed, so we can dive deeper to check, over a more normalized approach, if a pipeline is doing a good job or not. Let’s check them out:
Accuracy: It tells how many instances (positives and negatives) were truly classified
Precision: it tells how many positive classifications are, truly correct.
Recall: it tells how many true cases were called as positives.
TP – True positives; TN – True negatives; FP – False positives; FN – False negatives
In the table below, we can compare the performance of each of the pipelines. As we can see, in terms of accuracy, both pipelines are similar (i.e the overall positive + negative calls are correct). The main difference, though, occurs due false positives and false negatives. Precision tells us pipeline 2 is far better than pipeline 1. This could be translated as: when pipeline 2 makes a positive call, it most likely suggests a true one. This fact indicates this method is more careful in assigning a SNP. On the other hand, Recall tells us Pipeline 1 is the best. In this case, we could translate this fact as follows: when pipeline 1 doesn’t make a call for a specific site, it is most likely not a true SNP. In other words, its approach is more careful in avoiding false negatives.
In the end, users have to decide whether false negatives or false positives are more critical to whatever they plan to do with the variant information. Not always the best of the worlds is achievable, and a decision must be made based on whatever is available. In our example, if equilibrium is what someone pleases, pipeline 2 should definitely be the best choice (the precision is higher and even though recall is lower, the difference is minor compared to precision’s).
So, as we can see, these metrics allow us to compare different approaches to solve the same problem. Do they tell how trustworthy the pipelines are? In parts, they do. They tell how right the callings are, which is a lot, but do not reveal the big picture, yet. For a complete answer, two other considerations must be done:
1. Is the pipeline thoughtfully validated?
Take for instance our fake genome, used to calculate the metrics of pipeline 1 and 2. Would you trust these metrics truly represent the pipelines assertiveness to real world data? Definitely not! A pipeline must be extensively tested. Developers usually make use of simulated data, where they can control, exactly, what they expect and test whether the pipeline is capable of finding whatever features were added. Moreover, tests must be performed with real life data, where at least some features are expected. This way, one can control for complexity intrinsic of real data and evaluate how well the pipeline deals with it. This way, the metrics get more robust and transmit more confidence, regarding the prediction power of the method.
2. Does the pipeline always produce the same results for the same inputs?
Imagine you find a nice pipeline that performs a task you were interested in. In the morning you get your input files, execute your pipeline and retrieve your outputs. Everything looks nice. Then, you tell a partner about this tool you found, and he gets anxious to test it, as well. He takes the same input files, executes the pipeline, and BOOM, the output is not nearly close to what he expected. You both talk to each other, run the pipeline over and over again, and realize that sometimes it reports the correct results, other times it doesn’t. That’s what an unstable pipeline looks like, and what we, definitely, do not want either. We expect a pipeline to be robust and always report the same results for the same inputs*.
* In some cases, some problems might be computationally too intensive to get the best results. So programmatically, algorithms are designed, such as computation can be performed at the cost of assertiveness. In this case, random, yet significant results can be reported. But more on that, in a next post.
To summarize, to get a trustworthy pipeline, make sure it was fully validated with real world data, it is reproducible and has an acceptable precision/recall, best suited for your needs.