What is a VCF file?

When studying the genetic variations of an organism, the identified variants are stored in VCF (Variant Call Format) files. This type of file is in text format and in it we can find SNP variants, indels or even larger strucutural variants with additional information [1]. 

The VCF files may vary according to the version used by variants callers, so we will focus on the general aspects of this file. For more details of this file format, the format manual should be consulted [2].

How is a VCF file?

The VCF format is a tabular file, formed by 3 main parts: (1) metadata header, (2) header and lines containing the data and their appropriate notes (3).

image

1. This header starts with double hash mark (##). Your first line “fileformat” is mandatory and indicates the version of the VCF in which the data was stored. The remaining lines are optional, however, are responsible for providing information about the data in the session (3).

2. The column header starts with just a hash (#). The first 8 columns are required on any version of the VCF (to CHROM INFO). If the genotyping report VCF the embodiment, the Format column will appear, followed by a column indicating their values ​​for each sample (in this case, sample NA0001). If the VCF is multiple samples, multiple columns would appear then.

In short:

CHROM → reference name (may be a scaffold / fragment)

POS → Position in the reference where the variant was found

ID → If the variant has been noted in dbSNP (in case of human), their identification will be here

REF / ALT → The reference base on the given position / The alt base on the given position.

QUAL variant quality index varies for each program. The bigger, the more reliable.

FILTER → If the variable called spent on program quality requirements.

INFO → relevant information such as number of reads in position (often programs can disregard aligned reads low quality), the allele frequency allele counts, among others [2]

3. In this section, we will find information about each variant on the sample(s). Each line indicates a distinctive variant in most cases. Based on the first data line, the variant in question would be chr20: 14370G> A.

References: 

[1] Danecek P, et al. The variant call format and VCFtools. Bioinformatics 2011;27(15): 2156–8. doi: 10.1093/bioinformatics/btr330

[2] https://samtools.github.io/hts-specs/VCFv4.2.pdf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.