The information generated by the read mapping/alignment process against reference sequence is stored in files known as Sequence Alignment Map (SAM). SAM files are text based and it’s possible to obtain information such as: alignment position of the reads, mapping quality (MAQ), the reads name and the reference’s name in which it had been aligned, or even make it possible to calculate the distance between the pairs of reads, etc.
While the SAM file is a text format, the BAM (Binary Alignment Map) file is a binary representation, which compresses the alignment information into a smaller file.
Basic structure of SAM/BAM files
Here, we will introduce some basic units of a SAM/BAM file.
In general, the SAM file has two main parts: the header and the alignment session. Both parts may contain slight variations depending on the assembler and the version of the SAM format used.
For more detailed information about this structure, we recommend consulting the SAM/BAM format manual .
The header lines start with “@”.
From the example header above:
@HD: First line of the header. It can contain the version of the SAM file used (SN), how the alignment is ordered (OS), etc.
@SQ: Displays the reference sequence, containing the sequence length (LN), etc.
@RG: Indicates the group of reads generated from a single sequencing run. You can indicate the library (LB), your sample data (SM) and the sequencing technology (PL), etc. You can display more than one “@RG” if you have more than one sample, or reads with different libraries and/or sequencers, etc.
@PG: Returns the program used to generate the SAM/BAM file (PN), alignment command (CL), etc.
2. Alignment session
The SAM files have the first 11 columns as mandatory. The following columns are considered metadata and may vary depending on the aligner used.
In the example below, mandatory columns are indicated by shades of green, while metadata in orange.
(1) Name of the read
(2) Name of the reference where the read was aligned
(3) Specific information about read alignment. The first number indicates the FLAG sam , while the second indicates the CIGAR , a compact representation of the aligned read. In the example, we find the value of 22S12M2D52M65S which indicates that the read data has 22 bases that were ignored in the alignment (soft trimming), 12 bases identical with the reference, 2 deleted bases, 52 identical bases and 65 bases ignored in the alignment. The sum of the CIGAR values is equal to the read length.
(4) Regarding the read data pair, in order, (1) if it is aligned on the same reference (=) or in another (reference name), (2) the alignment position of the pair and (3) the amount of bases aligned on the reference.
(5) In order, it is the sequence of read bases followed by quality.
(6) Metadata, where we can find information such as the number of mismatches, etc.