bio align: perform alignments

Note: alignments in bio are primarily designed for exploratory use, for aligning relatively short (up to ~30Kb long sequences), visually investigating the alignments, interacting with the sequences before and after alignment.

Use software that relies on heuristics when investigating large datasets. Specialized software will operate (many) orders of magnitude faster. Depending on your needs you may want to use: blast, blat, mummer, minimap2, lastz, lastal, exonerate, vsearch, diamond.

Install bio with:

pip install bio --upgrade

The full documentation for bio is maintained at https://www.bioinfo.help/.

Quick start

Input may be given from command line:

bio align GATTACA GATCA

the first sequence is the target, all following sequences are aligned considered queries. The command prints:

# seq1 (7) vs seq2 (5)
# pident=57.1% len=7 ident=4 mis=1 del=2 ins=0
# semiglobal: score=2.0 match=1 mismatch=2 gapopen=11 gapextend=1

GATTACA
|||.|--
GATCA--

Alignment as a table

When generating alignment providing output in different formats is important. Here is the alignment formatted as a table:

bio align GATTACA GATCA --table

prints:

target  query  pident  len  match  mism  ins  del
seq1    seq2   57.1    7    4      1     0    2

Alignment as VCF

bio align GATTACA GATCA --vcf

prints:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=TYPE,Number=1,Type=String,Description="Type of the variant">
##contig=<ID=seq1,length=7,assembly=seq1>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  seq2
seq1    4   4_SNP_T/C   T   C   .   PASS    TYPE=SNP    GT  1
seq1    5   5_DEL_2 ACA A   .   PASS    TYPE=DEL    GT  1

Alignment as differences

The output may be formatted in as differences (mutations) between the reference (second sequence) and the query (first sequence):

bio align GATTACA GATCA --diff

prints:

T4C    SNP  4  T    C  seq1  seq2
ACA5A  DEL  5  ACA  A  seq1  seq2

You can think of the mutations output can be thought of as a simplified VCF.

Alignment types

The default alignment is semi-global (global alignment with no end gap penalies). To perform a global alignment write:

bio align GATTACA GATCA --global

the output is:

# seq1 (7) vs seq2 (5)
# pident=71.4% len=7 ident=5 mis=0 del=2 ins=0
# global: score=-7.0 match=1 mismatch=2 gapopen=11 gapextend=1

GATTACA
|||--||
GAT--CA

Available ptions are : --global, --local, --semi-global

Align realistic data

Fetches two genomic files:

bio fetch MN996532 NC_045512 > genomes.gb

Align two genomes, the first is the target, all following sequences are considered queries and are aligned against the target. Here we are looking at what mutations would need to be made to the bat genome to turn it into the coronavirus genome:

cat genomes.gb | bio fasta --genome | bio align | head

the command aligns two 30KB sequences and takes about 15 seconds on my system, it will print:

# MN996532.2 (29855) vs NC_045512.2 (29903)
# pident=96.0% len=29908 ident=28725 mis=1125 del=5 ins=53
# semiglobal: score=139005.0 gap open=11 extend=1  matrix=NUC.4.4

ATTAAAGGTTTATACCTTTCCAGGTAACAAACCAACGAACTCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAA
||||||||||||||||||.|||||||||||||||||.||||.|||||||||||||||||||||||||||||||||||||||
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAA

The bio align method takes the first file that it sees as target and aligns all other sequences to it as queries.

Align coding sequences

Align the DNA corresponding to protein S

cat genomes.gb | bio fasta --gene S | bio align | head

prints:

# QHR63300.2 (3810) vs YP_009724390.1 (3822)
# pident=92.9% len=3822 ident=3549 mis=261 del=0 ins=12
# semiglobal: score=3005.0 match=1 mismatch=2 gapopen=11 gapextend=1

ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACTAGAACTCAGTTACCTCCTGCA
||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||.||||||||.|||||.||||||
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCA

Align protein sequences

Align the protein sequences that prints:

cat genomes.gb | bio fasta --gene S --protein |  head

that prints:

S32F    SNP 32  S   F   QHR63300.2  YP_009724390.1
L50S    SNP 50  L   S   QHR63300.2  YP_009724390.1
I76T    SNP 76  I   T   QHR63300.2  YP_009724390.1
P218Q   SNP 218 P   Q   QHR63300.2  YP_009724390.1
D324E   SNP 324 D   E   QHR63300.2  YP_009724390.1
T346R   SNP 346 T   R   QHR63300.2  YP_009724390.1
T372A   SNP 372 T   A   QHR63300.2  YP_009724390.1
T403R   SNP 403 T   R   QHR63300.2  YP_009724390.1

Alignment showing mutations

cat genomes.gb | bio fasta --gene S --protein | bio align --diff | tail -5

prints the variations:

N519H      SNP  519   N  H      QHR63300.2  YP_009724390.1
A604T      SNP  604   A  T      QHR63300.2  YP_009724390.1
S680SPRRA  INS  680   S  SPRRA  QHR63300.2  YP_009724390.1
S1121N     SNP  1121  S  N      QHR63300.2  YP_009724390.1
I1224V     SNP  1224  I  V      QHR63300.2  YP_009724390.1

Alignment with tabular output

You can produce a column based table output

cat genomes.gb | bio fasta --gene S --protein | bio align --table

prints:

target      query           pident  len   match  mism  ins  del
QHR63300.2  YP_009724390.1  97.4    1273  1240   29    4    0

Different scoring matrices

cat genomes.gb | bio fasta --gene S --translate | bio align --matrix PAM30 | head

prints:

# QHR63300.2 (1270) vs YP_009724390.1 (1274)
# pident=97.4% len=1274 ident=1241 mis=29 del=0 ins=4
# semiglobal: score=9339.0 matrix=PAM30 gapopen=11 gapextend=1

MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFSNVTWFHAIHVSGTNGIKRFDN
|||||||||||||||||||||||||||||||.|||||||||||||||||.|||||||||||||||||||||||||.|||||
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDN

Scoring matrices

The scoring matrix may be a builtin name or a file with a scoring matrix format. See the scoring with:

bio align --matrix PAM30

prints:

#
# This matrix was produced by "pam" Version 1.0.6 [28-Jul-93]
#
# PAM 30 substitution matrix, scale = ln(2)/2 = 0.346574
#
# Expected score = -5.06, Entropy = 2.57 bits
#
# Lowest score = -17, Highest score = 13
#
    A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V   B   Z   X   *
A   6  -7  -4  -3  -6  -4  -2  -2  -7  -5  -6  -7  -5  -8  -2   0  -1 -13  -8  -2  -3  -3  -3 -17
R  -7   8  -6 -10  -8  -2  -9  -9  -2  -5  -8   0  -4  -9  -4  -3  -6  -2 -10  -8  -7  -4  -6 -17
N  -4  -6   8   2 -11  -3  -2  -3   0  -5  -7  -1  -9  -9  -6   0  -2  -8  -4  -8   6  -3  -3 -17
D  -3 -10   2   8 -14  -2   2  -3  -4  -7 -12  -4 -11 -15  -8  -4  -5 -15 -11  -8   6   1  -5 -17
...

Exercises

Align genomic DNA to CDNA

bio fetch ENST00000288602 | head > genomic.fa

bio fetch ENST00000288602 --type cdna | head > cdna.fa

Is this a good alignment?

bio align genomic.fa cdna.fa

Try local alignment

bio align genomic.fa cdna.fa --local