Sequence alignments

The alignments in bio are primarily designed for exploratory use, for aligning relatively short (up to ~30Kb long sequences), visually investigating the alignments, interacting with the sequences before and after alignment. In such cases the alignments will be generated in reasonable amounts of time (5sec per 10Kb). The implementations are mathematically optimal but the libraries that we rely on do not scale well to longer sequences.

Use a specially designed software that relies on heuristics to perform studies needing high throughput alignments. Specialzied software will operate (many) orders of magnitude faster. Depending on your needs blast, blat, mummer, minimap2, lastz, lastal, exonerate, vsearch, diamon will be far better suited for genome wide analyses.

DNA alignment

Align the DNA corresponding to protein S

bio align ncov:S ratg13:S --end 60 

# Ident=57(95.0%)  Mis=3(5.0%)  Gaps=0(0.0%)  Target=(1, 60)  Query=(1, 60)  Length=60  Score=273.0  NUC.4.4(11,1)

YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
             ||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
QHR63300.2   ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT

DNA alignment with 1 letter amino acid codes

bio align ratg13:S ncov:S  --end 60  -1

# Ident=57(95.0%)  Mis=3(5.0%)  Gaps=0(0.0%)  Target=(1, 60)  Query=(1, 60)  Length=60  Score=273.0  NUC.4.4(11,1)

              M  F  V  F  L  V  L  L  P  L  V  S  S  Q  C  V  N  L  T  T 
QHR63300.2   ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT
             ||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
              M  F  V  F  L  V  L  L  P  L  V  S  S  Q  C  V  N  L  T  T 

Reading frame will follow the slice!

DNA alignment with 3 letter amino acid codes

bio align ratg13:S ncov:S  --end 60  -3

# Ident=57(95.0%)  Mis=3(5.0%)  Gaps=0(0.0%)  Target=(1, 60)  Query=(1, 60)  Length=60  Score=273.0  NUC.4.4(11,1)

             MetPheValPheLeuValLeuLeuProLeuValSerSerGlnCysValAsnLeuThrThr
QHR63300.2   ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT
             ||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
             MetPheValPheLeuValLeuLeuProLeuValSerSerGlnCysValAsnLeuThrThr

Reading frame will follow the slice!

DNA alignment, tabular output

bio align ncov:S ratg13:S --end 90  --table
query   target  pident  ident   mism    gaps    score   alen    tlen    tstart  tend    qlen    qstart  qend
QHR63300.2  YP_009724390.1  92.2    83  7   0   387.0   90  90  1   90  90  1   90

Align the translated regions

bio align ncov:S ratg13:S --end 90 --translate  

# Ident=30(100.0%)  Mis=0(0.0%)  Gaps=0(0.0%)  Target=(1, 30)  Query=(1, 30)  Length=30  Score=153.0  BLOSUM62(11,1)

YP_009724390 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTN
             |||||||||||||||||||||||||||||| 30
QHR63300.2   MFVFLVLLPLVSSQCVNLTTRTQLPPAYTN

Align the protein corresponding to gene S

The protein sequence is fetched from the data (if exists) and is not a translated DNA.

bio align ncov:S ratg13:S --end 30 --protein  

# Ident=30(100.0%)  Mis=0(0.0%)  Gaps=0(0.0%)  Target=(1, 30)  Query=(1, 30)  Length=30  Score=153.0  BLOSUM62(11,1)

YP_009724390 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTN
             |||||||||||||||||||||||||||||| 30
QHR63300.2   MFVFLVLLPLVSSQCVNLTTRTQLPPAYTN

The slice now applies to the protein sequence.

Default alignment is global

With the default global alignment end gaps are have no penalty.

bio align THISLINE ISALIGNED  -i 

# Ident=4(36.4%)  Mis=2(18.2%)  Gaps=5(45.5%)  Target=(3, 8)  Query=(1, 8)  Length=11  Score=8.0  BLOSUM62(11,1)

TARGET       THISLI--NE-
             --||..--||- 11
QUERY        --ISALIGNED

There is a strict mode that applies end gap penalties.

Tabular output

All alignment may be formatted with tabular output

bio align THISLINE ISALIGNED  -i --table
query   target  pident  ident   mism    gaps    score   alen    tlen    tstart  tend    qlen    qstart  qend
QUERY   TARGET  36.4    4   2   5   8.0 11  8   3   8   9   1   8

Local alignment

Will produce all local alignments.

bio align THISLINE ISALIGNED -i --local

# Ident=2(100.0%)  Mis=0(0.0%)  Gaps=0(0.0%)  Target=(7, 8)  Query=(7, 8)  Length=2  Score=11.0  BLOSUM62(11,1)

TARGET       TH
             || 2
QUERY        NE

Global alignment

bio align THISLINE ISALIGNED -i --global

# Ident=4(36.4%)  Mis=2(18.2%)  Gaps=5(45.5%)  Target=(3, 8)  Query=(1, 8)  Length=11  Score=8.0  BLOSUM62(11,1)

TARGET       THISLI--NE-
             --||..--||- 11
QUERY        --ISALIGNED

Semiglobal alignment

Same as zero endgap global but reports only the aligned region:

bio align THISLINE ISALIGNED -i --semiglobal

# Ident=4(50.0%)  Mis=2(25.0%)  Gaps=2(25.0%)  Target=(3, 8)  Query=(1, 8)  Length=8  Score=8.0  BLOSUM62(11,1)

TARGET       ISLI--NE
             ||..--|| 8
QUERY        ISALIGNE

Strict global alignment

Applies end gap penalities.

bio align THISLINE ISALIGNED -i --global --strict

# Ident=2(22.2%)  Mis=6(66.7%)  Gaps=1(11.1%)  Target=(1, 8)  Query=(1, 8)  Length=9  Score=-7.0  BLOSUM62(11,1)

TARGET       THISLINE-
             ......||- 9
QUERY        ISALIGNED