Convert to FASTA

A GenBank file represents sequence inforamtion in multiple ways:

  1. Genomic sequences (the entire genomic sequence)
  2. Feature annotation (intervals relative to the genome)

In bio we operate on:

  • --fasta to access the genome
  • --fasta --features to access the features annotated on the genome

The --features flag often not necessary as bio will set it automatically if it is obvious that the command targets features. For example --type CDS will turn on feature rendering mode.

Shortcuts

A a colon delimited term:

bio convert foo:bar --fasta

is equivalent to:

bio convert foo --type CDS --gene bar --fasta

Get a dataset

Get SARS-COV-2 data and rename it to ncov:

bio fetch NC_045512 --rename ncov

Get the sequence for the genome

bio convert ncov --fasta | head -3
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT

Manipulate a genomic subsequence

bio convert ncov --fasta --start 100 --end 130 --seqid foo 
>foo [100:130]
CGGCTGCATGCTTAGTGCACTCACGCAGTAT

Extract the sequences for annotations of a certain type

bio convert ncov --fasta --type CDS | head -3
>YP_009724389.1 ID=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578
ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA

Extract CDS sequences by gene name

bio convert ncov --gene S --fasta --end 60 
>NC_045512.2 [1:60]
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT

a shortcut notation of the above:

bio convert ncov:S --fasta --start 100 --end 150 
>YP_009724390.1 CDS [100:150]
CGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCA

Extract sequence by feature accession number

bio convert ncov -id YP_009724390.1 --fasta --start 100 --end 150 
>YP_009724390.1 CDS [100:150]
CGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCA

Translate the sequence

This command translates the DNA sequence to peptides:

bio convert ncov:S --fasta --end 180 --translate
>YP_009724390.1 CDS [1:180] translated
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS

The slice to 180 is applied on the DNA sequence before the translation.

Extract the protein sequence

This flag extracts the protein sequence embedded in the original GenBank file:

bio convert ncov:S --fasta --end 60 --protein
>YP_009724390.1 [1:60]
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS

Note how in this case the slice to 60 is applied on the protein sequence.