bio fasta: convert to FASTA

Install bio with:

pip install bio --upgrade

The full documentation for bio is maintained at


GenBank/EMBL files represents sequence information in multiple sections:

  1. Genomic sequences (the entire genomic sequence)
  2. Feature annotation (intervals relative to the genome)

bio fasta can operate on GenBank/EMBL files, filter and extract various subsets of the data.

Get a GenBank file

bio fetch NC_045512 MN996532 >

Convert to FASTA

The default behavior is to convert the genome the GenBank file to FASTA:

bio fasta > genomes.fa

to convert the features component pass the --features flag or use any of --type, --gene or other feature specific selectors.

bio fasta --features > features.fa


The input may be GENBANK, FASTA, EMBL or FASTQ.

What gets converted?

GenBank and EMBL files contain both genomes and features all features are extracted.

cat | bio fasta > genomes.fa

pass any feature matcher to limit to certain types:

bio fasta --type CDS -e 10 | head


>YP_009724389.1 {"type": "CDS", "gene": "ORF1ab", "product": "ORF1ab polyprotein", "locus": "GU280_gp01"}
>YP_009725295.1 {"type": "CDS", "gene": "ORF1ab", "product": "ORF1a polyprotein", "locus": "GU280_gp01"}

Select by name

-m or --match performs a regular expression match on sequence ids:

cat | bio fasta -m glyco -end 10


>YP_009724390.1 {"type": "CDS", "gene": "S", "product": "surface glycoprotein", "locus": "GU280_gp02"}
>YP_009724393.1 {"type": "CDS", "gene": "M", "product": "membrane glycoprotein", "locus": "GU280_gp05"}T

-i or --id performs an exact match on sequence ids:

cat | bio fasta -i YP_009724390.1  -end 10


>YP_009724390.1 {"type": "CDS", "gene": "S", "product": "surface glycoprotein", "locus": "GU280_gp02"}

pass multiple ids to match multiple sequences:

cat | bio fasta -i YP_009724390.1,QHR63300.2  -end 10


>YP_009724390.1 {"type": "CDS", "gene": "S", "product": "surface glycoprotein", "locus": "GU280_gp02"}
>QHR63300.2 {"type": "CDS", "gene": "S", "product": "spike glycoprotein", "locus": ""}

Selecting features

If any feature selector is passed the FASTA conversion operates on the features in the GenBank:

bio fasta --type CDS

will convert to fasta the coding sequences alone.

Manipulate a genomic subsequence

bio fasta --start 100 --end 10kb

Extract the sequences for annotations of a certain type

bio fasta --type CDS | head -3

Extract CDS sequences by gene name

bio fasta --gene S--end 60 

Extract sequence by feature accession number

cat | bio fasta --end 10 --id QHR63308.1

Translate the sequence

This command translates the DNA sequence to peptides:

bio fasta --end 30 --translate --gene S

The slice to 30 is applied on the DNA sequence before the translation.

Extract the protein sequence

This flag extracts the protein sequence embedded in the original GenBank file:

bio fasta --end 10 --protein --gene S

Note how in this case the slice to 10 is applied on the protein sequence.

Other tools

My first choice when needing functionality not present in bio fasta would be to look at:


  • seqtk a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

Other potentially useful software

The following software may be installed with conda/mamba: