Welcome to bio

If you’ve ever done bioinformatics you know how even seemingly straightforward tasks may require multiple steps, reading documentation, and additional preparations that can slow down progress.

Time and again, I found myself not pursuing an idea because getting to the fun part was too tedious. The bio package was designed to solve that tedium by making bioinformatics explorations more enjoyable. The software lets users quickly answer questions such as:

  • How do I access a sequence for a viral genome?
  • How do I obtain the biological annotation of data?
  • How do I get the coding sequence for a specific gene?
  • What is the lineage of SARS-COV-2?
  • What are minisatellites and microsatellites?

bio combines data from different sources: GenBank, Gene Ontology, Sequence Ontology, NCBI Taxonomy and provides an unified, logical interface.

The software is also used to demonstrate and teach bioinformatics and is the companion software to the Biostar Handbook.

Quickstart

Install bio:

pip install bio --upgrade

Obtain data

First we download the data so that bio can operate on it. The step needs to be done only once:

bio fetch NC_045512 MN996532 > genomes.gb

Bioinformatics workflows often requires you to present data in different formats.

Convert Genbank to FASTA.

bio convert genomes.gb  --fasta | head

prints:

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT

Convert to GFF format:

bio convert genomes.gb  --gff | head

prints:

##gff-version 3
NC_045512.2 .   source  0   29903   .   +   .   ID=1;Name=NC_045512;Parent=NC_045512.2
NC_045512.2 .   five_prime_UTR  0   265 .   +   .   ID=2;Name=five_prime_UTR-1;Parent=five_prime_UTR-1;color=#cc0e74
NC_045512.2 .   gene    265 21555   .   +   .   ID=3;Name=ORF1ab;Parent=ORF1ab;color=#cb7a77
NC_045512.2 .   CDS 265 13468   .   +   .   ID=4;Name=YP_009724389.1;Parent=YP_009724389.1
NC_045512.2 .   CDS 13467   21555   .   +   .   ID=5;Name=YP_009724389.1;Parent=YP_009724389.1

View the resulting files in IGV

Among the many useful features, bio is also able to generate informative gene models from a GenBank file.

Getting sample metadata

Get sample metadata for the viral genomes (taxid 2697049):

bio meta 2697049  | head

prints:

accession   species host    date    location    isolate species_name
NC_045512.2 2697049 9606    2019-12 Asia; China Wuhan-Hu-1  Severe acute respiratory syndrome coronavirus 2
MT576563.1  2697049         North America;  SARS-CoV-2/human/USA/USA-WA1/2020   Severe acute respiratory syndrome coronavirus 2
MT324684.1  2697049     2020-03-25  North America; USA  SARS-CoV-2/ENV/USA/UF-3/2020    Severe acute respiratory syndrome coronavirus 2
MT476384.1  2697049     2020-02-21  North America; USA: FL  SARS-CoV-2/ENV/USA/UF-11/2020   Severe acute respiratory syndrome coronavirus 2
MT952602.1  2697049                 Severe acute respiratory syndrome coronavirus 2

Where to go next

Look at the sidebar for detailed documentation on how bio operates.