The bio package
If you’ve ever done bioinformatics you know how even seemingly straightforward tasks may require multiple steps, reading documentation, and additional preparations that can slow down progress.
Time and again, I found myself not pursuing an idea because getting to the fun part was too tedious. The
bio package was designed to solve that tedium by making bioinformatics explorations more enjoyable. The software lets users quickly answer questions such as:
- How do I access a sequence for a viral genome?
- How do I obtain the biological annotation of data?
- How do I get the coding sequence for a specific gene?
- What is the lineage of SARS-COV-2?
- What are minisatellites and microsatellites?
The software is also used to demonstrate and teach bioinformatics and is the companion software to the Biostar Handbook.
pip install bio --upgrade
First we download the data so that
bio can operate on it. The step needs to be done only once:
bio fetch NC_045512 MN996532 > genomes.gb
Bioinformatics workflows often requires you to present data in different formats.
Convert Genbank to FASTA.
bio convert genomes.gb --fasta | head
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
Convert to GFF format:
bio convert genomes.gb --gff | head
##gff-version 3 NC_045512.2 . source 0 29903 . + . ID=1;Name=NC_045512;Parent=NC_045512.2 NC_045512.2 . five_prime_UTR 0 265 . + . ID=2;Name=five_prime_UTR-1;Parent=five_prime_UTR-1;color=#cc0e74 NC_045512.2 . gene 265 21555 . + . ID=3;Name=ORF1ab;Parent=ORF1ab;color=#cb7a77 NC_045512.2 . CDS 265 13468 . + . ID=4;Name=YP_009724389.1;Parent=YP_009724389.1 NC_045512.2 . CDS 13467 21555 . + . ID=5;Name=YP_009724389.1;Parent=YP_009724389.1
View the resulting files in IGV
Among the many useful features,
bio is also able to generate informative gene models from a GenBank file.
Getting sample metadata
Get sample metadata for the viral genomes (taxid
bio meta 2697049 | head
accession species host date location isolate species_name NC_045512.2 2697049 9606 2019-12 Asia; China Wuhan-Hu-1 Severe acute respiratory syndrome coronavirus 2 MT576563.1 2697049 North America; SARS-CoV-2/human/USA/USA-WA1/2020 Severe acute respiratory syndrome coronavirus 2 MT324684.1 2697049 2020-03-25 North America; USA SARS-CoV-2/ENV/USA/UF-3/2020 Severe acute respiratory syndrome coronavirus 2 MT476384.1 2697049 2020-02-21 North America; USA: FL SARS-CoV-2/ENV/USA/UF-11/2020 Severe acute respiratory syndrome coronavirus 2 MT952602.1 2697049 Severe acute respiratory syndrome coronavirus 2
Where to go next
Look at the sidebar for detailed documentation on how