The bio package
Welcome to bio
bio
- command-line utilities to make bioinformatics explorations more enjoyable.
bio
streamlines the tedious bioinformatics and lets users quickly answer questions such as:
- How do I access a sequence for a viral genome?
- How do I obtain the biological annotation of data?
- How do I get the coding sequence for a specific gene?
- What are the differences between two sequences?
- What is the lineage of SARS-COV-2?
- What are minisatellites and microsatellites?
bio
combines and represents data from different sources: GenBank, Gene Ontology, Sequence Ontology,
NCBI Taxonomy and Short Read Archive through a unified interface. Having access to all the utility described above makes the bio
package well suited for exploratory analysis of genomes.
The software was written to teach bioinformatics and is the companion software to the Biostar Handbook
Quick links
- Source code: https://github.com/ialbert/bio
- Documentation: https://www.bioinfo.help
Why does this software exist?
If you’ve ever done bioinformatics you know how even seemingly straightforward tasks require multiple steps, arcane incantations, reading documentation, and numerous other preparations that slow down your progress.
Time and again, I found myself not pursuing an idea because getting to the fun part was too tedious. The bio
package is meant to solve that tedium.
Diving right in
Here is how to align the sequences of SARS-COV-2 (NC_045512
) versus the same region of a bat coronavirus (MN996532
). First get the data:
bio fetch NC_045512 MN996532
Now align the sequences (showing 60bp for brevity).
bio align NC_045512 MN996532 --end 60
# Ident=57(95.0%) Mis=3(5.0%) Gaps=0(0.0%) Target=(1, 60) Query=(1, 60) Length=60 Score=273.0 NUC.4.4(11,1)
NC_045512.2 ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
||||||||||||||||||.|||||||||||||||||.||||.|||||||||||||||||| 60
MN996532.2 ATTAAAGGTTTATACCTTTCCAGGTAACAAACCAACGAACTCTCGATCTCTTGTAGATCT
A more realistic workflow
Suppose you wanted to identify the mutations between the S
protein of the bat coronavirus deposited as MN996532
and the S
protein of the ancestral SARS-COV-2 virus designated by the NCBI via accession number NC_045512
.
If you are a trained bioinformatician, think about all the steps you would need to perform to accomplish this task, then think about the effort it would take you to teach someone else how to do the same.
With the bio
package, the process takes simple, concise steps.
Download and rename
First, we download and rename the data keep our sanity:
bio fetch NC_045512 --rename ncov
bio fetch MN996532 --rename ratg13
From now on, bio
can operate on NC_045512
using the name ncov
and on MN996532
using the name ratg13
no matter where you are on your computer!
Convert to different formats
bio
stores data in an internal storage system that it can find from any location. There is no clutter of files or paths to remember. For example, in any directory, you now can type:
bio convert ncov --fasta --end 100 | head -2
>NC_045512.2 [1:100]
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
and it will show you the FASTA representation of the genome
You could also convert the data stored under ncov
name to other formats. Let’s convert features with type CDS
to GFF
:
bio convert ncov --gff --type CDS | head -5
##gff-version 3
NC_045512.2 . CDS 266 13468 . + . ID=CDS-1;Parent=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009724389.1
NC_045512.2 . CDS 13468 21555 . + . ID=CDS-2;Parent=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009724389.1
NC_045512.2 . CDS 266 13483 . + . ID=CDS-3;Parent=YP_009725295.1;Name=YP_009725295.1;gene=ORF1ab;locus_tag=GU280_gp01;note=pp1a;codon_start=1;product=ORF1a polyprotein;protein_id=YP_009725295.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009725295.1
NC_045512.2 . CDS 21563 25384 . + . ID=CDS-4;Parent=YP_009724390.1;Name=YP_009724390.1;gene=S;locus_tag=GU280_gp02;gene_synonym=spike glycoprotein;note=structural protein; spike protein;codon_start=1;product=surface glycoprotein;protein_id=YP_009724390.1;db_xref=GeneID:43740568;gene_id=GU280_gp02;transcript_id=YP_009724390.1
Align nucleotides or peptides
Now, back to our problem of aligning proteins. Let’s align the first 90 base pairs of DNA sequences for the S
protein for each organism, bio
even gives you a shortcut; instead of typing --gene S --type CDS
you can write it as ncov:S
:
bio align ncov:S ratg13:S --end 60
# Ident=57(95.0%) Mis=3(5.0%) Gaps=0(0.0%) Target=(1, 60) Query=(1, 60) Length=60 Score=273.0 NUC.4.4(11,1)
YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
QHR63300.2 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT
We can visualize the translation of the DNA into aminoacids with one letter (-1
) or three-letter codes (-3
):
bio align ncov:S ratg13:S --end 60 -1
# Ident=57(95.0%) Mis=3(5.0%) Gaps=0(0.0%) Target=(1, 60) Query=(1, 60) Length=60 Score=273.0 NUC.4.4(11,1)
M F V F L V L L P L V S S Q C V N L T T
YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
QHR63300.2 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT
M F V F L V L L P L V S S Q C V N L T T
If, instead, we wanted to align the 60bp DNA subsequences for S
protein after their translation into proteins, we could do it like so:
bio align ncov:S ratg13:S --translate --end 60
# Ident=20(100.0%) Mis=0(0.0%) Gaps=0(0.0%) Target=(1, 20) Query=(1, 20) Length=20 Score=98.0 BLOSUM62(11,1)
YP_009724390 MFVFLVLLPLVSSQCVNLTT
|||||||||||||||||||| 20
QHR63300.2 MFVFLVLLPLVSSQCVNLTT
We can note right away that all differences in the first 60bp of DNA are synonymous substitutions, the protein translations are the same.
Look up the taxonomy
bio
understands taxonomies. Finding the lineage of the organism is as simple as:
bio taxon ncov --lineage
superkingdom, Viruses, 10239
clade, Riboviria, 2559587
kingdom, Orthornavirae, 2732396
phylum, Pisuviricota, 2732408
class, Pisoniviricetes, 2732506
order, Nidovirales, 76804
suborder, Cornidovirineae, 2499399
family, Coronaviridae, 11118
subfamily, Orthocoronavirinae, 2501931
genus, Betacoronavirus, 694002
subgenus, Sarbecovirus, 2509511
species, Severe acute respiratory syndrome-related coronavirus, 694009
no rank, Severe acute respiratory syndrome coronavirus 2, 2697049
See the bioproject
bio
knows about bio projects and sequencing data.
As it turns out the data for ncov
data is not adequately cross-referenced at NCBI … thus, we can’t quite get the SRR run numbers automatically, even at NCBI.
Let’s pick another data that has better cross-references, perhaps a virus from the 2014 Ebola outbreak:
bio fetch KM233118 --rename ebola
and now print:
bio runinfo ebola
ebola BioProject PRJNA257197
ebola BioSample SAMN02952049
if we wanted the SRR run numbers, we could run:
bio runinfo ebola --sample
to get:
[
{
"Run": "SRR1553609",
"ReleaseDate": "2014-08-19 11:41:53",
"LoadDate": "2014-08-19 11:18:49",
"spots": "464802",
"bases": "93890004",
"spots_with_mates": "464802",
"avgLength": "202",
"size_MB": "51",
"download_path": "https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1553609/SRR1553609.1",
"Experiment": "SRX674271",
"LibraryName": "NM042.3.FCH9",
"LibraryStrategy": "RNA-Seq",
"LibrarySelection": "cDNA",
"LibrarySource": "TRANSCRIPTOMIC",
"LibraryLayout": "PAIRED",
...
bio
is a data model
Beyond the functionality that we show, bio
is also an exploration into modeling biological data. The current standards and practices are woefully antiquated and painfully inadequate. Default formats such as GenBank or EMBL are inefficient and tedious to program with.
In contrast bio
represents data in simple, efficient, compressed in JSON format.
bio convert ncov --json | head -20
[
{
"id": "NC_045512.2",
"definition": "Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome",
"dblink": [
"BioProject:PRJNA485481"
],
"locus": "NC_045512",
"feature_count": 57,
"origin_len": 29903,
"molecule_type": "ss-RNA",
"topology": "linear",
"data_file_division": "VRL",
"date": "18-JUL-2020",
"accessions": [
"NC_045512"
],
"sequence_version": 2,
"keywords": [
"RefSeq"
The data layout allows bio
to read in the entire human chromosome 1, with its 253 million characters and 328 thousand genomic features, in just three(!) seconds. In another 3 seconds, bio
can convert that information FASTA or GFF; it can filter it by type, translate the sequence, extract proteins, slice by coordinate, etc.:
time bio convert chr1 --fasta | wc -c
253105766
real 0m6.238s
user 0m4.156s
sys 0m2.172s
For shorter genomes, bacterial or viral, the conversion times are under a fraction of a second.
Thanks to the representation, it is trivially easy to extend bio
. The data is already structured in an efficient layout that needs no additional parsing to load.