Welcome to bio

bio - command-line utilities to make bioinformatics explorations more enjoyable.

bio streamlines the tedious bioinformatics and lets users quickly answer questions such as:

  • How do I access a sequence for a viral genome?
  • How do I obtain the biological annotation of data?
  • How do I get the coding sequence for a specific gene?
  • What are the differences between two sequences?
  • What is the lineage of SARS-COV-2?
  • What are minisatellites and microsatellites?

bio combines and represents data from different sources: GenBank, Gene Ontology, Sequence Ontology, NCBI Taxonomy and Short Read Archive through a unified interface. Having access to all the utility described above makes the bio package well suited for exploratory analysis of genomes.

The software was written to teach bioinformatics and is the companion software to the Biostar Handbook

Why does this software exist?

If you’ve ever done bioinformatics you know how even seemingly straightforward tasks require multiple steps, arcane incantations, reading documentation, and numerous other preparations that slow down your progress.

Time and again, I found myself not pursuing an idea because getting to the fun part was too tedious. The bio package is meant to solve that tedium.

Diving right in

Here is how to align the sequences of SARS-COV-2 (NC_045512) versus the same region of a bat coronavirus (MN996532). First get the data:

bio fetch NC_045512 MN996532

Now align the sequences (showing 60bp for brevity).

bio align NC_045512 MN996532 --end 60  

# Ident=57(95.0%)  Mis=3(5.0%)  Gaps=0(0.0%)  Target=(1, 60)  Query=(1, 60)  Length=60  Score=273.0  NUC.4.4(11,1)

NC_045512.2  ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
             ||||||||||||||||||.|||||||||||||||||.||||.|||||||||||||||||| 60
MN996532.2   ATTAAAGGTTTATACCTTTCCAGGTAACAAACCAACGAACTCTCGATCTCTTGTAGATCT

A more realistic workflow

Suppose you wanted to identify the mutations between the S protein of the bat coronavirus deposited as MN996532 and the S protein of the ancestral SARS-COV-2 virus designated by the NCBI via accession number NC_045512.

If you are a trained bioinformatician, think about all the steps you would need to perform to accomplish this task, then think about the effort it would take you to teach someone else how to do the same.

With the bio package, the process takes simple, concise steps.

Download and rename

First, we download and rename the data keep our sanity:

bio fetch NC_045512 --rename ncov
bio fetch MN996532  --rename ratg13

From now on, bio can operate on NC_045512 using the name ncov and on MN996532 using the name ratg13 no matter where you are on your computer!

Convert to different formats

bio stores data in an internal storage system that it can find from any location. There is no clutter of files or paths to remember. For example, in any directory, you now can type:

bio convert ncov --fasta --end 100 | head -2
>NC_045512.2 [1:100]
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT

and it will show you the FASTA representation of the genome

You could also convert the data stored under ncov name to other formats. Let’s convert features with type CDS to GFF:

bio convert ncov --gff --type CDS  | head -5
##gff-version 3
NC_045512.2 .   CDS 266 13468   .   +   .   ID=CDS-1;Parent=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009724389.1
NC_045512.2 .   CDS 13468   21555   .   +   .   ID=CDS-2;Parent=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009724389.1
NC_045512.2 .   CDS 266 13483   .   +   .   ID=CDS-3;Parent=YP_009725295.1;Name=YP_009725295.1;gene=ORF1ab;locus_tag=GU280_gp01;note=pp1a;codon_start=1;product=ORF1a polyprotein;protein_id=YP_009725295.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009725295.1
NC_045512.2 .   CDS 21563   25384   .   +   .   ID=CDS-4;Parent=YP_009724390.1;Name=YP_009724390.1;gene=S;locus_tag=GU280_gp02;gene_synonym=spike glycoprotein;note=structural protein; spike protein;codon_start=1;product=surface glycoprotein;protein_id=YP_009724390.1;db_xref=GeneID:43740568;gene_id=GU280_gp02;transcript_id=YP_009724390.1

Align nucleotides or peptides

Now, back to our problem of aligning proteins. Let’s align the first 90 base pairs of DNA sequences for the S protein for each organism, bio even gives you a shortcut; instead of typing --gene S --type CDS you can write it as ncov:S :

bio align ncov:S ratg13:S --end 60

# Ident=57(95.0%)  Mis=3(5.0%)  Gaps=0(0.0%)  Target=(1, 60)  Query=(1, 60)  Length=60  Score=273.0  NUC.4.4(11,1)

YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
             ||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
QHR63300.2   ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT

We can visualize the translation of the DNA into aminoacids with one letter (-1) or three-letter codes (-3):

bio align ncov:S ratg13:S --end 60 -1

# Ident=57(95.0%)  Mis=3(5.0%)  Gaps=0(0.0%)  Target=(1, 60)  Query=(1, 60)  Length=60  Score=273.0  NUC.4.4(11,1)

              M  F  V  F  L  V  L  L  P  L  V  S  S  Q  C  V  N  L  T  T 
YP_009724390 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
             ||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||. 60
QHR63300.2   ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACT
              M  F  V  F  L  V  L  L  P  L  V  S  S  Q  C  V  N  L  T  T 

If, instead, we wanted to align the 60bp DNA subsequences for S protein after their translation into proteins, we could do it like so:

bio align ncov:S ratg13:S --translate --end 60

# Ident=20(100.0%)  Mis=0(0.0%)  Gaps=0(0.0%)  Target=(1, 20)  Query=(1, 20)  Length=20  Score=98.0  BLOSUM62(11,1)

YP_009724390 MFVFLVLLPLVSSQCVNLTT
             |||||||||||||||||||| 20
QHR63300.2   MFVFLVLLPLVSSQCVNLTT

We can note right away that all differences in the first 60bp of DNA are synonymous substitutions, the protein translations are the same.

Look up the taxonomy

bio understands taxonomies. Finding the lineage of the organism is as simple as:

bio taxon ncov --lineage
superkingdom, Viruses, 10239
   clade, Riboviria, 2559587
      kingdom, Orthornavirae, 2732396
         phylum, Pisuviricota, 2732408
            class, Pisoniviricetes, 2732506
               order, Nidovirales, 76804
                  suborder, Cornidovirineae, 2499399
                     family, Coronaviridae, 11118
                        subfamily, Orthocoronavirinae, 2501931
                           genus, Betacoronavirus, 694002
                              subgenus, Sarbecovirus, 2509511
                                 species, Severe acute respiratory syndrome-related coronavirus, 694009
                                    no rank, Severe acute respiratory syndrome coronavirus 2, 2697049

See the bioproject

bio knows about bio projects and sequencing data. As it turns out the data for ncov data is not adequately cross-referenced at NCBI … thus, we can’t quite get the SRR run numbers automatically, even at NCBI.

Let’s pick another data that has better cross-references, perhaps a virus from the 2014 Ebola outbreak:

bio fetch KM233118 --rename ebola

and now print:

bio runinfo ebola 
ebola   BioProject  PRJNA257197
ebola   BioSample   SAMN02952049

if we wanted the SRR run numbers, we could run:

bio runinfo ebola --sample

to get:

[
    {
        "Run": "SRR1553609",
        "ReleaseDate": "2014-08-19 11:41:53",
        "LoadDate": "2014-08-19 11:18:49",
        "spots": "464802",
        "bases": "93890004",
        "spots_with_mates": "464802",
        "avgLength": "202",
        "size_MB": "51",
        "download_path": "https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR1553609/SRR1553609.1",
        "Experiment": "SRX674271",
        "LibraryName": "NM042.3.FCH9",
        "LibraryStrategy": "RNA-Seq",
        "LibrarySelection": "cDNA",
        "LibrarySource": "TRANSCRIPTOMIC",
        "LibraryLayout": "PAIRED",
...
            

bio is a data model

Beyond the functionality that we show, bio is also an exploration into modeling biological data. The current standards and practices are woefully antiquated and painfully inadequate. Default formats such as GenBank or EMBL are inefficient and tedious to program with.

In contrast bio represents data in simple, efficient, compressed in JSON format.

bio convert ncov --json | head -20
[
    {
        "id": "NC_045512.2",
        "definition": "Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome",
        "dblink": [
            "BioProject:PRJNA485481"
        ],
        "locus": "NC_045512",
        "feature_count": 57,
        "origin_len": 29903,
        "molecule_type": "ss-RNA",
        "topology": "linear",
        "data_file_division": "VRL",
        "date": "18-JUL-2020",
        "accessions": [
            "NC_045512"
        ],
        "sequence_version": 2,
        "keywords": [
            "RefSeq"

The data layout allows bio to read in the entire human chromosome 1, with its 253 million characters and 328 thousand genomic features, in just three(!) seconds. In another 3 seconds, bio can convert that information FASTA or GFF; it can filter it by type, translate the sequence, extract proteins, slice by coordinate, etc.:

time bio convert chr1 --fasta | wc -c
253105766

real    0m6.238s
user    0m4.156s
sys     0m2.172s

For shorter genomes, bacterial or viral, the conversion times are under a fraction of a second.

Thanks to the representation, it is trivially easy to extend bio. The data is already structured in an efficient layout that needs no additional parsing to load.