Taxonomy operations

The bio package provides utilities to visualize NCBI taxonomies.

First steps

Before using the taxonomy related database needs to downloaded with:

bio taxon --download 

The above command takes about 6 minutes to obtain the remote databases and store them locally.

Check database

bio taxon

prints:

TaxDB: nodes=2,288,072 parents=198,666

There are a total of 2,288,072 nodes (taxonomical entries) out of which 198,666 are nodes that are non-terminal (non-leaf) nodes. For these numbers we see that the vast majority of the taxonomy annotations are for terminal, leaf nodes.

Searching for taxids

Searches the taxonomy for a word

bio taxon jawed 

prints:

# searching taxonomy for: jawed
clade, Gnathostomata (jawed vertebrates), 7776
species, Gillichthys mirabilis (long-jawed mudsucker), 8222
species, Pseudamia amblyuroptera (white-jawed cardinalfish), 1431476
species, Myctophum brachygnathum (short-jawed lanternfish), 1519985
species, Oryzias orthognathus (sharp-jawed buntingi), 1645897
species, Longjawed orbweaver circular virus 1, 2293294
species, Longjawed orbweaver circular virus 2, 2293295

The search words may use regular expression control characters:

bio taxon '^jawed'

produces:

# searching taxonomy for: ^jawed
clade, Gnathostomata (jawed vertebrates), 7776

View taxonomy for data

Once you fetch the data

bio fetch NC_045512 --rename ncov
    

you can view the descendants:

bio taxon ncov
no rank, Severe acute respiratory syndrome coronavirus 2, 2697049

or view the lineage:

bio taxon ncov --lineage
superkingdom, Viruses, 10239
   clade, Riboviria, 2559587
      kingdom, Orthornavirae, 2732396
         phylum, Pisuviricota, 2732408
            class, Pisoniviricetes, 2732506
               order, Nidovirales, 76804
                  suborder, Cornidovirineae, 2499399
                     family, Coronaviridae, 11118
                        subfamily, Orthocoronavirinae, 2501931
                           genus, Betacoronavirus, 694002
                              subgenus, Sarbecovirus, 2509511
                                 species, Severe acute respiratory syndrome-related coronavirus, 694009
                                    no rank, Severe acute respiratory syndrome coronavirus 2, 2697049

View taxonomy by tax id

Pass a NCBI taxonomical id to see all the descendants of it:

bio taxon 117565 | head
class, Myxini, 117565
   order, Myxiniformes, 7761
      family, Myxinidae (hagfishes), 7762
         subfamily, Eptatretinae, 30309
            genus, Eptatretus, 7763
               species, Eptatretus burgeri (inshore hagfish), 7764
               species, Eptatretus stoutii (Pacific hagfish), 7765
               species, Eptatretus okinoseanus, 7767
               species, Eptatretus atami, 50612
               species, Eptatretus cirrhatus (broadgilled hagfish), 78394

View a tax id

Pass a NCBI taxonomical id to see all the descendants of it:

bio taxon 117565 | head
class, Myxini, 117565
   order, Myxiniformes, 7761
      family, Myxinidae (hagfishes), 7762
         subfamily, Eptatretinae, 30309
            genus, Eptatretus, 7763
               species, Eptatretus burgeri (inshore hagfish), 7764
               species, Eptatretus stoutii (Pacific hagfish), 7765
               species, Eptatretus okinoseanus, 7767
               species, Eptatretus atami, 50612
               species, Eptatretus cirrhatus (broadgilled hagfish), 78394

To print the lineage of a term use:

bio taxon 564286 --lineage
no rank, cellular organisms, 131567
   superkingdom, Bacteria (eubacteria), 2
      clade, Terrabacteria group, 1783272
         phylum, Firmicutes, 1239
            class, Bacilli, 91061
               order, Bacillales, 1385
                  family, Bacillaceae, 186817
                     genus, Bacillus, 1386
                        species group, Bacillus subtilis group, 653685
                           species, Bacillus subtilis, 1423
                              strain, Bacillus subtilis str. 10, 564286

the lineage may be flattened:

bio taxon 564286 --lineage --flat
564286  cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Bacillus subtilis group;Bacillus subtilis;Bacillus subtilis str. 10

Filter blast results

(TODO) - filters BLAST alignment for species that fall within a taxonomical clade

List the content of the database:

bio taxon --list | head

prints:

no rank, root, 1
superkingdom, Bacteria (eubacteria), 2
genus, Azorhizobium, 6
species, Azorhizobium caulinodans, 7
species, Buchnera aphidicola, 9
genus, Cellvibrio, 10
species, Cellulomonas gilvus, 11
genus, Dictyoglomus, 13
species, Dictyoglomus thermophilum, 14
genus, Methylophilus, 16

Note: this command benefits greatly from using --preload.

Update the taxonomy

You may build the newest version locally:

bio taxon --update --build

The command will download and build a new taxonomy using the latest NCBI taxonomy data. The efficiency of the process depends on the speed of the hard drive and takes around 30 minutes.

Preloading data

For many use cases, the default behavior is plenty fast and can produce family, genus and species level information in a fraction of a second.

Internally, during operation, the software will query the database for each child node. When selecting a rank where the number of descendant nodes is large (over 10,000 nodes) the run time of the independent queries adds up to a substantial overhead.

For example the command below attempts to render the complete NCBI taxonomic tree with over 2.2 million descendant nodes. When run like so it will take a very long time to produce the output (more than two hours):

bio taxon 1 

The software can operate in a different mode to speed up the process massively by preloading all the data into memory at the cost of imposing a 6 second pre-loading penalty.

bio taxon 1 --preload

When run with the --preload flag the command takes a total of just 11 seconds to generate the same large tree of the entire NCBI taxonomical tree. We don’t apply this mode by default because all queries would then take at least 6 seconds, even those that currently finish very quickly.

For queries that take more than 10 seconds to complete (have more than 10,000 descendant nodes) we recommend applying the --preload flag.