Ontology operations

The bio package provides utility to search gene and sequence ontology.

Building the database

Before using the ontology related functionality the representation needs to be built:

bio --define --build

The command above has to be run once (perhaps on a monthly basis) to download the latest data. The efficiency of the process depends on the speed of the hard drive and takes around 30 seconds.

Check database

# Check the database
bio define
OntologyDB: total=49,747 gene=47,218 sequence=2,529

There are a total of 47,218 ontology terms out of which 47,218 are gene and 2,529 are sequence.

Define a term

# Define the term
bio define exon

## exon (SO:0000147)

A region of the transcript sequence within a gene which is not removed from the
primary RNA transcript by RNA splicing.

Parents:
- transcript_region 

Children:
- coding_exon 
- noncoding_exon 
- interior_exon 
- decayed_exon (non_functional_homolog_of)
- pseudogenic_exon (non_functional_homolog_of)
- exon_region (part_of)
- exon_of_single_exon_gene 
 
# Define term by SO id
bio define SO:0000147

## exon (SO:0000147)

A region of the transcript sequence within a gene which is not removed from the
primary RNA transcript by RNA splicing.

Parents:
- transcript_region 

Children:
- coding_exon 
- noncoding_exon 
- interior_exon 
- decayed_exon (non_functional_homolog_of)
- pseudogenic_exon (non_functional_homolog_of)
- exon_region (part_of)
- exon_of_single_exon_gene 
 

The first line is the ontological term that matches, with each subsequent line being a child of the first one.

bio define positive regulation of cell motility

bio define cellular response to tumor cell

bio --define intergenic mrna trans splicing

Showing the term lineage

# show term lineage 
bio define exon --lineage
SO:0000110  sequence_feature
  SO:0000001  region
    SO:0001411  biological_region
      SO:0000833  transcript_region

        ## exon (SO:0000147)

        A region of the transcript sequence within a gene which is not removed from the
        primary RNA transcript by RNA splicing.

        Children:
        - coding_exon 
        - noncoding_exon 
        - interior_exon 
        - decayed_exon (non_functional_homolog_of)
        - pseudogenic_exon (non_functional_homolog_of)
        - exon_region (part_of)
        - exon_of_single_exon_gene 
 
*** More than on path detected, use -P to view all relationships.
# Show term lineage by SO name
bio define SO:0000147 --lineage
SO:0000110  sequence_feature
  SO:0000001  region
    SO:0001411  biological_region
      SO:0000833  transcript_region

        ## exon (SO:0000147)

        A region of the transcript sequence within a gene which is not removed from the
        primary RNA transcript by RNA splicing.

        Children:
        - coding_exon 
        - noncoding_exon 
        - interior_exon 
        - decayed_exon (non_functional_homolog_of)
        - pseudogenic_exon (non_functional_homolog_of)
        - exon_region (part_of)
        - exon_of_single_exon_gene 
 
*** More than on path detected, use -P to view all relationships.

Searching the database

Any query that is not matched will be searched for, The -go flag filters for gene ontology while -so filters for sequence ontology.

Without the -so or -go flags, it will print out both.

To search for both sequence and gene ontology:

# Search by a keyboard
bio define histone | head 
GO:0000118  histone deacetylase complex
GO:0000123  histone acetyltransferase complex
GO:0000412  histone peptidyl-prolyl isomerization
GO:0000414  regulation of histone h3-k36 methylation
GO:0000415  negative regulation of histone h3-k36 methylation
GO:0000416  positive regulation of histone h3-k36 methylation
GO:0001207  histone displacement
GO:0001208  histone h2a-h2b dimer displacement
GO:0003762  obsolete histone-specific chaperone activity
GO:0004402  histone acetyltransferase activity

To search for gene ontology:

# Search by a keyboard
bio define histone --go |head 
GO:0000118  histone deacetylase complex
GO:0000123  histone acetyltransferase complex
GO:0000412  histone peptidyl-prolyl isomerization
GO:0000414  regulation of histone h3-k36 methylation
GO:0000415  negative regulation of histone h3-k36 methylation
GO:0000416  positive regulation of histone h3-k36 methylation
GO:0001207  histone displacement
GO:0001208  histone h2a-h2b dimer displacement
GO:0003762  obsolete histone-specific chaperone activity
GO:0004402  histone acetyltransferase activity

To search for sequence ontology:

# Search by a keyboard
bio histone --define -so |head

bio: making bioinformatics fun again

Valid commands:

   bio data    : list or rename data
   bio fetch   : downloads data from repositories
   bio align   : performs sequence alignments
   bio taxon   : displays NCBI taxonomies
   bio define  : explains biological terms
   bio convert : converts data to different formats
   bio runinfo : prints sequencing run information

*** invalid command: histone

Preloading data

For many use cases, the default behavior is plenty fast and can produce family, genus and species level information in a fraction of a second.

Internally, during operation, the software will query the database for each child node. When selecting a rank where the number of descendant nodes is large (over 10,000 nodes) the run time of the independent queries adds up to a substantial overhead.

When run like so it will around 6 seconds:

bio regulation --define

The software can operate in a different mode to speed up the process massively by preloading all the data into memory at the cost of imposing a 1 second pre-loading penalty.

bio regulation --define --preload

When run with the --preload flag the command takes less than a 2 seconds to generate the same result. We don’t apply this mode by default because all queries would then take at least 1 second, even those that currently finish very quickly.

For queries that take more than 1 second to complete we recommend applying the --preload flag.