Ontology operations
The bio
package provides utility to search gene and sequence ontology.
Building the database
Before using the ontology related functionality the representation needs to be built:
bio --define --build
The command above has to be run once (perhaps on a monthly basis) to download the latest data. The efficiency of the process depends on the speed of the hard drive and takes around 30 seconds.
Check database
# Check the database
bio define
OntologyDB: total=49,747 gene=47,218 sequence=2,529
There are a total of 47,218
ontology terms out of which 47,218
are gene and 2,529
are sequence.
Define a term
# Define the term
bio define exon
## exon (SO:0000147)
A region of the transcript sequence within a gene which is not removed from the
primary RNA transcript by RNA splicing.
Parents:
- transcript_region
Children:
- coding_exon
- noncoding_exon
- interior_exon
- decayed_exon (non_functional_homolog_of)
- pseudogenic_exon (non_functional_homolog_of)
- exon_region (part_of)
- exon_of_single_exon_gene
# Define term by SO id
bio define SO:0000147
## exon (SO:0000147)
A region of the transcript sequence within a gene which is not removed from the
primary RNA transcript by RNA splicing.
Parents:
- transcript_region
Children:
- coding_exon
- noncoding_exon
- interior_exon
- decayed_exon (non_functional_homolog_of)
- pseudogenic_exon (non_functional_homolog_of)
- exon_region (part_of)
- exon_of_single_exon_gene
The first line is the ontological term that matches, with each subsequent line being a child of the first one.
bio define positive regulation of cell motility
bio define cellular response to tumor cell
bio --define intergenic mrna trans splicing
Showing the term lineage
# show term lineage
bio define exon --lineage
SO:0000110 sequence_feature
SO:0000001 region
SO:0001411 biological_region
SO:0000833 transcript_region
## exon (SO:0000147)
A region of the transcript sequence within a gene which is not removed from the
primary RNA transcript by RNA splicing.
Children:
- coding_exon
- noncoding_exon
- interior_exon
- decayed_exon (non_functional_homolog_of)
- pseudogenic_exon (non_functional_homolog_of)
- exon_region (part_of)
- exon_of_single_exon_gene
*** More than on path detected, use -P to view all relationships.
# Show term lineage by SO name
bio define SO:0000147 --lineage
SO:0000110 sequence_feature
SO:0000001 region
SO:0001411 biological_region
SO:0000833 transcript_region
## exon (SO:0000147)
A region of the transcript sequence within a gene which is not removed from the
primary RNA transcript by RNA splicing.
Children:
- coding_exon
- noncoding_exon
- interior_exon
- decayed_exon (non_functional_homolog_of)
- pseudogenic_exon (non_functional_homolog_of)
- exon_region (part_of)
- exon_of_single_exon_gene
*** More than on path detected, use -P to view all relationships.
Searching the database
Any query that is not matched will be searched for,
The -go
flag filters for gene ontology while -so
filters for sequence ontology.
Without the -so
or -go
flags, it will print out both.
To search for both sequence and gene ontology:
# Search by a keyboard
bio define histone | head
GO:0000118 histone deacetylase complex
GO:0000123 histone acetyltransferase complex
GO:0000412 histone peptidyl-prolyl isomerization
GO:0000414 regulation of histone h3-k36 methylation
GO:0000415 negative regulation of histone h3-k36 methylation
GO:0000416 positive regulation of histone h3-k36 methylation
GO:0001207 histone displacement
GO:0001208 histone h2a-h2b dimer displacement
GO:0003762 obsolete histone-specific chaperone activity
GO:0004402 histone acetyltransferase activity
To search for gene ontology:
# Search by a keyboard
bio define histone --go |head
GO:0000118 histone deacetylase complex
GO:0000123 histone acetyltransferase complex
GO:0000412 histone peptidyl-prolyl isomerization
GO:0000414 regulation of histone h3-k36 methylation
GO:0000415 negative regulation of histone h3-k36 methylation
GO:0000416 positive regulation of histone h3-k36 methylation
GO:0001207 histone displacement
GO:0001208 histone h2a-h2b dimer displacement
GO:0003762 obsolete histone-specific chaperone activity
GO:0004402 histone acetyltransferase activity
To search for sequence ontology:
# Search by a keyboard
bio histone --define -so |head
bio: making bioinformatics fun again
Valid commands:
bio data : list or rename data
bio fetch : downloads data from repositories
bio align : performs sequence alignments
bio taxon : displays NCBI taxonomies
bio define : explains biological terms
bio convert : converts data to different formats
bio runinfo : prints sequencing run information
*** invalid command: histone
Preloading data
For many use cases, the default behavior is plenty fast and can produce family, genus and species level information in a fraction of a second.
Internally, during operation, the software will query the database for each child node. When selecting a rank where the number of descendant nodes is large (over 10,000 nodes) the run time of the independent queries adds up to a substantial overhead.
When run like so it will around 6 seconds:
bio regulation --define
The software can operate in a different mode to speed up the process massively by preloading all the data into memory at the cost of imposing a 1 second pre-loading penalty.
bio regulation --define --preload
When run with the --preload
flag the command takes less than a 2 seconds to generate the same result.
We don’t apply this mode by default because all queries would then take at least 1 second, even those that currently finish very quickly.
For queries that take more than 1 second to complete we recommend applying the --preload
flag.