bio mygene: get gene information
The bio mygene
command provides command-line based access to the https://mygene.info/ query interface.
The mygene
service was published in High-performance web services for querying gene and variant annotation, Genome Research, 2016 and is an amazingly potent query interface that focuses on returning data rather than web pages.
Note that bio mygene
is merely a command line convenience function that stands on the shoulders of the giants that created the mygene service in the first place.
Rationale
We have implemented the bio mygene
command to facilitate quick explorations of gene information. For more information on data sources and representations, consult The Biostar Handbook chapters on Biological Data Sources. To install bio
use:
pip install bio --upgrade
The full documentation for bio
is maintained at https://www.bioinfo.help/.
Usage example
bio mygene HBB --limit 1
will print:
[
{
"name": "hemoglobin subunit beta",
"symbol": "HBB",
"taxid": 9606,
"taxname": "Homo sapiens"
}
]
# showing 1 out of 148 results.
The same search for HBB
will return a lot more data as we add fields to it. Lets focus the search on symbol only symbol:HBB
human as species, and request Ensembl annotations:
bio mygene symbol:HBB --species human --fields ensembl
the query will now produce a much larger dataset:
[
{
"ensembl": {
"gene": "ENSG00000244734",
"protein": [
"ENSP00000333994",
"ENSP00000369671",
"ENSP00000488004",
"ENSP00000494175",
"ENSP00000496200"
],
"transcript": [
"ENST00000335295",
"ENST00000380315",
"ENST00000475226",
"ENST00000485743",
"ENST00000633227",
"ENST00000647020"
],
"translation": [
{
"protein": "ENSP00000333994",
"rna": "ENST00000335295"
},
{
"protein": "ENSP00000494175",
"rna": "ENST00000647020"
},
{
"protein": "ENSP00000488004",
"rna": "ENST00000633227"
},
{
"protein": "ENSP00000496200",
"rna": "ENST00000485743"
},
{
"protein": "ENSP00000369671",
"rna": "ENST00000380315"
}
],
"type_of_gene": "protein_coding"
},
"name": "hemoglobin subunit beta",
"symbol": "HBB",
"taxid": 9606,
"taxname": "Homo sapiens"
}
]
The output of the bio mygene
command is in the JSON (JavaScript Object Notation) format. JSON is not biology specific, instead, you can think of it as a generic, lightweight, human readable approach to represent diverse and hierarchical information.
How to change JSON into simpler formats
The tool called jq
may be used to extract information from JSON data.
mamba install jq
to use jq
first store the output in a file:
bio mygene symbol:HBB --species human --fields ensembl > data.json
Then practice the building extraction patterns with the following rules:
- key names are specified with a period
.
- lists with square brackets
[]
.
We build the extraction command left to right, matching various aspects of it. Since the output is always a list, our first pattern will usually be .[]
that visits all elements in the list.
For example, the pattern .[].symbol
instructs jq
to fetch the list for the root key .[]
, then visit each element and extract the value for the key named symbol
. Use the pattern it like so:
cat data.json | jq -r '.[].symbol'
prints:
HBB
Here is a more complicated pattern: .[].ensembl.protein[]
it will visit each element of the root list, extract the value for the ensemble
key, then the value for the protein
key and and prints all the resulting elements of that list:
cat data.json | jq -r '.[].ensembl.protein[]'
ENSP00000333994
ENSP00000369671
ENSP00000488004
ENSP00000494175
ENSP00000496200
Advanced usage
jq
is quite powerful. Below is a more complicated construct that pairs up fields into a tab-delimited table.
cat data.json | jq -r '.[].ensembl.translation[]|[.protein,.rna]| @tsv'
ENSP00000333994 ENST00000335295
ENSP00000494175 ENST00000647020
ENSP00000488004 ENST00000633227
ENSP00000496200 ENST00000485743
ENSP00000369671 ENST00000380315
Well, yeah … not everyone’s cup of tea, suffice to say you can do wondrous things with jq
if you put in some effort to learn it.
Stick with it! Once you get it, making patterns becomes that puzzle-solving kind of fun.
How to learn jq
There are numerous resources on learning jq
. Search and find one that suits you. I found the page below quite useful:
Additional examples
Often examples are more useful than words
# Limiting searches
bio mygene HBB --limit 10
# Adding summary
bio mygene HBB -f summary --limit 1 --species human
# Adding additional fields.
bio mygene symbol:HBB --species human --fields Ensembl,RefSeq
# Gene products with the function in human
bio mygene go:0000307 --species human
# Text search for a concept.
bio mygene insulin
# Genes with insulin in the summary then also print the summary field
bio mygene summary:insulin -f summary
# Find aliases, produces the same output
bio mygene symbol:XRCC2 -f alias --species human
bio mygene alias:SPGF50 -f alias --species human
Field documention
See the fields listed at API documentation site: