Convert to JSON

bio obtains data from NCBI and transforms it into an internal, simpler format. One would only need to process this format to provide functionality that is not yet offered in bio

Get a dataset

Get SARS-COV-2 data and rename it to ncov:

bio fetch NC_045512 --rename ncov

The GenBank data

Explore the contents of the file downloaded from NCBI

bio convert ncov --genbank | head -20
LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,
            complete genome.
ACCESSION   NC_045512
VERSION     NC_045512.2
DBLINK      BioProject: PRJNA485481
KEYWORDS    RefSeq.
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
            Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae;
            Betacoronavirus; Sarbecovirus.
REFERENCE   1  (bases 1 to 29903)
  AUTHORS   Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y.,
            Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H.,
            Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C. and Zhang,Y.Z.
  TITLE     A new coronavirus associated with human respiratory disease in
            China
  JOURNAL   Nature 579 (7798), 265-269 (2020)
   PUBMED   32015508

JSON data representation

See the transformed GenBank file as the JSON representation:

bio convert ncov --json | head -36
[
    {
        "id": "NC_045512.2",
        "definition": "Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome",
        "dblink": [
            "BioProject:PRJNA485481"
        ],
        "locus": "NC_045512",
        "feature_count": 57,
        "origin_len": 29903,
        "molecule_type": "ss-RNA",
        "topology": "linear",
        "data_file_division": "VRL",
        "date": "18-JUL-2020",
        "accessions": [
            "NC_045512"
        ],
        "sequence_version": 2,
        "keywords": [
            "RefSeq"
        ],
        "source": "Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)",
        "organism": "Severe acute respiratory syndrome coronavirus 2",
        "taxonomy": [
            "Viruses",
            "Riboviria",
            "Orthornavirae",
            "Pisuviricota",
            "Pisoniviricetes",
            "Nidovirales",
            "Cornidovirineae",
            "Coronaviridae",
            "Orthocoronavirinae",
            "Betacoronavirus",
            "Sarbecovirus"
        ],

References

The following references may be consulted to understand how data should be represented in GenBank and GFF formats:

INSDC feature descriptions:

NCBI GenBank format:

NCBI GFF format: