Convert to GFF

Building a nicer gene model

bio creates more meaningful and nicer GFF visualizations:

# Get chromosome 2L for Drosophila melanogaster (fruit-fly)
bio fetch NT_033779  --rename fly 

convert it to gff:

bio convert fly --gff > annotations.gff

GFF created with bio

Here is a region from the GFF file created with the code above as visualized in IGV:

  • Exons will have transcript_id and gene_id attributes set.
  • CDS features have protein_id and gene_id attributes set.

Get a dataset

Get SARS-COV-2 data and rename it to ncov:

bio fetch NC_045512 --rename ncov

Convert all features to GFF:

bio convert ncov --gff | head -5
##gff-version 3
NC_045512.2 .   region  1   29903   .   +   .   ID=region-1;Name=Severe acute respiratory syndrome coronavirus 2;organism=Severe acute respiratory syndrome coronavirus 2;mol_type=genomic RNA;isolate=Wuhan-Hu-1;host=Homo sapiens;db_xref=taxon:2697049;country=China;collection_date=Dec-2019;color=#CECECE
NC_045512.2 .   five_prime_UTR  1   265 .   +   .   ID=five_prime_UTR-2;Name=five_prime_UTR;color=#cc0e74
NC_045512.2 .   gene    266 21555   .   +   .   ID=ORF1ab;Name=ORF1ab;gene=ORF1ab;locus_tag=GU280_gp01;db_xref=GeneID:43740578;color=#cb7a77
NC_045512.2 .   mRNA_region 266 21555   .   +   .   ID=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;color=#7a77cb

Convert to GFF only the features with type CDS

bio convert ncov --gff --type transcript,exon,mRNA,CDS | head -5
##gff-version 3
NC_045512.2 .   CDS 266 13468   .   +   .   ID=CDS-1;Parent=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009724389.1
NC_045512.2 .   CDS 13468   21555   .   +   .   ID=CDS-2;Parent=YP_009724389.1;Name=YP_009724389.1;gene=ORF1ab;locus_tag=GU280_gp01;ribosomal_slippage=;note=pp1ab; translated by -1 ribosomal frameshift;codon_start=1;product=ORF1ab polyprotein;protein_id=YP_009724389.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009724389.1
NC_045512.2 .   CDS 266 13483   .   +   .   ID=CDS-3;Parent=YP_009725295.1;Name=YP_009725295.1;gene=ORF1ab;locus_tag=GU280_gp01;note=pp1a;codon_start=1;product=ORF1a polyprotein;protein_id=YP_009725295.1;db_xref=GeneID:43740578;gene_id=GU280_gp01;transcript_id=YP_009725295.1
NC_045512.2 .   CDS 21563   25384   .   +   .   ID=CDS-4;Parent=YP_009724390.1;Name=YP_009724390.1;gene=S;locus_tag=GU280_gp02;gene_synonym=spike glycoprotein;note=structural protein; spike protein;codon_start=1;product=surface glycoprotein;protein_id=YP_009724390.1;db_xref=GeneID:43740568;gene_id=GU280_gp02;transcript_id=YP_009724390.1

Convert to GFF only the features tagged with gene S

bio convert ncov --gff --gene S | head -5
##gff-version 3
NC_045512.2 .   gene    21563   25384   .   +   .   ID=S;Name=S;gene=S;locus_tag=GU280_gp02;gene_synonym=spike glycoprotein;db_xref=GeneID:43740568;color=#cb7a77
NC_045512.2 .   mRNA_region 21563   25384   .   +   .   ID=YP_009724390.1;Name=YP_009724390.1;gene=S;locus_tag=GU280_gp02;gene_synonym=spike glycoprotein;note=structural protein; spike protein;codon_start=1;product=surface glycoprotein;protein_id=YP_009724390.1;db_xref=GeneID:43740568;color=#7a77cb
NC_045512.2 .   CDS 21563   25384   .   +   .   ID=CDS-4;Parent=YP_009724390.1;Name=YP_009724390.1;gene=S;locus_tag=GU280_gp02;gene_synonym=spike glycoprotein;note=structural protein; spike protein;codon_start=1;product=surface glycoprotein;protein_id=YP_009724390.1;db_xref=GeneID:43740568;gene_id=GU280_gp02;transcript_id=YP_009724390.1