Parsing GTF file using command-line

Question

I am extracting exons details from a GTF file using command line in Unix like cut, awk, grep or sed.

input file.gtf:

chrI    ce11_ws245Genes CDS 8378308 8378427 0.000000    -   0   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8377602 8378427 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8379137 8379239 0.000000    -   1   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8379137 8379239 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8379706 8379815 0.000000    -   0   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8379706 8379815 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8380330 8380445 0.000000    -   2   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes exon    8380330 8380445 0.000000    -   .   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2"; 
chrI    ce11_ws245Genes CDS 8388028 8388092 0.000000    -   1   gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";

Desired output:

chrI 8377602 8378427 -  T19A6.1a.2
chrI 8379137 8379239 -  T19A6.1a.2
chrI 8379706 8379815 -  T19A6.1a.2
chrI 8380330 8380445 -  T19A6.1a.2

My successful attempts to solve the problem:

awk '/exon/ {print $1 " " $4 " " $5 " " $7 " " $10;}' file.gtf | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'

grep 'exon' file.gtf | cut -f1,4,5,7,9 | cut -d ';' -f1 | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'

steps:

search for lines which contain the word 'exon'
cut the fields of interest 1,4,5,7,9
in field 9: cut using the delimiter ';'
remove 'gene_id'
remove the double quotations around the genes' names

oliv · Accepted Answer · 2018-05-23 07:15:50Z

3

Your code can be simplified with only awk script:

awk '/exon/ {gsub("[\";]","", $10);print $1,$4,$5,$7,$10}' file.gtk

gusb removes any occurrence " or ; in the 10th element.

answered May 23, 2018 at 7:15

oliv

2611 silver badge5 bronze badges

Add a comment |

Stack Exchange Network

Parsing GTF file using command-line

1 Answer 1

Your Answer

Hot Network Questions

Parsing GTF file using command-line

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions