I am extracting exons details from a GTF file using command line in Unix like cut, awk, grep or sed.
input file.gtf:
chrI ce11_ws245Genes CDS 8378308 8378427 0.000000 - 0 gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes exon 8377602 8378427 0.000000 - . gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes CDS 8379137 8379239 0.000000 - 1 gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes exon 8379137 8379239 0.000000 - . gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes CDS 8379706 8379815 0.000000 - 0 gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes exon 8379706 8379815 0.000000 - . gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes CDS 8380330 8380445 0.000000 - 2 gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes exon 8380330 8380445 0.000000 - . gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
chrI ce11_ws245Genes CDS 8388028 8388092 0.000000 - 1 gene_id "T19A6.1a.2"; transcript_id "T19A6.1a.2";
Desired output:
chrI 8377602 8378427 - T19A6.1a.2
chrI 8379137 8379239 - T19A6.1a.2
chrI 8379706 8379815 - T19A6.1a.2
chrI 8380330 8380445 - T19A6.1a.2
My successful attempts to solve the problem:
awk '/exon/ {print $1 " " $4 " " $5 " " $7 " " $10;}' file.gtf | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'
grep 'exon' file.gtf | cut -f1,4,5,7,9 | cut -d ';' -f1 | awk '{sub(/gene_id/,"",$5)};1' | awk -F'"' '{print $1, $2}'
steps:
- search for lines which contain the word 'exon'
- cut the fields of interest 1,4,5,7,9
- in field 9: cut using the delimiter ';'
- remove 'gene_id'
- remove the double quotations around the genes' names