Parsing fasta headers with awk


I am new to linux shell, very new.

I have a fasta file from which I would like to extract a specific part from the header.

The fasta headers look like this:

lcl|CP000046.1_cds_AAW37389.1_1 [gene=dnaA] [locus_tag=SACOL0001] [protein=chromosomal replication initiator protein DnaA] [protein_id=AAW37389.1] [location=544…1905] [gbkey=CDS]

lcl|CP000046.1_cds_AAW37391.1_3 [locus_tag=SACOL0003] [protein=conserved hypothetical protein] [protein_id=AAW37391.1] [location=3697…3942] [gbkey=CDS]

I would like the ouput file to look like this:

CP000046.1_cds_AAW37391.1_3 SACOL0003

I have been trying to use awk, with various printing options, but I could not solve it (as you may have noticed the SACOL is not always the 3rd term of the header, which does not make my life easier).

Is there a way to print only what’s after locus_tag= ?

Thank you very much for your help.

We can extract that whole word using the grep:

echo 'lcl|CP000046.1_cds_AAW37391.1_3 [locus_tag=SACOL0003] [protein=conserved hypothetical protein] [protein_id=AAW37391.1] [location=3697…3942] [gbkey=CDS]' | grep -o '\[locus_tag=.........\]'

Maybe use a shell variable:

var=$(echo 'lcl|CP000046.1_cds_AAW37391.1_3 [locus_tag=SACOL0003] [protein=conserved hypothetical protein] [protein_id=AAW37391.1] [location=3697…3942] [gbkey=CDS]' | grep -o '\[locus_tag=.........\]')
echo "${var##*=}" | sed 's/]//'

Is this what you want?

Thank you!
I will try to implement this in the awk print command.
Have a good weekend.

another way, change all separator with a only |

echo "lcl|CP000046.1_cds_AAW37389.1_1 [gene=dnaA] [locus_tag=SACOL0001] [protein=chromosomal replication initiator protein DnaA] [protein_id=AAW37389.1] [location=544…1905] [gbkey=CDS]" |sed "s/ \[/\|/g" |awk -F"|" '{print $2}'

Linux sysadmin blog - Linux/Unix Howtos and Tutorials - Linux bash shell scripting wiki