awk38
April 6, 2020, 2:30pm
#1
Hello,
I am new to linux shell, very new.
I have a fasta file from which I would like to extract a specific part from the header.
The fasta headers look like this:
lcl|CP000046.1_cds_AAW37389.1_1 [gene=dnaA] [locus_tag=SACOL0001] [protein=chromosomal replication initiator protein DnaA] [protein_id=AAW37389.1] [location=544…1905] [gbkey=CDS]
lcl|CP000046.1_cds_AAW37391.1_3 [locus_tag=SACOL0003] [protein=conserved hypothetical protein] [protein_id=AAW37391.1] [location=3697…3942] [gbkey=CDS]
I would like the ouput file to look like this:
CP000046.1_cds_AAW37391.1_3 SACOL0003
I have been trying to use awk, with various printing options, but I could not solve it (as you may have noticed the SACOL is not always the 3rd term of the header, which does not make my life easier).
Is there a way to print only what’s after locus_tag= ?
Thank you very much for your help.
tomboi
April 8, 2020, 4:43pm
#2
We can extract that whole word using the grep:
echo 'lcl|CP000046.1_cds_AAW37391.1_3 [locus_tag=SACOL0003] [protein=conserved hypothetical protein] [protein_id=AAW37391.1] [location=3697…3942] [gbkey=CDS]' | grep -o '\[locus_tag=.........\]'
Maybe use a shell variable:
var=$(echo 'lcl|CP000046.1_cds_AAW37391.1_3 [locus_tag=SACOL0003] [protein=conserved hypothetical protein] [protein_id=AAW37391.1] [location=3697…3942] [gbkey=CDS]' | grep -o '\[locus_tag=.........\]')
echo "${var##*=}" | sed 's/]//'
Is this what you want?
awk38
April 16, 2020, 1:25pm
#3
Thank you!
I will try to implement this in the awk print command.
Have a good weekend.
another way, change all separator with a only |
echo "lcl|CP000046.1_cds_AAW37389.1_1 [gene=dnaA] [locus_tag=SACOL0001] [protein=chromosomal replication initiator protein DnaA] [protein_id=AAW37389.1] [location=544…1905] [gbkey=CDS]" |sed "s/ \[/\|/g" |awk -F"|" '{print $2}'