I am trying to compare columns of two CSV files and save all matched lines to a new CSV file with a header. Below are the example files
file1:
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775339,marker,gene3,1895,1962,Parent=mRNA1
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
file2:
Id,start,End
C2002,895,1166
C2003,1895,2166
C2004,2795,2962
here I am trying to compare columns 4th and 5th of file1 with columns 2nd and 3rd of file2 and save it to a new CSV file if matched.
using this command awk -F',' 'NR==FNR{A[$2,$3]=$0;next} A[$4,$5]' file2 file1 I am getting this output:
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
But I want the header of file1 as well which is achieved if the header name is identical in both files for example if startpos and endpos of file1 changed to start and end or vice-versa.
is there any way without having an identical header name, it can be done. So my expected output file will be like:
output:
ID,type,gene,startpos,Endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
CodePudding user response:
You may use another condition FNR == 1:
awk -F, 'NR==FNR {A[$2,$3]=$0; next} FNR == 1 || ($4,$5) in A' f2 f1
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
