I have a fasta file
my_file.fasta
>NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
I just want to add a ID_number at begins of each sequence name (line that start with >), so I tried:
awk '/^>/ {$1= ">ID_" i $1}1' my_file.fasta > Outfile.fasta
but I get
>ID_1>NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>ID_2>NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>ID_3>NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
and I want to get something similar, but without the second >, I mean something similar to (>ID_Number space old_name):
>ID_1 NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>ID_2 NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>ID_3 NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
Thanks
CodePudding user response:
With awk:
awk -F '>' '/^>/{$0=$1 ">ID_" c " " $2} {print}'
Output:
>ID_1 NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain] GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA >ID_2 NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain] GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA >ID_3 NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain] GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
CodePudding user response:
A couple small changes to OP's current awk code:
$ awk '/^>/{$1= ">ID_" i " " substr($1,2)}1' my_file.fasta
>ID_1 NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>ID_2 NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>ID_3 NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
CodePudding user response:
$ awk 'sub(/^>/,"&ID_"(cnt 1)" "){cnt } 1' my_file.fasta
>ID_1 NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>ID_2 NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>ID_3 NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
CodePudding user response:
I would harness GNU AWK for this task following way, let file.txt content be
>NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
then
awk 'BEGIN{FS="^>";OFS=" "}NF==2{$1=">ID_" i}{print}' file.txt
output
>ID_1 NODE_1_length_531158 [gcode=11] [organism=Genus species] [strain=strain]
GACCATGACACGGTTGAACCAGATCAGAGAGCAGTTACAAGCCTTCCCAGAACTGAAACA
>ID_2 NODE_2_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GGTGCCGACTACCGGAATCGAACTGGTGACCTACTGATTACAAGTCAGTTGCTCTACCTA
>ID_3 NODE_3_length_200_ [gcode=11] [organism=Genus species] [strain=strain]
GTTGCGGGGGCCGGATTTGAACCGACGACCTTCGGGTTATGAGCCCGACGAGCTACCAAG
Explanation: I inform GUN AWK to treat > at beginning (^) as field separator (FS) and space as output field separator (OFS), then for each line if number of field is equal 2 (that is it does starts with >) I set first field (which is empty string before >) to >ID_ concated with subsequent number. Each line is printed.
(tested in GNU Awk 5.0.1)
