Extract text from file in Linux: specific line; between 2 different patterns-CodePudding

I have a bunch of text files, all with the same structure, and I need to extract a specific piece in a specific line.

I can easily extract the line with awk:

awk 'NR==23' blast_out.txt

CP046310.1 Lactobacillus jensenii strain FDAARGOS_749 chromosome,...  787     0.0

But I don't want the whole line, rather just the part between the first space on the left (after CP046310.1) and the double space on the right (before 787). The final output should be:

Lactobacillus jensenii strain FDAARGOS_749 chromosome,...

I tried several combination of awk and grep but cannot find the correct one to extract this specific pattern.

CodePudding user response：

1st solution: With your shown samples, please try following awk code. Simple explanation would be, nullifying 1st, 2nd last field and last field, then globally substituting starting and ending space with NULL, then printing the line.

awk '{$1=$NF=$(NF-1)="";gsub(/^  |  $/,"")} 1' Input_file

OR to run it on 23rd line change it to:

awk 'FNR==23{$1=$NF=$(NF-1)="";gsub(/^  |  $/,"");print;exit}' Input_file

2nd solution: Going through fields and printing values which are required as per need.

awk '{for(i=2;i<(NF-1);i  ){printf("%s%s",$i,i==(NF-2)?ORS:OFS)}}' Input_file

OR on 23rd line try following:

awk 'FNR==23{for(i=2;i<(NF-1);i  ){printf("%s%s",$i,i==(NF-2)?ORS:OFS)};exit}' Input_file

CodePudding user response：

Using sed you can use this solution:

sed -En '23s/^[^ ]  |  .*$//gp' file

Lactobacillus jensenii strain FDAARGOS_749 chromosome,...

Or using awk:

 awk 'NR == 23 {gsub(/^[^ ]  |  .*$/, ""); print}' file

CodePudding user response：

If I get what you ask, you want to extract the fields from the second (included) to the second-last (excluded). I would go with:

awk ' FNR==23 {for (i = 2; i < NF - 2; i  ) { printf("%s ", $i) }; printf("%s\n", $i); exit }' file_path

An example with the line you posted:

$ echo "CP046310.1 Lactobacillus jensenii strain FDAARGOS_749 chromosome,...  787     0.0" | awk '{for (i = 2; i < NF - 2; i  ) { printf("%s ", $i) }; printf("%s\n", $i); exit }'
$ Lactobacillus jensenii strain FDAARGOS_749 chromosome,...

I assume that chromosome,... does not contains spaces and you have only single spaces separating the fields you want to extract. If the second condition is not true, those extra spaces are removed.