Home > Enterprise >  Need to copy a series of characters after a pattern of characters in a string
Need to copy a series of characters after a pattern of characters in a string

Time:01-23

i have a file that looks like this :

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1

and i need to copy and print everything from= (/GCA_ ) in every line in this file which will look like this :

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1

then will need to add a string in each line which says = (_protein.faa.gz), then the file would look like this:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1_protein.faa.gz

CodePudding user response:

You can do something like this using awk and for loop. Assuming your file contains the following content.

cat tempfile.txt 
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1

You can run this snippet from the same directory where this file is stored so it will capture the last part of the content and append the required string as you mentioned and create a new file. Just make sure there is no newtempfie.txt when you run this snippet as its going to append new line everytime you run.

for i in `cat tempfile.txt` ; do append=`echo $i | awk -F"/" '{print $NF}'`; echo $i/${append}_protein.faa.gz >> newtempfie.txt ;done

output

cat newtempfie.txt 
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1_protein.faa.gz

There might be other good solutions using sed.

CodePudding user response:

Using sed

$ sed -i.bak 's/GCA_.*/&\/&_protein.faa.gz/' input_file
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1_protein.faa.gz

CodePudding user response:

with awk

awk -F'/' '{print $0"/"$NF"_protein.faa.gz"}'

Logic: use / as field separator, print whole record, a slash, and the last field followed by the "...gz"-string.

  •  Tags:  
  • Related