i have a file that looks like this :
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1
and i need to copy and print everything from= (/GCA_ ) in every line in this file which will look like this :
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1
then will need to add a string in each line which says = (_protein.faa.gz), then the file would look like this:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1_protein.faa.gz
CodePudding user response:
You can do something like this using awk and for loop. Assuming your file contains the following content.
cat tempfile.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1
You can run this snippet from the same directory where this file is stored so it will capture the last part of the content and append the required string as you mentioned and create a new file. Just make sure there is no newtempfie.txt when you run this snippet as its going to append new line everytime you run.
for i in `cat tempfile.txt` ; do append=`echo $i | awk -F"/" '{print $NF}'`; echo $i/${append}_protein.faa.gz >> newtempfie.txt ;done
output
cat newtempfie.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1_protein.faa.gz
There might be other good solutions using sed.
CodePudding user response:
Using sed
$ sed -i.bak 's/GCA_.*/&\/&_protein.faa.gz/' input_file
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/240/185/GCA_000240185.2_ASM24018v2/GCA_000240185.2_ASM24018v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/364/295/GCA_006364295.1_ASM636429v1/GCA_006364295.1_ASM636429v1_protein.faa.gz
CodePudding user response:
with awk
awk -F'/' '{print $0"/"$NF"_protein.faa.gz"}'
Logic: use / as field separator, print whole record, a slash, and the last field followed by the "...gz"-string.
