I have two lists containing absolute paths of all files in PWD.
I generated this list from using find "$(pwd)" -type f
List 1:
/home/ec2-user/eclipsebio_toolkit/scripts/get_gene_counts_for_miRNA_specific_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_gene_info_sample_comparisons.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_miRNA_counts_for_gene_specific_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_mrna_lengths.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_nonnegative_peaks.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peak_gene_ids_chimeric.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peak_gene_ids_w_output.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peaks_not_sig.py
/home/ec2-user/eclipsebio_toolkit/scripts/get_peaks_overlapping_chimeric_reads.py
/home/ec2-user/eclipsebio_toolkit/scripts/create_ribo_html_report.py
List 2:
/home/ec2-user/snakemake_eclip/scripts/count_reads_broadfeatures_frombamfi_SRmap.pl
/home/ec2-user/snakemake_eclip/scripts/create_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_idr_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_mapped_read_num.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations_peaks.py
/home/ec2-user/snakemake_eclip/scripts/create_metagene_plot_from_saturations_reads.py
/home/ec2-user/snakemake_eclip/scripts/create_peak_norm_manifests.py
/home/ec2-user/snakemake_eclip/scripts/create_pureclip_html_report.py
/home/ec2-user/snakemake_eclip/scripts/create_ribo_html_report.py
I would like to find duplicate files between these two lists and then delete duplicate items only found in list 1 (rm from disk).
I have tried using awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' list1 list2 to delete items only found in list 1, but this does not take the absolute path into consideration.
CodePudding user response:
Your awk script is almost right,
awk -F / 'NR == FNR{ a[$NF] = 1;next } a[$NF]' file2 file1
CodePudding user response:
awk -F '/' '{print $NF}' list1 list2 \
| sort \
| uniq -d \
| xargs -I {} echo rm -v /home/ec2-user/eclipsebio_toolkit/scripts/{}
Output:
rm -v /home/ec2-user/eclipsebio_toolkit/scripts/create_ribo_html_report.py
If output looks okay, remove echo.
CodePudding user response:
awk -F'/' 'NR==FNR{ a[$NF]; next } $NF in a' file2 file1 | xargs rm
Always use the above key in a idiom for the second part of the script instead of a[key] as you were trying to do:
awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' list1 list2
because what you're doing is wasting cycles and memory by storing 1 for every key that exists in your first file (a[$0] = 1) and then wasting more memory by storing every key from the 2nd file too with !a[key].
This doesn't require you to assign any value when populating a[key] and tests if key exists in a by a hash lookup without adding anything else to a[]:
key in a
!(key in a)
while this does require you to initially assign a non-zero value to a[key] and then the code below does a hash lookup of key in a[] and if that key doesn't exist adds an entry to a[] indexed by key and then tests if the value for that entry is non-zero:
a[key]
!a[key]
so just don't do that as it's a waste of time and memory.
