Question:
I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:
1 779692 779715 Sample_3 1 ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC
File 2 is a FASTA file (hg19.fasta) that looks like this:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC
I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others, and 2) append this number to a new column next to the sequence so the new file looks like this:
Desired Output:
1 779692 779715 Sample_3 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
My attempt:
I imagine solving the first part will be some variation on grep, so far I've managed:
grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""
which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!
Any help is appreciated as always! =)
CodePudding user response:
Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:
#!/bin/bash
# read the fields into the array `v`:
while read -ra v
do
# print the 7 first elements in the array the output from `grep -c`:
echo "${v[@]:0:7}" "$(grep -c "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile
outfile will now contain
1 779692 779715 Sample_3 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
CodePudding user response:
I'd reach for a programming language like Perl.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta_file, $bed_file) = @ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed, '<', $bed_file or die "$bed_file: $!";
<$fasta>; # Skip the header.
my $seq;
while (<$fasta>) {
chomp;
$seq .= $_;
}
while (<$bed>) {
chomp;
my $short_seq = (split /\t/, $_)[-1];
my $count = () = $seq =~ /\Q$short_seq\E/g;
print "$_\t$count\n";
}
To count overlapping sequences, change the regex to a lookahead.
my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;
CodePudding user response:
If the second file is significantly bigger than the first one, I would try this awk script:
awk 'v==1 {a[$7];next} # Get the pattern from first file into the array a
v==2 { # For each line of the second file
for(i in a){ # Loop though all patterns
n=split($0,b,i)-1 # Get the number of pattern match in the line
a[i] =n
delete b
}
}
v==3 {print $0,a[$7]} # Re-read first file to add the number of pattern matches
' v=1 file1 v=2 file2 v=3 file1
