Count the number of times a substring appears in a file and place it in a new column-CodePudding

Question:

I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:

1 779692 779715 Sample_3   1  ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1  ATGAATAAGTCAAGTAAATGGAC

File 2 is a FASTA file (hg19.fasta) that looks like this:

>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC

I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others, and 2) append this number to a new column next to the sequence so the new file looks like this:

Desired Output:

1 779692 779715 Sample_3    1  ATGGTGCTTTGTTATGGCAGCTC 1 
1 783462 783485 Sample_4 - 1  ATGAATAAGTCAAGTAAATGGAC 2

My attempt:

I imagine solving the first part will be some variation on grep, so far I've managed:

grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""

which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!

Any help is appreciated as always! =)

CodePudding user response：

Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:

#!/bin/bash

# read the fields into the array `v`:
while read -ra v
do
    # print the 7 first elements in the array   the output from `grep -c`:
    echo "${v[@]:0:7}" "$(grep -c "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile

outfile will now contain

1 779692 779715 Sample_3   1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2

CodePudding user response：

I'd reach for a programming language like Perl.

#!/usr/bin/perl
use warnings;
use strict;

my ($fasta_file, $bed_file) = @ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed,   '<', $bed_file   or die "$bed_file: $!";


<$fasta>;  # Skip the header.
my $seq;
while (<$fasta>) {
    chomp;
    $seq .= $_;
}

while (<$bed>) {
    chomp;
    my $short_seq = (split /\t/, $_)[-1];
    my $count = () = $seq =~ /\Q$short_seq\E/g;
    print "$_\t$count\n";
}

To count overlapping sequences, change the regex to a lookahead.

my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;

CodePudding user response：

If the second file is significantly bigger than the first one, I would try this awk script:

awk 'v==1 {a[$7];next}          # Get the pattern from first file into the array a
     v==2 {                     # For each line of the second file
       for(i in a){             # Loop though all patterns
         n=split($0,b,i)-1      # Get the number of pattern match in the line
         a[i] =n 
         delete b
       }
     }
     v==3 {print $0,a[$7]}       # Re-read first file to add the number of pattern matches
 ' v=1 file1 v=2 file2 v=3 file1