Perl script for counting occurrences of three different IDs in string where occurrence may be zero-CodePudding

I've attached below an example featuring 5 lines from my input file (tab-delimited):

G157    G157.2  535 3   344 344:m64019_201112_211057/51839190/ccs,m64019_201112_211057/167772263/ccs,m64019_201112_211057/152963146/ccs
G157    G157.6  535 42  276,344,365 276:m54312U_201103_152606/5964842/ccs,m54312U_201103_152606/78907467/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/124453202/ccs,m54312U_201103_152606/117441369/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs,m54312U_201103_152606/137956132/ccs,m54312U_201103_152606/26020309/ccs;344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/20840619/ccs,m64019_201112_211057/32964769/ccs,m64019_201112_211057/119801176/ccs,m64019_201112_211057/62327216/ccs,m64019_201112_211057/155584613/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/40042591/ccs,m64019_201112_211057/129304850/ccs,m64019_201112_211057/16450019/ccs,m64019_201112_211057/127666695/ccs,m64019_201112_211057/29427856/ccs,m64019_201112_211057/171181539/ccs,m64019_201112_211057/175898871/ccs,m64019_201112_211057/28771811/ccs,m64019_201112_211057/167051372/ccs,m64019_201112_211057/25428057/ccs;365:m64019_201101_022708/103875458/ccs,m64019_201101_022708/24576259/ccs,m64019_201101_022708/67961035/ccs,m64019_201101_022708/149356854/ccs,m64019_201101_022708/5767478/ccs,m64019_201101_022708/155123744/ccs,m64019_201101_022708/125829415/ccs,m64019_201101_022708/137232674/ccs,m64019_201101_022708/83232122/ccs,m64019_201101_022708/126617353/ccs,m64019_201101_022708/64619288/ccs,m64019_201101_022708/64751219/ccs,m64019_201101_022708/132055970/ccs,m64019_201101_022708/34539631/ccs
G157    G157.9  535 4   344 344:m64019_201112_211057/80413080/ccs,m64019_201112_211057/78775365/ccs,m64019_201112_211057/85525815/ccs,m64019_201112_211057/27198805/ccs
G157    G157.11 535 6   276 276:m54312U_201103_152606/156304839/ccs,m54312U_201103_152606/15336676/ccs,m54312U_201103_152606/136382258/ccs,m54312U_201103_152606/134415958/ccs,m54312U_201103_152606/42665917/ccs,m54312U_201103_152606/20709542/ccs

The second column contains transcript IDs, the fifth column contains sample IDs which had reads mapping to that transcript, and the sixth column contains a list of all of those reads. Below is an explanation of the structure of the sixth column:

SAMPLEID:readinfo/separated/by/forwardslashes,readinfo/separated/by/forwardslashes;SAMPLEID:readinfo/separated/by/forwardslashes

There are three sample IDs (276,344,&365), but each transcript can have coverage from any one, two, or all three samples.

This is what I WANT the output to look like:

transcript_id   276 344 365
G157.1  0   0   2
G157.2  0   3   0
G157.6  9   18  15
G157.9  0   4   0
G157.11 6   0   0

I've been able to get this to work, but because I'm relatively new to Perl, I'm not able to figure out how to accomplish the task entirely in Perl. I piece together a matrix at the end using a second R script. This is my Perl script:

#!/usr/bin/perl -w
use strict;
my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";
while(<INFILE>) {
    $line=$_;
    chomp $line;
    if ($line=~m/\S /) {
        my ($transcript) = ($line=~m/\S \s (\S )/);
        my ($transcript_info) = ($line=~m/\S \s \S \s \S \s \S \s \S \s (\S )/);
        my ($unique_donor_counts)= ($line=~m/\S \s \S \s \S \s \S \s (\S )/);
        my ($unique_donor_ID) = ($transcript_info =~m/(\S )\:/);
    
       if ($unique_donor_counts !~ m/,/) {
           my $commas = $transcript_info =~ y/,//;
           my $counts = $commas 1;
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);  
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;
               my $counts = $commas 1;
               my ($donor) = ($i=~m/(\d\d\d)/);
               print "$transcript\t$donor\t$counts\n";
           }
       }
    }
}

which outputs:

G157.1  2   365
G157.2  3   344
G157.6  9   276
G157.6  18  344
G157.6  15  365
G157.9  4   344
G157.11 6   276

Instead of printing, I save it to an output file and run this R script:

library(tidyr)

DF <- read.delim("output", sep = "\t", header = F)
df <- tidyr::pivot_wider(output, id_cols = V1, names_from = V2, values_from = V3)
write.table(df, file = "FLcount", sep = "\t", col.names = T, row.names = F, quote = F, dec = ".")

The problem I was encountering in Perl is that I can't figure out how to get a count of 0 when the sample ID doesn't occur in that line. I really want to get better at Perl, so I'd like to figure out how to do this all in one script instead of manipulating the matrix using R. Thank you for taking the time to help a Perl newbie learn!

edited post from my original with some progress

CodePudding user response：

I'd do it with a lot of splits.

#!/usr/bin/env perl
use strict;
use warnings; # Prefer over the -w option
use feature qw/say/;

# Reads from standard input/filenames given on command line, prints to
# standard output.

# All samples encountered in the entire input
my %samples;
# The data for each line of input
my @transcripts;

# For each line of input
while (my $line = <<>>) {
    # Split into fields on tabs.
    my @F = split /\t/, $line;

    # Add to the known samples
    my @s = split /,/, $F[4];
    @samples{@s} = (1) x @s;

    my %counts;
    # Split the 6th column on semicolon and iterate over each element
    for my $read (split /;/, $F[5]) {
        # Extract the sample id and increment its count
        my ($id, @reads) = split /[:,]/, $read;
        $counts{$id}  = @reads;
    }

    # Save this line's counts
    push @transcripts, [ $F[0], \%counts ];
}

# Sample ids in sorted order
my @samples = sort { $a <=> $b } keys %samples;

# Header line
say join("\t", "transcript_id", @samples);

foreach my $transcript (@transcripts) {
    # And print the counts of each transcript's sample, replacing undef values with 0.
    say join("\t", $$transcript[0], map { $_ // 0 } @{$$transcript[1]}{@samples});
}

Example:

$ perl count.pl input.tsv
transcript_id   276 344 365
G157.2  0   3   0
G157.6  10  18  14
G157.9  0   4   0
G157.11 6   0   0

One thing this uses in several places is a slice, to set/retrieve multiple elements of a hash in one expression. Also helps to be familiar with references, which are used in more complicated data structures than just plain arrays and hashes.

CodePudding user response：

I'll walk through your program file and comment where needed.

#!/usr/bin/perl -w
use strict;

The -w switch turns on warnings globally. You should use the lexical version of warnings: use warnings. It is generally considered a better practice, as it allows you to turn off warnings if you need to, with for example: no warnings 'experimental'.

my($U_ID, $ave, $PI, $Gene_ID, $Attribute, %test,$inclusion, $gene,$skipping, %hash, 
    $inclusion_intron, $line, $sum,$counts,%counter1, %counter2, $countIntron, 
    $countGene, %unique, );

All of these variables are unused. So remove all unused variables.

You should not declare all of your variables at the top of your program and make them all global. You should declare them as close as you can to the place you are going to use them, and in the smallest scope possible. The scope of a variable declared with my is the nearest enclosing block: { block here }. This will encapsulate your code, and also make it easier to read.

open(INFILE, $ARGV[0]) or die"File1 is Dead\n";

This is the old, two argument open, which is generally considered bad practice, as it allows code insertion. You want the three argument open, with explicit open mode. You also want to include error reporting $! in your die statement:

open my $fh, "<", $ARGV[0] or die "Cannot open '$ARGV[0]' for reading: $!";

You also do not want to put a newline at the end of the die statement, because that will suppress the line number in the error.

while(<INFILE>) {
    $line=$_;
    chomp $line;

You do not have to jump through all these hoops. You can do either

while (my $line = <$fh>) {    # declared in the smallest possible scope
    chomp $line;

while (<$fh>) {
    chomp;        # chomps $_

The latter has the benefit of not having to type out what variable you want to match your regexes against. E.g. instead of $line =~ m/..../ just type m/..../. It is a commonly used Perl idiom.

    if ($line=~m/\S /) {
        my ($transcript)         = ($line=~m/\S \s (\S )/);
        my ($transcript_info)    = ($line=~m/\S \s \S \s \S \s \S \s \S \s (\S )/);
        my ($unique_donor_counts)= ($line=~m/\S \s \S \s \S \s \S \s (\S )/);

You could just match once, and capture three values at the same time:

my ($trans,$udc,$trans_info) = $line =~ /^\S \s (\S )\s \S \s (\S )\s (\S )\s (\S )/;

Note that you have to have the variables in the same order as the captured values.

This might also be somewhat simpler if you use split

my @vals = split ' ', $line;
my $trans = $vals[1];        # and so on...

This part

       my ($unique_donor_ID) = ($transcript_info =~m/(\S )\:/);

is wrong. You think it just matches a number in front of the first colon, e.g. 344:. If you only have one colon, that is what it does. Otherwise it matches all the characters before the last colon in the line, because \S is greedy and will try to match as much as possible. When you later split $unique_donor_ID on ;, you will miss all the data after the last colon. You can see the result here. Notice line 2 of the output which begins with 276 and ends with 365. 365 is the start of the last record.

       if ($unique_donor_counts !~ m/,/) {            # checks for 311,322, ..
           my $commas = $transcript_info =~ y/,//;    # count by destroying data
           my $counts = $commas 1;                    # unnecessary extra variable
           print "$transcript\t$unique_donor_ID\t$counts\n";
       } else { 
           my @spl = split(';', $unique_donor_ID);    # splitting the wrong variable
           foreach my $i (@spl) {
               my $commas = $i =~ y/,//;              
               my $counts = $commas 1;
               my ($donor) = ($i=~m/(\d\d\d)/);       # risky way to get the first 3 numbers
               print "$transcript\t$donor\t$counts\n";
        .....

Your logic is to check for commas in the donor id short list, and handle the single values separately. That is most likely a flawed approach. Why not just split the last field on ; and handle the results from there?

use strict;
use warnings;
use feature 'say';

my