How to remove rows with similar data to keep only highest value in a specific column (tsv file) with-CodePudding

I have a very large .tsv file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.

For example, this position have multiple entries for each combination:

1   861265  C   A   0.071
1   861265  C   A   0.148
1   861265  C   G   0.001
1   861265  C   G   0.108
1   861265  C   T   0
1   861265  C   T   0.216
2   193456  G   A   0.006
2   193456  G   A   0.094
2   193456  G   C   0.011
2   193456  G   C   0.152
2   193456  G   T   0.003
2   193456  G   T   0.056

The desired output would look like this:

1   861265  C   A   0.148
1   861265  C   G   0.108
1   861265  C   T   0.216
2   193456  G   A   0.094
2   193456  G   C   0.152
2   193456  G   T   0.056

Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash; in particular awk.

Thif input file has been sorted with the following command:

sort -t$'\t' -k1 -n -o sorted_file original_file

The command would basically need to:

compare the data from the first 4 columns in the sorted_file
if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`. I am not very familiar with awk syntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:

awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv

However, the output file does not look like it should, at all. Any help would be very much appreciated.

CodePudding user response：

You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.

% awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
       $5>=m{m=$5; line=$0}
       {str=$1" "$2" "$3" "$4} END{print line}' file
1   861265  C   A   0.148
1   861265  C   G   0.108
1   861265  C   T   0.216
2   193456  G   A   0.094
2   193456  G   C   0.152
2   193456  G   T   0.056

Data

% cat file
1   861265  C   A   0.071
1   861265  C   A   0.148
1   861265  C   G   0.001
1   861265  C   G   0.108
1   861265  C   T   0
1   861265  C   T   0.216
2   193456  G   A   0.006
2   193456  G   A   0.094
2   193456  G   C   0.011
2   193456  G   C   0.152
2   193456  G   T   0.003
2   193456  G   T   0.056

CodePudding user response：

With sort and awk:

sort -t$'\t' -k1,1n -k4,4 -k5,5rn file | awk 'BEGIN{FS=OFS="\t"}
!($1 $4 in seen){print; seen[$1 $4]}'

Prints:

1   861265  C   A   0.148
1   861265  C   G   0.108
1   861265  C   T   0.216
2   193456  G   A   0.094
2   193456  G   C   0.152
2   193456  G   T   0.056

This assumes the 'group' is defined as column 1.

Works by grouping first by column 1, then by column 4 (each letter) then reverse numeric sort on column 5.

The awk then prints the first group, letter seen which will be the max based on the sort.

CodePudding user response：

Assumptions/Understandings:

file is sorted by the first field
no guarantee on the ordering of fields #2, #3 and #4
must maintain the current row ordering (this would seem to rule out (re)sorting the file as we could lose the current row ordering)
the complete set of output rows for a given group will fit into memory (aka the awk arrays)

General plan:

we'll call field #1 the group field; all rows with the same value in field #1 are considered part of the same group
for a given group we keep track of all output rows via the awk array arr[] (index will be a combo of fields #2, #3, #4)
we also keep track of the incoming row order via the awk array order[]
update arr[] if we see a value in field #5 that's higher than the previous value
when group changes flush the current contents of the arr[] index to stdout

One awk idea:

awk '
function flush() {                     # function to flush current group to stdout
    for (i=1; i<=seq; i  )
        print group,order[i],arr[order[i]]

    delete arr                         # reset arrays
    delete order
    seq=0                              # reset index for order[] array
}

BEGIN      { FS=OFS="\t" }

$1!=group  { flush()
             group=$1
           }

           { key=$2 OFS $3 OFS $4

             if ( key in arr && $5 <= arr[key] )
                next
             if ( ! (key in arr) )
                order[  seq]=key
             arr[key]=$5
           }

END   { flush() }                      # flush last group to stdout
' input.dat

This generates:

1       861265  C       A       0.148
1       861265  C       G       0.108
1       861265  C       T       0.216
2       193456  G       A       0.094
2       193456  G       C       0.152
2       193456  G       T       0.056

CodePudding user response：

Update

sort
-k, --key=KEYDEF

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

So, by using sort -t$'\t' -k1 -n your file should be completely numerically sorted.

Here's an awk solution that should be fast:

awk '
    BEGIN {
        FS = "\t"
        if ((getline line) > 0) {
            split(line, arr)
            prev_key = arr[1] FS arr[2] FS arr[4]
            prev_line = $0
        }
    }
    {
        curr_key = $1 FS $2 FS $4
        if (curr_key != prev_key) print prev_line
        prev_key = curr_key
        prev_line = $0
    }
    END {
        if (prev_key) print prev_line
    }
' file.tsv

Note: As you're handling a file with a lot of lines I tried to keep the operations to a minimum. For example I'm saving around 4 billions comparisons just by processing the first line with getline in the BEGIN block.

CodePudding user response：

Assuming your real input is sorted by key then ascending value the same way as your example is:

$ cat tst.awk
{ key = $1 FS $2 FS $3 FS $4 }
key != prevKey {
    if ( NR > 1 ) {
        print prevRec
    }
    prevKey = key
}
{ prevRec = $0 }
END {
    print prevRec
}

$ awk -f tst.awk file
1       861265  C       A       0.148
1       861265  C       G       0.108
1       861265  C       T       0.216
2       193456  G       A       0.094
2       193456  G       C       0.152
2       193456  G       T       0.056

if your data isn't already sorted then just sort it with:

sort file | awk ..

That way only sort has to handle the whole file at once and it's designed to do so by using demand paging, etc. and so is far less likely to run out of memory than if you read the whole file into awk or python or any other tool