I have a very large .tsv file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.
For example, this position have multiple entries for each combination:
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
The desired output would look like this:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash; in particular awk.
Thif input file has been sorted with the following command:
sort -t$'\t' -k1 -n -o sorted_file original_file
The command would basically need to:
- compare the data from the first 4 columns in the
sorted_file - if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`.
I am not very familiar with
awksyntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:
awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv
However, the output file does not look like it should, at all. Any help would be very much appreciated.
CodePudding user response:
You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.
% awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
$5>=m{m=$5; line=$0}
{str=$1" "$2" "$3" "$4} END{print line}' file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Data
% cat file
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
CodePudding user response:
With sort and awk:
sort -t$'\t' -k1,1n -k4,4 -k5,5rn file | awk 'BEGIN{FS=OFS="\t"}
!($1 $4 in seen){print; seen[$1 $4]}'
Prints:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
This assumes the 'group' is defined as column 1.
Works by grouping first by column 1, then by column 4 (each letter) then reverse numeric sort on column 5.
The awk then prints the first group, letter seen which will be the max based on the sort.
CodePudding user response:
Assumptions/Understandings:
- file is sorted by the first field
- no guarantee on the ordering of fields #2, #3 and #4
- must maintain the current row ordering (this would seem to rule out (re)sorting the file as we could lose the current row ordering)
- the complete set of output rows for a given
groupwill fit into memory (aka theawkarrays)
General plan:
- we'll call field #1 the
groupfield; all rows with the same value in field #1 are considered part of the samegroup - for a given
groupwe keep track of all output rows via theawkarrayarr[](index will be a combo of fields #2, #3, #4) - we also keep track of the incoming row order via the
awkarrayorder[] - update
arr[]if we see a value in field #5 that's higher than the previous value - when
groupchanges flush the current contents of thearr[]index to stdout
One awk idea:
awk '
function flush() { # function to flush current group to stdout
for (i=1; i<=seq; i )
print group,order[i],arr[order[i]]
delete arr # reset arrays
delete order
seq=0 # reset index for order[] array
}
BEGIN { FS=OFS="\t" }
$1!=group { flush()
group=$1
}
{ key=$2 OFS $3 OFS $4
if ( key in arr && $5 <= arr[key] )
next
if ( ! (key in arr) )
order[ seq]=key
arr[key]=$5
}
END { flush() } # flush last group to stdout
' input.dat
This generates:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
CodePudding user response:
Update
sort
-k, --key=KEYDEF
KEYDEF isF[.C][OPTS][,F[.C][OPTS]]for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.
So, by using sort -t$'\t' -k1 -n your file should be completely numerically sorted.
Here's an awk solution that should be fast:
awk '
BEGIN {
FS = "\t"
if ((getline line) > 0) {
split(line, arr)
prev_key = arr[1] FS arr[2] FS arr[4]
prev_line = $0
}
}
{
curr_key = $1 FS $2 FS $4
if (curr_key != prev_key) print prev_line
prev_key = curr_key
prev_line = $0
}
END {
if (prev_key) print prev_line
}
' file.tsv
Note: As you're handling a file with a lot of lines I tried to keep the operations to a minimum. For example I'm saving around 4 billions comparisons just by processing the first line with getline in the BEGIN block.
CodePudding user response:
Assuming your real input is sorted by key then ascending value the same way as your example is:
$ cat tst.awk
{ key = $1 FS $2 FS $3 FS $4 }
key != prevKey {
if ( NR > 1 ) {
print prevRec
}
prevKey = key
}
{ prevRec = $0 }
END {
print prevRec
}
$ awk -f tst.awk file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
if your data isn't already sorted then just sort it with:
sort file | awk ..
That way only sort has to handle the whole file at once and it's designed to do so by using demand paging, etc. and so is far less likely to run out of memory than if you read the whole file into awk or python or any other tool
