I am trying to extract specific parts from /rec/flux_entrant/archive/le501/tble91_formation_eligible/* directory . This directory is located in HDFS so that we can expose its contains using the command :
hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/*
which return
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220104-221755/00000.deflate
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220103-231754/00001.deflate
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220111-152145/00002.deflate
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220112-155012/00003.deflate
My objectif is to extract only last part of these paths given by ( not xxx.deflate files) :
20220104-221755, 20220103-231754, 20220111-152145 and 20220112-155012
and then filter by those having date => 20220110, so that, the final result should be :
20220111-152145 and 20220112-155012 because 20220111 and 20220112 are => to 20220110
I tried using the awk command using the command :
hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* | awk -F'/' '{split($NF, a, "-"); if (a[1]>20220110) print $NF}'
But this return :00003.deflate and 00002.deflate and not 20220111-152145 and 20220112-155012
EDIT
As proposed by @Tom, I used print $(NF-1) instead of $NF, but the filter was not good. I also tried to get results in list variable :
OUTPUT=$(hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* |
awk -F'/' '{split($NF, a, "-"); if (a[1]>=20220110) print $(NF-1)}')
echo ${OUTPUT}
gives
Found 5 items 20200916-170926 20200916-170926 20200916-170926 20200916-170926 20200916-170926 Found 5 items 20200916-182251
The is not good because 20200916, 20200916 ... are not => 20220110
Also I need to delete Found 5 items from the final result
Any help, please ? thank you
CodePudding user response:
Try this, using the variable FPAT of AWK:
hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* |
awk -v startdate="20220110" 'BEGIN{FPAT="[0-9]{8}-[0-9]{6}"}($1 >= startdate){print $1}'
I used the variable startdate to avoid hardcoding the string 20220110 into the AWK code.
Explanation: FPAT is a regex that describes what AWK has to consider as a field: in our case, a sequence of 8 digits, followed by an hyphen and 6 digits. AWK prints the only sequence it finds in each line of its input with the instruction print $1, on the condition that ($1 >= startdate).
CodePudding user response:
From what I understand, you actually want something like this to start from:
$ hdfs dfs ls -d /path/to/dir/*/
This will select all subdirectories under /path/to/dir and not traverse them due to the flag -d (See hadoop documentation. From that point forward it is straightforward to make the directory selection. The directory is of the form YYYYMMDD-hhmmss and therefor lexicographically sortable. So you could just do something like this:
$ hdfs dfs ls -d /path/to/dir/*/ | awk -F/ '($NF<"20220128"){print $NF}'
Notice that we do a string comparison and not a numerical comparison in $NF<"20220128". Due to the internals of awk, you could do a numeric comparison as well as awk will strip all non-numeric parts of a string in it's conversion. So you could do:
$ hdfs dfs ls -d /path/to/dir/*/ | awk -F/ '($NF 0<20220128){print $NF}'
