Home > OS >  extract specific parts from hdfs directory using awk function
extract specific parts from hdfs directory using awk function

Time:01-26

I am trying to extract specific parts from /rec/flux_entrant/archive/le501/tble91_formation_eligible/* directory . This directory is located in HDFS so that we can expose its contains using the command : hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* which return

/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220104-221755/00000.deflate
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220103-231754/00001.deflate
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220111-152145/00002.deflate
/rec/flux_entrant/archive/le501/tble91_formation_eligible/20220112-155012/00003.deflate

My objectif is to extract only last part of these paths given by ( not xxx.deflate files) : 20220104-221755, 20220103-231754, 20220111-152145 and 20220112-155012 and then filter by those having date => 20220110, so that, the final result should be : 20220111-152145 and 20220112-155012 because 20220111 and 20220112 are => to 20220110

I tried using the awk command using the command :

hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* | awk -F'/' '{split($NF, a, "-"); if (a[1]>20220110) print $NF}'

But this return :00003.deflate and 00002.deflate and not 20220111-152145 and 20220112-155012

EDIT

As proposed by @Tom, I used print $(NF-1) instead of $NF, but the filter was not good. I also tried to get results in list variable :

OUTPUT=$(hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* |
awk -F'/' '{split($NF, a, "-"); if (a[1]>=20220110) print $(NF-1)}')
echo ${OUTPUT}

gives

Found 5 items 20200916-170926 20200916-170926 20200916-170926 20200916-170926 20200916-170926 Found 5 items 20200916-182251

The is not good because 20200916, 20200916 ... are not => 20220110 Also I need to delete Found 5 items from the final result

Any help, please ? thank you

CodePudding user response:

Try this, using the variable FPAT of AWK:

hdfs dfs -ls /rec/flux_entrant/archive/le501/tble91_formation_eligible/* | 
 awk -v startdate="20220110" 'BEGIN{FPAT="[0-9]{8}-[0-9]{6}"}($1 >= startdate){print $1}'

I used the variable startdate to avoid hardcoding the string 20220110 into the AWK code.

Explanation: FPAT is a regex that describes what AWK has to consider as a field: in our case, a sequence of 8 digits, followed by an hyphen and 6 digits. AWK prints the only sequence it finds in each line of its input with the instruction print $1, on the condition that ($1 >= startdate).

CodePudding user response:

From what I understand, you actually want something like this to start from:

$ hdfs dfs ls -d /path/to/dir/*/

This will select all subdirectories under /path/to/dir and not traverse them due to the flag -d (See hadoop documentation. From that point forward it is straightforward to make the directory selection. The directory is of the form YYYYMMDD-hhmmss and therefor lexicographically sortable. So you could just do something like this:

$ hdfs dfs ls -d /path/to/dir/*/ | awk -F/ '($NF<"20220128"){print $NF}'

Notice that we do a string comparison and not a numerical comparison in $NF<"20220128". Due to the internals of awk, you could do a numeric comparison as well as awk will strip all non-numeric parts of a string in it's conversion. So you could do:

$ hdfs dfs ls -d /path/to/dir/*/ | awk -F/ '($NF 0<20220128){print $NF}'
  •  Tags:  
  • Related