I have a file with a word Sweden in different variations.
I am trying to get if 34th column has Sweden there
awk -F\" '$34 ~ /Sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/ {print $0}' $ipp >> sweden.csv &
As far as i know it gonna be so slow, as I have 650 million rows.
Is there any way I can get all variation in 1 awk command?
CodePudding user response:
You can use this awk:
awk -F\" 'tolower($34) ~ /sweden|^se$/' "$ipp" >> sweden.csv
CodePudding user response:
With your shown samples, attempts please try following awk code. Simply making field separator as " and in main block checking if field 34th is either containing sweden(including upper and lower cases to match any kind of combinations of it) OR it starts from se9with both lower and upper case for letters) if any of the condition passes then print that line.
awk -F\" '$34 ~ /[Ss][Ww][Ee][Dd][Ee][Nn]|^[Ss][Ee]$/' "$ipp" >> sweden.csv
CodePudding user response:
If you're using GNU awk, you can use IGNORECASE option:
awk -F\" 'BEGIN{IGNORECASE=1} $34 ~ /sweden|^se$/' "$ipp" >> sweden.csv
CodePudding user response:
Your code might be ameloriated as already explained, more generally you might put 6 pattern-action pairs in single awk call rather than 6 separate that is
awk -F\" '$34 ~ /Sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/ {print $0}' $ipp >> sweden.csv &
might be written more concisely as
awk -F\" '$34 ~ /Sweden/ {print $0}$34 ~ /sweden/ {print $0}$34 ~ /SWEDEN/ {print $0}$34 ~ /^se$/ {print $0}$34 ~ /^Se$/ {print $0}$34 ~ /^SE$/ {print $0}' $ipp >> sweden.csv &
Note that if line does contain both Sweden and SWEDEN it will appear twice (in 6 x awk and 1 x awk solution) and also order of lines in output might be different between these 2 approaches.
