I have 2 tasks to perform over a data string separated by fields, I could come up with an awk command for each task separatedly, but what I need is to apply a second task to the 3rd field, and have the results of 2nd task inside results of first task.
Data in file data.csv
31;Area A;Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676);German;;;0;;Wolfgang Mozart
Fisrt task is to generate an xml structure, that's done with next awk command:
awk -F';' -v OFS=';' '{printf "<A>\n\t<T>%s</T>\n\t<S>%s</S>\n\t<AT>%s</AT>\n\t<D>%s</D>\n\t<Id>%s</Id>\n</A>\n", $9,$3,$2,$7,$7,$1}' test.csv
Result is:
<A>
<T>Wolfgang Mozart</T>
<S>Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676)</S>
<AT>Area A</AT>
<D>0</D>
<Id>31</Id>
</A>
Second task is to convert every code in format "[0-9]{4}-[1-5]" into a <AT> tag, which I could manage with the next command:
awk 'BEGIN{FPAT="[0-9]{4}-[1-5]"} {for (i=1; i <= NF; i ) print "<AT>"$i"</AT>"}' test.csv
Result is (by the way, I could not accomplish to print out only one instance when repeated):
<AT>3343-1</AT>
<AT>3343-1</AT>
<AT>3345-1</AT>
The desired output is:
<A>
<T>Wolfgang Mozart</T>
<S>Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676)</S>
<AT>Area A</AT>
<AT>3343-1</AT>
<AT>3345-1</AT>
<D>0</D>
<Id>31</Id>
</A>
The best command I could come up with is the next one, which does not porduce the desired output:
awk -F';' -v OFS=';' 'BEGIN{FPAT="[0-9]{4}-[1-5]"} {for (i=1; i <= NF; i ) printf "<A>\n\t<T>%s</T>\n\t<S>%s</S>\n\t<AT>%s</AT>\n\t<D>%s</D>\n\t<Id>%s</Id>\n</A>\n", $9,$3,$2,$7,$7,$1}' test.csv
Result is:
<A>
<T>Wolfgang Mozart</T>
<S>Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676)</S>
<AT>Area A</AT>
<AT>3343-1</AT>
<D>0</D>
<Id>31</Id>
</A>
<A>
<T>Wolfgang Mozart</T>
<S>Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676)</S>
<AT>Area A</AT>
<AT>3343-1</AT>
<D>0</D>
<Id>31</Id>
</A>
<A>
<T>Wolfgang Mozart</T>
<S>Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676)</S>
<AT>Area A</AT>
<AT>3345-1</AT>
<D>0</D>
<Id>31</Id>
</A>
CodePudding user response:
One idea using GNU awk:
awk -F';' '
{ patsplit($3,arr,"[0-9]{4}-[1-5]") # split field #3 into NNNN-N strings
delete seen # clear seen[] array
for (i in arr) # for each NNNN-N string in arr[], store as index in
seen[arr[i]] # seen[] array; duplicates are effectively eliminated
printf "<A>\n\t<T>%s</T>\n\t<S>%s</S>\n\t<AT>%s</AT>\n", $9,$3,$2
PROCINFO["sorted_in"]="@ind_str_asc" # assuming we want NNNN-N strings displayed in sorted order
for (at in seen)
printf "\t<AT>%s</AT>\n", at
printf "\t<D>%s</D>\n\t<Id>%s</Id>\n</A>\n", $7,$1
}
' data.csv
This generates:
<A>
<T>Wolfgang Mozart</T>
<S>Language B1 G1-T1-(3343-1-25274) (3343-1-25278) (3345-1-25676)</S>
<AT>Area A</AT>
<AT>3343-1</AT>
<AT>3345-1</AT>
<D>0</D>
<Id>31</Id>
</A>
