I have a file like this and need to remove duplicates in each cell without changing the order or format
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
The missing data are noted as . (dot).
So far I have tried with awk
awk '{str="";c=0;split($0,arr,","); for (v in arr) c ; for (m=c;m >= 1;m--) for (n=1; n<m;n ) if (arr[m] == arr[n]) delete arr[m]; for (k=1;k<=c;k ) {if (k ==1 ) {s=arr[k] } else if (arr[k] != "") str=str" "arr[k] } print str}'
But it is killing the format. Is there any other way to do this ?
Expected output
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
CodePudding user response:
Since the input looks like it is fixed-width, you can use unpack to split it into columns. Then split each cell on comma and use uniq to remove the duplicates while preserving order. Then, output it with pack.
use warnings;
use strict;
use List::Util qw(uniq);
my $tmpl = 'A6A6A7A5A6A10A8A15A*';
while (<DATA>) {
my @cols = unpack $tmpl, $_;
for my $c (@cols) {
$c =~ s/^\s //;
my @items = split /,/, $c;
$c = join ',', uniq(@items);
}
print pack($tmpl, @cols), "\n";
}
__DATA__
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
Output:
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
CodePudding user response:
with sed
$ sed -E 's/\t(.*),\1/\t\1/g;s/, \t/\t/g' file | column -ts$'\t'
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
CodePudding user response:
Assuming your file is fixed width, and not tab delimited, you can dedupe the fields with a regex. Match any unbroken string of non-whitespace, split on comma, dedupe the result, then join it back with commas. Add spaces for every character removed to fix the formatting.
use strict;
use warnings;
my $hdr = <DATA>;
print $hdr;
while (<DATA>) {
s/(\S )/ my %s; my $n = join ',', grep { !$s{$_} } split ',', $1; $n .= ' ' x (length($1) - length($n)); $n; /eg;
print;
}
__DATA__
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car,car Case CAT1,CAT1,Dog p.12>a,p.12>a
23 as swe 34 2,2 Bus,Bus Case1,, Dog1,Dog1,, N.12>a,N.12>a
23 ks awe 35 . Bike,Bike Case1,, rat4,rat4,, 5.16>b,5.16>b
Output:
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
CodePudding user response:
Using any POSIX awk:
$ cat tst.awk
NR==1 {
hdr = $0
while ( match(hdr,/[^[:space:]] [[:space:]] /) ) {
width[ i] = RLENGTH
hdr = substr(hdr,RSTART RLENGTH)
}
}
{
for ( i=1; i<=NF; i ) {
fld = ""
delete seen
n = split($i,parts,/,/)
for ( j=1; j<=n; j ) {
part = parts[j]
if ( (part != "") && !seen[part] ) {
fld = (fld == "" ? "" : fld ",") part
}
}
printf "%-*s", width[i], fld
}
print ""
}
$ awk -f tst.awk file
Sl.no Name1 Name2 Dis From Type item Animal Code
2 qw wsa 12 23 car Case CAT1,Dog p.12>a
23 as swe 34 2 Bus Case1 Dog1 N.12>a
23 ks awe 35 . Bike Case1 rat4 5.16>b
The above assumes you don't really want "From" in the header line to start 1 character sooner than the data values below it nor to have "Code" be right-aligned when everything else is left-aligned.
