Home > OS >  Remove duplicates from each cell
Remove duplicates from each cell

Time:01-22

I have a file like this and need to remove duplicates in each cell without changing the order or format

Sl.no Name1 Name2  Dis  From  Type      item    Animal         Code
 2    qw     wsa   12    23   car,car   Case    CAT1,CAT1,Dog  p.12>a,p.12>a
23    as     swe   34    2,2  Bus,Bus   Case1,, Dog1,Dog1,,    N.12>a,N.12>a
23    ks     awe   35    .    Bike,Bike Case1,, rat4,rat4,,    5.16>b,5.16>b

The missing data are noted as . (dot).

So far I have tried with awk

 awk '{str="";c=0;split($0,arr,","); for (v in arr) c  ; for (m=c;m >= 1;m--) for (n=1; n<m;n  ) if (arr[m] == arr[n]) delete arr[m]; for (k=1;k<=c;k  ) {if (k ==1 ) {s=arr[k] } else if (arr[k] != "") str=str" "arr[k] } print str}'

But it is killing the format. Is there any other way to do this ?

Expected output

Sl.no Name1 Name2  Dis  From  Type      item    Animal        Code
 2    qw     wsa   12    23   car       Case    CAT1,Dog    p.12>a
23    as     swe   34    2    Bus       Case1   Dog1        N.12>a
23    ks     awe   35    .    Bike      Case1   rat4        5.16>b

CodePudding user response:

Since the input looks like it is fixed-width, you can use unpack to split it into columns. Then split each cell on comma and use uniq to remove the duplicates while preserving order. Then, output it with pack.

use warnings;
use strict;
use List::Util qw(uniq);

my $tmpl = 'A6A6A7A5A6A10A8A15A*';
while (<DATA>) {
    my @cols = unpack $tmpl, $_;
    for my $c (@cols) {
        $c =~ s/^\s //;
        my @items = split /,/, $c;
        $c = join ',', uniq(@items);
    }
    print pack($tmpl, @cols), "\n";
}

__DATA__
Sl.no Name1 Name2  Dis  From  Type      item    Animal         Code
 2    qw     wsa   12    23   car,car   Case    CAT1,CAT1,Dog  p.12>a,p.12>a
23    as     swe   34    2,2  Bus,Bus   Case1,, Dog1,Dog1,,    N.12>a,N.12>a
23    ks     awe   35    .    Bike,Bike Case1,, rat4,rat4,,    5.16>b,5.16>b

Output:

Sl.no Name1 Name2  Dis  From  Type      item    Animal         Code
2     qw    wsa    12   23    car       Case    CAT1,Dog       p.12>a
23    as    swe    34   2     Bus       Case1   Dog1           N.12>a
23    ks    awe    35   .     Bike      Case1   rat4           5.16>b

CodePudding user response:

with sed

$ sed -E 's/\t(.*),\1/\t\1/g;s/, \t/\t/g' file | column -ts$'\t'

Sl.no  Name1  Name2  Dis  From  Type  item   Animal    Code
 2     qw     wsa    12   23    car   Case   CAT1,Dog  p.12>a
23     as     swe    34   2     Bus   Case1  Dog1      N.12>a
23     ks     awe    35   .     Bike  Case1  rat4      5.16>b

CodePudding user response:

Assuming your file is fixed width, and not tab delimited, you can dedupe the fields with a regex. Match any unbroken string of non-whitespace, split on comma, dedupe the result, then join it back with commas. Add spaces for every character removed to fix the formatting.

use strict;
use warnings;

my $hdr = <DATA>;
print $hdr;

while (<DATA>) {
    s/(\S )/ my %s; my $n = join ',', grep { !$s{$_}   } split ',', $1; $n .= ' ' x (length($1) - length($n)); $n; /eg;
    print;
}

__DATA__
Sl.no Name1 Name2  Dis  From  Type      item    Animal         Code
 2    qw     wsa   12    23   car,car   Case    CAT1,CAT1,Dog  p.12>a,p.12>a
23    as     swe   34    2,2  Bus,Bus   Case1,, Dog1,Dog1,,    N.12>a,N.12>a
23    ks     awe   35    .    Bike,Bike Case1,, rat4,rat4,,    5.16>b,5.16>b

Output:

Sl.no Name1 Name2  Dis  From  Type      item    Animal         Code
 2    qw     wsa   12    23   car       Case    CAT1,Dog       p.12>a
23    as     swe   34    2    Bus       Case1   Dog1           N.12>a
23    ks     awe   35    .    Bike      Case1   rat4           5.16>b

CodePudding user response:

Using any POSIX awk:

$ cat tst.awk
NR==1 {
    hdr = $0
    while ( match(hdr,/[^[:space:]] [[:space:]] /) ) {
        width[  i] = RLENGTH
        hdr = substr(hdr,RSTART RLENGTH)
    }
}
{
    for ( i=1; i<=NF; i   ) {
        fld = ""
        delete seen
        n = split($i,parts,/,/)
        for ( j=1; j<=n; j   ) {
            part = parts[j]
            if ( (part != "") && !seen[part]   ) {
                fld = (fld == "" ? "" : fld ",") part
            }
        }
        printf "%-*s", width[i], fld
    }
    print ""
}

$ awk -f tst.awk file
Sl.no Name1 Name2  Dis  From  Type      item    Animal         Code
2     qw    wsa    12   23    car       Case    CAT1,Dog       p.12>a
23    as    swe    34   2     Bus       Case1   Dog1           N.12>a
23    ks    awe    35   .     Bike      Case1   rat4           5.16>b

The above assumes you don't really want "From" in the header line to start 1 character sooner than the data values below it nor to have "Code" be right-aligned when everything else is left-aligned.

  •  Tags:  
  • Related