Home > OS >  Joining two files that both have duplicate rows
Joining two files that both have duplicate rows

Time:01-08

I am trying to join two files that have identical column 1 and different column 2:

File1

    aaa 1
    bbb 3
    bbb 3
    ccc 1
    ccc 1
    ccc 0

File2

    aaa 2
    bbb 2
    bbb 2
    ccc 1
    ccc 1
    ccc 0

When I try to join them with

    join File1 File2 > File3

I get

    aaa 1 2
    bbb 3 2
    bbb 3 2
    bbb 3 2
    bbb 3 2
    ccc 1 1
    ccc 1 1
    ccc 1 0
    ccc 1 1
    ccc 1 1
    ccc 1 0
    ccc 0 1
    ccc 0 1
    ccc 0 0

join is trying to expand the duplicates when all I want it to do is go line-by line so the output should be

    aaa 1 2
    bbb 3 2
    bbb 3 2
    ccc 1 1
    ccc 1 1
    ccc 0 0

How do I tell join to ignore duplicates and just combine the files line-by-line?

EDIT: This is being done in a loop with multiple files that all have the same column 1 but different column 2. I am joining the first two files into a temporary file and then looping through the other files joining with that temporary file.

CodePudding user response:

Assumptions:

  • all files have the same number of rows
  • all files have the same values in the first column for the same numbered row
  • the final result set can fit into memory

Sample input:

$ for f in f{1..4}
do
echo "############ $f"
cat $f
done
############ f1
aaa 1
bbb 3
bbb 3
ccc 1
ccc 1
ccc 0
############ f2
aaa 2
bbb 2
bbb 2
ccc 1
ccc 1
ccc 0
############ f3
aaa 12
bbb 12
bbb 12
ccc 11
ccc 11
ccc 10
############ f4
aaa 202
bbb 202
bbb 202
ccc 201
ccc 201
ccc 200

One awk idea:

awk '
FNR==NR { a[FNR]=$0; next }
        { a[FNR]=a[FNR] OFS $2 }
END     { for (i=1;i<=FNR;i  ) 
              print a[i]
        }
' f1 f2 f3 f4

This generates:

aaa 1 2 12 202
bbb 3 2 12 202
bbb 3 2 12 202
ccc 1 1 11 201
ccc 1 1 11 201
ccc 0 0 10 200

CodePudding user response:

Based on a suggestion from @Andre Wildberg, this worked best:

    paste File1 <(cut -d " " -f 2 File2)

This allowed be to loop through a list of files:

    cat File1 > tmp

    for file in $files
    do
        paste tmp <(cut -d " " -f 2 $file) > tmpf
        mv tmpf tmp
    done

    mv tmp FinalFile
  •  Tags:  
  • Related