Home > database >  PowerShell: Find unique values from multiple CSV files
PowerShell: Find unique values from multiple CSV files

Time:01-26

let's say that I have several CSV files and I need to check a specific column and find values that exist in one file, but not in any of the others. I'm having a bit of trouble coming up with the best way to go about it as I wanted to use Compare-Object and possibly keep all columns and not just the one that contains the values I'm checking.

So I do indeed have several CSV files and they all have a Service Code column, and I'm trying to create a list for each Service Code that only appears in one file. So I would have "Service Codes only in CSV1", "Service Codes only in CSV2", etc.

Based on some testing and a semi-related question, I've come up with a workable solution, but with all of the nesting and For loops, I'm wondering if there is a more elegant method out there.

Here's what I do have:

$files = Get-ChildItem -LiteralPath "C:\temp\ItemCompare" -Include "*.csv"
$HashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $files.Count; $i  ){
    $TempHashSet = [System.Collections.Generic.HashSet[String]]::New([String[]](Import-Csv $files[$i])."Service Code")
    $HashList.Add($TempHashSet)
}

$FinalHashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $HashList.Count; $i  ){
    $UniqueHS = [System.Collections.Generic.HashSet[String]]::New($HashList[$i])
    For ($j = 0; $j -lt $HashList.Count; $j  ){
        #Skip the check when the HashSet would be compared to itself
        If ($j -eq $i){Continue}
        $UniqueHS.ExceptWith($HashList[$j])
    }
    $FinalHashList.Add($UniqueHS)
}

It seems a bit messy to me using so many different .NET references, and I know I could make it cleaner with a tag to say using namespace System.Collections.Generic, but I'm wondering if there is a way to make it work using Compare-Object which was my first attempt, or even just a simpler/more efficient method to filter each file.

CodePudding user response:

I believe I found an "elegant" solution based on Group-Object, using only a single pipeline:

# Import all CSV files. 
Get-ChildItem $PSScriptRoot\csv\*.csv -File -PipelineVariable file | Import-Csv | 

    # Add new column "FileName" to distinguish the files.
    Select-Object *, @{ label = 'FileName'; expression = { $file.Name } } |

    # Group by ServiceCode to get a list of files per distinct value. 
    Group-Object ServiceCode |

    # Filter by ServiceCode values that exist only in a single file.
    # Sort-Object -Unique takes care of possible duplicates within a single file.
    Where-Object { ( $_.Group.FileName | Sort-Object -Unique ).Count -eq 1 } |

    # Expand the groups so we get the original object structure back.
    ForEach-Object Group |

    # Format-Table requires sorting by FileName, for -GroupBy.
    Sort-Object FileName |

    # Finally pretty-print the result.
    Format-Table -Property ServiceCode, Foo -GroupBy FileName 

Test Input

a.csv:

ServiceCode,Foo
1,fop
2,fip
3,fap

b.csv:

ServiceCode,Foo
6,bar
6,baz
3,bam
2,bir
4,biz

c.csv:

ServiceCode,Foo
2,bla
5,blu
1,bli

Output

   FileName: b.csv    

ServiceCode Foo       
----------- ---       
4           biz       
6           bar       
6           baz       

   FileName: c.csv    

ServiceCode Foo       
----------- ---       
5           blu  

Looks correct to me. The values 1, 2 and 3 are duplicated between multiple files, so they are excluded. 4, 5 and 6 exist only in single files, while 6 is a duplicate value only within a single file.

Understanding the code

Maybe it is easier to understand how this code works, by looking at the intermediate output of the pipeline produced by the Group-Object line:

Count Name                      Group
----- ----                      -----
    2 1                         {@{ServiceCode=1; Foo=fop; FileName=a.csv}, @{ServiceCode=1; Foo=bli; FileName=c.csv}}
    3 2                         {@{ServiceCode=2; Foo=fip; FileName=a.csv}, @{ServiceCode=2; Foo=bir; FileName=b.csv}, @{ServiceCode=2; Foo=bla; FileName=c.csv}}
    2 3                         {@{ServiceCode=3; Foo=fap; FileName=a.csv}, @{ServiceCode=3; Foo=bam; FileName=b.csv}}
    1 4                         {@{ServiceCode=4; Foo=biz; FileName=b.csv}}
    1 5                         {@{ServiceCode=5; Foo=blu; FileName=c.csv}}
    2 6                         {@{ServiceCode=6; Foo=bar; FileName=b.csv}, @{ServiceCode=6; Foo=baz; FileName=b.csv}}

Here the Name contains the unique ServiceCode values, while Group "links" the data to the files.

From here it should already be clear how to find values that exist only in single files. If duplicate ServiceCode values within a single file wouldn't be allowed, we could even simplify the filter to Where-Object Count -eq 1. Since it was stated that dupes within single files may exist, we need the Sort-Object -Unique to count multiple equal file names within a group as only one.

CodePudding user response:

i was able to get unique items as follow

# Get all items of CSVs in a single variable with adding the file name at the last column
$CSVs = Get-ChildItem "C:\temp\ItemCompare\*.csv" | ForEach-Object {
    $CSV = Import-CSV -Path $_.FullName
    $FileName = $_.Name
    $CSV | Select-Object *,@{N='Filename';E={$FileName}}
}
Foreach($line in $CSVs){
$ServiceCode = $line.ServiceCode
$file = $line.Filename
if (!($CSVs | where {$_.ServiceCode -eq $ServiceCode -and $_.filename -ne $file})){
$line
}
}

CodePudding user response:

Here is my take on this fun exercise, I'm using a similar approach as yours with the HashSet but adding [System.StringComparer]::OrdinalIgnoreCase to leverage the .Contains(..) method:

using namespace System.Collections.Generic

# Generate Random CSVs:
$charset = 'abABcdCD0123xXyYzZ'
$ran = [random]::new()
$csvs = @{}
foreach($i in 1..50) # Create 50 CSVs for testing
{
    $csvs["csv$i"] = foreach($z in 1..50) # With 50 Rows
    {
        $index = (0..2).ForEach({ $ran.Next($charset.Length) })
        
        [pscustomobject]@{
            ServiceCode = [string]::new($charset[$index])
            Data = $ran.Next()
        }
    }
}

# Get Unique 'ServiceCode' per CSV:
foreach($key in $csvs.Keys)
{
    # Get all unique `ServiceCode` from the other CSVs
    $tempHash = [HashSet[string]]::new(
        [string[]]($csvs[$csvs.Keys -ne $key].ServiceCode),
        [System.StringComparer]::OrdinalIgnoreCase
    )
    # Filter the unique `ServiceCode`
    $result[$key] = foreach($line in $csvs[$key])
    {
        if(-not $tempHash.Contains($line.ServiceCode))
        {
            $line
        }
    }
}

# Test if the code worked,
# If something is returned from here means it didn't work
foreach($key in $result.Keys)
{
    $tmp = $result[$result.Keys -ne $key].ServiceCode
    foreach($val in $result[$key])
    {
        if($val.ServiceCode -in $tmp)
        {
            $val
        }
    }
}
  •  Tags:  
  • Related