Home > Mobile >  Grouping data via LinQ in Powershell
Grouping data via LinQ in Powershell

Time:01-07

I have an array of data and I need to group them by 2 attributes and then calculate the sum of a third attribute per each group. I want to do this via Linq to be as fast as possible.

This is my demo-code so far:

class costs {
    [string] $first;
    [string] $last;
    [int]    $price;
    costs([string]$first, [string]$last, [int] $price){
        $this.first = $first
        $this.last  = $last
        $this.price = $price
    }
}

[costs[]]$costs = @(
    [costs]::new('peter', 'parker', 1),
    [costs]::new('peter', 'parker', 2),
    [costs]::new('paul',  'summer', 3),
    [costs]::new('paul',  'winter', 4),
    [costs]::new('mary',  'winter', 5)
)

# group by full name:
$groupBy = [Func[Object,string]] {$args[0].first   $args[0].last}
$groupResult = [Linq.Enumerable]::GroupBy($costs, $groupBy)

# sum the costs per group:
$selectFunc   = [Func[Object,int]] {$sum=0; foreach($p in $args[0].price){$sum  = $p};$sum}
$selectResult = [Linq.Enumerable]::Select($groupResult, $selectFunc)

$selectResult

The selectResult shows me the correct sum of the costs for each user. But I am struggling to get the sum togehter with the two user attributes from initial array. I am also not sure, if I could combine both Linq-calls in a single one to make it faster. Any input is more than welcome here (except "why Linq?").

Update

Based on the answers I updated the code like this:

class costs {
    [string] $first;
    [string] $last;
    [int]    $price;
    costs([string]$first, [string]$last, [int] $price){
        $this.first = $first
        $this.last  = $last
        $this.price = $price
    }
}

[costs[]]$costs = @(
    [costs]::new('peter', 'parker', 1),
    [costs]::new('peter', 'parker', 2),
    [costs]::new('paul',  'summer', 3),
    [costs]::new('paul',  'winter', 4),
    [costs]::new('mary',  'winter', 5)
)
foreach($doubler in 0..15){$costs  = $costs}

cls
write-host "processing $($costs.count) elements."

(measure-command {
    # group by full name:
    $groupBy = [Func[Object,string]] {$args[0].first   $args[0].last}
    $groupResult = [Linq.Enumerable]::GroupBy($costs, $groupBy)

    # sum the costs per group:
    $selectFunc = [Func[Object,Object]]{
        $sum=0
        foreach($p in $args[0].price){
            $sum  = $p
        }
        foreach($a in $args[0]) {
            [costs]::new($a.first, $a.last, $sum)
            break
        }
    }
    $selectResult = [Linq.Enumerable]::Select($groupResult, $selectFunc)
    $result = [Linq.Enumerable]::ToArray($selectResult)
}).TotalSeconds

$result

The runtime for over 300000 elements is around 2.5 seconds. Not that bad. Till now I could not find a faster approach without switching to embedded C# code.

CodePudding user response:

Change the $selectFunc definition to return [psobject] or [object] instead, then create the resulting object from the existing grouped value:

$selectFunc   = [Func[Object,psobject]]{
    $sum=0
    foreach($p in $args[0].price){
      $sum  = $p
    }

    # Output new object with first last based on input object   sum
    $args[0] |Select first,last,@{Name='sum';Expression={$sum}} -First 1
}

I want to do this via Linq to be as fast as possible.

I'd strongly suggest you actually test whether this is faster than using, say Group-Object, or a simple hashtable for the calculations - a lot of the overhead that makes PowerShell slow (notably parameter binding), will still apply to your code, so the difference might not be significant - but the readability of your script might suffer significantly.

My personal preference would be to just use the Group-Object cmdlet:

$costs |Group-Object first,last |ForEach-Object {
  $sum = ($_.Group |Measure price -Sum).Sum
  $_.Group |Select -Property first,last,@{N='Sum';E={$sum}} -First 1
}
  •  Tags:  
  • Related