Apply the Herfindahl-Hirschman Index function to a group of rows for an individual in R-CodePudding

I have a dataframe with multiple rows per individual. Each individual has an ID. Also each row per person has a column with a percentage that add up to 100 across the rows.

Dataframe DF below:

ID	Percentage
1	50
1	50
2	25
2	20
2	45
2	10

I want to apply the Herfindahl-Hirschman Index (hhi function) to compute the index by person. The function is hhi(x, s) and has two arguments. x = the object, s = the value column (percentage in this case). So far, I've tried the following but it doesn't work. It still computes the index across the entire dataframe.

setDT(df)[,hhi(df, "percentage"), ID]

CodePudding user response：

IRTFM's solution is excellent and elegant. Here is a dply solution as well. There is probably a simpler way to do this with an anonymous function or dplyr group_by

library(dplyr)
library(hhi)
library(purrr)

compute_hhi<-function(df){
  hhi=hhi( df %>% as.data.frame(.),"Percentage")
  id=df %>% pluck("ID") %>% head(1)
  data.frame(id,hhi)
}

df_hhi<-df %>%
  group_split(ID, .keep=TRUE) %>%
  map(compute_hhi) %>%
  bind_rows()

df_hhi
#>   id  hhi
#> 1  1 5000
#> 2  2 3150

^{Created on 2022-01-14 by the reprex package (v2.0.1)}

CodePudding user response：

Summary: You spelled Percentage incorrectly, although that appears to be from a failure to copy your code precisely. The real problem as you pointed out is that the data.table function is using the entire column of Percentage values each time through the by-loop. The correct way to refer to a by-constructed subset of data is with the .SD (Subset-of-Data) construct.

Here's the MCVE

library(hhi)
 
 df <- read.table(text="ID  Percentage
 1  50
 1  50
 2  25
 2  20
 2  45
 2  10", head=T)

library(data.table)

setDT(df)
df[,hhi(df, "percentage"), ID]
#------------------
Error in `[.data.frame`(x, i, j) : undefined columns selected
Error in `[.data.frame`(x, i, j) : undefined columns selected
In addition: Warning message:
In hhi(df, "percentage") : shares, "s", do not sum to 100
#-----------------
df[,hhi(df, "Percentage"), ID]  # correct spelling
   ID   V1
1:  1 8150
2:  2 8150
Warning messages:
1: In hhi(df, "Percentage") : shares, "s", do not sum to 100
2: In hhi(df, "Percentage") : shares, "s", do not sum to 100

That is apparently what you are seeing and it is because you have not correctly told the [.data.table function that the df is that same df as is being evaluated by subset. To do that correctly you need to use the .SD self-(subset)referential operation.

df[,hhi(.SD, "Percentage"), by=ID]

#-----------
   ID   V1
1:  1 5000
2:  2 3150    # no warnings, more sensible indices of concentration

It's interesting to compare a base version of this operation to the data.table and another poster's dplyr version. I happen to think that as far as elegance goes, the winner is base-R although there clearly is a justification for learning data.table for it's speed and efficiency in memory footprint for large datasets.

lapply( split(df, df$ID), hhi, s="Percentage")
$`1`
[1] 5000

$`2`
[1] 3150