Split or Group DATA and then kick out unnecessary ones-CodePudding

My Problem is almost completely mentioned in the Title. I need to group/split data and afterwards kick out the ones that have not the necessary amount of data. Afterwards I want to regress the data in a multiple regression (most likely will use lm()).

My Data set has over 50.000 companies which provide data for one year and each line provides data for one firm and each column (have over 30) for one aspect for example sales, asset or the segment they work in (examples are in the table below).

I need to split them into the segment they are working in because I want to regress afterwards the consumption depending on the marketing spending and because not every segment is investing ( is necessary to invest) the same amount I need to split them.

Name	Segment	Sale	Year	Asset	Another header
A	3401	10000	2000	200000	x
A	3401	20000	2001	250000	x
B	2201	15000	2004	280000	x
B	2201	23000	2009	320000	x
B	2201	28000	2010	390000	x
C	2201	30000	2000	210000	x
C	2201	18000	2004	200000	x
D	1	28000	2000	400000	x
D	1	38000	2001	521000	x

I tried just simply grouping ( group_by())the data but because of the different amount of segments I can't filter out afterwards per hand the segments that have lower than 10 companies. And if I don't filter it or split it it wont be registered in the regression afterwards.

Hope my questions are clear :):

Need to split or group data for the column "segment"
Kick out the groups that have not the necessary amount of companies (for significance in the regression)
If I don't have split up data I need to have a function that I can use for the regression so that the regression only provides data for each segment (split up data most likely is way to huge to do a regression on each of them)

(4. How can I see how many different segments I have?)

CodePudding user response：

Update on OP request: You can perform your regression this way: You have to modify the regression here it is only a meanless example:

library(tidyverse)
library(broom)
df %>% 
  mutate(Segment = factor(Segment)) %>% 
  group_by(Segment) %>% 
  filter(n() > 2) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(Sale ~ Asset, data=df) %>% 
      glance() %>% 
      add_column(Segment = unique(df$Segment), .before = 1)
  })

# A tibble: 1 x 13
  Segment r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC  deviance df.residual  nobs
  <fct>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>       <int> <int>
1 2201       0.0520        -0.264 7173.     0.165   0.712     1  -50.2  106.  105. 154336000           3     5

After filter I used for this example 4 we could use group_split to see the splitted groups:

library(dplyr)
df %>% 
  group_by(Segment) %>% 
  filter(n() > 4) %>% 
  group_split()

  Name  Segment  Sale  Year  Asset Another_header
  <chr>   <int> <int> <int>  <int> <chr>         
1 B        2201 15000  2004 280000 x             
2 B        2201 23000  2009 320000 x             
3 B        2201 28000  2010 390000 x             
4 C        2201 30000  2000 210000 x             
5 C        2201 18000  2004 200000 x

CodePudding user response：

dplyr

library(dplyr)
dat %>%
  group_by(Segment) %>%
  filter(n() >= 10) %>%   # '10' being some minimum number of rows for regression
  ...

base R

spldat <- split(dat, dat$Segment)
spldat_use <- spldat[ sapply(spldat, nrow) >= 10 ]