Home > Back-end >  Split or Group DATA and then kick out unnecessary ones
Split or Group DATA and then kick out unnecessary ones

Time:01-25

My Problem is almost completely mentioned in the Title. I need to group/split data and afterwards kick out the ones that have not the necessary amount of data. Afterwards I want to regress the data in a multiple regression (most likely will use lm()).

My Data set has over 50.000 companies which provide data for one year and each line provides data for one firm and each column (have over 30) for one aspect for example sales, asset or the segment they work in (examples are in the table below).

I need to split them into the segment they are working in because I want to regress afterwards the consumption depending on the marketing spending and because not every segment is investing ( is necessary to invest) the same amount I need to split them.

Name Segment Sale Year Asset Another header
A 3401 10000 2000 200000 x
A 3401 20000 2001 250000 x
B 2201 15000 2004 280000 x
B 2201 23000 2009 320000 x
B 2201 28000 2010 390000 x
C 2201 30000 2000 210000 x
C 2201 18000 2004 200000 x
D 1 28000 2000 400000 x
D 1 38000 2001 521000 x

I tried just simply grouping ( group_by())the data but because of the different amount of segments I can't filter out afterwards per hand the segments that have lower than 10 companies. And if I don't filter it or split it it wont be registered in the regression afterwards.

Hope my questions are clear :):

  1. Need to split or group data for the column "segment"
  2. Kick out the groups that have not the necessary amount of companies (for significance in the regression)
  3. If I don't have split up data I need to have a function that I can use for the regression so that the regression only provides data for each segment (split up data most likely is way to huge to do a regression on each of them)

(4. How can I see how many different segments I have?)

CodePudding user response:

Update on OP request: You can perform your regression this way: You have to modify the regression here it is only a meanless example:

library(tidyverse)
library(broom)
df %>% 
  mutate(Segment = factor(Segment)) %>% 
  group_by(Segment) %>% 
  filter(n() > 2) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(Sale ~ Asset, data=df) %>% 
      glance() %>% 
      add_column(Segment = unique(df$Segment), .before = 1)
  })

# A tibble: 1 x 13
  Segment r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC  deviance df.residual  nobs
  <fct>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl>       <int> <int>
1 2201       0.0520        -0.264 7173.     0.165   0.712     1  -50.2  106.  105. 154336000           3     5

After filter I used for this example 4 we could use group_split to see the splitted groups:

library(dplyr)
df %>% 
  group_by(Segment) %>% 
  filter(n() > 4) %>% 
  group_split()
  Name  Segment  Sale  Year  Asset Another_header
  <chr>   <int> <int> <int>  <int> <chr>         
1 B        2201 15000  2004 280000 x             
2 B        2201 23000  2009 320000 x             
3 B        2201 28000  2010 390000 x             
4 C        2201 30000  2000 210000 x             
5 C        2201 18000  2004 200000 x  

CodePudding user response:

dplyr

library(dplyr)
dat %>%
  group_by(Segment) %>%
  filter(n() >= 10) %>%   # '10' being some minimum number of rows for regression
  ...

base R

spldat <- split(dat, dat$Segment)
spldat_use <- spldat[ sapply(spldat, nrow) >= 10 ]
  •  Tags:  
  • Related