Want to group data set and filter it twice in r-CodePudding

I want to start directly with my question for the program r: I have a data set with 31159 obs. and 11 variables and I want to get the data set to show me every company in each segment it is established(SIC) and what year.

My data set looks like this:

Company name	Location	Year	Sales	Assets	SIC
Company A	USA	2000	50000	1500000	9997
Company A	USA	2002	50000	1500000	9997
Company A	USA	2003	80000	1800000	9997
Company B	USA	2000	100000	1000000	9997
Company B	USA	2002	110000	1100000	9997
Company B	USA	2005	90000	1200000	9997
Company B	USA	2006	100000	1200000	9997
Company C	USA	2002	90000	1900000	100
Company C	USA	2008	78000	2000000	100
Company D	USA	2008	69420	964220	100

I tried the following code to split the data set into peaces:

Data_Set1 <- Split(Data_Set, Data_Set$sic, drop=FALSE)

this divided the data set into one Set with little sets in it with the same SIC code (list of length 335). But afterwards I could not separate it (what is maybe not that useful because of the length of 335 of the set), so that only the same year and same SIC code would be in one table. I need to do that because I want to get the information of each segment marketing spending(MRKT) (which would be calculated by a regression as follows: MRKT= ß(0) ß(1)*1/assets ß(2) * sales/assets) and afterwards I want to calculate what every company should have spend on marketing. But first I need to delete every segment year out which have less than 10 obs. in the table.

What I want to get in the end is different groups with which I can make the mentioned regression MKT= ß(0) ß(1)*1/assets ß(2) * sales/assets (note that the groups should have at least 10 obs./10 companies so the regression works). Also the question how can I make the regression in r with the equation I mentioned?:

Company name	Location	Year	Sales	Assets	SIC
Company A	USA	2000	50000	1500000	9997
Company B	USA	2000	100000	1000000	9997

Company name	Location	Year	Sales	Assets	SIC
Company C	USA	2008	78000	2000000	100
Company D	USA	2008	69420	964220	100

Company name	Location	Year	Sales	Assets	SIC
Company C	USA	2002	90000	1900000	100

Company name	Location	Year	Sales	Assets	SIC
Company A	USA	2002	50000	1500000	9997
Company B	USA	2002	110000	1100000	9997

Company name	Location	Year	Sales	Assets	SIC
Company B	USA	2005	90000	1200000	9997

Company name	Location	Year	Sales	Assets	SIC
Company B	USA	2006	100000	1200000	9997

Company name	Location	Year	Sales	Assets	SIC
Company A	USA	2003	80000	1800000	9997

Thanks in advance for any help :)

CodePudding user response：

Update: with group_split from dplyr package you get a list of quasi new dataframes containing only the intersting information for this group. Once you have your groups you can then easily apply your regression analysis to each list element. To access a list element you could do df1[[1]] etc... see example:

Once you understand this kind of operations, then perform a new question of the second part of your actual question, :-)

library(dplyr)
df1 <- df %>% 
  group_by(SIC, Year) %>% 
  group_split()
  
  
df1[[1]]
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2002 90000 1900000   100

df1[[2]]
 Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2008 78000 2000000   100
2 Company D     USA       2008 69420  964220   100

etc...

[[1]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2002 90000 1900000   100

[[2]]
# A tibble: 2 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company C     USA       2008 78000 2000000   100
2 Company D     USA       2008 69420  964220   100

[[3]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company A     USA       2000 50000 1500000  9997

[[4]]
# A tibble: 1 x 7
  Company name  Location  Year  Sales  Assets   SIC
  <chr>   <chr> <chr>    <int>  <int>   <int> <int>
1 Company B     USA       2001 100000 1000000  9997

[[5]]
# A tibble: 2 x 7
  Company name  Location  Year  Sales  Assets   SIC
  <chr>   <chr> <chr>    <int>  <int>   <int> <int>
1 Company A     USA       2002  50000 1500000  9997
2 Company B     USA       2002 110000 1100000  9997

[[6]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company A     USA       2003 80000 1800000  9997

[[7]]
# A tibble: 1 x 7
  Company name  Location  Year Sales  Assets   SIC
  <chr>   <chr> <chr>    <int> <int>   <int> <int>
1 Company B     USA       2005 90000 1200000  9997

[[8]]
# A tibble: 1 x 7
  Company name  Location  Year  Sales  Assets   SIC
  <chr>   <chr> <chr>    <int>  <int>   <int> <int>
1 Company B     USA       2006 100000 1200000  9997

We could first use group_by and then summarise()

library(dplyr)
df %>% 
  group_by(Company, name, Year, SIC) %>% 
  summarise()

   Company name   Year   SIC
   <chr>   <chr> <int> <int>
 1 Company A      2000  9997
 2 Company A      2002  9997
 3 Company A      2003  9997
 4 Company B      2001  9997
 5 Company B      2002  9997
 6 Company B      2005  9997
 7 Company B      2006  9997
 8 Company C      2002   100
 9 Company C      2008   100
10 Company D      2008   100