I want to start directly with my question for the program r: I have a data set with 31159 obs. and 11 variables and I want to get the data set to show me every company in each segment it is established(SIC) and what year.
My data set looks like this:
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company A | USA | 2000 | 50000 | 1500000 | 9997 |
| Company A | USA | 2002 | 50000 | 1500000 | 9997 |
| Company A | USA | 2003 | 80000 | 1800000 | 9997 |
| Company B | USA | 2000 | 100000 | 1000000 | 9997 |
| Company B | USA | 2002 | 110000 | 1100000 | 9997 |
| Company B | USA | 2005 | 90000 | 1200000 | 9997 |
| Company B | USA | 2006 | 100000 | 1200000 | 9997 |
| Company C | USA | 2002 | 90000 | 1900000 | 100 |
| Company C | USA | 2008 | 78000 | 2000000 | 100 |
| Company D | USA | 2008 | 69420 | 964220 | 100 |
I tried the following code to split the data set into peaces:
Data_Set1 <- Split(Data_Set, Data_Set$sic, drop=FALSE)
this divided the data set into one Set with little sets in it with the same SIC code (list of length 335). But afterwards I could not separate it (what is maybe not that useful because of the length of 335 of the set), so that only the same year and same SIC code would be in one table. I need to do that because I want to get the information of each segment marketing spending(MRKT) (which would be calculated by a regression as follows: MRKT= ß(0) ß(1)*1/assets ß(2) * sales/assets) and afterwards I want to calculate what every company should have spend on marketing. But first I need to delete every segment year out which have less than 10 obs. in the table.
What I want to get in the end is different groups with which I can make the mentioned regression MKT= ß(0) ß(1)*1/assets ß(2) * sales/assets (note that the groups should have at least 10 obs./10 companies so the regression works). Also the question how can I make the regression in r with the equation I mentioned?:
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company A | USA | 2000 | 50000 | 1500000 | 9997 |
| Company B | USA | 2000 | 100000 | 1000000 | 9997 |
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company C | USA | 2008 | 78000 | 2000000 | 100 |
| Company D | USA | 2008 | 69420 | 964220 | 100 |
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company C | USA | 2002 | 90000 | 1900000 | 100 |
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company A | USA | 2002 | 50000 | 1500000 | 9997 |
| Company B | USA | 2002 | 110000 | 1100000 | 9997 |
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company B | USA | 2005 | 90000 | 1200000 | 9997 |
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company B | USA | 2006 | 100000 | 1200000 | 9997 |
| Company name | Location | Year | Sales | Assets | SIC |
|---|---|---|---|---|---|
| Company A | USA | 2003 | 80000 | 1800000 | 9997 |
Thanks in advance for any help :)
CodePudding user response:
Update:
with group_split from dplyr package you get a list of quasi new dataframes containing only the intersting information for this group. Once you have your groups you can then easily apply your regression analysis to each list element.
To access a list element you could do df1[[1]] etc... see example:
Once you understand this kind of operations, then perform a new question of the second part of your actual question, :-)
library(dplyr)
df1 <- df %>%
group_by(SIC, Year) %>%
group_split()
df1[[1]]
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2002 90000 1900000 100
df1[[2]]
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2008 78000 2000000 100
2 Company D USA 2008 69420 964220 100
etc...
[[1]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2002 90000 1900000 100
[[2]]
# A tibble: 2 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company C USA 2008 78000 2000000 100
2 Company D USA 2008 69420 964220 100
[[3]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2000 50000 1500000 9997
[[4]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2001 100000 1000000 9997
[[5]]
# A tibble: 2 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2002 50000 1500000 9997
2 Company B USA 2002 110000 1100000 9997
[[6]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company A USA 2003 80000 1800000 9997
[[7]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2005 90000 1200000 9997
[[8]]
# A tibble: 1 x 7
Company name Location Year Sales Assets SIC
<chr> <chr> <chr> <int> <int> <int> <int>
1 Company B USA 2006 100000 1200000 9997
We could first use group_by and then summarise()
library(dplyr)
df %>%
group_by(Company, name, Year, SIC) %>%
summarise()
Company name Year SIC
<chr> <chr> <int> <int>
1 Company A 2000 9997
2 Company A 2002 9997
3 Company A 2003 9997
4 Company B 2001 9997
5 Company B 2002 9997
6 Company B 2005 9997
7 Company B 2006 9997
8 Company C 2002 100
9 Company C 2008 100
10 Company D 2008 100
