Home > Software engineering >  How to sum over subsets of rows in R
How to sum over subsets of rows in R

Time:01-22

I'm using R to work with the US county-level voting data that the good folks at MIT steward. I'd like to know the total votes each candidate got in each county. For some states, such as Wisconsin, that's easy:

"state", "county_name", "county_fips", "candidate", "party", "candidatevotes", "totalvotes", "mode"<br>
"WISCONSIN", "WINNEBAGO", "55139", "JO JORGENSEN", "LIBERTARIAN", 1629, 94032, "TOTAL"

For other states, such as Utah, it's doable:

"UTAH", "WEBER", "49057", "DONALD J TRUMP", "REPUBLICAN", 0, 111403, "EARLY"<br>
"UTAH", "WEBER", "49057", "DONALD J TRUMP", "REPUBLICAN", 0, 111403, "ELECTION DAY"<br>
"UTAH", "WEBER", "49057", "DONALD J TRUMP", "REPUBLICAN", 0, 111403, "MAIL"<br>
"UTAH", "WEBER", "49057", "DONALD J TRUMP", "REPUBLICAN", 65949, 111403, "TOTAL"

South Carolina, however, is problematic:

"SOUTH CAROLINA", "YORK", "45091", "JOSEPH R BIDEN JR", "DEMOCRAT", 13656, 144050, "ABSENTEE BY MAIL"<br>
"SOUTH CAROLINA", "YORK", "45091", "JOSEPH R BIDEN JR", "DEMOCRAT", 22075, 144050, "ELECTION DAY"<br>
"SOUTH CAROLINA", "YORK", "45091", "JOSEPH R BIDEN JR", "DEMOCRAT", 18, 144050, "FAILSAFE"<br>
"SOUTH CAROLINA", "YORK", "45091", "JOSEPH R BIDEN JR", "DEMOCRAT", 176, 144050, "FAILSAFE PROVISIONAL"<br>
"SOUTH CAROLINA", "YORK", "45091", "JOSEPH R BIDEN JR", "DEMOCRAT", 22950, 144050, "IN-PERSON ABSENTEE"<br>
"SOUTH CAROLINA", "YORK", "45091", "JOSEPH R BIDEN JR", "DEMOCRAT", 133, 144050, "PROVISIONAL"

It seems to me that there should be some way to loop over the FIPS codes and the party name to generate the totals for each county, but I'm stumped.

CodePudding user response:

Does this solve your problem?

library(tidyverse)

df <- read_csv("~/Desktop/countypres_2000-2020.csv")
#> Rows: 72617 Columns: 12
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): state, state_po, county_name, county_fips, office, candidate, party...
#> dbl (4): year, candidatevotes, totalvotes, version
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df %>%
  filter(year == 2020) %>%
  group_by(candidate, county_fips) %>%
  summarise(
    county_name,
    total_votes_per_candidate_per_county = sum(candidatevotes)
    ) %>%
  relocate(candidate, .before = 4) %>%
  distinct() %>%
  arrange(county_fips)
#> `summarise()` has grouped output by 'candidate', 'county_fips'. You can override using the `.groups` argument.
#> # A tibble: 11,902 × 4
#> # Groups:   candidate, county_fips [11,898]
#>    county_fips county_name candidate         total_votes_per_candidate_per_coun…
#>    <chr>       <chr>       <chr>                                           <dbl>
#>  1 01001       AUTAUGA     DONALD J TRUMP                                  19838
#>  2 01001       AUTAUGA     JOSEPH R BIDEN JR                                7503
#>  3 01001       AUTAUGA     OTHER                                             429
#>  4 01003       BALDWIN     DONALD J TRUMP                                  83544
#>  5 01003       BALDWIN     JOSEPH R BIDEN JR                               24578
#>  6 01003       BALDWIN     OTHER                                            1557
#>  7 01005       BARBOUR     DONALD J TRUMP                                   5622
#>  8 01005       BARBOUR     JOSEPH R BIDEN JR                                4816
#>  9 01005       BARBOUR     OTHER                                              80
#> 10 01007       BIBB        DONALD J TRUMP                                   7525
#> # … with 11,892 more rows

Created on 2022-01-20 by the reprex package (v2.0.1)

  •  Tags:  
  • Related