Home > Net >  How to use sumtable in R to get the summary stats for only a subset of observations
How to use sumtable in R to get the summary stats for only a subset of observations

Time:01-12

I have a data frame object that contains a subset of variables (model, mpg, year, etc.).

I created a data frame object from that only contains the first 200 observations called reducedset.

I am trying to make a summary statistics table that for only the model "cars" but I cannot figure it. I referenced vtable.pdf but am still struggling.

st(reducedset, group='model', group.test=TRUE)

CodePudding user response:

I do not have your data, so I tried to run your analysis over the Auto dataset from the package ISLR (see Introduction to Statistical Learning, James et al., 2013). I replaced the condition model == "cars" with year == 70, but the reasoning is the same.

library(ISLR)
dta = Auto # Replace this with your data!
reducedset = dta[1:200, ]
st(reducedset[reducedset$year == 70, ], group='name', group.test=TRUE) # Change the condition within square brackets!

CodePudding user response:

I believe you are looking for something like this. The following function termed my_stats() splits the subset of mtcars termed sub into groups of a grouping_factor (here: vs) and then computes the mean, sd, min, and max for each variable within sub.

# cars data
data(mtcars)

# random subset
sub <- mtcars[sample(seq_len(nrow(mtcars)), 20, replace = TRUE), ]

# function to compute the mean and sd for variables in 'df' according
# to 'grouping_factor'
my_stats <- \(df, grouping_factor){
  sum_stats <- lapply(split(df, df[[grouping_factor]]), \(x) {
    data.frame(sapply(x, \(i) cbind(
      mean(i, na.rm = TRUE), sd(i, na.rm = TRUE),
      min(i, na.rm = TRUE), max(i, na.rm = TRUE))))
  })
  sum_stats <- lapply(sum_stats, \(x) {
    rownames(x) <- c('Mean', 'SD', 'Min', 'Max'); x
  })
  for(i in 1:length(sum_stats)) {
    names(sum_stats)[i] <-
      paste(grouping_factor, '=', levels(as.factor(df[[grouping_factor]]))[i])
  }
  return(sum_stats)
}

Output (for the first three columns in each group)

> lapply(my_stats(df = sub, grouping_factor = 'vs'), '[', 1:3)
$`vs = 0`
           mpg      cyl     disp
Mean 16.650000 7.500000 296.3833
SD    3.234333 0.904534  99.4829
Min  10.400000 6.000000 145.0000
Max  21.000000 8.000000 460.0000

$`vs = 1`
           mpg cyl      disp
Mean 24.350000   4 113.21250
SD    3.000952   0  22.39256
Min  21.500000   4  79.00000
Max  30.400000   4 146.70000

If you would like to see all the output, simply run my_stats(df = sub, grouping_factor = 'vs').

Note: use function(x) instead of \(x) if you use a version of R <4.1.0

  •  Tags:  
  • Related