Objective: I would like to have a sub dataset that shows all observation related to women between the ages 18-19,20-24,25-34 (already codified in 7,8,9) and to have this i entered the following:
women <- AVQ19[AVQ19$SEX == 2, AVQ19$AGE >= 7 & AVQ19$AGE <= 9,]
however, when i run the code it shows:
23527 obs. of 0 variables (the dataset has overall 45483 obs of 26 variables)
CodePudding user response:
The , in your subsetting should likely be a logical (& or |).
Options:
women <- AVQ19[AVQ19$SEX == 2 & AVQ19$AGE >= 7 & AVQ19$AGE <= 9,]
women <- subset(AVQ19, SEX == 2 & AGE >= 7 & AGE <= 9)
women <- dplyr::filter(AVQ19, SEX == 2, AGE >= 7, AGE <= 9)
I included the dplyr::filter above to demonstrate why it might seem that the comma would make sense elsewhere. Within dplyr::filter, all arguments (after the data argument) are &'ed together. One can always use explicit & here as well in place of the commas,
women <- dplyr::filter(AVQ19, SEX == 2 & AGE >= 7 & AGE <= 9)
for precisely identical results. I suspect the rationale for allowing commas, is that it allows for more fluid (perhaps more readable) line-based line-wrapping, such as
... %>%
filter(
SEX == 2,
AGE >= 7,
AGE <= 9
)
I'm not espousing that you switch to dplyr solely for this benefit (though there may be benefits to learning dplyr, for other reasons). I just thought if you had seen the comma in this use, know that outside of dplyr::filter, comma-separated logicals do not necessarily mean the same thing.
FYI, the reason it is returning 23527 obs. of 0 variables is because the second argument after your comma is taken as a column-index. I'm going to assume that your df has many more rows than columns, and AVQ$AGE >= 7 & AVQ$AGE <= 9 likely returns numbers much higher than the number of columns you have.
The 23527 is likely the number of rows of AVQ where SEX == 2 is true. The 0 variables is likely because none of them occur in the first few rows, i.e., the first occurrence of AGE between 7 and 9 is on a row number that is greater than ncol(df).
