0 variables? how can i fix it-CodePudding

Objective: I would like to have a sub dataset that shows all observation related to women between the ages 18-19,20-24,25-34 (already codified in 7,8,9) and to have this i entered the following:

women <- AVQ19[AVQ19$SEX == 2, AVQ19$AGE >= 7 & AVQ19$AGE <= 9,]

however, when i run the code it shows:

23527 obs. of 0 variables (the dataset has overall 45483 obs of 26 variables)

CodePudding user response：

The , in your subsetting should likely be a logical (& or |).

Options:

women <- AVQ19[AVQ19$SEX == 2 & AVQ19$AGE >= 7 & AVQ19$AGE <= 9,]
women <- subset(AVQ19, SEX == 2 & AGE >= 7 & AGE <= 9)
women <- dplyr::filter(AVQ19, SEX == 2, AGE >= 7, AGE <= 9)

I included the dplyr::filter above to demonstrate why it might seem that the comma would make sense elsewhere. Within dplyr::filter, all arguments (after the data argument) are &'ed together. One can always use explicit & here as well in place of the commas,

women <- dplyr::filter(AVQ19, SEX == 2 & AGE >= 7 & AGE <= 9)

for precisely identical results. I suspect the rationale for allowing commas, is that it allows for more fluid (perhaps more readable) line-based line-wrapping, such as

... %>%
  filter(
    SEX == 2,
    AGE >= 7,
    AGE <= 9
  )

I'm not espousing that you switch to dplyr solely for this benefit (though there may be benefits to learning dplyr, for other reasons). I just thought if you had seen the comma in this use, know that outside of dplyr::filter, comma-separated logicals do not necessarily mean the same thing.

FYI, the reason it is returning 23527 obs. of 0 variables is because the second argument after your comma is taken as a column-index. I'm going to assume that your df has many more rows than columns, and AVQ$AGE >= 7 & AVQ$AGE <= 9 likely returns numbers much higher than the number of columns you have.

The 23527 is likely the number of rows of AVQ where SEX == 2 is true. The 0 variables is likely because none of them occur in the first few rows, i.e., the first occurrence of AGE between 7 and 9 is on a row number that is greater than ncol(df).