I'm working in R and I have a file where there is a string in the first row of the .csv, a blank row, and then my date.
I tried to remove this by doing:
df <- df[-c(1,2), ]
Instead of removing this string and the empty row, it removes the first two rows of the data. This indicates to me that R is disregarding this as part of my dataset.
I would like to remove it, but lack the vocabulary as to what this data is so that I can figure out how to remove it. I don't want to remove anything that might indicate important information like:
# A tibble: 0 x 5
# ... with 5 variables: row <int>, col <int>, expected <chr>,
# actual <chr>, file <chr>
So my assumption is to not delete the header.
Here is the output when I type head(df):
# A tibble: 6 x 2
`Category: All categories` ...2
<chr> <chr>
1 NA NA
2 Month st: (United States)
3 2004-01 25
4 2004-02 21
5 2004-03 20
6 2004-04 24
Here is the file, if that helps. It is output from Google Trends.
Thank you, I'm new at this, but really trying to learn it. :)
CodePudding user response:
Your use of
-c(0,1)suggests that you believe R's indexing to be 0-based (similar to python and C); it is not, R's indexing is 1-based.Most (all?) CSV-reading functions have a
skip=argument:head(read.csv("2004_present_st.csv", skip=2)) # Month st...United.States. # 1 2004-01 25 # 2 2004-02 21 # 3 2004-03 20 # 4 2004-04 24 # 5 2004-05 24 # 6 2004-06 24 ### all verified to work similarly readr::read_csv("2004_present_st.csv", skip=2) data.table::fread("2004_present_st.csv", skip=2) vroom::vroom("2004_present_st.csv",skip=2)In case you've already read in the data and need to repair it, though ...
df <- read.csv("2004_present_st.csv") nms <- unlist(df[2,]) df <- df[-(1:2),] names(df) <- nms df[] <- lapply(df, type.convert, as.is = TRUE) head(df) # Month st: (United States) # 3 2004-01 25 # 4 2004-02 21 # 5 2004-03 20 # 6 2004-04 24 # 7 2004-05 24 # 8 2004-06 24 str(df) # 'data.frame': 217 obs. of 2 variables: # $ Month : chr "2004-01" "2004-02" "2004-03" "2004-04" ... # $ st: (United States): int 25 21 20 24 24 24 23 21 20 23 ...Notes on this cleanup:
- I used
-(1:2)instead of-c(1,2), mostly the same. df[] <- lapply(df, ...)is a way to iterate a function over all columns and return them into place, preserving the class of"data.frame"for thedfobject. Had we donedf <- lapply(df, ...)instead,dfwould now be a list;as.data.frame(lapply(df, ...))would work, but I thinkdf[] <- lapply(df, ...)looks/works better.- When you read in the data the first time, it's likely that due to the data in the top couple of rows, some or all of the columns might be misconstrued as strings.
type.convertis used byread.csv(and other places) to try to convert strings into integers and numbers;as.is=TRUEprevents strings from being converted tofactors, feel free to omit this if you prefer.
- I used
