The data is from Citi Bikes NYC for January 2019 to December 2019,the data can be viewed here:

I rounded the ride_distance_meters and hoped that I would get a smoother distribution, but that did not happen:

ridedata_clean$ride_distance_mtrs_rnd <- round(ridedata_clean$ride_distance_meters,-2)

ridedata_clean %>% count(ride_distance_mtrs_rnd) |> ggplot(aes(x = ride_distance_mtrs_rnd, y= n))  
geom_histogram(stat='identity', freq = FALSE, color = "#F8766D")  xlim(0,3000)   ylim(0,27000)

Can anyone tell me what I am doing wrong.

CodePudding user response：

Here are the histograms. There is no need to count first nor to have an identity stat, geom_histogram will bin the data and count how many data points are in each bin automatically.
The first histogram uses the default number of bins.

library(ggplot2)

ggplot(ridedata_clean, aes(ride_distance_mtrs))  
  geom_histogram(fill = "#F8766D")  
  ggtitle(label = "Default number of bins: 30")  
  theme_bw()

Now vary the number of bins to see the number of bars increase, therefore making the histograms less smooth.
Base R has functions that return numbers of bins according to different criteria, see the documentation for the functions below

Alternatively, the bins can be set with binwidth, to give control on the bins cut points, not on the total number of bins.

# bins widths
binwidths_vec <- c(10, 50, 100, 1000)
gg_plots2 <- vector("list", length(binwidths_vec))

for(i in seq_along(bins_vec)) {
  main_title <- sprintf("Histogram of distances, bin width: %d", binwidths_vec[i])
  gg_plots2[[i]] <- ggplot(ridedata_clean, aes(ride_distance_mtrs))  
    geom_histogram(binwidth = binwidths_vec[i], fill = "#F8766D")  
    ggtitle(label = main_title)  
    theme_bw()
  if(i == 1)
    gg_plots2[[i]] <- gg_plots2[[i]]   ylim(0, 1.25e5)
}

gridExtra::grid.arrange(grobs = gg_plots2)

There is a visible data artifact, the first bar seems to be due to values near zero. After inspecting the data, I have found that around 2.16% of the distances are equal to zero. I have not determined the file or files those values come from.

sum(ridedata_clean$ride_distance_mtrs == 0)
# [1] 443021

100*mean(ridedata_clean$ride_distance_mtrs == 0)
# [1] 2.155642

Data

Assuming that the files were downloaded to the current directory, the following code was used to transform lon/lat coordinates to distances in meters. This takes a long time with a mid-range year 2022 PC running R 4.2.1 on Windows 11. I have not parallelized the code.

Note that the number of rows 20551697 is equal to the 5 20551692 rows in the question's posted data.

library(readr)
library(geosphere)

read_citibike_file <- function(filename, cols, verbose = TRUE) {
  Y <- readr::read_csv(
    file = filename, 
    col_types = "dddd", 
    col_select = all_of(cols)
  )
  Y <- as.matrix(Y)
  if(verbose) {
    mat_size <- round(utils::object.size(Y)/1024/1024, digits = 1)
    cat("file:", filename, "\tdata size:", mat_size, "Mb\trows:", nrow(Y), "\n")
  }
  Y
}
convert_lonlat_dist <- function(data, start_cols, end_cols) {
  d <- numeric(nrow(data))
  for(i in seq_along(d)) {
    start <- data[i, start_cols, drop = TRUE]
    end <- data[i, end_cols, drop = TRUE]
    tryCatch(
      d[i] <- distm(start, end, fun = distHaversine),
      error = function(e) print(conditionMessage(e))
    )
  }
  d
}

zip_files <- list.files(pattern = "\\.zip$")
for(f in zip_files) unzip(f, exdir = ".")

fls <- list.files(pattern = "\\.csv$")

lon1 <- "start station longitude"
lat1 <- "start station latitude"
lon2 <- "end station longitude"
lat2 <- "end station latitude"
start_cols <- c(lon1, lat1)
end_cols <- c(lon2, lat2)
lonlat_cols <- c(start_cols, end_cols)

dist_mtrs <- sapply(fls, \(x) {
  y <- read_citibike_file(x, cols = lonlat_cols)
  convert_lonlat_dist(y, start_cols, end_cols)
})

dist_mtrs <- unlist(dist_mtrs)
length(dist_mtrs)
# [1] 20551697
ridedata_clean <- data.frame(ride_distance_mtrs = dist_mtrs)
nrow(ridedata_clean)
# [1] 20551697