Live location cassandra partition key strategy-CodePudding

I was watching a talk about Uber's live location storage using Cassandra and was curious about the partition key. My original line of thinking would be to have the following fields:

ride_id
driver_id
timestamp
latitude
longitude

For the partition key I was between the following:

Composite primary key (ride_id, driver_id)
Primary key (ride_id)
Primary key (driver_id)

When querying I would want to query for the location data for a given trip and potentially the location data for a given driver. Would it make sense to create a composite key? I would want each node to have ~100k rows. Could I also have two separate tables of duplicate data but different indexing so I can query depending on the index?

In the Uber talk, they mentioned they used the uuid (I am assuming related to the driver or ride) and minute offset of the timestamp as the partition key. Is that a better approach?

CodePudding user response：

In Cassandra data modelling, the prime objective is to design a table for each app query. Another way of putting it is that tables and app queries have a one-to-one relationship: one app query maps to one table. If there are 10 app queries, you need to design 10 tables.

[EDIT] - Updated my answer after getting additional info in the comments.

For this app query:

I would want to query for the location data for a given trip

you want the table to be partitioned by the trip so it would look like:

CREATE TABLE location_by_trip (
    trip_id text,
    trip_timestamp timestamp,
    latitude float,
    longitude float,
    driver text,
    passenger text,
    ...
    PRIMARY KEY (trip_id, trip_timestamp)
)

And you would retrieve the location at a specific time with:

SELECT latitude, longitude
FROM location_by_trip 
WHERE trip_id = ?
  AND trip_timestamp = ?

Then for the second app query:

... the location data for a given driver

The table schema would look almost identical except the table is partitioned by driver:

CREATE TABLE location_by_driver (
    driver text,
    trip_timestamp timestamp,
    latitude float,
    longitude float,
    trip_id text,
    passenger text,
    ...
    PRIMARY KEY (driver, trip_timestamp)
)

and you would query the table with the driver as the filter in the WHERE clause:

SELECT latitude, longitude
FROM location_by_driver
WHERE driver = ?
  AND trip_timestamp = ?

IDs can be UUIDs if you choose but that's all up to you. But just remember that you don't need to create artificial IDs to use as partition keys because it's always best to use "natural keys". Example of natural keys are email addresses, URLs, fully-qualified phone numbers (includes country area code).

You will only need to use composite partition keys if you need multiple columns to make the partition key unique. For example, movies can share the same titles so we generally recommend adding the release year to make it unique. If you're interested, I've explained it in a bit more detail with examples in this post -- https://community.datastax.com/questions/6171/.

If you're new to Cassandra, have a look at datastax.com/dev. It has lots of free hands-on tutorials that allows you to learn key concepts very quickly since each tutorial only lasts a few minutes.

The Cassandra Fundamentals course is a good place to start. The Data Modeling tutorial is also a good one for you. The free tutorials are interactive and run inside your browser so there's nothing to install or configure. Cheers!