df.iat not fast enough for linear scan over large dataset when grouped-CodePudding

I have a traffic dataset of camera sightings of license plates. It has approx. 100M points that can be thought of as { camera ID, license plate, timestamp } 3-tuples. There are around 2000 unique cameras and a few million unique license plates. The cameras cover most, but not all of the city and most data points (around 70%) have a license plate (as opposed to a blank). I am trying to reconstruct a flow network of common driving routes, and want to this network to be as follows. There should be an edge between two cameras A and B if and only if

There are a lot of direct trips from A to B (ie. a lot of cars that are seen at camera A immediately before being seen at camera B)
Most of these trips took a reasonable amount of time/didn't take too long (ie. we compute the median time of all trips A->B and then only count a trip as a "direct trip" if it's <150% of the median). This eliminates the possibility that most trips A->B went through other cameras but those just happened not to be in the dataset because they are blanks.

The way I'm doing this is as follows. I first sort by time into car-days (trips) so each group represents one car's movement through cameras on one day, in chronological order. This is done by

grouped = full_data.sort_values('time_seconds').groupby(['placa_encoded', 'date'])

Then I create a matrix count_mat = np.zeros((num_cameras, num_cameras)) where count_mat[i, j] represents the number of times camera i immediately precedes camera j in the data (if the given duration between the sightings isn't 'too long'). This is done as follows.

for name, group in grouped: 
    for x in range(len(group)-1):
        c1, c2 = group.iat[x, 0], group.iat[x 1, 0]
        t1, t2 = group.iat[x, 3], group.iat[x 1, 3]
        # from, to camera codes, eg. 480502114, 480221121 
        i, j = code2index(c1), code2index(c2)
        if (abs(t2-t1) < 1.5 * median_mat[i][j]): 
            # this means A->B was a direct trip that someone took in a reasonable time
            count_mat[i][j]  = 1

Here, median_mat[i,j] is the pre-calculated median times over all trips to go from camera i to camera j. t1 and t2 are the timestamps for a data point, and c1 and c2 are camera codes.

The problem is that the above code is taking hours to run over all 100M points, even though it is theoretically O(n) if I assume that iat is constant time, since it goes through each data point once and all matrix accesses are random access and so constant time. I used to use iloc but that was much slower still. Having looked up similar questions on SO, I people say that vectorizing or using list comprehensions in similar situations goes much faster, but I can't see how either of those would apply/be used here, or how I could do this more efficiently. Currently, it takes a couple of hours to go through all the points on my MacBook Pro, but I'm hoping to find a way to have it done in a couple minutes, if not less.

For completeness, code2index is an O(1) helper that maps camera IDs into the range [0, num_cameras] so that we know where each camera goes in the various matrices. It is as follows.

def code2index(code): 
    # takes camera code like 480502114 and converts to index into count_mat 0 <= i < 1832
    if (code in c2i.keys()): return c2i[code]
    else: 
        n = len(c2i.keys())
        # eg. if there are 2 keys, then we want to put new code into dict[2]
        c2i[code] = n 
        return n

Any ideas/optimizations are much appreciated. Any general pointers on how to conceptually think about optimality a priori before sitting down and writing this code would also be great. Thank you!

CodePudding user response：

As an example, compare time profile of your python implementation of code2index(code) with the C implementation

int code2index( const std::string& code )
{
  static std::set<std::string> mySet;
  return mySet.insert(code).second - mySet.begin();
}

I anticipate the C code will be at least 50 times faster.

CodePudding user response：

I would sort all camera ID's, replace them with numbers between [0, len(camera IDs)]. This sorts the (seemingly) unnecessary code2index function. Also, reduce the lookup by taking two indexes at once. Lastly, remove the if-statement and make it a pure boolean computation.

for name, group in grouped: 
    for x in range(len(group)-1):
        c1, c2 = group.iloc[[x, x 1], 0]
        t1, t2 = group.iloc[[x, x 1], 3]

        count_mat[c1][c2]  = ( abs(t2-t1) < 1.5 * median_mat[c1][c2] ) * 1

Hope this speeds up the work!