Design pattern for filtering by attributes-CodePudding

I am a scientist trying to learn better software engineering practices, so that my analysis scripts are more robust and easily useable by others. I am having trouble finding the best pattern for the following problem. Is there an OOP framework for easy subsetting by instance attributes? For instance:

I have a large table of vehicle trajectories over time [[x, y]]. These are different drivers in different cars, time-trialing on a track.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Trajectory': {0: np.array([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                                  1: np.array([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                                  2: np.array([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                                  3: np.array([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                                  4: np.array([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                                  5: np.array([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]]), 
                                 }
                  })

A function takes as input a trajectory, and analyzes it over time. eg.

def avg_velocity(trajectory):
    v = []
    for t in range(len(trajectory) - 1):
        v.append(trajectory[t 1] - trajectory[t])
    return np.mean(v)

I'd like to write a program that can analyze the trajectories of a particular subset eg. the average velocities of all drives on SUVs, the average velocities of all drives on white vehicles.

My current solution is to:

I use a Pandas Dataframe to store the table.
I subset by different criteria (ie. df.groupby(by=['Form']) )
Iterating over this list of dataframes, I pass each trajectory to a function avg_velocity(trajectory).
I store the results in a nested dictionary ie. results[vehicle_make][vehicle_form][vehicle_color][driver_age].
To retrieve information, I access the nested dictionary by key.

A natural OOP way might be to create a class Drive with many attributes (ie. Drive.make, Drive.form, Drive.age, ... etc.).

class Drive:
    
    def __init__(self, form, make, color, age, trajectory):
        self.form = form
        self.make = make
        self.color = color
        self.age = age
        self.trajectory = trajectory
...

However, I am not sure how to quickly subset by a particular criteria when each drive has been separated into different instances. Say I suddenly want to plot the average velocity of all drives by Toyotas. I'd have to iterate through a list of Drive instances, check if Drive.make == 'Toyota'. Is this a common problem with OOP?

CodePudding user response：

Instead of your current solution or its OOP alternative, I would suggest sticking with Pandas, like this:

# First, compute velocity

df["Average_velocity"] = df["Trajectory"].apply(avg_velocity)
df = df.drop(columns="Trajectory")

# Secondly, define a helper function to filter the dataframe as needed

def filter_by_attributes(df, form=None, make=None, color=None, age=None):
    pairs = {"Form": form, "Make": make, "Color": color, "Driver age": age}
    criterias = [
        (df[col] == value) if value else ~df[col].isin([])
        for col, value in pairs.items()
    ]
    return df[criterias[0] & criterias[1] & criterias[2] & criterias[3]]

Then, you can, for instance, find the average velocity of all white SUVs, like this:

print(filter_by_attributes(df, form="SUV", make=None, color="White", age=None))
# Output
  Form   Make  Color  Driver age  Average_velocity
0  SUV   Ford  White          25          1.416667
4  SUV  Honda  White          50          1.466667

Following your comment:

Is it possible to store objects (ie. a user defined class) within a dataframe? This would be congruent with the above: columns of the dataframe are retained as metadata, and another column is devoted to storing individual instances"

Yes, you can do it like this:

car = Drive(
    "SUV", "Ford", "White", 25, np.array([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]])
)
df = pd.DataFrame({key: [value] for key, value in car.__dict__.items()})

df["average_velocity"] = df["trajectory"].apply(avg_velocity)
df = df.drop(columns="trajectory")

df.loc[0, "instance"] = car

And so:

print(df)
# Output
form    make    color   age average_velocity    instance
0   SUV Ford    White   25  1.416667     <__main__.Drive object at 0x00000236EE1F80B8>

print(df.loc[0, "instance"].form)
# Output
"SUV"