I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame.
This is what the DataFrame looks like:
| id | name | features | timestamp | |
|---|---|---|---|---|
| 1 | 34233 | Bob | athletics | 04-06-2022 |
| 2 | 23423 | John | mathematics | 03-06-2022 |
| 3 | 34233 | Bob | english_literature | 06-06-2022 |
| 4 | 23423 | John | mathematics | 10-06-2022 |
| ... | ... | ... | ... | ... |
And this is are the data types when doing df.dtypes:
| id | int64 |
| name | object |
| features | object |
| timestamp | object |
Lastly, this is the piece of code I used:
df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)
The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense?
CodePudding user response:
The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows:
- Create the list of columns for the subset:
col_subset = df.columns.tolist() - Remove timestamp:
col_subset.remove('timestamp') - Use the col_subset list in the
drop_duplicates()function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)
CodePudding user response:
You can do that using the method drop.
here is an working example: https://abstra.show/724UMzdRXx
CodePudding user response:
Try this:
df.drop_duplicates(subset=[x for x in df.columns if x != "timestamp"]).reset_index(drop=True)
