Home > Back-end >  Consolidate multiple columns in df to single column based on value / Nested np.where unknown depth
Consolidate multiple columns in df to single column based on value / Nested np.where unknown depth

Time:02-04

I have a dataframe that contains columns like this - ['labels', 'labels.1', 'labels.2'] this continues to some number labels.n

One of these columns contains a value I need. I want to grab that value and add it to df['final_label']. If the value is not in the labels columns, the new column should be blank.

The dataframe looks like this:

|  Age  |  City  | labels | labels.1 | labels.2 | labels.n |
| ----- | ------ | ------ | -------- | -------- | -------- |
|   27  | city A | value1 |  other   |  other   | other    |
|   45  | city B | other  |  other   |  value2  |          |
|   34  | city A | value2 |  other   |  other   | other    |
|   57  | city D | other  |  value2  |          |          |
|   30  | city E | other  |  other   |  other   |          |

I need the final dataframe to look like this:

|  Age  |  City  | labels |
| ----- | ------ | ------ |
|   27  | city A | value1 |
|   45  | city B | value2 |
|   34  | city A | value2 |
|   57  | city D | value2 |
|   30  | city E |        |

I realize a nested np.where() will have an unknown depth so not sure how to implement that in a simple way.

Here is what I have been trying to do:

labels_cols = [col for col in df.columns if "labels" in col]
        
values_list = ['value1', 'value2']
        
### I want to basically do this, but the depth is unknown
df['final_labels'] = np.where(df['labels'].isin(values_list), df['labels'],
                        np.where(df['labels.1'].isin(values_list), df['labels.1'],
                           np.where(df['labels.2'].isin(values_list), df['labels.2'],
                              np.where(df['labels.n'].isin(values_list), df['labels.n'],''))))

### Tried some things like this, but can't get the syntax right
df['final_label'] = df.loc[df.index(values_list), df.columns.str.contains('label')==True] 

CodePudding user response:

You can use fillna with another series, wich uses that series to fill the null values. Assuming all columns used to fill the labels column contain "labels.":

for c in df.columns:
    if "labels." in c:
        df["labels"] = df["labels"].fillna(df[c])
    del df[c]

Then df["labels"] contains the result.

CodePudding user response:

Use filter to keep labels columns then replace unwanted values by nan before forward fill values until the last column:

pattern = r'^(?!value1|value2)$'
df['final_label'] = df.filter(like='labels').replace(pattern, np.nan, regex=True) \
                      .ffill(axis=1).iloc[:, -1].fillna('')
print(df)

# Output
   Age    City  labels labels.1 labels.2 labels.n final_label
0   27  city A  value1    other    other    other       other
1   45  city B   other    other   value2               value2
2   34  city A  value2    other    other    other       other
3   57  city D   other   value2                        value2
4   30  city E   other    other    other                other
  •  Tags:  
  • Related