Home > Blockchain >  using pandas find out repeated string elements from multi-index series dataframe
using pandas find out repeated string elements from multi-index series dataframe

Time:01-13

In Pandas dataframe, one of the column am having is series datatype i.e food_column, from this column I have to extract the output column

   Input : food_column
 [ 'bread','bread','bread'] ,
 [ 'meat','butter','butter'] ,
 [ 'meat', 'butter','bread','meat']
        ['butter']
 ['bread','meat','bread','meat']

Output : main_column
       ['bread']
       ['butter']
        ['meat']
       ['butter']
       ['bread']

condition:

  1. if any string element is repeated more than once that should be picked as an output element,
  2. if any two or three elements count are same it should be picked as np.random.choice from that two or three elements
  3. if only one element is there in any of the rows, assign/map that element to the output column
  4. else mark it as "unknown" to output column

CodePudding user response:

import pandas as pd
import random
from collections import Counter
import numpy as np

food_list = [[ 'bread','bread','bread'] ,
 ['meat','butter','butter'] ,
 ['meat', 'butter','bread','meat'],
 ['butter'],
 ['bread','meat','bread','meat'],
 ['']]

food_series = pd.Series(food_list)

df = pd.DataFrame({'food_column': food_series})

# randomize list item order, since dict item order is constant in Python 3.6  
df['random_food_list'] = [random.sample(z, len(z)) for z in df['food_column'].to_list()]
# get counts
df['food_counts'] = df['random_food_list'].apply(lambda x: Counter(x))
# get key with max value
df['main_column'] = df['food_counts'].apply(lambda x:  max(x, key=x.get))
# replace empty strings with 'unknown'
df['main_column'] = np.where(df['main_column'] == '', 'unknown', df['main_column'])
  •  Tags:  
  • Related