I have an excel like this, in the 'Elements' sheet of the excel like below: https://i.stack.imgur.com/pT0PY.png
Item Category Value
TRANSPORT A 1
Bus A 2
Car A 3
Automobile A 4
Bike A 5
ACCOMODATION A 6
House A 7
Apartment A 8
DELIVERY B 9
Glovo B 10
Emag B 11
Transporter B 12
ACCOMODATION B 13
Apartment1 B 14
Apartment2 B 15
ACCOMODATION C 16
Rental C 17
Apartment C 18
I want to separate the items(TRANSPORT, ACCOMODATION, DELIVERY) from the elements(Bus, Car, Automobile, Bike...) into a dataframe like below:
Element Item Category Value
0 Bus TRANSPORT A 2
1 Car TRANSPORT A 3
2 Automobile TRANSPORT A 4
3 Bike TRANSPORT A 5
4 House ACCOMODATION A 7
5 Apartment ACCOMODATION A 8
I have managed to write the code to separate the elements from the category A, but somehow the code breaks when I use it for the category B or C or others. It triggers:IndexError: index 8 is out of bounds for axis 0 with size 7 because of the trimming of indexes in the code. I need to extract the values for the Value column only for the elements, not for the items and it breaks because of the length mismatch.
I would need the final dataframe to contain all the information for all the categories from the Excel, not just for one category.
What I tried so far (working for category A only):
import pandas as pd
import numpy as np
df = pd.read_excel('elements.xlsx',
['Elements'], engine='openpyxl')
category_names = df['Elements']['Category'].unique()
df['Elements'] = df['Elements'].groupby(['Category'])
categ_group = ['TRANSPORT', 'ACCOMODATION', 'DELIVERY']
def create_category_df(category_name='A'):
helper_df = df['Elements'].get_group(category_name)
# get index for items
item_index = helper_df[helper_df["Item"].isin(categ_group)].index.to_list()
# get elements and associated items
item_data = np.split(helper_df['Item'].to_numpy(), item_index)
helper_df = helper_df.drop(helper_df.index[item_index]) # drop rows for items
helper_df = helper_df.reset_index(drop=True)
resulted_df = pd.DataFrame(columns=['Element', 'Item', 'Category', 'Value'])
item_list = []
for index in range(len(item_data)):
if item_data[index].size != 0:
resulted_df = resulted_df.append(pd.DataFrame(item_data[index][1:], columns=['Element']))
item_list = len(item_data[index][1:]) * [
item_data[index][0]] # multiply items by number of times it is present and add it to df
resulted_df['Category'] = category_name # 'Hardware EA'
resulted_df['Item'] = item_list
resulted_df['Value'] = helper_df['Value'].values
resulted_df = resulted_df.reset_index(drop=True)
print(resulted_df.to_string())
return resulted_df
create_category_df()
CodePudding user response:
First replace columns names, then repalce non matched values of list by NaNs in Series.where, so possible forward filling missing values in DataFrame.insert for second new column and last remove rows if equal values in both columns in boolean indexing:
categ_group = ['TRANSPORT', 'ACCOMODATION', 'DELIVERY']
df = df.rename(columns={'Item':'Element'})
df.insert(1, 'Item', df['Element'].where(df['Element'].isin(categ_group)).ffill())
df =df[ df['Element'].ne(df['Item'])]
print (df)
Element Item Category Value
1 Bus TRANSPORT A 2
2 Car TRANSPORT A 3
3 Automobile TRANSPORT A 4
4 Bike TRANSPORT A 5
6 House ACCOMODATION A 7
7 Apartment ACCOMODATION A 8
9 Glovo DELIVERY B 10
10 Emag DELIVERY B 11
11 Transporter DELIVERY B 12
13 Apartment1 ACCOMODATION B 14
14 Apartment2 ACCOMODATION B 15
16 Rental ACCOMODATION C 17
17 Apartment ACCOMODATION C 18
