Similar to Remove header row in Excel using pandas
I wish to remove headers from an excel sheet.
Using the same example but edited:
Company name
Account Name 1
Account No.1
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Account Name 3
Account No.3
cabinet_name group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
I wish to concatenate it to become like
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
I've managed to remove the top header by skipping rows e.g.(skiprows = 4). However, This still leaves the other headers and totals which append like so:
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Account Name 3
Account No.3
cabinet_name group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
Would be extremely grateful if someone could enlighten me on how to clean this sheet with pandas as all I've seen online are working on only 1 table in a single excel sheet.
Let me know if I've missed anything out, I would gladly edit this question.
I thought this might be useful,
My usual process for cleaning the files in excel, is to firstly delete the first 4 rows, only leaving
cabinet_name group pair amount ... result
value1 value1 value1 value1 ... value1
value2 value2 value2 value2 ... value2
totals sum values
Account Name 2
Account No.2
cabinet_name group pair amount ... result
value3 value3 value3 value3 ... value3
value4 value4 value4 value4 ... value4
totals sum values
Account Name 3
Account No.3
cabinet_name group pair amount ... result
value5 value5 value5 value5 ... value5
value6 value6 value6 value6 ... value6
totals sum values
Then I will filter group or pair to look for blank values in said column and deleting it.
CodePudding user response:
You could try to remove the unwanted rows like this.
df= df[~df['first_column_name'].str.startswith(('Company name','Account Name','Account No.','cabinet_name'))]
CodePudding user response:
Presumably, the first 3 rows and the rows with "totals" and "sum values" have empty cells, so dropna should eliminate those rows. Then drop_duplicates with keep=False parameter should drop the the duplicate column names:
out = df.replace('', np.nan).replace(' ', np.nan).dropna().drop_duplicates()
I just can't test it because your read_clipboard complains about the format of your data.
