Hi i have a dataset in csv format. the issue with that data set is that it combine different csv files. and from the other csv files it also copied the column names. now i want to remove all column name which is in the middle of dataset
current csv file is like
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
col1 col2 col3
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
want to change it to this column name only on the top
col1 col2 col3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
CodePudding user response:
you can use this line of codes:
df = pd.read_csv(df_path)
# removing repeating headers
df = df[df.ne(df.columns).any(1)]
This solution compares each row with actual columns and works regardless of non-column rows being number or not
CodePudding user response:
Assuming that your col1 suppose to have integers, you can use:
df = df[df.apply(lambda r: r['col1'].isdigit(), axis=1)]
Or use your own test in lambda for each row.
Also, this will leave the column types as object, so you might want to use .astype(int) (or something more specific for your case).
CodePudding user response:
First read the csv file into a dataframe using pd.read_csv(). Then drop duplicate rows by df = df[df.col1 != "col1"].
It will find all the rows which have col1 in them and drop these rows.
