I am working on dataframe with python.
in my first dataframe df1 i have :
------ --------- ------------- -------------------------------
| ID | PUBLICATION TITLE | DATE | JOURNAL |
------ --------------------- -------------- ------------------
| 1 "a" "01/10/2000" "book1" |
| 2 "b" "09/03/2005" NaN |
| NaN "b" "09/03/2005" "book2 |
| 5 "z" "21/08/1995" "book4" |
| 6 "n" "15/04/1993" "book9\xc3\x28" |
--------------------------------------------------------------
Here I would like to clean my dataframe but I don't know how to do it in this case. Indeed there are two points which block me.
The first one is that lines 2 and 3 seems to be the same line because the title of the publication is the same and because I think that the title of the publication is unique to a newspaper
The second point is for the last line one to \xc3\x28.
How can I clean my dataframe in a smart way, to be able to use this code for other daataframe if possible?
CodePudding user response:
First you should remove the row with ID = NaN. This can be done by:
df1 = df1[df1['ID'].notna()]
Then update the journal of the 2nd row:
df1.iloc[1, df1.columns.get_loc('JOURNAL')] = 'book2'
Finally, for the entry of 'book9\xc3\x28', you can update it by:
df1.iloc[4, df1.columns.get_loc('JOURNAL')] = 'book9'
CodePudding user response:
What type of encoding are you using. I recommend using "utf8" encoding for this purpose.
