cleaning my dataframe (similar lines and \xc3\x28 in the field)-CodePudding

I am working on dataframe with python.

in my first dataframe df1 i have :

 ------ --------- ------------- -------------------------------   
|  ID  |  PUBLICATION TITLE  |    DATE      |      JOURNAL     |
 ------ --------------------- -------------- ------------------ 
|   1            "a"           "01/10/2000"        "book1"     |
|   2            "b"           "09/03/2005"          NaN       |
|  NaN           "b"           "09/03/2005"        "book2      |
|   5            "z"           "21/08/1995"        "book4"     |
|   6            "n"           "15/04/1993"   "book9\xc3\x28"  |
 --------------------------------------------------------------

Here I would like to clean my dataframe but I don't know how to do it in this case. Indeed there are two points which block me.

The first one is that lines 2 and 3 seems to be the same line because the title of the publication is the same and because I think that the title of the publication is unique to a newspaper

The second point is for the last line one to \xc3\x28.

How can I clean my dataframe in a smart way, to be able to use this code for other daataframe if possible?

CodePudding user response：

First you should remove the row with ID = NaN. This can be done by:

df1 = df1[df1['ID'].notna()]

Then update the journal of the 2nd row:

df1.iloc[1, df1.columns.get_loc('JOURNAL')] = 'book2'

Finally, for the entry of 'book9\xc3\x28', you can update it by:

df1.iloc[4, df1.columns.get_loc('JOURNAL')] = 'book9'

CodePudding user response：

What type of encoding are you using. I recommend using "utf8" encoding for this purpose.