I have a dataframe with lat long values, whose types should be float. However, for some rows, you can find things like -74.128815°, with the '°' character in the end of the string.
| id | Lat | Long |
|---|---|---|
| 1 | 4.807 | -75.684 |
| 2 | 4.5405 | -75.6658 |
| 3 | -74.128815° | |
| 4 | 5.35002 | -72.4002 |
| 5 | 4.6774° | -75.693 |
I want to keep all float values as they are but replace the values containing '°' (and then convert them to float), so that in the end I have this:
| id | Lat | Long |
|---|---|---|
| 1 | 4.807 | -75.684 |
| 2 | 4.5405 | -75.6658 |
| 3 | -74.128815 | |
| 4 | 5.35002 | -72.4002 |
| 5 | 4.6774 | -75.693 |
The DataFrame is named df. I have tried
df[df['Lat'].str.contains('°')]
which raises ValueError: Cannot mask with non-boolean array containing NA / NaN values error
Also, I've also tried df['Lat'] = np.where(df['Lat'].str.contains('°'), df['Lat'][:-1], df['Lat']) which raises ValueError: operands could not be broadcast together
CodePudding user response:
You can replace the invalid character with an empty string, and then just use pd.to_numeric:
degree_sign = u'\N{DEGREE SIGN}' # or degree_sign = "°"
df[['lat', 'long']].replace(degree_sign, '', regex=True)\
.apply(pd.to_numeric, errors='coerce')
CodePudding user response:
Using regex replace the last non-digit character(s) ('\D ')in the columns, then convert to float:
df[['Lat', 'Long']].replace('\\D $', '', regex = True).astype(float)
Lat Long
0 4.80700 -75.684000
1 4.54050 -75.665800
2 NaN -74.128815
3 5.35002 -72.400200
4 4.67740 -75.693000
