If I have the following dataframe:
| ID | other |
|---|---|
| 219218 | 34 |
| 823#32 | 47 |
| unknown | 42 |
| 8#3#32 | 32 |
| 1#3#5# | 97 |
| 6#3### | 27 |
I want to obtain the following result:
| ID | other |
|---|---|
| 219218 | 34 |
| 823#32 | 47 |
| unknown | 42 |
| 8#3#32 | 32 |
| unknown | 97 |
| unknown | 27 |
I am using the following code which works.
for i in range(len(df)):
ident = testing.loc[i, 'ID']
if ident.count('#') > 2:
df.loc[i, 'ID'] = 'unknown'
Is there a way to make it more optimal, bearing in mind that I am going to apply the code to a dataframe of more than 60,000 rows?
Thank you for your help.
CodePudding user response:
For an efficient solution, use vectorial methods and assign with loc:
df.loc[df['ID'].str.count('#').gt(2), 'ID'] = 'unknown'
output:
ID other
0 219218 34
1 823#32 47
2 unknown 42
3 8#3#32 32
4 unknown 97
5 unknown 27
CodePudding user response:
Personally speaking, I prefer apply function on the dataframe:
def replaceRow(value):
if value.count("#") > 2:
return "unknown"
else:
return value
df["ID"] = df["ID"].apply(replaceRow)
df
Output
| ID | other |
|---|---|
| 219218 | 34 |
| 823#32 | 47 |
| unknown | 42 |
| 8#3#32 | 32 |
| unknown | 97 |
| unknown | 27 |
