I'm working on a huge file that has names in columns that contain extraneous values (like the "|" key) that I want to remove, but for some reason my str.replace function only seems to apply to some rows in the column.
My column in the dataframe summary looks something like this:
Labels
test|test 1
test 2
test 3
test|test 4
test|test 5
test 6
As you can see, some columns are already how i want them to be, only containing the name "test #", but some have "test|" in front, which I want removed.
My function to remove them is like this:
correction = summary["Labels"].str.replace('test\|', '')
It seems to work for most of the values, but when I check for pipes ("|") in the dataframe (once i merged correction with summary), it says it finds 9330 of them:
found = summary[summary['Labels'].str.contains('|',regex=False)]
print(len(found))
print(found['Labels'].value_counts())
Results
9330
test|test-667 59
test|test-765 40
test|test-1810 39
test|test-685 36
test|test-1077 33
..
Does anyone know why this is, and how i can fix it?
CodePudding user response:
Try str.extract:
df['Labels'] = df['Labels'].str.extract(r'\|(.*)', expand=False) \
.combine_first(df['Labels'])
print(df)
# Output
Labels
0 test 1
1 test 2
2 test 3
3 test 4
4 test 5
5 test 6
CodePudding user response:
You were on the right track. Replace raw string as follows
summary['Labels'] = summary['Labels'].str.replace(r'test\|','', regex=True)
Labels
0 test 1
1 test 2
2 test 4
