I am trying to replace all "\n" characters present in a string column in pyspark. I tried the following which seems not to work
df1 = df.withColumn("old_trial_text_clean", f.regexp_replace(f.col("old_trial_text"), "[\\n]", ""))
The current dataframe has the exact same text in both column
old_trial_text_clean
'', 'Drug: \n\n. Other Names:\n\n ', '\n\n. Other Names:\n\n ', 'Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ' , 'Arms\n. Assigned Interventions\n\n\n \n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n \n\n\n \n\n. Other Names:\n\n \n\n\n\n ALN plus alendronate\n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ', '', '', ' ALN plus alendronate'
old_trial_text
'', 'Drug: \n\n. Other Names:\n\n ', '\n\n. Other Names:\n\n ', 'Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ' , 'Arms\n. Assigned Interventions\n\n\n \n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n \n\n\n \n\n. Other Names:\n\n \n\n\n\n ALN plus alendronate\n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ', '', '', ' ALN plus alendronate'
Can someone please let me know what am I missing. I want all the \n characters to be replaced in the text.
CodePudding user response:
You don't need to escape the backslash in a literal newline \n. Use this version:
df1 = df.withColumn("old_trial_text_clean", f.regexp_replace(f.col("old_trial_text"), "\n", ""))
CodePudding user response:
To fix the above issue I had to use the following regex
df1 = df.withColumn("old_trial_text_clean", f.regexp_replace(f.col("old_trial_text"),"\\\\n" ,""))
