Home > OS >  not able to replace \n using regex_replace in pyspark
not able to replace \n using regex_replace in pyspark

Time:02-01

I am trying to replace all "\n" characters present in a string column in pyspark. I tried the following which seems not to work

df1 = df.withColumn("old_trial_text_clean", f.regexp_replace(f.col("old_trial_text"), "[\\n]", ""))

The current dataframe has the exact same text in both column

old_trial_text_clean
'', 'Drug: \n\n. Other Names:\n\n ', '\n\n. Other Names:\n\n ', 'Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ' , 'Arms\n. Assigned Interventions\n\n\n \n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n \n\n\n \n\n. Other Names:\n\n \n\n\n\n  ALN plus alendronate\n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ', '', '', ' ALN plus alendronate'
old_trial_text
'', 'Drug: \n\n. Other Names:\n\n ', '\n\n. Other Names:\n\n ', 'Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ' , 'Arms\n. Assigned Interventions\n\n\n \n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n \n\n\n \n\n. Other Names:\n\n \n\n\n\n  ALN plus alendronate\n\n\n Drug: \n\n. Other Names:\n\n \n\n\n\n. Other Names:\n\n ', '', '', ' ALN plus alendronate'

Can someone please let me know what am I missing. I want all the \n characters to be replaced in the text.

CodePudding user response:

You don't need to escape the backslash in a literal newline \n. Use this version:

df1 = df.withColumn("old_trial_text_clean", f.regexp_replace(f.col("old_trial_text"), "\n", ""))

CodePudding user response:

To fix the above issue I had to use the following regex

df1 = df.withColumn("old_trial_text_clean", f.regexp_replace(f.col("old_trial_text"),"\\\\n" ,""))
  •  Tags:  
  • Related