I'm struggling with replacing with regexp_replace in Pyspark. I have to following string column:
"1233455666, 'ThisIsMyAdress, 1234AB', 24234234234"
A better overview of the string:
| Id | Address | Code |
|---|---|---|
| 1233455666 | 'ThisIsMyAdress, 1234AB' | 24234234234 |
The total string that I receive and process is comma separated, like the example in the beginning. Unfortunately I can't change this format of delivered data. To handle the data well I want to replace the comma between the quotes with nothing.
The only requirement is using regexp_replace.
I've tried the code below, and many more. But with these code the comma separation will break as well. Then the string is one big string with removed comma's.
.withColumn("ColCommasRemoved" , regexp_replace( col("X"), "[,]", ""))
which gave me this output:
"1233455666 'ThisIsMyAdress 1234AB' 24234234234"
The output what I want to achieve:
"1233455666, 'ThisIsMyAdress 1234AB', 24234234234"
CodePudding user response:
Using regexp_replace:
from pyspark.sql import functions as F
df = spark.createDataFrame([("1233455666, 'ThisIsMyAdress, 1234AB', 24234234234",)], ["X"])
result = df.withColumn(
"ColCommasRemoved",
F.split(F.regexp_replace("X", ",(?=[^']*'[^']*(?:'[^']*'[^']*)*$)", ""), ",")
).select(
F.col("ColCommasRemoved")[0].alias("ID"),
F.col("ColCommasRemoved")[1].alias("Address"),
F.col("ColCommasRemoved")[2].alias("Code")
)
result.show()
# ---------- ------------------------ ------------
#|ID |Address |Code |
# ---------- ------------------------ ------------
#|1233455666| 'ThisIsMyAdress 1234AB'| 24234234234|
# ---------- ------------------------ ------------
Or if you want to split directly the original column by , and ignore those inside quotes:
result = df.withColumn(
"split",
F.split(F.col("X"), ",(?=(?:[^']*'[^']*')*[^']*$)")
)
result.show(truncate=False)
# ------------------------------------------------- -----------------------------------------------------
#|X |split |
# ------------------------------------------------- -----------------------------------------------------
#|1233455666, 'ThisIsMyAdress, 1234AB', 24234234234|[1233455666, 'ThisIsMyAdress, 1234AB', 24234234234]|
# ------------------------------------------------- -----------------------------------------------------
