Home > OS >  Pyspark: Regex_replace commas between quotes
Pyspark: Regex_replace commas between quotes

Time:01-29

I'm struggling with replacing with regexp_replace in Pyspark. I have to following string column:

"1233455666, 'ThisIsMyAdress, 1234AB', 24234234234"

A better overview of the string:

Id Address Code
1233455666 'ThisIsMyAdress, 1234AB' 24234234234

The total string that I receive and process is comma separated, like the example in the beginning. Unfortunately I can't change this format of delivered data. To handle the data well I want to replace the comma between the quotes with nothing.

The only requirement is using regexp_replace.

I've tried the code below, and many more. But with these code the comma separation will break as well. Then the string is one big string with removed comma's.

.withColumn("ColCommasRemoved" , regexp_replace( col("X"), "[,]", ""))

which gave me this output:

"1233455666 'ThisIsMyAdress 1234AB' 24234234234"

The output what I want to achieve:

"1233455666, 'ThisIsMyAdress 1234AB', 24234234234"

CodePudding user response:

Using regexp_replace:

from pyspark.sql import functions as F

df = spark.createDataFrame([("1233455666, 'ThisIsMyAdress, 1234AB', 24234234234",)], ["X"])

result = df.withColumn(
    "ColCommasRemoved",
    F.split(F.regexp_replace("X", ",(?=[^']*'[^']*(?:'[^']*'[^']*)*$)", ""), ",")
).select(
    F.col("ColCommasRemoved")[0].alias("ID"),
    F.col("ColCommasRemoved")[1].alias("Address"),
    F.col("ColCommasRemoved")[2].alias("Code")
)

result.show()
# ---------- ------------------------ ------------ 
#|ID        |Address                 |Code        |
# ---------- ------------------------ ------------ 
#|1233455666| 'ThisIsMyAdress 1234AB'| 24234234234|
# ---------- ------------------------ ------------ 

Or if you want to split directly the original column by , and ignore those inside quotes:

result = df.withColumn(
    "split",
    F.split(F.col("X"), ",(?=(?:[^']*'[^']*')*[^']*$)")
)

result.show(truncate=False)
# ------------------------------------------------- ----------------------------------------------------- 
#|X                                                |split                                                |
# ------------------------------------------------- ----------------------------------------------------- 
#|1233455666, 'ThisIsMyAdress, 1234AB', 24234234234|[1233455666,  'ThisIsMyAdress, 1234AB',  24234234234]|
# ------------------------------------------------- ----------------------------------------------------- 

  •  Tags:  
  • Related