s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)
--------------------
| request|
--------------------
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
--------------------
I would like to get as
-------------------- ---------------
| request| sub|
-------------------- ---------------
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}| {'num':123}|
-------------------- ---------------
I did write as below, but it is throwing error :
TypeError: Column is not iterable
df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()
Can someone please help me ?
CodePudding user response:
You need to use substring function in SQL expression in order to pass columns for position and length arguments. Note also that you need to add 1 to length to get correct result:
import pyspark.sql.functions as F
df = df.withColumn(
"json",
F.expr("substring(request, locate('{',request), length(request) - locate('{', request) 1)")
)
df.show()
# -------------------- ---------------
#| request| json|
# -------------------- ---------------
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}| {'num':123}|
# -------------------- ---------------
You could also consider using regexp_extract function instead of substring like this:
df = df.withColumn(
"json",
F.regexp_extract("request", "^.*:(\\{.*\\})$", 1)
)
