TypeError: Column is not iterable-CodePudding

s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)

 -------------------- 
|             request|
 -------------------- 
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
 --------------------

I would like to get as

 -------------------- --------------- 
|             request|            sub|
 -------------------- --------------- 
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}|    {'num':123}|
 -------------------- ---------------

I did write as below, but it is throwing error :

TypeError: Column is not iterable

df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()

Can someone please help me ?

CodePudding user response：

You need to use substring function in SQL expression in order to pass columns for position and length arguments. Note also that you need to add 1 to length to get correct result:

import pyspark.sql.functions as F

df = df.withColumn(
    "json",
    F.expr("substring(request, locate('{',request), length(request) - locate('{', request)   1)")
)

df.show()
# -------------------- --------------- 
#|             request|           json|
# -------------------- --------------- 
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}|    {'num':123}|
# -------------------- ---------------

You could also consider using regexp_extract function instead of substring like this:

df = df.withColumn(
    "json",
    F.regexp_extract("request", "^.*:(\\{.*\\})$", 1)
)