I'm trying to to use pyspark for filter those rows they contains a word "home" in a column. But using more methods don't work. For example:
"content" "number"
My home have two bathroom. 1
Your table is old. 3
I'm going to home. 4
I'm going to filter for number and content. I try:
df_spark.filter((col('number') >= 2) & (col('content').like('%home%')))
and
df_spark.filter((col('number') >= 2) & (col('content').contains('home')))
And I'm going to obtain the result:
I'm going to home. 4
But I have problem, because database is too large and go in overhead.
ConnectionRefusedError: [WinError 10061] Impossible to establish the connection. Persistent rejection of destination computer.
or
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
How I can resolve this problem with another function? I'm think to use expressions.
CodePudding user response:
I do not solve my problem, I use this:
df_pyspark.filter(df_pyspark.content.like("%home%")==True)
or this:
df_pyspark.filter(df_pyspark.content.contains("home")==True)
This working with fews row but do not work with all database!
CodePudding user response:
I solve my problem so:
df_pyspark.filter((col('number') > 2) & (df_pyspark.content.contains("home")==True))
