Home > Mobile >  Use pyspark to verify if a column contains a word
Use pyspark to verify if a column contains a word

Time:01-30

I'm trying to to use pyspark for filter those rows they contains a word "home" in a column. But using more methods don't work. For example:

    "content"                     "number"
My home have two bathroom.           1
Your table is old.                   3
I'm going to home.                   4

I'm going to filter for number and content. I try:

df_spark.filter((col('number') >= 2) & (col('content').like('%home%')))

and

df_spark.filter((col('number') >= 2) & (col('content').contains('home')))

And I'm going to obtain the result:

I'm going to home.                   4

But I have problem, because database is too large and go in overhead.

ConnectionRefusedError: [WinError 10061] Impossible to establish the connection. Persistent rejection of destination computer.

or

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

How I can resolve this problem with another function? I'm think to use expressions.

CodePudding user response:

I do not solve my problem, I use this:

df_pyspark.filter(df_pyspark.content.like("%home%")==True)

or this:

df_pyspark.filter(df_pyspark.content.contains("home")==True)

This working with fews row but do not work with all database!

CodePudding user response:

I solve my problem so:

df_pyspark.filter((col('number') > 2) & (df_pyspark.content.contains("home")==True))
  •  Tags:  
  • Related