Home > Back-end >  PySpark hangs on S3 read
PySpark hangs on S3 read

Time:01-14

I want to read files from S3 with PySpark (local installation, not EMR). The problem is that it freezes on read, without any timeout or error.

Versions:

  • PySpark 3.2.2
  • Hadoop 3.3.1
  • Hadoop-AWS 3.3.1 .jar
  • AWS Java SDK bundle 1.12.136 .jar (also tried 1.11.901)

JAR files are put directly in SPARK_HOME/jars directory, so I don't need to specify them separately here (this approach worked for my other Spark jobs with other JAR dependencies).

My PySpark code:

from pyspark.sql import SparkSession

# filled in code
aws_access_key_id = ""
aws_secret_access_key = ""

spark = (
    SparkSession
    .builder
    .appName("Test S3 app")

    .config("spark.hadoop.fs.s3a.access.key", aws_access_key_id)
    .config("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key)
    .config("spark.hadoop.fs.s3a.endpoint", "eu-central-1.amazonaws.com")
    
    .getOrCreate()
)

# here the execution hangs
df = spark.read.parquet("s3a://bucket/file.parquet")

df.show()

What can I do with this? I've seen this question, but there is working no solution there.

Using the same credentials and S3 path with boto3 works and downloads the file in less than a second.

CodePudding user response:

Through some experimentation (when you wait, the query times out and prints error, but after a long time) I came up with the solution - it suffices to set:

.config("spark.hadoop.fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")

Note that s3. has to be used, not s3a., suggested in some places that I've seen. With this modification, everything works.

  •  Tags:  
  • Related