I want to read files from S3 with PySpark (local installation, not EMR). The problem is that it freezes on read, without any timeout or error.
Versions:
- PySpark 3.2.2
- Hadoop 3.3.1
- Hadoop-AWS 3.3.1 .jar
- AWS Java SDK bundle 1.12.136 .jar (also tried 1.11.901)
JAR files are put directly in SPARK_HOME/jars directory, so I don't need to specify them separately here (this approach worked for my other Spark jobs with other JAR dependencies).
My PySpark code:
from pyspark.sql import SparkSession
# filled in code
aws_access_key_id = ""
aws_secret_access_key = ""
spark = (
SparkSession
.builder
.appName("Test S3 app")
.config("spark.hadoop.fs.s3a.access.key", aws_access_key_id)
.config("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key)
.config("spark.hadoop.fs.s3a.endpoint", "eu-central-1.amazonaws.com")
.getOrCreate()
)
# here the execution hangs
df = spark.read.parquet("s3a://bucket/file.parquet")
df.show()
What can I do with this? I've seen this question, but there is working no solution there.
Using the same credentials and S3 path with boto3 works and downloads the file in less than a second.
CodePudding user response:
Through some experimentation (when you wait, the query times out and prints error, but after a long time) I came up with the solution - it suffices to set:
.config("spark.hadoop.fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
Note that s3. has to be used, not s3a., suggested in some places that I've seen. With this modification, everything works.
