Reading a large file from S3 ( >5GB) into lambda with the following code:
import json
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
response = s3.get_object(
Bucket="my-bucket",
Key="my-key"
)
text_bytes = response['Body'].read()
...
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
However I get the following error:
"errorMessage": "signed integer is greater than maximum"
"errorType": "OverflowError"
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 13, in lambda_handler\n text_bytes = response['Body'].read()\n"
" File \"/var/runtime/botocore/response.py\", line 77, in read\n chunk = self._raw_stream.read(amt)\n"
" File \"/var/runtime/urllib3/response.py\", line 515, in read\n data = self._fp.read() if not fp_closed else b\"\"\n"
" File \"/var/lang/lib/python3.8/http/client.py\", line 472, in read\n s = self._safe_read(self.length)\n"
" File \"/var/lang/lib/python3.8/http/client.py\", line 613, in _safe_read\n data = self.fp.read(amt)\n"
" File \"/var/lang/lib/python3.8/socket.py\", line 669, in readinto\n return self._sock.recv_into(b)\n"
" File \"/var/lang/lib/python3.8/ssl.py\", line 1241, in recv_into\n return self.read(nbytes, buffer)\n"
" File \"/var/lang/lib/python3.8/ssl.py\", line 1099, in read\n return self._sslobj.read(len, buffer)\n"
]
I am using Python 3.8, and I found here an issue with Python 3.8/9 that might be why: https://bugs.python.org/issue42853
Is there any way around this?
CodePudding user response:
As mentioned in the bug you linked to, the core issue in Python 3.8 is the bug with reading more than 1gb at a time. You can use a variant of the workaround suggested in the bug to read the file in chunks.
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
response = s3.get_object(
Bucket="-example-bucket-",
Key="path/to/key.dat"
)
buf = bytearray(response['ContentLength'])
view = memoryview(buf)
pos = 0
while True:
chunk = response['Body'].read(67108864)
if len(chunk) == 0:
break
view[pos:pos len(chunk)] = chunk
pos = len(chunk)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
At best, however, you're going to spend a minute or more of each Lambda run just reading from S3. It would be much better if you could store the file in EFS and read it from there in the Lambda, or use another solution like ECS to avoid reading from a remote data source.
