I'm trying to read some files with pandas using the s3Hook to get the keys. I'm able to get the keys, however I'm not sure how to get pandas to find the files, when I run the below I get:
No such file or directory:
Here is my code:
def transform_pages(company, **context):
ds = context.get("execution_date").strftime('%Y-%m-%d')
s3 = S3Hook('aws_default')
s3_conn = s3.get_conn()
keys = s3.list_keys(bucket_name=Variable.get('s3_bucket'),
prefix=f'S/{company}/pages/date={ds}/',
delimiter="/")
prefix = f'S/{company}/pages/date={ds}/'
logging.info(f'keys from function: {keys}')
""" transforming pages and loading data back to S3 """
for file in keys:
df = pd.read_csv(file, sep='\t', skiprows=1, header=None)
CodePudding user response:
The format you are looking for is the following:
filepath = f"s3://{bucket_name}/{key}"
So in your specific case, something like:
for file in keys:
filepath = f"s3://s3_bucket/{file}"
df = pd.read_csv(filepath, sep='\t', skiprows=1, header=None)
Just make sure you have s3fs installed though (pip install s3fs).
