We have multiple files on our S3 bucket with the same file extensions.
I would like to find a way to list all these file extensions with the amount of space they're taking up in our bucket in human readable format.
For example, instead of just listing out all the files with aws s3 ls s3://ebio-rddata --recursive --human-readable --summarize
I'd like to list only the file extensions with the total size they're taking:
.basedon.peaks.l2inputnormnew.bed.full | total size: 100 GB
.adapterTrim.round2.rmRep.sorted.rmDup.sorted.bam | total size: 200 GB
.logo.svg | total size: 400 MB
CodePudding user response:
You will have to use the SDK for that, with your favorite language and a script to filter out objects recursively for the file formats you want
and then export the list as csv or json whatever you prefer more readable
CodePudding user response:
Here's an idea for how to solve this with the awscli and a couple of other command lines tools (grep and awk, freely available on Mac and Linux).
aws s3 ls s3://mybucket --recursive \
| grep -v -E '^. /$' \
| awk '{na=split($NF, a, "."); tot[a[na]] = $3; num[a[na]] ;} END {for (e in tot) printf "d m %s\n", tot[e], num[e], e};'
Step by step, aws s3 ls s3://mybucket --recursive results in output like this:
2021-11-24 12:45:39 57600 cat.png
2021-09-29 13:15:48 93651 dog.png
2021-09-29 14:16:06 1448 names.csv
2021-02-15 15:09:56 0 pets/
2021-02-15 15:09:56 135 pets/pets.json
Piping that through grep -v -E '^. /$' removes the folders, and the result looks like this:
2021-11-24 12:45:39 57600 cat.png
2021-09-29 13:15:48 93651 dog.png
2021-09-29 14:16:06 1448 names.csv
2021-02-15 15:09:56 135 pets/pets.json
Finally, the AWK script is called for each line. It splits the last word of each line on the period character (split($NF, a, ".")) so it can work out what the file extension is (stored in a[na]). It then aggregates the file size by extension in tot[extension] and the file count by extension in num[extension]. It finally prints out the aggregated file size and file count by extension, which looks something like this:
151251 2 png
1448 1 csv
135 1 json
You could also solve this fairly simply e.g. in Python using the boto3 SDK.
CodePudding user response:
Here's a Python script that will count objects by extension and compute the total size by extension:
import boto3
s3_resource = boto3.resource('s3')
sizes = {}
quantity = {}
for object in s3_resource.Bucket('jstack-a').objects.all():
if not object.key.endswith('/'):
extension = object.key.split('.')[-1]
sizes[extension] = sizes.get(extension, 0) object.size
quantity[extension] = quantity.get(extension, 0) 1
for extension, size in sizes.items():
print(extension, quantity[extension], size)
It goes a bit funny if there is an object without an extension.
