What exactly happens in this example-CodePudding

I am writing TPC-H query 6 in dask on a slice of the tpc-h dataset:

start = time.time()
lineitem = dd.read_csv("s3://tpc-h-csv/lineitem/lineitem.tbl.1",sep="|", header = 0)
df = lineitem.rename(columns=dict(zip(lineitem.columns, lineitem_scheme)))
filtered_df = df.loc[(df.l_shipdate > "1994-01-01") & (df.l_discount >= 0.05) & (df.l_discount <= 0.07) & (df.l_quantity < 24)]
filtered_df['product'] = filtered_df.l_extendedprice * filtered_df.l_discount
print(filtered_df.product.sum().compute())
print(time.time() - start)

I have a couple questions:

Is this the fastest way to write said query in Dask?
The data that I am downloading from S3 is 48GB. The memory on my node is 16 GB. Does Dask do batched computation? Does it persist to disk and then read from disk? What happens?

CodePudding user response：

For your first question, that seems like an efficient way to write the query in Dask, though it is hard to test given that the s3 bucket is not public. None of your operations require shuffling, so they should all be fairly inexpensive. For the second question, the answer is essentially yes, since the dataset you're downloading is larger than the memory available, Dask will spill to disk. Fine tuning how Dask manages memory is a bit tricky, there's more on that here if you're interested.