pandas, dataframe: If you need to process data row by row, how to do it faster than itertuples-CodePudding

I know that .itertuples() and .iterrows() are slow, but how can I speed them up if I need to use and process data one row at a time, as shown below?

df = pd.read_csv('example.csv')

posts = []
for row in df.itertuples():
    post = Post(title=row.title, text=row.text, ...)
    posts.append(post)

CodePudding user response：

You can use list comprehension and unpacking (using kwargs) if your DataFrame columns have the same names as your class attributes. An example is shown below.

df = pd.DataFrame({"title": ["fizz", "buzz"], "text": ["aaaa", "bbbb"]})
posts = [Post(**kwargs) for kwargs in df.to_dict("records")]

CodePudding user response：

What I usually do is using apply function.

import pandas as pd

df = pd.DataFrame(dict(title=["title1", "title2", "title3"],text=["text1", "text2", "text3"]))

df["Posts"] = df.apply(lambda x: dict(title=x["title"], text=x["text"]), axis=1)

posts = list(df["Posts"])
print(posts)

Output:

[{'title': 'title1', 'text': 'text1'}, {'title': 'title2', 'text': 'text2'}, {'title': 'title3', 'text': 'text3'}]

It's better to avoid a for loop when you have another methods to do that.