I am trying to load a very large (~25 million rows) csv file into pandas. I am chunking this file, 100,000 rows at a time, and appending a newly created dataframe that essentially counts the occurrences of certain words within a column of the chunked dataframe. When I save the first chunk, everything works properly, and the chunk is concatenated to the newly created dataframe side-by-side. The second chunk, however, is concatenated diagonally for some reason. By this I mean that the chunked dataframe is now 200,000 rows, with the first 100,000 rows being empty, and the newly created dataframe is concatenated side-by-side with the first 100,000 rows. How do I fix this problem and concatenate each chunk side by side with the newly created dataframe and save each chunk into separate csv files?
My code:
import pandas as pd
from pandas.core.frame import DataFrame
chunk = 1
for df in pd.read_csv('all_comments_data.csv', chunksize=100000):
dict_to_append = {}
with open('conflict_words.txt') as f:
for word in f.readlines():
dict_to_append[word.strip()] = []
index = 0
for comment in df['comment'].to_numpy():
word_list = str(comment).split(" ")
for conflict_word in dict_to_append.keys():
dict_to_append[conflict_word].append(word_list.count(conflict_word))
print(index)
index =1
df_to_append = pd.DataFrame(dict_to_append)
final_df = pd.concat([pd.DataFrame(df), df_to_append], axis=1)
final_df.to_csv(f"all_comments_data_with_conflict_scores_{chunk}.csv")
chunk = 1
What I need dataframes to look like:
---------------------------
| | |
| chunk | new dframe |
| | |
---------------------------
What the dataframes look like after the first chunk:
---------------------------
| | |
| | new dframe |
| | |
---------------------------
| | |
| chunk | |
| | |
---------------------------
CodePudding user response:
When running pd.concat column-wise, pandas will try and match rows by their index:
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[1, 3, 5])
df2 = pd.DataFrame({"A": [4, 5, 6], "B": [7, 8, 9]}, index=[2, 4, 6])
df3 = pd.concat([df1, df2], axis=1)
print(df3)
A B A B
1 1.0 4.0 NaN NaN
2 NaN NaN 4.0 7.0
3 2.0 5.0 NaN NaN
4 NaN NaN 5.0 8.0
5 3.0 6.0 NaN NaN
6 NaN NaN 6.0 9.0
If you want chunk and new dframe to sit side-by-side after a concat, you will need to ensure they both have the same row indices.
