Home > Enterprise >  Saving multiple csv files from chunked csv file in pandas
Saving multiple csv files from chunked csv file in pandas

Time:01-05

I am trying to load a very large (~25 million rows) csv file into pandas. I am chunking this file, 100,000 rows at a time, and appending a newly created dataframe that essentially counts the occurrences of certain words within a column of the chunked dataframe. When I save the first chunk, everything works properly, and the chunk is concatenated to the newly created dataframe side-by-side. The second chunk, however, is concatenated diagonally for some reason. By this I mean that the chunked dataframe is now 200,000 rows, with the first 100,000 rows being empty, and the newly created dataframe is concatenated side-by-side with the first 100,000 rows. How do I fix this problem and concatenate each chunk side by side with the newly created dataframe and save each chunk into separate csv files?

My code:

import pandas as pd
from pandas.core.frame import DataFrame

chunk = 1

for df in pd.read_csv('all_comments_data.csv', chunksize=100000):
    dict_to_append = {}

    with open('conflict_words.txt') as f:
        for word in f.readlines():
            dict_to_append[word.strip()] = []
    
    index = 0

    for comment in df['comment'].to_numpy():
        word_list = str(comment).split(" ")
        for conflict_word in dict_to_append.keys():
            dict_to_append[conflict_word].append(word_list.count(conflict_word))
        print(index)
        index  =1


    df_to_append = pd.DataFrame(dict_to_append)
    final_df = pd.concat([pd.DataFrame(df), df_to_append], axis=1)
    final_df.to_csv(f"all_comments_data_with_conflict_scores_{chunk}.csv")
    chunk  = 1 
What I need dataframes to look like:

---------------------------
|            |            |
|  chunk     | new dframe |
|            |            |
---------------------------

What the dataframes look like after the first chunk:

---------------------------
|            |            |
|            | new dframe |
|            |            |
---------------------------
|            |            |
| chunk      |            |
|            |            |
---------------------------

CodePudding user response:

When running pd.concat column-wise, pandas will try and match rows by their index:

df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[1, 3, 5])
df2 = pd.DataFrame({"A": [4, 5, 6], "B": [7, 8, 9]}, index=[2, 4, 6])
df3 = pd.concat([df1, df2], axis=1)
print(df3)

     A    B    A    B
1  1.0  4.0  NaN  NaN
2  NaN  NaN  4.0  7.0
3  2.0  5.0  NaN  NaN
4  NaN  NaN  5.0  8.0
5  3.0  6.0  NaN  NaN
6  NaN  NaN  6.0  9.0

If you want chunk and new dframe to sit side-by-side after a concat, you will need to ensure they both have the same row indices.

  •  Tags:  
  • Related