Home > Net >  How to group data by count of columns in Pandas?
How to group data by count of columns in Pandas?

Time:05-24

I have a CSV file with a lot of rows and different number of columns.

How to group data by count of columns and show it in different frames?

File CSV has the following data:

1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18

Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:

        ID NAME  STATE COUNTRY HOBBY 
   FR1: 1  OLEG    US   FRANCE  BIG

        ID NAME  COUNTRY AGE
   FR2: 1   OLEG   FR    18


  FR3:  
     ID  NAME AGE
     1  NATA    18

Any words, I need to group rows by count of columns and show them in different dataframes.

CodePudding user response:

since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.

One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.

with open('input.csv', 'r') as f:
    reader = csv.reader(f, delimiter=' ')
    data= list(reader)
    
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
    
print(df1, df2, df3, sep='\n\n')

  ID  NAME AGE
0  1  NATA  18

  ID  NAME COUNTRY AGE
0  1  OLEG      FR  18

  ID  NAME STATE COUNTRY HOBBY
0  1  OLEG    US  FRANCE   BIG

If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.

EDIT Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.

col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]

with open('input.csv', 'r') as f:
    reader = csv.reader(f, delimiter=' ')
    data= list(reader)

dict_of_dfs = {}
for cols in col_list:
    dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
    
for key,val in dict_of_dfs.items():
    print(f'{key=}: \n {val} \n')

key='df_3': 
   ID  NAME AGE
0  1  NATA  18 

key='df_4': 
   ID  NAME COUNTRY AGE
0  1  OLEG      FR  18 

key='df_5': 
   ID  NAME STATE COUNTRY HOBBY
0  1  OLEG    US  FRANCE   BIG 

Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.

If you need to import the data with pandas, you could have a look at this post.

  • Related