concatenate dataframes with variable row sizes-CodePudding

I have two csv files which have different row numbers.

test1.csv
num,sam
1,1.2
2,1.13
3,0.99

test2.csv
num,sam
1,1.2
2,1.1
3,0.99
4,1.02

I would like to read the sam columns and append them to an empty dataframe. Thing is that, when I read test1.csv, I extract the base file name, test1 and want to append the sam column based on the `column header in the empty dataframe.

big_df = pd.DataFrame(columns =['test1','test2'])
pwd = os.getcwd()
for file in os.listdir(pwd):
     filename = os.fsdecode(file)
     if filename.endswith(".csv"):
         prog = filename.split('.')[0] # test1 test2
         df = pd.read_csv(filename, usecols=['sam'])
         # The read dataframe has one column
         # Move/append that column to the big_df where column == prog
         big_df[prog] = df
print(big_df)

But big_df misses the fourth row of test2.csv.

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99

I expect to see

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99
3   NaN    1.02

How can I fix that?

CodePudding user response：

Using pandas.concat and a simple dictionary comprehension:

files = ['test1.csv', 'test2.csv']
df = pd.concat({f.rsplit('.', 1)[0]: pd.read_csv(f).set_index('num')['sam']
                for f in files}, axis=1)

output:

     test1  test2
num              
1     1.20   1.20
2     1.13   1.10
3     0.99   0.99
4      NaN   1.02

CodePudding user response：

You could approach it differently and use concat instead of creating an empty data frame in the first place. Might be also a bit more efficient. In code that reads like

def get_columns():
    for file in os.listdir(pwd):
        filename = os.fsdecode(file)
        if filename.endswith(".csv"):
            prog = filename.split('.')[0] # test1 test2
            yield pd.read_csv(filename, usecols=['sam'])['sam'].rename(prog)
big_df = pd.concat(get_columns(), axis=1)

Otherwise, you could use merge with outer as mentioned in the comment.

CodePudding user response：

read bothe csv's first using pd.read_csv("filename",sep=',')

df1
   num   sam
0    1  1.20
1    2  1.13
2    3  0.99

df2
   num   sam
0    1  1.20
1    2  1.10
2    3  0.99
3    4  1.02

the do the following

df2.drop('num',axis=1,inplace=True)
df2.columns=['test2']
df2['test1']=df1['sam']

output:

   test2  test1
0   1.20   1.20
1   1.10   1.13
2   0.99   0.99
3   1.02    NaN