Pandas - concating multi-indexed dataframes keeps duplicate indizes-CodePudding

Hello I'm trying to read in multiple dfs of the same structre and concating them into a single one however the combined df somehow keeps duplicates in the indizes....

df1 = processUSAdata(2223)
df2 = processUSAdata(2224)

print(df1)
print(df2)

test = pd.concat([df1,df2])
test.set_index(['state','year'],inplace=True)
print(test)

outputs

         state  population  violent_crime  ...    theft     gta  year
0      ALABAMA     4903185         510.81  ...  1886.06  256.51  2223
1       ALASKA      731545         867.07  ...  2066.04  357.74  2223
2      ARIZONA     7278717         455.31  ...  1796.86  249.37  2223
3     ARKANSAS     3017804         584.63  ...  2012.56  245.87  2223
4   CALIFORNIA    39512223         441.21  ...  1586.35  358.77  2223
5     COLORADO     5758736         380.95  ...  1858.26  383.99  2223
6  CONNECTICUT     3565287         183.60  ...  1078.65  167.28  2223

[7 rows x 12 columns]
         state  population  violent_crime  ...    theft     gta  year
0      ALABAMA     4903185         510.81  ...  1886.06  256.51  2224
1       ALASKA      731545         867.07  ...  2066.04  357.74  2224
2      ARIZONA     7278717         455.31  ...  1796.86  249.37  2224
3     ARKANSAS     3017804         584.63  ...  2012.56  245.87  2224
4   CALIFORNIA    39512223         441.21  ...  1586.35  358.77  2224
5     COLORADO     5758736         380.95  ...  1858.26  383.99  2224
6  CONNECTICUT     3565287         183.60  ...  1078.65  167.28  2224

[7 rows x 12 columns]
                  population  violent_crime  murder  ...  burglary    theft     gta
state       year                                     ...                           
ALABAMA     2223     4903185         510.81    7.30  ...    531.88  1886.06  256.51
ALASKA      2223      731545         867.07    9.43  ...    487.05  2066.04  357.74
ARIZONA     2223     7278717         455.31    5.01  ...    394.29  1796.86  249.37
ARKANSAS    2223     3017804         584.63    8.02  ...    599.61  2012.56  245.87
CALIFORNIA  2223    39512223         441.21    4.28  ...    386.10  1586.35  358.77
COLORADO    2223     5758736         380.95    3.79  ...    348.41  1858.26  383.99
CONNECTICUT 2223     3565287         183.60    2.92  ...    180.66  1078.65  167.28
ALABAMA     2224     4903185         510.81    7.30  ...    531.88  1886.06  256.51
ALASKA      2224      731545         867.07    9.43  ...    487.05  2066.04  357.74
ARIZONA     2224     7278717         455.31    5.01  ...    394.29  1796.86  249.37
ARKANSAS    2224     3017804         584.63    8.02  ...    599.61  2012.56  245.87
CALIFORNIA  2224    39512223         441.21    4.28  ...    386.10  1586.35  358.77
COLORADO    2224     5758736         380.95    3.79  ...    348.41  1858.26  383.99
CONNECTICUT 2224     3565287         183.60    2.92  ...    180.66  1078.65  167.28

Hoping someone can help me out here. Thanks in advance! :)

CodePudding user response：

It's probably because your index is not sorted:

test = pd.concat([df1, df2]).set_index(['state', 'year']).sort_index()
print(test)

# Output
                  population  violent_crime    theft     gta
state       year                                            
ALABAMA     2223     4903185         510.81  1886.06  256.51
            2224     4903185         510.81  1886.06  256.51
ALASKA      2223      731545         867.07  2066.04  357.74
            2224      731545         867.07  2066.04  357.74
ARIZONA     2223     7278717         455.31  1796.86  249.37
            2224     7278717         455.31  1796.86  249.37
ARKANSAS    2223     3017804         584.63  2012.56  245.87
            2224     3017804         584.63  2012.56  245.87
CALIFORNIA  2223    39512223         441.21  1586.35  358.77
            2224    39512223         441.21  1586.35  358.77
COLORADO    2223     5758736         380.95  1858.26  383.99
            2224     5758736         380.95  1858.26  383.99
CONNECTICUT 2223     3565287         183.60  1078.65  167.28
            2224     3565287         183.60  1078.65  167.28

CodePudding user response：

you can verify when you concat for duplicates by:

test = pd.concat([df1,df2],verify_integrity=True)

Or you can drop duplicates afterwards:

test.set_index(['state','year'],inplace=True)..drop_duplicates(inplace=True)