Hello I'm trying to read in multiple dfs of the same structre and concating them into a single one however the combined df somehow keeps duplicates in the indizes....
df1 = processUSAdata(2223)
df2 = processUSAdata(2224)
print(df1)
print(df2)
test = pd.concat([df1,df2])
test.set_index(['state','year'],inplace=True)
print(test)
outputs
state population violent_crime ... theft gta year
0 ALABAMA 4903185 510.81 ... 1886.06 256.51 2223
1 ALASKA 731545 867.07 ... 2066.04 357.74 2223
2 ARIZONA 7278717 455.31 ... 1796.86 249.37 2223
3 ARKANSAS 3017804 584.63 ... 2012.56 245.87 2223
4 CALIFORNIA 39512223 441.21 ... 1586.35 358.77 2223
5 COLORADO 5758736 380.95 ... 1858.26 383.99 2223
6 CONNECTICUT 3565287 183.60 ... 1078.65 167.28 2223
[7 rows x 12 columns]
state population violent_crime ... theft gta year
0 ALABAMA 4903185 510.81 ... 1886.06 256.51 2224
1 ALASKA 731545 867.07 ... 2066.04 357.74 2224
2 ARIZONA 7278717 455.31 ... 1796.86 249.37 2224
3 ARKANSAS 3017804 584.63 ... 2012.56 245.87 2224
4 CALIFORNIA 39512223 441.21 ... 1586.35 358.77 2224
5 COLORADO 5758736 380.95 ... 1858.26 383.99 2224
6 CONNECTICUT 3565287 183.60 ... 1078.65 167.28 2224
[7 rows x 12 columns]
population violent_crime murder ... burglary theft gta
state year ...
ALABAMA 2223 4903185 510.81 7.30 ... 531.88 1886.06 256.51
ALASKA 2223 731545 867.07 9.43 ... 487.05 2066.04 357.74
ARIZONA 2223 7278717 455.31 5.01 ... 394.29 1796.86 249.37
ARKANSAS 2223 3017804 584.63 8.02 ... 599.61 2012.56 245.87
CALIFORNIA 2223 39512223 441.21 4.28 ... 386.10 1586.35 358.77
COLORADO 2223 5758736 380.95 3.79 ... 348.41 1858.26 383.99
CONNECTICUT 2223 3565287 183.60 2.92 ... 180.66 1078.65 167.28
ALABAMA 2224 4903185 510.81 7.30 ... 531.88 1886.06 256.51
ALASKA 2224 731545 867.07 9.43 ... 487.05 2066.04 357.74
ARIZONA 2224 7278717 455.31 5.01 ... 394.29 1796.86 249.37
ARKANSAS 2224 3017804 584.63 8.02 ... 599.61 2012.56 245.87
CALIFORNIA 2224 39512223 441.21 4.28 ... 386.10 1586.35 358.77
COLORADO 2224 5758736 380.95 3.79 ... 348.41 1858.26 383.99
CONNECTICUT 2224 3565287 183.60 2.92 ... 180.66 1078.65 167.28
Hoping someone can help me out here. Thanks in advance! :)
CodePudding user response:
It's probably because your index is not sorted:
test = pd.concat([df1, df2]).set_index(['state', 'year']).sort_index()
print(test)
# Output
population violent_crime theft gta
state year
ALABAMA 2223 4903185 510.81 1886.06 256.51
2224 4903185 510.81 1886.06 256.51
ALASKA 2223 731545 867.07 2066.04 357.74
2224 731545 867.07 2066.04 357.74
ARIZONA 2223 7278717 455.31 1796.86 249.37
2224 7278717 455.31 1796.86 249.37
ARKANSAS 2223 3017804 584.63 2012.56 245.87
2224 3017804 584.63 2012.56 245.87
CALIFORNIA 2223 39512223 441.21 1586.35 358.77
2224 39512223 441.21 1586.35 358.77
COLORADO 2223 5758736 380.95 1858.26 383.99
2224 5758736 380.95 1858.26 383.99
CONNECTICUT 2223 3565287 183.60 1078.65 167.28
2224 3565287 183.60 1078.65 167.28
CodePudding user response:
you can verify when you concat for duplicates by:
test = pd.concat([df1,df2],verify_integrity=True)
Or you can drop duplicates afterwards:
test.set_index(['state','year'],inplace=True)..drop_duplicates(inplace=True)
