I am unable to properly explain my requirement, but I can show the expected result. I have a dataframe that looks like so:
| Series1 | Series2 |
|---|---|
| 1370307 | 1370306 |
| 927092 | 927091 |
| 925392 | 925391 |
| 925390 | 925389 |
| 2344089 | 2344088 |
| 1827855 | 1827854 |
| 1715793 | 1715792 |
| 2356467 | 2356466 |
| 1463264 | 1463263 |
| 1712684 | 1712683 |
actual dataframe size: 902811 rows × 2 columns
then another dataframe of unique values of Series2. This I've done using value counts.
df2 = df['Series2'].value_counts().rename_axis('Series2').to_frame('counts').reset_index()
Then I need a list of matching Series1 values for each Series2 value: The expected result is:
| Series2 | counts | Series1_List |
|---|---|---|
| 2543113 | 6 | [2543114, 2547568, 2559207, 2563778, 2564330, 2675803] |
| 2557212 | 6 | [2557213, 2557301, 2559192, 2576080, 2675693, 2712790] |
| 2432032 | 5 | [2432033, 2444169, 2490928, 2491392, 2528056] |
| 2559269 | 5 | [2559270, 2576222, 2588034, 2677710, 2713207] |
| 2439554 | 5 | [2439555, 2441882, 2442272, 2443590, 2443983] |
| 2335180 | 5 | [2335181, 2398282, 2527060, 2527321, 2565487] |
| 2494111 | 4 | [2494112, 2495321, 2526026, 2528492] |
| 2559195 | 4 | [2559196, 2570172, 2634537, 2675718] |
| 2408775 | 4 | [2408776, 2409117, 2563765, 2564320] |
| 2408773 | 4 | [2408774, 2409116, 2563764, 2564319] |
I achieve this (although only for a subset of 50 rows) using the following code:
df2.loc[:50,'Series1_List'] = df2.loc[:50,'Series2'].apply(lambda x: df[df['Series2']==x]['Series1'].tolist())
If I do this for the whole dataframe it wouldn't complete even in 20 minutes. So the question is whether there is a faster and efficient method of achieving the result?
CodePudding user response:
IIUC, use:
df2 = (df.groupby('Series2', as_index=False)
.agg(counts=('Series1', 'count'), Series1_List=('Series1', list))
)
