How to merge rows in pandas with similar data-CodePudding

I got data with columns: startpoint, endpoint, data.

I want to merge the startpoint, endpoint rows if they contain the same data (to both directions) and add another columns of the extra data.

for example starting with:

startpoint	endpoint	data
A	B	1
C	D	2
B	A	3
D	C	4

TO:

startpoint	endpoint	data_1	data_2
A	B	1	3
C	D	2	4

Is there quick way to do it on pandas?

Thanks.

CodePudding user response：

First we groupby on an index where we sort values in startpoint, endpoint to make sure we get match permutations

match_groups = ['_'.join(sorted(t)) for t in zip(df['startpoint'],df['endpoint'])]
df2 = df.groupby(match_groups, as_index = False).agg({'startpoint':'first', 'endpoint':'first', 'data':list})

df2 looks like this:

    startpoint    endpoint    data
--  ------------  ----------  ------
 0  A             B           [1, 3]
 1  C             D           [2, 4]

if we want data in separate columns then we apply pd.Series (and rename columns to desired labels)

(df2.set_index(['startpoint', 'endpoint'])['data']
    .apply(pd.Series).rename(columns = lambda n:f'data_{n 1}')
    .reset_index()
)

output:

    startpoint    endpoint      data_1    data_2
--  ------------  ----------  --------  --------
 0  A             B                  1         3
 1  C             D                  2         4

CodePudding user response：

If I've understood your question correctly, the following code should do what you want -

data.index = [hash(frozenset([x,y])) for x, y in zip(data["startpoint"], data["endpoint"])]
result = data.groupby(data.index)["data"].apply(list).to_frame()
result = result["data"].apply(pd.Series)
result.columns = ["data1", "data2"]
result = pd.merge(data[["startpoint", "endpoint"]], result, left_index=True, right_index=True)
result = result[~result.index.duplicated(keep='first')]

The variable data is the original DataFrame. Please let me know if anything is unclear.

CodePudding user response：

Your best bet is to use pd.merge(). Pandas official website shows how to use pd.merge() functions.

https://pandas.pydata.org/docs/dev/user_guide/merging.html

CodePudding user response：

Get matching values values between startpoint and endpoint and do the necessary conversions for your data types. You can then sort the values in your dataframe and use a groupby:

# Matching values between startpoint   endpoint
df['start_end_grouped'] = [sorted(''.join(val).replace(',','')) for val in zip(df['startpoint'], df['endpoint'])]

# Conversions
df['data'] = df['data'].str.replace(',','').astype(float)
df['start_end_grouped'] = df['start_end_grouped'].astype(str)

# Result

df[['data','start_end_grouped']].sort_values(by=['start_end_grouped','data'])\
    .groupby('start_end_grouped',as_index=False).agg(list)

  start_end_grouped        data
0        ['A', 'B']  [1.0, 3.0]
1        ['C', 'D']  [2.0, 4.0]