I need to gather information from an existing dataset. The dataset looks as follows:
Source Target Label_S Weight Prop_1 Prop_2 Mer_1 Mer_2
car airplane 0.5 0.2 1 0 0 0
car train 0.5 0.5 1 1 0 1
car bike 0.5 0.2 1 1 0 0
bike motorbike 1 0.7 1 1 0 1
bike car 1 0.2 1 1 0 0
airplane car -1 0.2 0 1 0 0
train car 1 0.5 1 1 1 0
motorbike car 1 0.7 1 1 1 0
motorbike toy 1 0.6 1 0 1 1
Label_S, Prop_1 and Mer_1 are Source's properties; Prop_2 and Mer_2 are Target's properties.
I am trying to create a list of unique nodes from both Source and Target, including their properties; something like this:
Node Label_S Property Merchandising
car 0.5 1 0
airplane -1 0 0
train 1 1 1
bike 1 1 0
motorbike 1 1 1
toy 0 1
I had not problem to create the list including all the nodes:
source = df['Source'].unique().tolist()
target = df['Target'].unique().tolist()
all_nodes=list(source target)
but I am not actually understanding how to get information from properties columns based on Source/Target information.
I think I should first split the dataframe in two dataframes: one with Source plus the properties of Source; the other one with Target elements plus the properties of Target.
Once got this information, maybe it could be good to append the two dataframes and remove duplicates under the column Node. But I feel that something is wrong: for example, I have Label_S which is a property of Source and not of Target...
CodePudding user response:
I'm not sure I understand correctly but you can first create two distinct dataframes for source and target and aggregate the properties in lists:
df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source").agg(list)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target").agg(list)
print(df_s)
print(df_t)
Output:
Label_S Prop_1 Mer_1
Source
airplane [-1.0] [0] [0]
bike [1.0, 1.0] [1, 1] [0, 0]
car [0.5, 0.5, 0.5] [1, 1, 1] [0, 0, 0]
motorbike [1.0, 1.0] [1, 1] [1, 1]
train [1.0] [1] [1]
Weight Prop_2 Mer_2
Target
airplane [0.2] [0] [0]
bike [0.2] [1] [0]
car [0.2, 0.2, 0.5, 0.7] [1, 1, 1, 1] [0, 0, 0, 0]
motorbike [0.7] [1] [1]
toy [0.6] [0] [1]
train [0.5] [1] [1]
Edit
You can aggregate the properties differently to keep only one value (e.g. max), then merge you dataframes:
df_s = df[["Source", "Label_S", "Prop_1", "Mer_1"]].groupby("Source", as_index=False).agg(max)
df_t = df[["Target", "Weight", "Prop_2", "Mer_2"]].groupby("Target", as_index=False).agg(max)
df_s.columns = ["Node", "Label_S", "Property", "Merchandising"]
df_t.columns = ["Node", "Weight", "Property", "Merchandising"]
print(df_s.merge(df_t, how="outer").set_index("Node"))
Output:
Label_S Property Merchandising Weight
Node
airplane -1.0 0 0 0.2
bike 1.0 1 0 0.2
car 0.5 1 0 0.7
motorbike 1.0 1 1 0.7
train 1.0 1 1 0.5
toy NaN 0 1 0.6
If you want to exclude the Weight column:
print(df_s.merge(df_t[["Node", "Property", "Merchandising"]], how="outer").set_index("Node"))
