I read data from a CSV file and put it into a DataFrame. The data contains rows of attributes that are related to an element and need to be transformed to columns instead. For each element, there is also a sub-table that I need to account for.
Example DataFrame:
df = pd.DataFrame([["Drop-Down Field Name:", "Test List"],
["Drop-Down List Name:", "Test List"],
["Drop-Down List Type:", "Specific"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "DDL V1", "Yes", "Yes", "1.00", None],
[None, "DDL V2", "Yes", None, "2.00", None],
[None, "DDL V3", "Yes", None, "3.00", None],
[None, "DDL V4", "Yes", None, "4.00", None],
[None, "DDL V5", "Yes", None, "5.00", None],
["Drop-Down Field Name:", "Test Empty List"],
["Drop-Down List Name:", "Test Empty List"],
["Drop-Down List Type:", "Specific"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
["Drop-Down Field Name:", "Email verified?"],
["Drop-Down List Name:", "Yes/No"],
["Drop-Down List Type:", "Reusable"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
[None, "No", "Yes", None, "0", "red_cross.jpg"],
["Drop-Down Field Name:", "Payment verified?"],
["Drop-Down List Name:", "Yes/No"],
["Drop-Down List Type:", "Reusable"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "Yes", "Yes", None, "1.00", "green_checkmark.jpg"],
[None, "No", "Yes", None, "0", "red_cross.jpg"],
["Drop-Down Field Name:", "How many years expected from now?"],
["Drop-Down List Name:", "How many years expected from now?"],
["Drop-Down List Type:", "Specific"],
["Drop-Down List Status:", "Active"],
["Drop-Down List Values:", "Text", "Active", "Default", "Weight", "Image"],
[None, "One", None, None, None, None],
[None, "Two", None, None, None, None],
[None, "Three", None, None, None, None],
[None, "Four", None, None, None, None],
[None, "Five", None, None, None, None],
[None, "1", "Yes", None, None, None],
[None, "2", "Yes", None, None, None],
[None, "3", "Yes", None, None, None],
[None, "4", "Yes", None, None, None],
[None, "5", "Yes", None, None, None],
[None, "6-10", "Yes", None, None, None],
[None, "11-15", "Yes", None, None, None],
[None, "16-20", "Yes", None, None, None],
[None, "20 ", "Yes", None, None, None]])
The sub-tables in the table have their cells in bold. They are not equal in number of entries and can have no entries at all.
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 0 | Drop-Down Field Name: | Test List | None | None | None | None |
| 1 | Drop-Down List Name: | Test List | None | None | None | None |
| 2 | Drop-Down List Type: | Specific | None | None | None | None |
| 3 | Drop-Down List Status: | Active | None | None | None | None |
| 4 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
| 5 | None | DDL V1 | Yes | Yes | 1.00 | None |
| 6 | None | DDL V2 | Yes | None | 2.00 | None |
| 7 | None | DDL V3 | Yes | None | 3.00 | None |
| 8 | None | DDL V4 | Yes | None | 4.00 | None |
| 9 | None | DDL V5 | Yes | None | 5.00 | None |
| 10 | Drop-Down Field Name: | Test Empty List | None | None | None | None |
| 11 | Drop-Down List Name: | Test Empty List | None | None | None | None |
| 12 | Drop-Down List Type: | Specific | None | None | None | None |
| 13 | Drop-Down List Status: | Active | None | None | None | None |
| 14 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
| 15 | Drop-Down Field Name: | Email verified? | None | None | None | None |
| 16 | Drop-Down List Name: | Yes/No | None | None | None | None |
| 17 | Drop-Down List Type: | Reusable | None | None | None | None |
| 18 | Drop-Down List Status: | Active | None | None | None | None |
| 19 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
| 20 | None | Yes | Yes | None | 1.00 | green_checkmark.jpg |
| 21 | None | No | Yes | None | 0 | red_cross.jpg |
| 22 | Drop-Down Field Name: | Payment verified? | None | None | None | None |
| 23 | Drop-Down List Name: | Yes/No | None | None | None | None |
| 24 | Drop-Down List Type: | Reusable | None | None | None | None |
| 25 | Drop-Down List Status: | Active | None | None | None | None |
| 26 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
| 27 | None | Yes | Yes | None | 1.00 | green_checkmark.jpg |
| 28 | None | No | Yes | None | 0 | red_cross.jpg |
| 29 | Drop-Down Field Name: | How many years expected from now? | None | None | None | None |
| 30 | Drop-Down List Name: | How many years expected from now? | None | None | None | None |
| 31 | Drop-Down List Type: | Specific | None | None | None | None |
| 32 | Drop-Down List Status: | Active | None | None | None | None |
| 33 | Drop-Down List Values: | Text | Active | Default | Weight | Image |
| 34 | None | One | None | None | None | None |
| 35 | None | Two | None | None | None | None |
| 36 | None | Three | None | None | None | None |
| 37 | None | Four | None | None | None | None |
| 38 | None | Five | None | None | None | None |
| 39 | None | 1 | Yes | None | None | None |
| 40 | None | 2 | Yes | None | None | None |
| 41 | None | 3 | Yes | None | None | None |
| 42 | None | 4 | Yes | None | None | None |
| 43 | None | 5 | Yes | None | None | None |
| 44 | None | 6-10 | Yes | None | None | None |
| 45 | None | 11-15 | Yes | None | None | None |
| 46 | None | 16-20 | Yes | None | None | None |
| 47 | None | 20 | Yes | None | None | None |
My Approach:
- For each sub-table:
- Find start and end locations
- Transpose
- Replace with the transpose in main table
- Insert a column that performs
groupby()andcumcount()on element attributes - Pivot the table
I am relatively new to pandas, so I'm more than open to hear better approaches :) But here is my code so far:
ddlv_locations = df.where(df.eq("Drop-Down List Values:")).stack().index.tolist()
result = df.copy()
# Get sub-table, transpose it, and concatenate with main table
for ddlv_location in ddlv_locations:
x_start = list(ddlv_location)[0]
x_end = list(ddlv_location)[0]
y_start = list(ddlv_location)[1]
# Capture the row index where the sub-table ends
while(x_end 1 < df.shape[0] and pd.isna(df.loc[x_end 1, y_start])):
x_end = x_end 1
ddlv = df.loc[x_start:x_end, y_start 1:df.shape[1]].T # Transpose the sub-table
ddlv.columns = range(ddlv.shape[1]) # Reset column index
# Concatenate sub-table into main table whilst removing the original rows for the sub-table
result = pd.concat([result.loc[:x_start - 1], ddlv]) if x_end == df.shape[0] - 1 else pd.concat([result.loc[:x_start - 1], ddlv, result.loc[x_end 1:]])
result.insert(0, "count", result.groupby(0).cumcount())
result = result.reset_index(drop=True)
At this point, I can't wrap my head around how I would pivot the table and ensure the columns for the sub-tables are captured by individual rows in the final result.
Desired Result:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Drop-Down Field Name: | Drop-Down List Name: | Drop-Down List Type: | Drop-Down List Status: | Text | Active | Default | Weight | Image |
| 1 | Test List | Test List | Specific | Active | DDL V1 | Yes | Yes | 1.00 | None |
| 2 | Test List | Test List | Specific | Active | DDL V2 | Yes | None | 2.00 | None |
| 3 | Test List | Test List | Specific | Active | DDL V3 | Yes | None | 3.00 | None |
| 4 | Test List | Test List | Specific | Active | DDL V4 | Yes | None | 4.00 | None |
| 5 | Test List | Test List | Specific | Active | DDL V5 | Yes | None | 5.00 | None |
| 6 | Test Empty List | Test Empty List | Specific | Active | None | None | None | None | None |
| 7 | Email verified? | Yes/No | Reusable | Active | Yes | Yes | None | 1.00 | green_checkmark.jpg |
| 8 | Email verified? | Yes/No | Reusable | Active | No | None | None | 0 | red_cross.jpg |
| 9 | Payment verified? | Yes/No | Reusable | Active | Yes | Yes | None | 1.00 | green_checkmark.jpg |
| 10 | Payment verified? | Yes/No | Reusable | Active | No | None | None | 0 | red_cross.jpg |
| 11 | How many years expected from now? | How many years expected from now? | Specific | Active | One | None | None | None | None |
| 12 | How many years expected from now? | How many years expected from now? | Specific | Active | Two | None | None | None | None |
| 13 | How many years expected from now? | How many years expected from now? | Specific | Active | Three | None | None | None | None |
| 14 | How many years expected from now? | How many years expected from now? | Specific | Active | Four | None | None | None | None |
| 15 | How many years expected from now? | How many years expected from now? | Specific | Active | Five | None | None | None | None |
| 16 | How many years expected from now? | How many years expected from now? | Specific | Active | 1 | Yes | None | None | None |
| 17 | How many years expected from now? | How many years expected from now? | Specific | Active | 2 | Yes | None | None | None |
| 18 | How many years expected from now? | How many years expected from now? | Specific | Active | 3 | Yes | None | None | None |
| 19 | How many years expected from now? | How many years expected from now? | Specific | Active | 4 | Yes | None | None | None |
| 20 | How many years expected from now? | How many years expected from now? | Specific | Active | 5 | Yes | None | None | None |
| 21 | How many years expected from now? | How many years expected from now? | Specific | Active | 6-10 | Yes | None | None | None |
| 22 | How many years expected from now? | How many years expected from now? | Specific | Active | 11-15 | Yes | None | None | None |
| 23 | How many years expected from now? | How many years expected from now? | Specific | Active | 15-20 | Yes | None | None | None |
| 24 | How many years expected from now? | How many years expected from now? | Specific | Active | 20 | Yes | None | None | None |
CodePudding user response:
def reorganize(gr):
left = gr.iloc[:4, :2].set_index(0).T
right = (
gr.iloc[4:, 1:]
.reset_index(drop=True)
.T.set_index(0).T
)
# note that indexing in both frames starts from 1
return left.join(right, how='outer')
start_marker = "Drop-Down Field Name:"
grouper = (df.iloc[:, 0] == start_marker).cumsum()
result = df.groupby(grouper).apply(reorganize).reset_index(drop=True)
result.iloc[:, :4] = result.iloc[:, :4].fillna(method='pad')
p.s. The code works on the assumption that columns of df have no specific names, i.e. they are indexed by default, starting from zero (sort of 0, 1, 2, 3, ...)
