How to create a multiIndex (hierarchical index) dataframe object from another df's column'-CodePudding

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.

Is there an easier way to have this information summarized besides creating this dataframe?

Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...

Original Dataframe:

data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'], 
                     'B': [2001, 2002, 2003, 2004, 2005], 
                     'C': ['F', 'M', 'F', 'F', 'M'],
                     'D': [0, 0, 0, 1, 0],
                     'E': [1, 0, 1, 0, 1],
                     'F': [1, 1, 0, 0, 0]})


    A       B       C   D   E   F
0   case    2001    F   0   1   1
1   case    2002    M   0   0   1
2   case    2003    F   0   1   0
3   case    2004    F   1   0   0
4   case    2005    M   1   1   0

Desired outcome:

     unique  percent
A    case    100 
B    2001    20
     2002    20
     2003    20
     2004    20
     2005    20
C    F       60
     M       40
D    0       80
     1       20
E    0       40
     1       60
F    0       60
     1       40

My failed for loop attempt:

def unique_values(df):
    values = {}
    columns = []
    df = pd.DataFrame(values, columns=columns)
    for col in data:
        df2 = data[col].value_counts(normalize=True)*100
        values = values.update(df2.to_dict)
        columns = columns.append(col*len(df2))
    return df

unique_values(data)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
     11 
     12 
---> 13 unique_values(data)

<ipython-input-84-a341284fb859> in unique_values(df)
      5     for col in data:
      6         df2 = data[col].value_counts(normalize=True)*100
----> 7         values = values.update(df2.to_dict)
      8         columns = columns.append(col*len(df2))
      9     return df

TypeError: 'method' object is not iterable

Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.

CodePudding user response：

This is a fairly straightforward application of .melt:

data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)

output

                index
variable value  
A        case   1.0
B        2001   0.2
         2002   0.2
         2003   0.2
         2004   0.2
         2005   0.2
C        F      0.6
         M      0.4
D        0      0.8
         1      0.2
E        0      0.4
         1      0.6
F        0      0.6
         1      0.4

CodePudding user response：

I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.

I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)

const data = {
    A: ["case", "case", "case", "case", "case"],
    B: [2001, 2002, 2003, 2004, 2005],
    C: ["F", "M", "F", "F", "M"],
    D: [0, 0, 0, 1, 0],
    E: [1, 0, 1, 0, 1],
    F: [1, 1, 0, 0, 0]
};

const getUniqueStats = (_data) => {
    const results = [];
    for (let row in _data) {
        // create list of unique values
        const s = [...new Set(_data[row])]; 
        // filter for unique values and count them for percentage, then push
        results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
    }
    return results;
};

const results = getUniqueStats(data);

results.forEach((row) =>
    row.values.forEach((value) =>
        console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
    )
);