I have the following dataframe
Country Name Code Signed Index
0 CZ Paulo 3 x 1
1 AE Paulo Yes None 1
2 AE Paulo Yes None 2
3 AE Paulo 1 Yes 5
4 CZ Paulo None None 6
5 DK Paulo Yes None 9
6 DK Paulo None None 20
7 PT Paulo 2 Yes 20
8 PT Paulo 1 Yes 22
I need three new columns after grouping by country
- count the missing values in
CodeandSignedcolumn - total of rows that have both values filled
- total of rows that have the same Country value
- point the rows where we have any of those values blank per Country (list or non list format) using the column "Index" as reference
If any of the Countries have all their Code and Signed rows filled, remove it from the dataframe.
In this case, it would return this dataframe:
Country Total_Blanks_on_Code Total_Blanks_on_Signed Total_of_rows_with_both_values_filled Total_of_rows_of_the_Country Rows with any blank
0 CZ 1 1 None 2 6
1 AE 2 0 1 3 [1,2]
2 DK 2 1 None 2 [9,20]
Thank you for your help!
CodePudding user response:
Here's a way to do what your question asks:
df['both_filled'] = (df.Code.notna() & df.Signed.notna()).map({True:True, False:None})
df['Rows_with_any_blank'] = df.Index[df['both_filled'].isna()]
gb = df.groupby('Country', sort=False)
df2 = ( gb.count().assign(
Rows_with_any_blank=gb['Rows_with_any_blank']
.agg(lambda x: list(x.dropna().astype(int)))) )
df2 = ( df2.assign(
Total_Blanks_on_Code=df2.Name - df2.Code,
Total_Blanks_on_Signed=df2.Name - df2.Signed)
[df2.both_filled < df2.Name]
[['Total_Blanks_on_Code','Total_Blanks_on_Signed',
'both_filled','Name','Rows_with_any_blank']]
.reset_index()
.rename(columns={
'Name':'Total_of_rows_of_the_Country',
'both_filled':'Total_of_rows_with_both_values_filled'
}) )
Input:
Country Name Code Signed Index
0 CZ Paulo 3 x 1
1 AE Paulo Yes None 1
2 AE Paulo Yes None 2
3 AE Paulo 1 Yes 5
4 CZ Paulo None None 6
5 DK Paulo Yes None 9
6 DK Paulo None None 20
7 PT Paulo 2 Yes 20
8 PT Paulo 1 Yes 22
Output:
Country Total_Blanks_on_Code Total_Blanks_on_Signed Total_of_rows_with_both_values_filled Total_of_rows_of_the_Country Rows_with_any_blank
0 CZ 1 1 1 2 [6]
1 AE 0 2 1 3 [1, 2]
2 DK 1 2 0 2 [9, 20]
Explanation:
- Create
both_filledcolumn which isTrueif bothCodeandSignedare non-null and isNoneotherwise (this allows us to later usecount()to effectively sum the number of rows having both columns non-null) - Create
Rows_with_any_blankcolumn which contains the value inIndexfor rows where neither ofCodeandSignedis null - Create a
groubpyobjectgbbyCountry - Use
count()to get the number of non-null entries per group in each column ofgb - Use
assign()to overwrite theRows_with_any_blankcolumn to be a list of the non-nullIndexvalues for each group - Use
assign()to create and populate columnsTotal_Blanks_on_CodeandTotal_Blanks_on_Signed - Keep only rows where the count in
both_filled< the count inName(which is the total number of rows in the original df); this removes anyCountryfor which allCodeandSignedrows are filled - Select the 5 desired columns in the specified order using
[[]] - Use
reset_index()to switchCountryfrom the index to a column - Use
rename()to changeNameandboth_filledto have the specified labelsTotal_of_rows_of_the_CountryandTotal_of_rows_with_both_values_filled.
CodePudding user response:
Based on the definitions/conditions you gave, the country AE should have a total blanks on Code equal to 0 and not 2.
Anyway, you can use the code below to get the format of output you're looking for :
out = (df.assign(Total1 = df['Code'].isna(),
Total2 = df['Signed?'].isna(),
Total3 = ~df['Code'].isna() & ~df['Signed?'].isna())
.groupby('Country', as_index=False)
.agg(NumberOfCountries = ('Country','size'),
Total1 = ('Total1','sum'),
Total2 = ('Total2','sum'),
Total3 = ('Total3','sum'))
).rename(columns={'Total1': 'Total Blanks on Code', 'Total2': 'Total Blanks on Signed?',
'Total3': 'Total of rows with both values filled', 'NumberOfCountries': 'Total of rows of the Country'})

