How to iterate over columns and check condition by group-CodePudding

I have data for many countries over a period of time (2001-2003). It looks something like this:

index	year	country	inflation	GDP
1	2001	AFG	nan	48
2	2002	AFG	nan	49
3	2003	AFG	nan	50
4	2001	CHI	3.0	nan
5	2002	CHI	5.0	nan
6	2003	CHI	7.0	nan
7	2001	USA	nan	220
8	2002	USA	4.0	250
9	2003	USA	2.5	280

I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.

In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.

What's the best way to do that?

CodePudding user response：

This should work by filtering all values that have nan in one of (inflation, GDP):

(
    df.groupby(['country'])
    .filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)

Note, if you have more than two columns you can work on a more general version of this:

df.groupby(['country']).filter(lambda x: not x.isnull().all().any())

If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:

mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())

CodePudding user response：

You can also try this:

# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()

# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])

# keep the rows contains the countries

df = df.drop(df[~df.country.isin(final_countries)].index)

CodePudding user response：

You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.

To convert from long to wide, you can use pivot functions. See this question too.

Here's code for dropping nulls, after its reshaped:

df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null

To convert back to long, you can use pd.melt.