Home > Blockchain >  How to iterate over columns and check condition by group
How to iterate over columns and check condition by group

Time:01-15

I have data for many countries over a period of time (2001-2003). It looks something like this:

index year country inflation GDP
1 2001 AFG nan 48
2 2002 AFG nan 49
3 2003 AFG nan 50
4 2001 CHI 3.0 nan
5 2002 CHI 5.0 nan
6 2003 CHI 7.0 nan
7 2001 USA nan 220
8 2002 USA 4.0 250
9 2003 USA 2.5 280

I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.

In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.

What's the best way to do that?

CodePudding user response:

This should work by filtering all values that have nan in one of (inflation, GDP):

(
    df.groupby(['country'])
    .filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)

Note, if you have more than two columns you can work on a more general version of this:

df.groupby(['country']).filter(lambda x: not x.isnull().all().any())

If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:

mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())

CodePudding user response:

You can also try this:

# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()

# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])

# keep the rows contains the countries

df = df.drop(df[~df.country.isin(final_countries)].index)

CodePudding user response:

You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.

To convert from long to wide, you can use pivot functions. See this question too.

Here's code for dropping nulls, after its reshaped:

df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null

To convert back to long, you can use pd.melt.

  •  Tags:  
  • Related