In the following example, what would be the best to group so that there could be a new column that is formed by taking first year in each group and subtracting by current year. For example in in row with index 0 it would be NaN, row with index 1 , it would = 1, row with index 2 it would = 3, row with index 4 = 1 and so forth.
>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
... 'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003]})
>>> print(df)
id Year
0 1 2000
1 1 2001
2 1 2003
3 2 2004
4 2 2005
5 3 2002
6 4 2001
7 4 2003
CodePudding user response:
Transform Year with first to get the first year per id, then subtract this from Year column to get difference, finally mask the values where difference is 0:
s = df['Year'] - df.groupby('id')['Year'].transform('first')
df['col'] = s.mask(s == 0)
id Year col
0 1 2000 NaN
1 1 2001 1.0
2 1 2003 3.0
3 2 2004 NaN
4 2 2005 1.0
5 3 2002 NaN
6 4 2001 NaN
7 4 2003 2.0
CodePudding user response:
Although the answer from @Shubham is shorter than my answer, I want to put here another approach:
1. Looking for the firsts of each group
You need to know what are each first value of each group, for that reason, you are going to use Groupby.first method:
firsts = df.groupby('id').first()
print(first)
The first variable is a dataframe with this structure:
Year
id
1 2000
2 2004
3 2002
4 2001
2. Looking at how many values have each group
For this, you will use Groupby.count method
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
print(howmany)
The howmany variable is a dataframe with this structure:
Howmany
id
1 3
2 2
3 1
4 2
3. Create an information dataframe
Information will be an auxiliary dataframe with information about what is the first Year of each group and how many values has that group. You need to use a simple Dataframe.join method
information = firsts.join(howmany).reset_index()
print(information)
The information variable is a dataframe with this structure:
id Year Howmany
0 1 2000 3
1 2 2004 2
2 3 2002 1
3 4 2001 2
4. Create a column with each first value of each group
You are going to create a new column called col which has the first Year of each group repeated according to the number of values.
lists_df = []
for value in information.itertuples():
lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
So far, our df looks like this:
id Year col
0 1 2000 2000 <--- First value of group 1, it is repeated 3 times because group 1 has 3 values
1 1 2001 2000
2 1 2003 2000
3 2 2004 2004 <--- First value of group 2, it is repeated 2 times because group 2 has 2 values
4 2 2005 2004
5 3 2002 2002
6 4 2001 2001
7 4 2003 2001
You can realize that col column has exactly the first value of each group
5. Use Groupby.shift to shift col column by each group
So, if you do
df.groupby('id')['col'].shift()
Then, you will get:
col
0 NaN
1 2000.0
2 2000.0
3 NaN
4 2004.0
5 NaN
6 NaN
7 2001.0
Finally, a simple vectorized substraction is enough, between Year and col column:
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
All code together!
firsts = df.groupby('id').first()
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
information = firsts.join(howmany).reset_index()
lists_df = []
for value in information.itertuples():
lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
print(df)
Output:
id Year col
0 1 2000 NaN
1 1 2001 1.0
2 1 2003 3.0
3 2 2004 NaN
4 2 2005 1.0
5 3 2002 NaN
6 4 2001 NaN
7 4 2003 2.0
