I have a dataframe you can have it by following code:
import numpy as np
import pandas as pd
from io import StringIO
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df4s = """
contract RB BeginDate
0 A00118 46 20120705
1 A00118 47 20121005
2 A00253 48 0
3 A00253 48 0
"""
df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s ',
dtype={"BeginDate": int}
)
The output is:
contract RB BeginDate
0 A00118 46 20120705
1 A00118 47 20121005
2 A00253 48 0
3 A00253 48 0
Now I want to generate a new header 'first_month' based on the 'BeginDate',the logic is simple, if the BeginDate equals 0 then first_month will be 0,or it will equals to the month value of the BeginDate,my code is:
df4['first_month'] = np.where(df4['BeginDate'] != 0,
df4['BeginDate'].astype(str).str[4:6:1].astype(int), 0)
The error is:
ValueError: invalid literal for int() with base 10: ''
The error track is:
:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
707 # work around NumPy brokenness, #1987
708 if np.issubdtype(dtype.type, np.integer):
--> 709 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
710
711 # if we have a datetime/timedelta array of objects
pandas\_libs\lib.pyx in pandas._libs.lib.astype_intsafe()
pandas/_libs/src\util.pxd in util.set_value_at_unsafe()
ValueError: invalid literal for int() with base 10: ''
The output should be:
contract RB BeginDate first_month
0 A00118 46 20120705 7
1 A00118 47 20121005 10
2 A00253 48 0 0
3 A00253 48 0 0
Any friend can help?
CodePudding user response:
From the documentation of np.where:
x, y, array_like
Values from which to choose. x, y and condition need to be broadcastable to some shape.
It means your operation for x is not evaluated only when condition is true but selected when condition is true. Therefore your operation is performed on the whole dataframe not on the subset (2 first rows). That's why you raised this error on the third row.
Use zfill to transform a '0' to '00000000':
df4['first_month'] = df4['BeginDate'].astype(str).str.zfill(8).str[4:6:1].astype(int)
print(df4)
# Output
contract RB BeginDate first_month
0 A00118 46 20120705 7
1 A00118 47 20121005 10
2 A00253 48 0 0
3 A00253 48 0 0
CodePudding user response:
Pandas dataframes can be filtered directly
df4[df4.BeginDate != 0]
or
df4.loc[df4.BeginDate != 0]
You can then just add on your logic to extract the date
df4['first_month'] = df4.loc[df4.BeginDate != 0].BeginDate.astype(str).str[4:6:1].astype(int)
Note that this will put 'NAN's in rows 2 and 3 not '0' like in your example.
CodePudding user response:
Given the BeginDate column contains string objects you can achieve your desired results with:
df['FirstMonth'] = df.apply(lambda row: 0 if row['BeginDate'] == '0' else int(row['BeginDate'][4:6]), axis = 1)
This yields:
Contract RB BeginDate FirstMonth
0 A00118 46 20120705 7
1 A00118 47 20121005 10
2 A00253 48 0 0
3 A00253 48 0 0
