pandas numpy error:ValueError: invalid literal for int() with base 10: ''-CodePudding

I have a dataframe you can have it by following code:

import numpy as np
import pandas as pd
from io import StringIO
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df4s = """
    contract  RB  BeginDate  
0  A00118     46   20120705      
1  A00118     47   20121005      
2  A00253     48   0             
3  A00253     48   0                     
"""
df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s ', 
                  dtype={"BeginDate": int}
                  )

The output is:

contract    RB  BeginDate
0   A00118  46  20120705
1   A00118  47  20121005
2   A00253  48  0
3   A00253  48  0

Now I want to generate a new header 'first_month' based on the 'BeginDate',the logic is simple, if the BeginDate equals 0 then first_month will be 0,or it will equals to the month value of the BeginDate,my code is:

df4['first_month'] = np.where(df4['BeginDate'] != 0,
                                    df4['BeginDate'].astype(str).str[4:6:1].astype(int), 0)

The error is:

ValueError: invalid literal for int() with base 10: ''

The error track is:

:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
    707         # work around NumPy brokenness, #1987
    708         if np.issubdtype(dtype.type, np.integer):
--> 709             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    710 
    711         # if we have a datetime/timedelta array of objects

pandas\_libs\lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src\util.pxd in util.set_value_at_unsafe()

ValueError: invalid literal for int() with base 10: ''

The output should be:

contract    RB  BeginDate  first_month
0   A00118  46  20120705   7
1   A00118  47  20121005   10
2   A00253  48  0          0
3   A00253  48  0          0

Any friend can help?

CodePudding user response：

From the documentation of np.where:

x, y, array_like
Values from which to choose. x, y and condition need to be broadcastable to some shape.

It means your operation for x is not evaluated only when condition is true but selected when condition is true. Therefore your operation is performed on the whole dataframe not on the subset (2 first rows). That's why you raised this error on the third row.

Use zfill to transform a '0' to '00000000':

df4['first_month'] = df4['BeginDate'].astype(str).str.zfill(8).str[4:6:1].astype(int)
print(df4)

# Output
  contract  RB  BeginDate  first_month
0   A00118  46   20120705            7
1   A00118  47   20121005           10
2   A00253  48          0            0
3   A00253  48          0            0

CodePudding user response：

Pandas dataframes can be filtered directly

df4[df4.BeginDate != 0]

df4.loc[df4.BeginDate != 0]

You can then just add on your logic to extract the date

df4['first_month'] = df4.loc[df4.BeginDate != 0].BeginDate.astype(str).str[4:6:1].astype(int)

Note that this will put 'NAN's in rows 2 and 3 not '0' like in your example.

CodePudding user response：

Given the BeginDate column contains string objects you can achieve your desired results with:

df['FirstMonth'] = df.apply(lambda row: 0 if row['BeginDate'] == '0' else int(row['BeginDate'][4:6]), axis = 1)

This yields:

    Contract    RB  BeginDate   FirstMonth
0   A00118  46  20120705    7
1   A00118  47  20121005    10
2   A00253  48  0   0
3   A00253  48  0   0