Create a DataFrame, print info on it append a row, print info again. The dtype of all the columns changes to object. Why?
myData = np.array([134.29, 136.97, 250.31, 312.28])
mySeries = pd.Series(myData,index=['IBM','P&G','Microsoft','Home Depot'], name="Stock Price")
myData1 = np.array(['120.573B', '336.72B', '1.885T' , '335.974B'])
mySeries1 = pd.Series(myData1, index=['IBM','P&G','Microsoft','Home Depot'], name="Market Cap")
myData2 = np.array([120_573_000_000, 336_720_000_000, 1_885_000_000_000 , 335_974_000_000])
mySeries2 = pd.Series(myData2, index=['IBM','P&G','Microsoft','Home Depot'], name="Market Cap Raw")
myDataFrame = pd.concat([mySeries, mySeries1, mySeries2], axis=1)
#print(myDataFrame)
print(myDataFrame.info())
# After adding the row below, the dtype of numeric types change to object
myData = np.array([20.99, '100M', 100000000 ])
mySeries = pd.Series(myData, index = myDataFrame.columns, name = 'HML')
myDataFrame = myDataFrame.append(mySeries, ignore_index=False)
#print(myDataFrame)
print(myDataFrame.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, IBM to Home Depot
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Stock Price 4 non-null float64
1 Market Cap 4 non-null object
2 Market Cap Raw 4 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 128.0 bytes
None
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, IBM to HML
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Stock Price 5 non-null object
1 Market Cap 5 non-null object
2 Market Cap Raw 5 non-null object
dtypes: object(3)
memory usage: 160.0 bytes
None
CodePudding user response:
When you create a Series object containing objects of different incompatible types, the dtype of that Series becomes object.
When you create myData and mySeries the second time, that's exactly what's happening:
>>> myData = np.array([20.99, '100M', 100000000 ])
>>> mySeries = pd.Series(myData, index = myDataFrame.columns, name = 'HML')
>>> mySeries.dtype
dtype('O')
Right after that, you append that Series (of dtype object) to the dataframe. Since the object type is more general than the dtypes of the various columns of the dataframe, those columns get converted to the more general object dtype.
CodePudding user response:
I figure out how to fix it:
tmpSeries = pd.to_numeric(myDataFrame['Stock Price'])
myDataFrame['Stock Price'] = tmpSeries
This changes the column to float64 from object. to_numeric can also be used to convert to other numeric types.
