Home > Net >  Stack with an error when trying to replace missing data
Stack with an error when trying to replace missing data

Time:02-06

Please I need your help in my mini-project, I need to create a prediction model using a dataset from Kaggle, I am stuck with an error when I try to replace the missing data from a 'value' column. It seems that the value are considered like a string, because they have points between numbers. It's not possible to edit the column manually, it has more than 49000 rows. How can resolve this problem?
Here's the code and the error:

x['value'].replace(' ',np.NaN).astype(np.float)

ValueError: could not convert string to float: '154.619.063'

The dataset: Multinationals by industrial sector the dataset from Kaggle Thank you so much for your help

CodePudding user response:

Try this:

x['value'].str.replace('.', '').replace(' ', np.NaN).astype(np.float)

CodePudding user response:

import numpy as np 
import pandas as pd
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%pylab inline
import seaborn as sns
import pandas_profiling as pp
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.express as px
import tensorflow as tf
df = pd.read_csv("C:\\Users\\Souf win\\Downloads\\multinationals.csv", delimiter = ';')
def preprocessing(df):
df = df.copy()
df = df.drop(['partner country','ind', 'var','declaring country','unit code','part', 'cou', 'year','year.1', 'unit', 'power_code code', 'power_code' , 'reference period code', 'reference period' ], axis=1) 
missing_target_rows=df[df['value'].isna()].index
df= df.drop(missing_target_rows, axis=0).reset_index(drop=True)
df['value']=df['value'].str.replace('.', '').replace(' ',np.NaN).astype(np.float)

for column in ['economic variable' ,'industry' ]:
    dummies=pd.get_dummies(df[column], prefix=column)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
#split df to x and y
y = df['value']
x = df.drop('value', axis=1) 
#Train_test split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,         shuffle=True, random_state=1 )
#scale x
scaler = StandardScaler()
scaler.fit(x_train)
#x_train = scaler.transform(x_train)
x_train = pd.DataFrame(scaler.transform(x_train), index=x_train.index, columns=x_train.columns)
x_test = pd.DataFrame(scaler.transform(x_test), index=x_test.index, columns=x_test.columns)
   
return x_train, x_test, y_train, y_test

x_train, x_test, y_train, y_test = preprocessing(df)
x_train
y_train
x_train.shape
inputs = tf.keras.Input(shape=(86,))
x = tf.keras.layers.Dense(128, activation='relu')(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(x)
outputs=tf.keras.layers.Dense(1, activation='linear')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(
optimizer='adam',
loss = 'mse'
)

history=model.fit(
    x_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks= [
        tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=3,
        restore_best_weights=True
        )
    ]
)
  •  Tags:  
  • Related