Good evening everyone, I am new to Python and I'm trying to learn by reproducing a model I have on Excel
I need to replicate the "TREND" function to fit a small linear model between two extreme points, let's say
A = (1, 0.15) B= (5,0.2)
and predicting using a given value (let's say 4.2).
For the purpose of this code I need to fit a model for each line of my database. All x values are x_1=1 and x_2=5, while y values are different in each line.
I tried using LinearRegression() and model.predict from the sklearn.linear_model package this way
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
data = {'New_x':[5, 2.1, 4.5, 3.0],
'X1':[1, 1, 1, 1],
'X2':[5, 5, 5, 5],
'Y1':[0.15, 0.7, 1.35, 0.2],
'Y2':[0.2, 0.85, 1.55, 0.4]}
df=pd.DataFrame(data,index=["1","2","3","4"])
model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
prediction=model.predict(df["New_x"].values.reshape(-1,1))
But I'm getting this error
ValueError Traceback (most recent call last)
<ipython-input-88-da83cb57bf4a> in <module>()
18
19 model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
---> 20 prediction=model.predict(df["New_x"].values.reshape(-1,1))
21
22 #model = LinearRegression().fit(SEC_ERBA_sample[["Vertex1","Vertex2"]], SEC_ERBA_sample[["SENIOR_1Y","SENIOR_5Y"]])
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) self.intercept_
242
243 def predict(self, X):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
ValueError: shapes (4,1) and (2,2) not aligned: 1 (dim 1) != 2 (dim 0)
So I presume that LinearRegression().fit is fitting a unique model based on the column values. Is there a way to fit and predict a linear regression for each row?
CodePudding user response:
I think this is a simple code typo, but may be funded on a deeper conceptual problem, so I'll try to give you a broader answer.
The sklearn.base.BaseEstimator#fit trains a ML model by associating a set of features X to a set of ground-truth values y. In your example, you are training two multi-variable regression model to estimate the Y1 and Y2 variables taking X1 and X2 into consideration:
model = LinearRegression().fit(df[["X1","X2"]], df[["Y1","Y2"]])
So the model learns to estimate these two variables taking two other variables into consideration.
During predicting, the model requires exactly variables (X1 and X2) to be able to predict the values of interest.
predictions = model.predict(df[["New_x1", "New_x2"]])
If the New_x2 information is not available during test (predict) time, then you either have to estimate it as well or remove it from training altogether.
A simple abstract example: if a model was trained to estimate your preferred t-shirt size from your height and weight, you need to know both height and weight during test (predict) time to obtain the correct size estimation.
CodePudding user response:
I found a solution using iterrow(). Still incomplete as I can't save the output, but I think I will open a separate and more focused question for that
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
data = {'New_x':[5, 2.1, 4.5, 3.0],
'X1':[1., 1, 1, 1],
'X2':[5., 5, 5, 5],
'Y1':[0.15, 0.7, 1.35, 0.2],
'Y2':[0.2, 0.85, 1.55, 0.4]}
df=pd.DataFrame(data,index=["1","2","3","4"])
This final piece allows iterating the linear regression. Using iterrows() is not suggested as many operations can be run in different ways (including vectorization) but in this case I was not finding an alternative solution for this problem
for index, row in df.iterrows():
model=LinearRegression().fit(np.array([row["X1"],row["X2"]]).reshape(-1,1),
np.array([row["Y1"],row["Y2"]]))
print(model.predict(row["New_x"]))
