I tried to find the max 3 values in the list for implementing my knn model. While trying to do so, I did it using the method that was intuitive to me the code was something as follows `
first_k = X_train['distance'].sort_values().head(k)
prediction = first_k.value_counts().idxmax()
` The first_k list contains the first k elements from the sorted values of the distance column. Prediction is what the model will return at last.
Another approach I found on the internet was this `
prediction = y_train[X_train["distance"].nsmallest(n=k).index].mode()[0]
` The second approach yields the correct results and my approach did not work as intended. Can someone explain to me why my approach did not work.
CodePudding user response:
The difference is in the usage of .index after the method nsmallest(n=k) in the alternative approach. What you are doing in your code is the following:
- Sort X using
distanceas sorting key, then take the first k elements in the sorted dataset - Check the distance frequency and the the first occurrence of the most frequent distance
The alternative approach instead does the following steps:
- Recover the k smallest elements in the
distancecolumn - Get the corresponding index value of the rows recovered in the previous step (for example with
k=5it could be an element that when printed shows something similar toInt64Index([3, 9, 10, 1, 8], dtype='int64') - Recover in
ythe labels with the same index values of the ones recovered in the previous step - Get the most frequent label in
y(or themode)
So, as you can see, the main difference is the fact that the most frequent distance is not necessarily the most frequent class among the K neighbours that you have recovered.
Anyway you code can be easily fixed:
first_k = X_train['distance'].sort_values().head(k).index
prediction = y_train[first_k].mode()[0]
