I'm trying to brush up on my understanding of some basic ML method. Can someone explain what is going on under the hood with kneighbors_graph? I would like to replicate this output with only NumPy.
from sklearn.neighbors import kneighbors_graph
X = [[0, 1], [3, 4], [7, 8]]
A = kneighbors_graph(X, 2, mode='distance', include_self=True)
A.toarray()
Output:
array([[0. , 4.24264069, 0. ],
[4.24264069, 0. , 0. ],
[0. , 5.65685425, 0. ]])
CodePudding user response:
The resulting matrix represents the distance-weighted graph of n = 2 neighbours for each point in X, where you are including a point as its own neighbour (with a distance of zero). Note that distances to non-neighbours are also zero, so you might want to check the connectivity graph to know if you're looking at a zero-distance neighbour or a non-neighbour.
Let's start with the first row, representing the first point, [0, 1]. Here's what the numbers in that row mean:
- The first
0is the distance to the nearest point, which is itself (because you specifiedinclude_self=True). If you specifiedmode='connectivity'this would be a 1, because it's a neighbour. - The second element,
4.24, is the Euclidean distance (aka L2 norm) to the next point inX, which is[3, 4]. You get this distance becausemetric='minkowski', p=2are defaults; if you want a different distance metric, you can have it. Again, if you specifiedmode='connectivity'this would also be a 1, because it's a neighbour. - The third element, another
0, is not really a distance; it's telling you that the third point,[7, 8], is not a neighbour whenn_neighborsis 2. If you specifiedmode='connectivity'this would be a 0, because it's not a neighbour.
You can compute the distances between all pairs of points in an array with scipy.spatial.distance.cdist(X, X). There's also scipy.spatial.KDTree for neighbour lookup. If you really want to go pure NumPy, check out the linalg module.
