KNN algorithm implemented in Python

Question

This is the first time I tried to write some code in Python. I think it gives proper answers but probably some "vectorization" is needed

import numpy as np
import math
import operator

data = np.genfromtxt("KNNdata.csv", delimiter = ',', skip_header = 1)
data = data[:,2:]

np.random.shuffle(data)
X = data[:, range(5)]
Y = data[:, 5]

def distance(instance1, instance2):
    dist = 0.0
    for i in range(len(instance1)):
        dist += pow((instance1[i] - instance2[i]), 2)
    return math.sqrt(dist)

# Calculating distances between all data, return sorted  k-elements list (whole element and output)
def getNeighbors(trainingSetX, trainingSetY, testInstance, k):
    distances = []
    for i in range(len(trainingSetX)):
        dist = distance(testInstance, trainingSetX[i])
        distances.append((trainingSetX[i], dist, trainingSetY[i]))
    distances.sort(key=operator.itemgetter(1))
    neighbour = []
    for elem in range(k):
        neighbour.append((distances[elem][0], distances[elem][2]))
    return neighbour

#return answer
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = int(neighbors[x][-1])
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse = True)
    return sortedVotes[0][0]

#return accuracy, your predicitons and actual values
def getAccuracy(testSetY, predictions):
    correct = 0
    for x in range(len(predictions)):
        if testSetY[x] == predictions[x]:
            correct += 1
    return (correct / (len(predictions))) * 100.0

def start():
    trainingSetX = X[:2000]
    trainingSetY = Y[:2000]
    testSetX = X[2000:]
    testSetY = Y[2000:]

    # generate predictions
    predictions = []
    k = 4
    for x in range(len(testSetX)):
        neighbors = getNeighbors(trainingSetX, trainingSetY, testSetX[x], k)
        result = getResponse(neighbors)
        predictions.append(result)
    accuracy = getAccuracy(testSetY, predictions)
    print('Accuracy: ' + str(accuracy))

start()

Graipher · Accepted Answer · 2017-02-07 12:58:00Z

First a style nitpick: Python has an official style-guide, PEP8, which recommends using lower_case_with_underscores for variable and function names instead of camelCase.

Second, the comments you have above your functions should become docstrings. These appear for example when calling help(your_function) in an interactive session. Just Have a string as the first line below the function header like so:

def f(a, b):
    """Returns the sum of `a` and `b`"""
    return a + b

It is recommended to always use triple double-quotes (i.e. """).

Now I am going to focus on the distance calculation.

First, you can greatly simplify your getNeighbors function using list comprehensions:

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):
    distances = sorted((distance(testInstance, x), x, y)
                       for x, y in zip(trainingSetX, trainingSetY)
    return [(d[1], d[2]) for d in distances[:k]]

Here I used the fact that tuples already sort naturally, by first comparing the first index then (if they are equal) the second and so on. So I put the distance as the first index of the tuple and you don't need the key function any longer. sorted can take a generator expression and sort it directly. We can also iterate over multiple iterables at the same time using zip.

Since your variables are all numpy arrays, you could also do it more vectorized. For this I would first re-define the distance function to use numpy functions:

def distance(x, y):
    return np.sqrt(((x - y)**2).sum())

And then put the distances into a numpy array as well. Only returning the second and third column becomes then easier with array slicing.

def getNeighbors(trainingSetX, trainingSetY, testInstance, k):
    distances = np.array([(distance(testInstance, x), x, y)
                          for x, y in zip(trainingSetX, trainingSetY])
    distances.sort()
    return distances[:k, 1:]

This can probably be modified even further by trying to make the distance call vectorized as well.

Your function classVotes can be simplified using the collections.Counter class which implements exactly what you do her:

def getResponse(neighbors):
    classVotes = Counter(int(neighbor[-1]) for neighbor in neighbors)
    return max(classVotes.iteritems(), key=itemgetter(1))[0]

Your function getAccuracy can be slightly simplified using a generator expression and sum:

def getAccuracy(testSetY, predictions):
    correct = sum(y == p for y, p in zip(testSetY, predictions))
    return correct * 100.0 / len(predictions)

And lastly, in your start function, you can directly iterate over the elements of testSetX, make it a generator expression and use the fact that print can take multiple arguments:

def start():
    trainingSetX = X[:2000]
    trainingSetY = Y[:2000]
    testSetX = X[2000:]
    testSetY = Y[2000:]

    # generate predictions
    k = 4
    predictions = (getResponse(getNeighbors(trainingSetX, trainingSetY, x, k)]
                   for x in testSetX)
    accuracy = getAccuracy(testSetY, predictions)
    print('Accuracy:', accuracy)

Stack Exchange Network

KNN algorithm implemented in Python

1 Answer 1

Your Answer

Hot Network Questions

KNN algorithm implemented in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions