What is KNN Algorithm?

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems.

The algorithm is based on the principle that similar data points (i.e. data points that are nearby in space) tend to have similar labels (i.e. they tend to belong to the same class). Therefore, the KNN algorithm can be used to predict the label of a new data point by looking at the labels of the data points that are nearby in space.

Features of knn algorithm:

  • KNN is a simple and easy to understand algorithm.
  • KNN can be used for both classification and regression problems.
  • KNN is a versatile algorithm and can be used with data that has a variety of different sized dimensions.
  • KNN is aNon-parametric algorithm, which means it does not make any assumptions about the data.
  • KNN is an instance-based learning algorithm, which means it does not require any training data.
  • KNN is a lazy algorithm, which means it does not perform any training phase.
  • KNN is a distance-based algorithm, which means it uses a distance metric to calculate the distance between two data points.

Examples of KNN algorithm:

1. KNN can be used for classification, regression, and outlier detection.

def knn(k, X_train, y_train, X_test):
    # calculate the distance between the test point and all the training points
    distances = []
    for i in range(len(X_train)):
        dist = np.linalg.norm(X_test - X_train[i])
        distances.append((dist, y_train[i]))
    # sort the distances
    distances.sort(key=lambda x: x[0])
    # get the k nearest neighbors
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][1])
    # get the most common class among the neighbors
    counts = np.bincount(neighbors)
    return np.argmax(counts)

2. It can be used to predict the creditworthiness of customers from financial data.

def knn(k, X_train, y_train, X_test):
    # calculate the distance between the test point and all the training points
    distances = []
    for i in range(len(X_train)):
        dist = euclidean_distance(X_test, X_train[i])
        distances.append((X_train[i], y_train[i], dist))
    # sort the distances
    distances = sorted(distances, key=lambda x: x[2])
    # get the k nearest neighbors
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][1])
    # get the most common class among the neighbors
    output = max(set(neighbors), key=neighbors.count)
    return output

3. It can be used to identify customer segments from demographic data.

def knn(k, X_train, y_train, X_test):
    # create a list for distances and targets
    distances = []
    targets = []
    # loop over rows in X_test
    for ix in range(len(X_test)):
        # get row from X_train
        row = X_train[ix, :]
        # compute the distance between the row and X_test
        distance = np.sqrt(np.sum((row - X_test[ix, :])**2))
        # add the distance and target to the list
        distances.append(distance)
        targets.append(y_train[ix])
    # sort the list of distances and targets
    indices = np.argsort(distances)
    # initialize the KNN target and the KNN counter
    KNN_target = []
    KNN_counter = 0
    # loop over the sorted indices
    for i in indices:
        # add the target to the KNN target
        KNN_target.append(targets[i])
        # increment the KNN counter
        KNN_counter += 1
        # if the KNN counter is equal to k
        if KNN_counter == k:
            # break
            break
    # return the KNN target
    return np.array(KNN_target)

4. It can be used to detect fraudulent activities in transaction data.

def knn(k, X_train, y_train, X_test):
    # calculate the distance between each test point and each training point
    distances = []
    for i in range(len(X_train)):
        dist = euclidean_distance(X_test, X_train[i])
        distances.append((X_train[i], y_train[i], dist))
    # sort the distances
    distances = sorted(distances, key=lambda x: x[2])
    # select the k nearest neighbors
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][1])
    # return the most common class among the neighbors
    return max(set(neighbors), key=neighbors.count)

Conclusion

The algorithm is relatively simple and can be applied to a wide variety of data sets. However, the algorithm is also susceptible to overfitting if the data set is not carefully preprocessed.

Leave a Reply