What is K Mean Clustering?

Bilal Muhammad

2 years ago

k-means clustering is a machine learning algorithm that groups data points into k clusters.

Table of Contents

Toggle

What are usages of “k mean clustering”?

There are a few common usages for k-means clustering:

As a preprocessing step for other algorithms
To simplify data for further analysis
To find unusual data points
To group similar data points together

Features of “k mean clustering”?

K-Means clustering is a method of vector quantization, that can be used for cluster analysis in data mining.
K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Examples of “k mean clustering”

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

def k_mean_clustering(X, k, iterations=1000):
    """
    X: is a numpy.ndarray of shape (n, d) containing the dataset
    n: is the number of data points
    d: is the number of dimensions for each data point
    k: is the number of clusters
    iterations: is the number of iterations to execute the algorithm
    Returns: a numpy.ndarray of shape (k, d) containing the centroids
    """
    if type(X) is not np.ndarray:
        return TypeError("X must be a numpy.ndarray")
    if len(X.shape) != 2:
        return TypeError("X must be a 2D numpy.ndarray")
    if type(k) is not int:
        return TypeError("k must be an integer")
    if k <= 0:
        return ValueError("k must be a positive integer")
    if type(iterations) is not int:
        return TypeError("iterations must be an integer")
    if iterations <= 0:
        return ValueError("iterations must be a positive integer")
    centroids = np.ndarray((k, X.shape[1]))
    for i in range(k):
        centroids[i] = X[np.random.randint(0, X.shape[0])]
    for i in range(iterations):
        clusters = np.ndarray((k, X.shape[0]))
        for j in range(k):
            clusters[j] = np.array([np.linalg.norm(X[i] - centroids[j]) for i in range(X.shape[0])])
        clusters = np.argmin(clusters, axis=0)
        for j in range(k):
            centroids[j] = np.mean(X[clusters == j], axis=0)
    return centroids

def k_means_clustering(X, k, tolerance=0.0001, max_iterations=500):
    # initialize k cluster centroids
    # initialize a list to hold the cluster centroids
    centroids = random.sample(X, k)
    # initialize a list to hold the clusters
    clusters = [[] for _ in range(k)]
    # initialize a list to hold the old centroids
    old_centroids = None
    # loop until the centroids stop updating
    for i in range(max_iterations):
        # loop over each observation
        for x in X:
            # loop over each cluster centroid
            min_distance = float('inf')
            closest_cluster = None
            for j in range(k):
                # compute the distance between the observation and the cluster centroid
                distance = euclidean_distance(x, centroids[j])
                # if the distance is less than the min_distance, update the min_distance and closest_cluster
                if distance < min_distance:
                    min_distance = distance
                    closest_cluster = j
            # append the observation to the closest cluster
            clusters[closest_cluster].append(x)
        # compute the new centroids
        new_centroids = compute_centroids(X, clusters)
        # check if the centroids have converged
        if converged(old_centroids, new_centroids, tolerance):
            break
        # update the old centroids
        old_centroids = new_centroids
    # return the clusters and centroids
    return clusters, new_centroids

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

def voronoi_cells(points):
    # create a list of clusters
    clusters = []
    # create a list of clusters
    for point in points:
        # create a cluster
        cluster = []
        # add the point to the cluster
        cluster.append(point)
        # add the cluster to the list of clusters
        clusters.append(cluster)
    # create a list of edges
    edges = []
    # create a list of visited points
    visited = []
    # create a list of unvisited points
    unvisited = points
    # while there are unvisited points
    while unvisited:
        # set the closest point to the first unvisited point
        closest = unvisited[0]
        # set the closest distance to infinity
        closest_distance = float('inf')
        # for each unvisited point
        for point in unvisited:
            # calculate the distance between the closest point and the current point
            distance = distance_between_points(closest, point)
            # if the distance is less than the closest distance
            if distance < closest_distance:
                # set the closest point to the current point
                closest = point
                # set the closest distance to the distance
                closest_distance = distance
        # add the closest point to the visited points
        visited.append(closest)
        # remove the closest point from the unvisited points
        unvisited.remove(closest)
        # for each unvisited point
        for point in unvisited:
            # calculate the distance between the closest point and the current point
            distance = distance_between_points(closest, point)
            # if the distance is less than the closest distance
            if distance < closest_distance:
                # add the edge to the list of edges
                edges.append((closest, point))
    # return the list of clusters and the list of edges
    return clusters, edges

What are usages of “k mean clustering”?

Features of “k mean clustering”?

Examples of “k mean clustering”

Share this: