Livio/ May 12, 2019/ Python/ 0 comments

k-means clustering with Python

Today we will be implementing a simple class to perform k-means clustering with Python. Before continuing it is worth stressing that the scikit-learn package already implements such algorithms, but in my opinion it is always worth trying to implement one on your own in order to grasp the concepts better.

What is k-means clustering?

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Imagine you have the below observations and you wanted to partition the observations and assign them to their appropriate cluster. How would you go about it? This is what we will try to put in place Implementing the class

The algorithm will work in the following way:

1. Start by assigning each point with x and y coordinates to a random cluster
2. Compute the centroids of each cluster
3. Re-assign each point to the closest cluster (based on the distance with the centroids)
4. Repeat steps 2 and 3 until the points are not re-assigned to another cluster

Our class will have the following methods:

__init__ method:

The argument passed to our __init__ method is k, which is the number of clusters we want to identify.

euclidean_distance:

In order to determine the distance between two points, we will use the euclidean formula, which is: where p1 and p2 are points with x and y coordinates, for example p1 = [10, 15], p2 = [2, 3]

_random_assign_points:
This method performs the first step of the algorithm which is to assign each point p with x and y coordinates to a random cluster
_calculate_centroids:
This method will perform step 2 which is to calculate the centroid of each cluster.
_assign_points:
This method will perform step 3 which is to re-assign each point to the closest cluster based on the distance with its centroid
train:
This last method will make use of all the former methods and will train our data and assign each point to the appropriate cluster:
Entire class code:
Using the class is easy, for example:
with a simple implementation in Jupyter Lab we can see visually how the clusters are formed at each iteration:   