Homework 2: K-means clustering via scikit-learn
First, please accept the assignment on GitHub Classroom by clicking the link on Piazza to create the assignment repository. Please see Homework 1 instructions for how to clone your repository and submit. Remember to tag your submission with a “submission” tag and update the tag as necessary.
For each part below, we have provided skeleton code in
We have also included one or two example outputs in the files
part1.txt, etc. so that you
can check your first few lines of program output and check the output format.
NOTE 1: The first part of your program output should EXACTLY match the example output files (i.e. character by character). The format of the output should be exactly the same. This is important so that we can automatically grade your submissions.
NOTE 2: Please make sure to use
$ python part1.py 2 [53 97] 3 [62 50 38]
Note that this is accomplished by merely calling
$ python part1.py > part1.txt
You may receive 0 points for a part if you do not follow these instructions since we are using automatic grading.
Part 1: Different number of clusters
We have provided code to load the classic Iris flowers dataset in
For this part, run kmeans via scikit-learn’s
Use all the default parameter settings except set
random_state=0 so that
everyone’s code produces the same output.
Vary the number of clusters (
n_clusters parameter of
KMeans) from 2 to 10 and
fit the estimator to the dataset.
For each number of clusters,
- Print the number of clusters.
- Print the number of points in each cluster.
Part 2: Different random seeds
For this part, we have loaded a sample double circles dataset. This part illustrates that if the algorithm does not always converge to the same solution and the data distribution may not be in nice round clusters.
Run KMeans with default parameters except with the number of clusters set to 3.
Vary the random state from 0 to 5 inclusive (i.e. [0, 1, 2, 3, 4, 5]) and fit the estimator to the dataset. For each random state,
- Print the random state.
- Print the cluster centers as a 3 x 2 matrix.
Part 3: Different maximum iteration and confusion matrix
For this last part, we will use the digits dataset, which represent small 8x8 grayscale images
for the handwritten numbers 0-9.
We will be using the scikit-learn function
sklearn.metrics.confusion_matrix to compare our clustering
labels to the true class labels (i.e. to the true label of which digit the image represents).
See sklearn’s confusion matrix documentation for more information at:
Essentially, we will treat our clustering algorithm as doing classification and then evaluate via standard classification evaluation metrics. Because clustering labels are just dummy indices and don’t correspond to class labels, we will need to permute the clustering labels so that they (approximately) align with the true class labels. Note that in general this is not necessarily possible and may not be obvious, but in this simple case, the results are reasonable. We have already provided the function for you to permute (or map) your clustering labels to predicted class labels. Thus, when computing the confusion matrix you will use the true class labels and the permuted labels.
(If a dataset has class labels, this is one way to evaluate new clustering methods but is not useful in the real-world applications since class labels are (by the definition of clustering) not available.)
For this part, use default parameters for
KMeans except set the random state to 0 and the number of clusters to 10.
Vary the maximum iteration to be 1, 5, 10, and 50 (in that order), fit the estimator, permute the cluster labels, compute the confusion matrix using the true labels and permuted labels, and then print out the following:
- Print out the maximum iteration.
- Print out the confusion matrix.