# Homework 2: K-means clustering via scikit-learn

First, please accept the assignment on GitHub Classroom by clicking the link on Piazza to create the assignment repository. Please see Homework 1 instructions for how to clone your repository and submit. Remember to tag your submission with a “submission” tag and update the tag as necessary.

For each part below, we have provided skeleton code in part1.py, part2.py and part3.py. We have also included one or two example outputs in the files part1.txt, etc. so that you can check your first few lines of program output and check the output format.

NOTE 1: The first part of your program output should EXACTLY match the example output files (i.e. character by character). The format of the output should be exactly the same. This is important so that we can automatically grade your submissions.

NOTE 2: Please make sure to use print statements to actually print to stdout instead of writing to the text files. Thus, when we run your program, it should actually print to stdout. As an example, when running from the command line, the output should look like the following:

$python part1.py 2 [53 97] 3 [62 50 38]  Note that this is accomplished by merely calling print statements. If you want to save to a file (optional), you can redirect stdout to the corresponding text file like so: $ python part1.py > part1.txt


You may receive 0 points for a part if you do not follow these instructions since we are using automatic grading.

## Part 1: Different number of clusters

We have provided code to load the classic Iris flowers dataset in part1.py. For this part, run kmeans via scikit-learn’s sklearn.cluster.KMeans estimator. Use all the default parameter settings except set random_state=0 so that everyone’s code produces the same output.

Vary the number of clusters (n_clusters parameter of KMeans) from 2 to 10 and fit the estimator to the dataset. For each number of clusters,

1. Print the number of clusters.
2. Print the number of points in each cluster.

## Part 2: Different random seeds

For this part, we have loaded a sample double circles dataset. This part illustrates that if the algorithm does not always converge to the same solution and the data distribution may not be in nice round clusters.

Run KMeans with default parameters except with the number of clusters set to 3.

Vary the random state from 0 to 5 inclusive (i.e. [0, 1, 2, 3, 4, 5]) and fit the estimator to the dataset. For each random state,

1. Print the random state.
2. Print the cluster centers as a 3 x 2 matrix.

## Part 3: Different maximum iteration and confusion matrix

For this last part, we will use the digits dataset, which represent small 8x8 grayscale images for the handwritten numbers 0-9. We will be using the scikit-learn function sklearn.metrics.confusion_matrix to compare our clustering labels to the true class labels (i.e. to the true label of which digit the image represents). See sklearn’s confusion matrix documentation for more information at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Essentially, we will treat our clustering algorithm as doing classification and then evaluate via standard classification evaluation metrics. Because clustering labels are just dummy indices and don’t correspond to class labels, we will need to permute the clustering labels so that they (approximately) align with the true class labels. Note that in general this is not necessarily possible and may not be obvious, but in this simple case, the results are reasonable. We have already provided the function for you to permute (or map) your clustering labels to predicted class labels. Thus, when computing the confusion matrix you will use the true class labels and the permuted labels.

(If a dataset has class labels, this is one way to evaluate new clustering methods but is not useful in the real-world applications since class labels are (by the definition of clustering) not available.)

For this part, use default parameters for KMeans except set the random state to 0 and the number of clusters to 10.

Vary the maximum iteration to be 1, 5, 10, and 50 (in that order), fit the estimator, permute the cluster labels, compute the confusion matrix using the true labels and permuted labels, and then print out the following:

1. Print out the maximum iteration.
2. Print out the confusion matrix.