# Homework 3: Gradient Descent in PyTorch for Gaussian Mixture Model

First, please accept the assignment on GitHub Classroom by clicking the link on Piazza to create the assignment repository. Please see Homework 1 instructions for how to clone your repository and submit. Remember to tag your submission with a “submission” tag and update the tag as necessary.

For each part below, we have provided skeleton code in `part1.py`

, `part2.py`

, etc.
We have also included example outputs in the files `initial_output_part1.txt`

, etc. so that you
can check your first few lines of program output and check the output format.

**NOTE 1: The first part of your program output should EXACTLY match the example output files
(i.e. character by character). The format of the output should be exactly the same.
This is important so that we can automatically grade your submissions.**

**NOTE 2: Please make sure to use print statements to actually print to stdout instead
of writing to the text files.**
Thus, when we run your program, it should actually print to stdout. As an example, when running
from the command line, the output should look like the following:

```
$ python part1.py
2.99090
3.45420
3.40466
```

Note that this is accomplished by merely calling `print`

statements.

**You may receive 0 points for a part if you do not follow these instructions since
we are using automatic grading.**

## Setup PyTorch

You will need to install PyTorch if you have not already similar to how you installed previous packages. See PyTorch installation instructions

## Part 1: Implement the negative log likelihood function for the Gaussian Mixture Modle (GMM) in PyTorch

We have provided the implementation of negative log likelihood function for GMMs using `numpy`

.
The negative log likelihood function computes per-sample negative log likelihood.
Thus, if the input is a $n \times d$ matrix, the output will be a $n$-dimensional vector.

Your task is to convert these Numpy functions into equivalent PyTorch functions. This is so that we can automatically compute gradients via PyTorch. We have provided skeleton code and denoted exactly where you should edit the code. The code for printing is already included so you shouldn’t have to add any print statements. We only include the first 10 points in this part so that there are only 10 output lines but in future parts you should use all points when computing log likelihood.

## Part 2: Compute gradients of the *mean* negative log likelihood via `backward`

function

In this part, we merely compute the gradients for all parameters with respect
to the *mean* of the negative log likelihood function.
Note that the previous step gives the *per-sample* negative log likelihood
so you need to take the mean/average of the per-sample negative log likelihood to get
a single number.
Then you can call backward on this single number to get the gradients for all
the parameters.

This is just to check your understanding of computing gradients via PyTorch.
Again, similar to part 1 we have provided skeleton code for you.
When computing the log likelihood make sure to use all points in `X_test`

.

## Part 3: Optimize x via vanilla gradient descent to find the modes of the GMM distribution

For this part, we will actually optimize over $x$—i.e. a data point—instead of the model parameters $\theta$. In class, we did this for a univariate Gaussian I believe. The main idea is to use the same mean negative log likelihood function but optimize over $x$ instead of $\theta$ (we will optimize for $\theta$ in step 5 but it is slightly more tricky). This will start at a random point and then climb the GMM density function until the optimized $x$ reaches the “peaks” of the density function. Again, skeleton code has been provided. You merely need to implement the gradient update step.

*Hint 1: x will originally be a simple vector (i.e., one dimensional array) but
gmm_neg_log_likelihood expects a matrix so you need to reshape x to be a 2D array
with only one row, i.e., x.reshape(1, -1). (See notebook on gradient descent.)*

*Hint 2: Make sure not to overwrite x when reshaping because you will need the
gradient of the original x, which has requires_grad=True.*

## Part 4: Convert unconstrained parameters to constrained parameters and vice versa

The original problem for optimizing a GMM parameters is formulated as follows:

but where there are constraints on $\pi$ and $\Sigma_{1\dots k}$. In particular, $\pi_j \geq 0, \forall j$, $\sum_j \pi_j = 1$, (i.e., the weights $\pi$ must be a probability mass function) and the covariance matrices $\Sigma_{1 \dots k}$ must be positive definite (PD). As discussed in class, constraints are more difficult to handle. Thus, we want to convert to an equivalent problem that is unconstrained. Namely, let us denote our parameters as $\theta = (\theta^{(\pi)}, \theta^{(\mu)}, \theta^{(\Sigma)})$ and let us define the following unconstrained optimization problem:

Your task is to implement the conversion from $\theta$ to $\pi$, $\mu_{1 \dots k}$, and $\Sigma_{1 \dots k}$.
For converting, $\theta^{(\pi)}$ to $\pi$, please use the `torch.softmax`

function (you will likely want to look at the documentation for this function if you haven’t already).
This will be important for using unconstrained gradient descent in the next part.
We performed something similar in class when using gradient descent to optimize for the covariance matrix.

## Part 5: Use vanilla gradient descent to find parameters of GMM distribution

Using the functions you implemented in part 4, you can now implement unconstrained gradient descent using the unconstrained parameters $\theta$. In the beginning of each gradient descent step, you will need to convert from the current $\theta$ to the constrained GMM parameters. Then, compute the mean negative log likelihood of the GMM, compute gradients, and update $\theta$. This puts all the steps together for gradient descent to find GMM instead of using EM. Again, skeleton code has been provided so you only need to implement the gradient update step.