## CSC411/2515 Study Guide, Winter 2018

Suppose our training set and test set are the same. Why would this be a problem?

Why is it necessary to have both a test set and a validation set?

Images are generally represented as \(n\times m\times 3\) arrays, but we treat them as vectors. How is that possible?

Write pseudocode to select the \(k\) in the \(k\)-nearest-neighbours algorithm.

Give an example of a training and a test set such that 3-nearest-neighbours and 5-nearest neighbours perform differently on the test set.

Consider the data plotted below. If all that you have to go by is the data below, how would you objectively determine whether 1-NN or 15-NN is the better classifier for the data? Use pseudocode.

Explain why the negative cosine distance may make more sense for comparing images than the Euclidean distance.

Explain why the Euclidean distance may make more sense for comparing images than the negative cosine distance

Why do we have the minus sign in the formula for the negative cosine distance?

What is the performance of k-NN on the training set for \(k = 1\)? Is it the same for \(k = 3\)? (Give an example in your answer to the second question.)

What happens to the performance of k-NN if \(k\) is the same as the size of the training set?

Give an example of a (non-image) dataset where the Euclidean distance metric makes more sense than the negative cosine distance.

Explain how the quadratic cost function \[J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 \] is constructed. Specifically, what is the intuitive meaning of \[(h_\theta(x^{(i)})-y^{(i)})^2?\]

How to estimate the best \(h_{\theta}\) by looking at a surface plot of the cost function, in the context of linear regression?

How to estimate the best \(h_{\theta}\) by looking at a contour plot of the cost function, in the context of linear regression?

When performing coordinate-wise gradient descent to minimize the function \(J\), the update rule is \[\text{for j = 1, 2, ..., n}: x_j \leftarrow x_j - \alpha \frac{\partial J}{\partial x_j}(x)\] When performing steepest-descent gradient descent, the update rule is: \[x \leftarrow x - \alpha \nabla J (x)\] Why are those rules not equivalent?

Characterize functions for which a larger learning rate \(\alpha\) makes sense and functions for which a smaller learning rate \(\alpha\) makes sense.

Write a Python function to find a local minimum of \[y = x^6 - 10 x^5- 5x^4 + 1\] using gradient descent.

On a plot of a function of one variable \(y= f(x)\) , at point \((x_0, f(x_0))\) , write down a unit vector that points “uphill.”

On a surface plot of a function of two variables, at point \((x_0, y_0, f(x_0, y_0))\) , write down a unit vector that points “downhill.”

Without computing the gradient symbolically, how can you estimate it? (Hint: you can estimate partial derivatives by looking at the definition of the derivative)

Suppose \(J(\theta)\) is the cost function for linear regression. Explain how to interpret \[\partial J/\partial \theta_0 = 1/m \sum_i (\theta^T x^{(i)}-y^{(i)}),\] where \(\theta_0\) is the constant coefficient.

Explain how to use regression in order to solve two-classification problems. What’s the problem with using this approach for multi-class classification?

Describe a technique for using regression for multi-class classification problems.

Show that the code below computes the gradient with respect to the parameters of a linear hypothesis function

`def df(x, y, theta): x = vstack( (ones((1, x.shape[1])), x)) return -2*sum((y-dot(theta.T, x))*x, 1)`

Write non-vectorized code to perform linear regression.

Explain how points that are not separable by a linear decision boundary using just the original features can become separable by a linear decision boundary if computed features are used.

Recall that we can find the Maximum Likelihood solution to linear regression problems using \[\hat{\theta} = (X^T X)^{-1} (X^T Y).\] Suppose you are given a dictionary

`paul_imgs`

whose keys are`0..19`

, and a dictionary`john_imgs`

whose keys are`0..19`

such that`paul_imgs[i]`

and`john_imgs[i]`

are the i-th image of Paul and John, respectively. The images are \(20\times 20\) NumPy arrays representing grayscale images. Write NumPy code that would take in a new image, and return`True`

if it’s likely a picture of Paul and False if it’s likely a picture of John. Use linear regression.Explain why it makes sense to use Maximum Likelihood estimation.

Explain why it sometimes makes sense to use a prior on the parameter \(\theta\).

Show that, for priors that are never zero, the more training data there is, the less important is the prior for Maximum A-Posteriori inference. Show that by writing down the posterior probability that is being maximized

Define the 0-1 classification error. What is the gradient of the 0-1 classification error?

One issue with the quadratic cost function when it is used in the context of classification is that the cost is large when \(\theta^T x^{(i)}\) is very large. Explain why that is a problem by considering a situation where there is just one point \(x^{(i)}\) for which this happens.

Show that the decision surface when using Logistic Regression is a hyperplane. (First, define what a hyperplane is.)

Show that \[\sigma'(t) = (1-\sigma(t))\sigma(t).\]

Show that the gradient of the quadratic cost is very small for points where the classifier is ``very sure" when using \(h_\theta (x) = \sigma(\theta^T x)\) as the hypothesis function. Why is it a problem?

Looking at slide 1 of the Linear Classification lecture, describe precisely how to obtain the decision boundary (i.e., an equation for the decision boundary) when the input is a list of coordinates of orange points and coordinates of blue points.

Draw a graph for the quadratic cost that is similar to the graph on Slide 15 of the linear classification lecture.

Explain why the Cross Entropy loss used in the Logistic Regression context makes sense.

Prove that the Maximum Likelihood estimate for \(\theta\) when \(Y_i\sim Bernoulli(\theta)\) is \(\frac{1}{n} \sum_i^{n} y_i.\)

Show that, for priors that are never zero, the more training data there is, the less important is the prior for Maximum A-Posteriori inference. Show that by writing down the posterior probability that is being maximized

Recall the following code. Explain why

`theta_hat`

will tend to have larger values (in terms of the absolute value than`theta`

.`def compare_theta_theta_hat(N, ndim, sigma_theta, sigma): theta = scipy.stats.norm.rvs(scale=sigma_theta,size=ndim+1) x_raw = 100*(random.random((N, ndim))-.5) x = hstack((ones((N, 1)), x_raw, )) y = dot(x, theta) + scipy.stats.norm.rvs(scale= sigma,size=N) theta_hat = dot(linalg.pinv(dot(x.T, x)), dot(x.T, y)) return theta[1], theta_hat[1]`

How would you use Bayesian inference in order to make it so that

`theta_hat`

is not systematically larger? Explan why your prior makes sense.Show that minimizing the sum of squared differences between the prediction \(h_\theta (x)\) and the actual outputs \(y\) is equivalent to maximizing the likelihood when \(y\sim N(\theta^T x, \sigma^2)\).

Show how to compute AND using a neural network that uses logistic neurons.

Why can’t you use the log-loss when using linear neurons?

Assume that \(y\sim Poisson(\theta^T x)\). Derive the log-likelihood of the training set where the inputs are x and the output are y.

Write Python code to find the Maximum Likelihood estimate for \(\theta\) where \(y\sim Poisson(\theta^T x)\).

After you find the Maximum Likelihood estimate for \(\theta\) where \(y\sim Poisson(\theta^T x)\), how would you use it to make predictions for new data?

Suppose that your model for the data is \[y\sim Poisson(\alpha x+10), \] with \(\alpha \in \mathbb{R}\). Assume you believe that \(\alpha\sim N(0, \beta)\). Write Python code to obtain the posterior distribution of \(\alpha\) given a dataset, and to obtain the Maximum A-Posteriori (MAP) estimate of \(\alpha\).

Write vectorized code (i.e., you cannot use for-loops) to compute the output of a feedforward neural network with the activation function \(g\).

Explain all three of the countour plots below (i.e., state what each contour plot correponds to). Why do they show that, in this case, L1 regularization (with what \(\lambda\)?) leads to a sparse solution? Point out the minima that you can obtain from the contour plots. Sketch the contour plot of the cost functino obtained when summing up the error and regularization terms.

Suppose that both height and weight predict shoe size, but height predicts shoe size a little better. Explain why L1 regularization can make the coefficient for weight be 0 but L2 regularization may not.

Explain the intuition for Early Stopping accomplishing the same task as L2 regularization.