Probabilistic nearest neighbours

Wed 16 March 2022

Some notes on a probabilistic interpretation of the nearest neighbours algorithm as a mixture of Gaussians.

Theory

This article pretty much follows from Barber, 2012

kNN is simple to understand and implement, and often used as a baseline. This method comes with a few limitations:

In metric based methods, how do we measure distance? Euclidean distance, for instance, does not account for how data is distributed.
The whole dataset needs to be stored to make a classification since the novel point must be compared to all of the train points. This is computationally intensive in both the time and space domains.
Each distance calculation can be expensive if the datapoints are high dimensional.

We can reformulate the kNN as a class conditional mixture of Gaussians, and work around all of the above.

Probabilistic NN

Barber 2012 shows that nearest neighbour is the limiting case of a mixture model.

What follows is a solution for Exercise 14.2 from Barber 2012 in python.

Write a routine SoftNearNeigh(xtrain,xtest,trainlabels,sigma) to implement soft nearest neighbours, analogous to nearNeigh.m. Here sigma is the variance σ2 in equation (14.3.1). The file NNdata.mat contains training and test data for the handwritten digits 5 and 9. Using leave one out cross-validation, find the optimal
$$\sigma^2$$
and use this to compute the classification accuracy of the method on the test data. Hint: you may have numerical difficulty with this method. To avoid this, consider using the logarithm, and how to numerically compute
$$log (e^a + e^b)$$
for large (negative) a and b. See also logsumexp.m.

Data

For this exercise we'll be using a subsent of the MNIST dataset provided in BRMLtoolkit, available online at Barber 2012.

import scipy.io as sio
nndata = sio.loadmat('BRMLtoolkit/data/NNdata.mat')

We want to solve a binary classification problem:

class0_train = nndata['train5']
class0_test = nndata['test5']

class1_train = nndata['train9']
class1_test = nndata['test9']

Barber follows a generative approach and uses kernel density estimation (a Parzen estimator) to interpret kNN as the limiting case of a mixture of Gaussians. An isotropic Gaussian of width $\sigma^2$ is placed at each data point, and a mixture is used to model each class.

The Parzen estimator

With kernel density estimation we want to approximate a PDF with a mixture of continuous probability distributions. A Parzen estimator centers a probability distribution at each data point $\textbf{x}_n$ as

$$P(\textbf{x}) = \frac{1}{N} \sum_{n=1}^{N} P(\textbf{x}|\textbf{x}_{n})$$

For a D dimensional $\textbf{x}$ we choose an isotropic Gaussian

$$P(\textbf{x}|\textbf{x}_{n}) = \mathcal{N} (\textbf{x}|\textbf{x}_{n}, \sigma^2 \textbf{I}_D)$$

where $\sigma^2 \textbf{I}_D$ is a scalar variance multiplied by an identity matrix.

This gives the mixture:

$$ P(x) = \frac{1}{N} \sum_{n=1}^{N} \frac{1}{(2 \pi \sigma^2)^{D/2}}\mathcal{e}^{- (\textbf{x} - \textbf{x}\_n)^2 / 2\sigma^2} $$

Nearest Neighbour classification

Given classes $c = {0, 1}$, we consider the following mixture model

$$P(\textbf{x}|c=0) = \frac{1}{N_0} \sum_{n \in \textit{class 0}} \mathcal{N}(\textbf{x}| \textbf{x}_n, \sigma^2\textbf{I}) = \frac{1}{N_0} \frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}} \sum_{n \in \textit{class 0}} e^{-(\textbf{x} - \textbf{x}_n)^2 / (2 \sigma^2) }$$

$$P(\textbf{x}|c=1) = \frac{1}{N_1} \sum_{n \in \textit{class 1}} \mathcal{N}(\textbf{x}| \textbf{x}_n, \sigma^2\textbf{I}) = \frac{1}{N_1} \frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}} \sum_{n \in \textit{class 1}} e^{-(\textbf{x} - \textbf{x}_n)^2 / (2 \sigma^2) } $$

To classify a new instance $\textbf{x}_{*}$ we calculate the posterior for both classes and take the ratio

$\frac{P(c=0|\textbf{x}_{\star})}{P(c=1|\textbf{x}_{\star})} = \frac{P(\textbf{x}_{\star}|c=0) P(c=0)}{P(\textbf{x}_{\star}|c=1) P(c=1)}$. If this ratio is $\gt 1$ than we classify $\textbf{n}_{\star}$ as class 0, otherwise as class 1. The class probabilities can be determined by maximum likelihood with the following setting $P(c) = \sum_{i=0}^N \frac{N_{i}}{N}$.

To understand how this relates to the nearest neighbour method, we need to consider the case $\sigma^2 \rightarrow 0$.

Note that both nominator and denominator are sums of exponentials. Intuitively, if the veriance is small, the nominator will be dominated by the term for which point $x_{n_0} \in class 0$ is closest to $\textbf{x}_{\star}$. The same holds for the denominator and points in class 1.

$$\frac{1}{(2 \pi \sigma^2)^{\frac{D}{2}}}$$

cancels out, and for vanishingly small values of $\sigma$ we have

$$\frac{P(c=0|\textbf{x}_{\star})}{P(c=1|\textbf{x}_{\star})} \approx$$

$$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2 / (2 \sigma^2)} P(c=0)/N_{0}} {e^{-(\textbf{x}_{\star} - x_{n_1})^2 / (2 \sigma^2)}P(c=1)/N_{1}}$$

$$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2 / (2 \sigma^2)}} {e^{-(\textbf{x}_{\star} - x_{n_1})^2 / (2 \sigma^2)}} $$

In the limit $\sigma^2 \rightarrow 0$ we have

$$\frac{ e^{-(\textbf{x}_{\star} - x_{n_0})^2}}{e^{-(\textbf{x}_{\star} - x_{n_1})^2}}$$

so we classify $\textbf{x}_{\star}$ as class 0 if it is closer to class 0 than class 1.

Implementation

Summing a large number exponentials can result in overflow (underflow), and accuracy loss. A common trick to avoid this issue is to perform arithemtic computation on a logarithmic scale.

$$LogSumExp(x_1, \dots, x_n) = x_{\star} + \log\left( \exp(x_1-x_{\star})+ \cdots + \exp(x_n-x_{\star}) \right)$$

where

$$x_{\star} = \max{{x_1, \dots, x_n}}$$

We can implement LogSumExp in numpy as

import numpy as np
def log_sum_exp(x):
    """
    Log likelihood of a Parzen estimator
    """
    a = np.max(x)
    return np.max(x) + np.log(np.sum(np.exp(x-a)))

However, scipy provides a logsumexp function that computes the log of the sum of exponentials of input elements.

import numpy as np
from scipy.special import logsumexp

def parzen(x, mu, sigma=1.0):
    """
    Makes a function that allows the evalution of a Parzen 
    estimator where the Kernel is a normal distribution with 
    stddev sigma and with points at mu.

    Parameters
    -----------
    x : numpy array
        Classification input
    mu : numpy matrix
        Contains the data points over which this distribution is based.
    sigma : scalar
        The standard deviation of the normal distribution around each data \
        point.
    Returns
    -------
    lpdf : callable
        Estimator of the log of the probability density under a point.
    """
    z = mu.shape[0] * np.sqrt(sigma * np.pi * 2.0)
    e = (-(x - mu)**2.0) / (2.0 * sigma)
    log_p = logsumexp(e) 
    return log_p - z

And we put everything together with:

sigmas = [1e-13]

priorC0 = class0_train.T.shape[0] / (class0_train.T.shape[0] + class1_train.T.shape[0])
priorC1 = class1_train.T.shape[0] / (class1_train.T.shape[0] + class1_train.T.shape[0])


for sigma in sigmas:
    correct = 0
    for x in class0_test.T:
        c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma) 
        c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma) 

        if (c0p / c1p) > 1:
            correct += 1

    print(sigma, correct, class0_test.shape[1], correct / class0_test.shape[1])

    correct = 0
    for x in class1_test.T:
        c0p = priorC0 * parzen(x, class0_train.T, sigma=sigma) 
        c1p = priorC1 * parzen(x, class1_train.T, sigma=sigma)

        if (c0p / c1p) < 1:
            correct += 1

    print(sigma, correct, class1_test.shape[1], correct / (class1_test.shape[1]))