2.8. Density Estimation

Density estimation is a very simple concept, and most people are alreadyfamiliar with one common density estimation technique: the histogram.

A histogram is a simple visualization of data where bins are defined, and thenumber of data points within each bin is tallied. An example of a histogramcan be seen in the upper-left panel of the following figure:

A major problem with histograms, however, is that the choice of binning canhave a disproportionate effect on the resulting visualization. Consider theupper-right panel of the above figure. It shows a histogram over the samedata, with the bins shifted right. The results of the two visualizations lookentirely different, and might lead to different interpretations of the data.

Intuitively, one can also think of a histogram as a stack of blocks, one blockper point. By stacking the blocks in the appropriate grid space, we recoverthe histogram. But what if, instead of stacking the blocks on a regular grid,we center each block on the point it represents, and sum the total height ateach location? This idea leads to the lower-left visualization. It is perhapsnot as clean as a histogram, but the fact that the data drive the blocklocations mean that it is a much better representation of the underlyingdata.

This visualization is an example of a kernel density estimation, in this casewith a top-hat kernel (i.e. a square block at each point). We can recover asmoother distribution by using a smoother kernel. The bottom-right plot showsa Gaussian kernel density estimate, in which each point contributes a Gaussiancurve to the total. The result is a smooth density estimate which is derivedfrom the data, and functions as a powerful non-parametric model of thedistribution of points.

2.8.2. Kernel Density Estimation

Kernel density estimation in scikit-learn is implemented in thesklearn.neighbors.KernelDensity estimator, which uses theBall Tree or KD Tree for efficient queries (see fora discussion of these). Though the above exampleuses a 1D data set for simplicity, kernel density estimation can beperformed in any number of dimensions, though in practice the curse ofdimensionality causes its performance to degrade in high dimensions.

In the following figure, 100 points are drawn from a bimodal distribution,and the kernel density estimates are shown for three choices of kernels:

It’s clear how the kernel shape affects the smoothness of the resultingdistribution. The scikit-learn kernel density estimator can be used asfollows:

Here we have used kernel='gaussian', as seen above.Mathematically, a kernel is a positive function

which is controlled by the bandwidth parameter.Given this kernel form, the density estimate at a point withina group of points is given by:

The bandwidth here acts as a smoothing parameter, controlling the tradeoffbetween bias and variance in the result. A large bandwidth leads to a verysmooth (i.e. high-bias) density distribution. A small bandwidth leadsto an unsmooth (i.e. high-variance) density distribution.

implements several common kernelforms, which are shown in the following figure:

The form of these kernels is as follows:

Gaussian kernel (kernel = 'gaussian')

Epanechnikov kernel (kernel = 'epanechnikov')

Exponential kernel ()

Linear kernel (kernel = 'linear')

The kernel density estimator can be used with any of the valid distancemetrics (see for a list of available metrics), thoughthe results are properly normalized only for the Euclidean metric. Oneparticularly useful metric is theHaversine distancewhich measures the angular distance between points on a sphere. Hereis an example of using a kernel density estimate for a visualizationof geospatial data, in this case the distribution of observations of twodifferent species on the South American continent:

One other useful application of kernel density estimation is to learn anon-parametric generative model of a dataset in order to efficientlydraw new samples from this generative model.Here is an example of using this process tocreate a new set of hand-written digits, using a Gaussian kernel learnedon a PCA projection of the data:

The “new” data consists of linear combinations of the input data, with weightsprobabilistically drawn given the KDE model.

Examples:

: computation of simple kerneldensity estimates in one dimension.
Kernel Density Estimation: an example of usingKernel Density estimation to learn a generative model of the hand-writtendigits data, and drawing new samples from this model.