LECTURE 13: FROM. INTERPOLATIONS TO REGRESSIONS. TO GAUSSIAN PROCESSES. ⢠So far we were mostly doing linear or nonlin
LECTURE 13: FROM INTERPOLATIONS TO REGRESSIONS TO GAUSSIAN PROCESSES • So far we were mostly doing linear or nonlinear regression of data points with simple small basis (for example, linear function y=ax+b) • The basis can be arbitrarily large and can be defined in many different ways: we do not care about values of a,b… but about predictions for y • This is the task of machine learning • Interpolation can be viewed as regression without noise (but still sparse sampling of data) • There are many different regressions
Interpolation • Goal: we have function y=f(x) defined at points yi=f(xi), and we want to know its value at any x. For now we will assume solution goes through xi’s • Why? Perhaps it is very expensive to evaluate it everywhere. Or perhaps we really do not know it. • Local interpolation: use a few nearby points • Example: polynomial interpolation f(x)=Si=1Naixi • How do we choose N? Higher N better for smooth functions and worse for sharp kinks
Differentiability • Lagrange formula for polynomial interpolation:
• Polynomial interpolation does not guarantee that derivatives are differentiable everywhere. • Stiffer solution are splines, where we enforce derivatives to be continuous • Most popular is cubic spline where 1st derivatives are smooth and 2nd derivatives are continuous
Cubic spline
• Start with piecewise linear interpolation
• This will not have 2nd derivative continuous: it is zero inside intervals and infinite at xi • But we can add another interpolation of 2nd derivatives y’’. If we also arrange it to be 0 at xi then we do not spoil linear interpolation above. • This has a unique solution:
Cubic spline
• So far we assumed we know y’’, but we do not • We can determine it by requiring continuity of 1st derivatives on both sides of xi:
• We get N-2 equations for N unknown yi’’, tridiagonal system, O(N) • Natural cubic spline: set y0’’=0 and yN’’=0
Rational function expansion • Spline is mostly used for interpolation • Polynomials can be used for extrapolation outside the interval (x0,xN), but the polynomial with the largest power dominates and results in divergence • Rational functions can be better for extrapolation: if we know the function goes to 0 we choose n>µ • Rational functions are better for interpolation if the functions has poles (in real or complex plane) • Rational functions can also be used for analytic work (Pade approximation)
Interpolation on a grid in higher dimensions • Simplest 2d example: bilinear interpolation
• Higher order accuracy: polynomials (biquadratic…) • Higher ordersmoothness: bicubic spline
From spline to B-spline to gaussian • We can write basis functions for (cubic) splines (B for basis), so that the solution is a linear combination of them. For cubic B-splines on uniform points
• Very close to gaussian • Gaussian is infinitely differentiable (i.e. smoother) but not sparse
Examples of basis functions • Polynomial spline/Gaussian sigmoid
• Sigmoid (used for classification)
From interpolation to regression • One can view function interpolation as regression in the limit of zero noise, but still sparse sampling • Sparse sampling will induce an error, as will noise • Both can use the same basis expansion f(x) • For regression with noise or sparse sampling we add a regularization term (Tikhonov/ridge/L2, Lasso/L1…), this can prevent overfitting • l=Si=1N [Sj=1Majf(xi) –yi]2+lSj=1Maj2 • The question is how to choose the regularization parameter l
High l: low scatter, high bias, low l : high scatter, low bias Example: N=M=25, gaussian basis
Bayesian regression
• In the Bayesian context we perform regression of coefficients aj assigning them some prior distribution, such as a gaussian with some precision a. If we also have noise precision b then ln p is
• So regularizing parameter is l=a/b
Linear algebra solution • More general prior
• Posterior
• We want to predict at arbitrary t:
Prediction for t • We
Examples
Kernel picture • Kernels work as closeness measure, giving more weight to nearby points. Assuming m0=0
Kernel examples
Kernel regression • So far the kernel was defined in terms of a finite number of basis functions F(x) • We can eliminate the concept of the basis functions and work simply with kernels • Nadaraya-Watson regression • We want to estimate y(x)=h(x): we use points close to x weighted by some function of the distance Kh(x-xi): as the distance increases the weights drop off • Kernel Kh does not have to be a covariance function, but if it is then this becomes a gaussian process
Gaussian process • We interpret the kernel as Kij=Cov(y(xi),y(xj)) and define y(x) as a random variable with a multi-variate gaussian distribution N(µ,K) • 2-variables • We then use posterior prediction for y: we get both the mean µ and variance K1/2 • We can interpret GP as regression on an infinite basis
Gaussian process
• Example covariance function K for translational and rotational invariant case
• Kernel parameters can be learned from data using optimization ed by
GP for regression • Model
• We can marginalize over f
GP regression
Note we have to invert a matrix: this can be expensive (O(N3))
Examples of different kernels
Predictions for different kernels
Learning the kernel
This can be viewed as a likelihood of q given the data (x,y) We can use NL optimization to determine MLE q
From linear regression to GP
• We started with a linear regression with inputs xi and noisy yi: yi=a+bxi+ei • We generalized this to a general basis with M basis functions yi=Sj=1Majf(xi)+ei • Next we performed Bayesian regression by adding a gaussian prior on coefficients aj=N(0, lj): this is equivalent to adding L2 norm and minimize l=Si=1N [Sj=1Majf(xi) –yi]2+Sj=1M lj aj2 • Next we marginalize over lj: we are left with E(yi)=0 and Kij=Cov(y(xi),y(xj))= Sj=1Mljf(xi) f(xj) • This is a gaussian process with a finite number of basis functions. Many GP kernels correspond to infinite number of basis functions.
Connections between regression/ classification models
Summary • Interpolations: polynomial, spline, rational… • They are just a subset of regression models, and one can connect them to regularized regression, which can be connected to Bayesian regression, which can be connected to kernel regression • This can also be connected to classification, next lecture
Literature • • • • •
Bishop, Ch. 3 Gelman etal Ch. 20, 21 http://www.gaussianprocess.org https://arxiv.org/pdf/1505.02965.pdf http://mlss2011.comp.nus.edu.sg/uploads/Site/l ect1gp.pdf