Block Relaxation Algorithms in Statistics -- Part III

4 downloads 64 Views 3MB Size Report
dedicated journals. ... questions we will then have to answer are if and under what conditions our ...... Moved the graphics from an external server to the book.
Block Relaxation Algorithms in Statistics -- Part III

Table of Contents Project

0

Background

1

Introduction

1.1

Analysis

1.2

Semi-continuities

1.2.1

Directional Derivatives

1.2.2

Differentiability and Derivatives

1.2.3

Taylor's Theorem

1.2.4

Implicit Functions

1.2.5

Necessary and Sufficient Conditions for a Minimum

1.2.6

Point-to-Set Maps

1.3

Continuities

1.3.1

Marginal Functions

1.3.2

Solution Maps

1.3.3

Basic Inequalities

1.4

Jensen's Inequality

1.4.1

The AM-GM Inequality

1.4.2

Cauchy-Schwartz Inequality

1.4.3

Young's Inequality

1.4.4

Fixed Point Problems and Methods Subsequential Limits Convex Functions Composition Rates of Convergence

1.5 1.5.1 1.6 1.6.1 1.7

Over- and Under-Relaxation

1.7.1

Acceleration of Convergence of Fixed Point Methods

1.7.2

Matrix Algebra

1.8

Eigenvalues and Eigenvectors of Symmetric Matrices

1.8.1

Singular Values and Singular Vectors

1.8.2

Canonical Correlation

1.8.3

2

Block Relaxation Algorithms in Statistics -- Part III

Eigenvalues and Eigenvectors of Asymmetric Matrices

1.8.4

Modified Eigenvalue Problems

1.8.5

Quadratics on a Sphere

1.8.6

Generalized Inverses

1.8.7

Partitioned Matrices

1.8.8

Matrix Differential Calculus

1.9

Matrix Derivatives

1.9.1

Derivatives of Eigenvalues and Eigenvectors

1.9.2

Miscellaneous

1.10

Multidimensional Scaling

1.10.1

Cobweb Plots

1.10.2

Notation

2

Bibliography

3

What's New

4

Workflow

5

Glossary

3

Block Relaxation Algorithms in Statistics -- Part III

This is Part III of Block Relaxation Algorithms in Statistics. It discusses various mathematical, computational, and notational background topics.

Project

4

Block Relaxation Algorithms in Statistics -- Part III

Background

Background

5

Block Relaxation Algorithms in Statistics -- Part III

Introduction

Introduction

6

Block Relaxation Algorithms in Statistics -- Part III

14.2: Analysis

Analysis

7

Block Relaxation Algorithms in Statistics -- Part III

14.2.1: SemiContinuities The lower limit or limit inferior of a sequence

is defined as

Alternatively, the limit inferior is the smallest cluster point or subsequential limit

In the same way

We always have

Also if

then

The lower limit or limit inferior of a function at a point is defined as

where

Alternatively

In the same way

Semi-continuities

8

Block Relaxation Algorithms in Statistics -- Part III

A function is lower semi-continuous at if

Since we always have

we can also define lower semicontinuity as

A function is upper semi-continuous at if

We have

A function is continuous at if and only if it is both lower semicontinuous and upper semicontinous, i.e. if

Semi-continuities

9

Block Relaxation Algorithms in Statistics -- Part III

14.2.2: Directional Derivatives The notation and terminology are by no means standard. We generally follow Demyanov [2007, 2009]. The lower Dini directional derivative of at in the direction is

and the corresponding upper Dini directional derivative is

If

exists, i.e. if

, then it we simply write

for the Dini directional

derivative of at in the direction . Penot [2013] calls this the radial derivative and Schirotzek [2007]) calls it the directional Gateaux derivative. If directionally differentiable at in the direction , and if

exists is Dini exists at for all we say

that is Dini directionally differentiable at Delfour [2012] calls semidifferentiable at . In a similar way we can define the Hadamard lower and upper directional derivatives. They are

and

The Hadamard directional derivative

exists if both

and

exist

and are equal. In that case is Hadamard directionally differentiable at in the direction , and if

exists at for all we say that is Hadamard directionally differentiable at

Generally we have

Directional Derivatives

10

Block Relaxation Algorithms in Statistics -- Part III

The classical directional derivative of at in the direction is

Note that for the absolute value function at zero we have

, while

does not exist. The classical directional derivative is not particularly useful in the context of optimization problems.

Directional Derivatives

11

Block Relaxation Algorithms in Statistics -- Part III

Differentiability and Derivatives The function is Gateaux differentiable at if and only if the Dini directional derivative exists for all and is linear in . Thus The function is Hadamard differentiable at if the Hadamard directional derivative exists for all and is linear in . Function is locally Lipschitz at if there is a ball

and a

such that

for all If is locally Lipschitz and Gateaux differentiable then it is Hadamard differentiable. If the Gateaux derivative of is continuous then is Frechet differentiable. Define Frechet differentiable The function is Hadamard differentiable if and only if it is Frechet differentiable. Gradient, Jacobian

Differentiability and Derivatives

12

Block Relaxation Algorithms in Statistics -- Part III

14.2.3: Taylor's Theorem Suppose Define, for all

is

times continuously differentiable in the open set

.

,

as the inner product of the -dimensional array of partial derivatives dimensional outer power of By convention

and the -

Both arrays are super-symmetric, and have dimension .

Also define the Taylor Polynomials

and the remainder

Assume contains the line segment with endpoints and . Then Lagrange's form of the remainder says there is a

such that

and the integral form of the remainder says

Taylor's Theorem

13

Block Relaxation Algorithms in Statistics -- Part III

Implicit Functions The classical implicit function theorem is discussed in all analysis books. We are particularly fond of Spivak [1970, p. 40]. The history of the theorem, and many of its variations, is discussed in [Krantz and Parks [2013] and a comprenhensive modern treatment, using the tools of convex and variational analysis, is in Dontchev and Rockafellar [2014]. Suppose where

and suppose that

is continuously differentiable in an open set containing Define the

matrix

is non-singular. Then there is an open set containing and an

open set containing such that for every

The function

there is a unique

is differentiable. If we differentiate

with

we find

and thus

As an example consider the eigenvalue problem

where is a function of a real parameter . Then

which works out to

Implicit Functions

14

Block Relaxation Algorithms in Statistics -- Part III

Necessary and Sufficient Conditions for a Minimum Directional derivatives can be used to provide simple necessary or sufficient conditions for a minimum [Demyanov [2009, propositions 8 and 10]]. Result: If is a local minimizer of then directions . If

for all

and

for all

then has a strict local minimum at .

The special case of a quadratic deserves some separate study, because the quadratic model is so prevalent in optimization. So let us look at symmetric. Use the eigen-decomposition using

. Then

, with

to change variables to

, also

, which we can write as

Here

If

is non-empty we have

.

If

is empty, then attains its minimum if and only if

for all

. Otherwise

again If the minimum is attained, then

with

the Moore-Penrose inverse. And the minimum is attained if and only if is positive

Necessary and Sufficient Conditions for a Minimum

15

Block Relaxation Algorithms in Statistics -- Part III

semi-definite and

.

Necessary and Sufficient Conditions for a Minimum

16

Block Relaxation Algorithms in Statistics -- Part III

Point-to-set Maps

Point-to-Set Maps

17

Block Relaxation Algorithms in Statistics -- Part III

14.3.1: Continuities

Continuities

18

Block Relaxation Algorithms in Statistics -- Part III

Marginal Functions and Solution Maps Suppose at a unique

and , where

. Suppose the minimum is attained Then obviously

.

Differentiating gives

To differentiate the solution map we need second derivatives of . Differentiating the implicit definition

gives

or

Now combine both

and

We see that if

to obtain

then

.

Now consider minimization problem with constraints. Suppose continuously differentiable functions on

are twice

, and suppose

Define

and

where again we assume the minimizer is unique and satisfies

Differentiate again, and define

Marginal Functions

19

Block Relaxation Algorithms in Statistics -- Part III

and

\Then

which leads to

There is an alternative way of arriving at basically the same result. Suppose the manifold is parametrized locally as

and

Marginal Functions

, i.e.

. Then

. Let

. Then

20

Block Relaxation Algorithms in Statistics -- Part III

14.3.3: Solution Maps

Solution Maps

21

Block Relaxation Algorithms in Statistics -- Part III

Basic Inequalities

Basic Inequalities

22

Block Relaxation Algorithms in Statistics -- Part III

Jensen's Inequality

Jensen's Inequality

23

Block Relaxation Algorithms in Statistics -- Part III

The AM-GM Inequality The Arithmetic-Geometric Mean Inequality is simple, but quite useful for majorization. For completeness, we give the statement and proof here. Theorem: If

and

Proof: Expand

then

with equality if and only if

and collect terms. QED

Corollary: Proof: Just a simple rewrite of the theorem. QED

The AM-GM Inequality

24

Block Relaxation Algorithms in Statistics -- Part III

Polar Norms and the Cauchy-Schwarz Inequality Theorem: Suppose

Then

with equality if and only if and

are proportional. Proof: The result is trivially true if either or is zero. Thus we suppose both are non-zero. We have

for all Thus

which is the required result. QED

Cauchy-Schwartz Inequality

25

Block Relaxation Algorithms in Statistics -- Part III

Young's Inequality The AM-GM inequality is a very special cases of Young's inequality. We derive it in a general form, using the coupling functions introduced by Moreau. Suppose is a real-valued function on and is a real-valued function on

, called the coupling function. Here

and are arbitrary. Define the -conjugate of by

Then

and thus

, which is the generalized Young's inequality. We can also write this in the form that directly suggests minorization

The classical coupling function is take

, with

The

is attained for

Thus if

with both and in the positive reals. If we

, then

, from which we find

with

.

such that

Then for all

we have

with equality if and only if

Young's Inequality

.

26

Block Relaxation Algorithms in Statistics -- Part III

Fixed Point Problems and Methods As we have emphasized before, the algorithms discussed in this books are all special cases of block relation methods. But block relation methods are often appropriately analyzed as fixed point methods, which define an even wider class of iterative methods. Thus we will not discuss actual fixed point algorithms that are not block relation methods, but we will use general results on fixed point methods to analyze block relaxation methods. A (stationary, one-step) fixed point method on

is defined as a map

.

Depending on the context we refer to as the update map or algorithmic map. Iterative sequences are generated by starting with

for

and then setting

. Such a sequence is also called the Picard sequence generated by the

map. If the sequence and thus

converges to, say,

, and if is continuous, then

is a fixed point of on . The set of all

the fixed point set of on , and is written as

such that

, is called

.

The literature on fixed point methods is truly gigantic. There are textbooks, conferences, and dedicated journals. A nice and compact treatment, mostly on existence theorems for fixed points, is Smart, [1974]. An excellent modern overview, concentrating on metrical fixed point theory and iterative computation, is Berinde [2007]. The first key result in fixed point theory is the Brouwer Fixed Point Theorem, which says that for compact convex and continuous there is at least one

. The second is

the Banach Fixed Point Theorem, which says that if is a non-empty complete metric space and is a contraction, i.e. the Picard sequence

for some

converges from any starting point

, then

to the unique fixed point of

in . Much of the fixed point literature is concerned with relaxing the contraction assumption and choosing more general spaces on which the various mappings are defined. I shall discuss some of the generalizations that we will use later in this book. First, we can generalize to point-to-set maps

, where

is the power set of ,

i.e. the set of all subsets. Point-to-set maps are also called correspondences or multivalued maps. The Picard sequence is now defined by if and only if

Fixed Point Problems and Methods

and we have a fixed point

. The generalization of the Brouwer Fixed Point

27

Block Relaxation Algorithms in Statistics -- Part III

Theorem is the Kakutani Fixed Point Theorem. It assumes that is non-empty, compact and convex and that

is non-empty and convex for each

must be closed or upper semi-continuous on , i.e. whenever and

we have

. In addition, the map and

. Under these conditions Kakutani's Theorem asserts

the existence of a fixed point. Our discussion of the global convergence of block relaxation algorithms, in a later chapter, will be framed using fixed points of point-to-set maps, assuming the closedness of maps. In another generalization of iterative algorithms we get rid of the one-step and the stationary assumptions. The iterative sequence is

Thus the iterations have perfect memory, and the update map can change in each iteration. In an -step method, memory is less than perfect, because the update is a function of only the previous elements in the sequence. Formally, for

with some special provisions for

,

.

Any -step method on can be rewritten as a one-step method on

This makes it possible to limit our discussion to one-step methods. In fact, we will mostly discuss block-relaxation methods which are stationary one-step fixed point methods. For non-stationary methods it is somewhat more complicated to define fixed points. In that case it is natural to define a set

of desirable points or targets, which for stationary

algorithms will generally, but not necessarily, coincide with the fixed points of . The questions we will then have to answer are if and under what conditions our algorithms converge to desirable points, and if they converge how fast the convergence will take place.

Fixed Point Problems and Methods

28

Block Relaxation Algorithms in Statistics -- Part III

Subsequential Limits

Subsequential Limits

29

Block Relaxation Algorithms in Statistics -- Part III

Differentiable Convex Functions If a function attains its minimum on a convex set at , and is differentiable at , then for all If attains its minimum on

. at , and is differentiable at , then

precisely, if is differentiable from the right at and Suppose with

. Or, more

.

is the unit ball and a differentiable attains its a minimum at Then

for all

By Cauchy-Schwartz this means that

This is true if and only if

, with

As an aside, if a differentiable function attains its minimum on the unit sphere at then

attains is minimum over

equal to zero shows that we must have

at . Setting the derivative , which again translates to

, with

Convex Functions

30

Block Relaxation Algorithms in Statistics -- Part III

Composition

Composition

31

Block Relaxation Algorithms in Statistics -- Part III

Rates of Convergence The basic result we use is due to Perron and Ostrowski \cite{ostr}. Theorem: If the iterative algorithm

converges to

and is differentiable at and then the algorithm is linearly convergent with rate Proof: QED The norm in the theorem is the spectral norm, i.e. the modulus of the maximum eigenvalue. Let us call the derivative of the iteration matrix and write it as

In general block

relaxation methods have linear convergence, and the linear convergence can be quite slow. In cases where the accumulation points are a continuum we have sublinear rates. The same things can be true if the local minimum is not strict, or if we are converging to a saddle point. Generalization to non-differentiable maps. Points of attraction and repulsion. Superlinear etc

Rates of Convergence

32

Block Relaxation Algorithms in Statistics -- Part III

Over- and Under-Relaxation

Over- and Under-Relaxation

33

Block Relaxation Algorithms in Statistics -- Part III

Acceleration of Convergence of Fixed Point Methods

Acceleration of Convergence of Fixed Point Methods

34

Block Relaxation Algorithms in Statistics -- Part III

Matrix Algebra

Matrix Algebra

35

Block Relaxation Algorithms in Statistics -- Part III

Eigenvalues and Eigenvectors of Symmetric Matrices In this section we give a fairly complete introduction to eigenvalue problems and generalized eigenvalue problems. We use a constructive variational approach, basically using the Rayleigh quotient and deflation. This works best for positive semi-definite matrices, but after dealing with those we discuss several generalizations. Suppose is a positive semi-definite matrix of order . Consider the problem of maximizing the quadratic form

on the sphere

attained, we have

. At the maximum, which is always

, with a Lagrange multiplier, as well as

. It follows that

. Note that the maximum is not necessarily attained at a unique value. Also the maximum is zero if and only if is zero. Any pair

such that

and

is called an eigen-pair of . The members

of pair are the eigenvector and the corresponding eigenvalue . Result 1: Suppose

and

are two eigen-pairs, with

premultiplying both sides of

by

gives

Then , and thus

. This shows that cannot have more than distinct eigenvalues. If there were distinct eigenvalues, then the

matrix , which has the corresponding

eigenvectors as columns, would have column-rank and row-rank , which is impossible. In words: one cannot have more than orthonormal vectors in Suppose the distinct values are eigenvalues

, with

dimensional space. Thus each of the

is equal to one of the

Result 2: If

and

are two eigen-pairs with the same eigenvalue then any

linear combination

suitably normalized, is also an eigenvector with eigenvalue

. Thus the eigenvectors corresponding with an eigenvalue form a linear subspace of with dimension, say, matrix eigenvalue

,

. This subspace can be given an orthonormal basis in an

. The number

equal to

is the multiplicity of

and by implication of the

.

Of course these results are only useful if eigen-pairs exist. We have shown that at least one eigen-pair exists, the one corresponding to the maximum of on the sphere. We now give a procedure to compute additonal eigen-pairs. Consider the following algorithm for generating a sequence and 1. Test: If

of matrices. We start with

. stop.

2. Maximize: Computes the maximum of

over

Eigenvalues and Eigenvectors of Symmetric Matrices

. Suppose this is attained

36

Block Relaxation Algorithms in Statistics -- Part III

at an eigen-pair

. If the maximizer is not unique, select an arbitrary one.

3. Orthogonalize: Replace

by

4. Deflate: Set

,

5. Update: Go back to step 1 with replaced by If

.

then in step (2) we compute the largest eigenvalue of and a corresponding

eigenvector. In that case there is no step (3). Step (4) constructs

by deflation, which

basically removes the contribution of the largest eigenvalue and corresponding eigenvector. If is an eigenvector of with eigenvalue

, then

by result (1) above. Also, of course, eigenvalue . If

, so

is an eigenvector of with eigenvalue

we can choose such that

We see that

, then by result (2)

has the same eigenvectors as , with the same multiplicities, except for , and zero, which now has its old multiplicity

is the eigenvector corresponding with

by result (1)

with

and thus

, which now has its old multiplicity Now if

is an eigenvector of

is automatically orthogonal to

, the largest eigenvalue of , which is an eigenvalue of

.

, then with

eigenvalue zero. Thus step (3) is not ever necessary, although it will lead to more precise numerical computation. Following the steps of the algorithm we see thatit defines orthonormal matrices moreover satisfy

where

is the projector

for

, and with

, which

. Also

. This is the eigen decomposition or the spectral

decomposition of a positive semi-definite . Our algorithm stops when

, which is the same as

then the minimum eigenvalue is zero, and has multiplicity

. If .

is the orthogonal projector of the null-space of , with

. Using the square orthonormal

we can write the eigen decomposition in the form

Eigenvalues and Eigenvectors of Symmetric Matrices

37

Block Relaxation Algorithms in Statistics -- Part III

where the last

diagonal elements of are zero. Equation

can also be

written as

which says that the eigenvectors diagonalize and that is orthonormally similar to the diagonal matrix of eigenvalues, We have shown that the largest eigenvalue and corresponding eigenvector exist, but we have not indicated , at least in this section, how to compute them. Conceptually the power method is the most obvious way. It is a tangential minorization method, using the inquality , which means that the iteration function is

See the Rayleigh Quotient section for further details. We now discuss a first easy generalization. If is real and symmetric but not necessarily positive semi-definite then we can apply our previous results to the matrix . Or we can apply it to into an

, with

. Or we can modify the algorithm if we run

with maximum eigenvalue equal to zero. If this happens we switch to

finding the smallest eigenvalues, which will be negative. No matter how we modify the constructive procedure, we will still find an eigen decomposition of the same form

and

as in the positive semi-definite case. The second generalization, also easy, are generalized eigenvalues of a pair of real symmetric matrices

. We now maximize

over satisfying

.

In data analysis, and the optimization problems associated with it, we almost invariably assume that is positive definite. In fact we might as well make the weaker assumption that is positive semi-definite, and

is an eigen decomposition of Then

for all such that

Suppose

Change variables by writing as

and

.

. We can find the generalized

eigenvalues and eigenvectors from the ordinary eigen decomposition of This defines the

in

.

, and the choice of is completely

arbitrary. Now suppose is the square orthonormal matrix of eigenvectors diagonalizing with the corresponding eigenvalues, and and

. Then

. Thus diagonalizes both and . For the more general

Eigenvalues and Eigenvectors of Symmetric Matrices

38

Block Relaxation Algorithms in Statistics -- Part III

case, in which we do not assume that

for all with

, we refer to De Leeuw

[1982].

Eigenvalues and Eigenvectors of Symmetric Matrices

39

Block Relaxation Algorithms in Statistics -- Part III

Singular Values and Singular Vectors Suppose

is an

matrix,

is an

symmetric matrix, and

is an

symmetric matrix. Define

Consider the problem of finding the maximum, the minimum, and other stationary values of . In order to make the problem well-defined and interesting we suppose that the symmetric partitioned matrix

is positive semi-definite. This has some desirable consequences. Proposition: Suppose the symmetric partitioned matrix

is positive semi-definite. Then both

and

are positive semi-definite,

for all

with

we have

,

for all

with

we have

.

Proof: The first assertion is trivial. To prove the last two, consider the convex quadratic form

as a function of

for fixed

. It is bounded below by zero, and thus attains its minimum.

At this minimum, which is attained at some

, the derivative vanishes and we have

and thus . But must have

If

then

because the quadratic form is positive semi-definite. Thus if , which is true if and only if

we

. QED

Now suppose

Singular Values and Singular Vectors

40

Block Relaxation Algorithms in Statistics -- Part III

and

are the eigen-decompositions of matrix

and

. The

have positive diagonal elements, and

and

matrix

and the

are the ranks of

and

.

Define new variables

Then

which does not depend on and

and

at all. Thus we can just consider as a function of

, study its stationary values, and then translate back to , choosing

and

and

using

and

completely arbitrary.

Define

. The stationary equations we have to solve are

where is a Lagrange multiplier, and we identify

and

by

. It follows

that

and also

.

Singular Values and Singular Vectors

41

Block Relaxation Algorithms in Statistics -- Part III

Canonical Correlation Suppose

is an

matrix and

between two linear combinations

is an and

matrix. The cosine of the angle is

Consider the problem of finding the maximum, the minimum, and possible other stationary values of . are two matrices of dimensions, respectiSpecifically there exists a non-singular of order and a non-singular of order

Here

and

such that

are diagonal, with the

and

leading diagonal elements equal to one

and all other elements zero. is a matrix with the non-zero canonical correlations in nonincreasing order along the diagonal and zeroes everywhere else. http://en.wikipedia.org/wiki/Principal_angles http://meyer.math.ncsu.edu/Meyer/PS_Files/AnglesBetweenCompSubspaces.pdf

Canonical Correlation

42

Block Relaxation Algorithms in Statistics -- Part III

Eigenvalues and Eigenvectors of Asymmetric Matrices If is a square but asymmetric real matrix the eigenvector-eigenvalue situation becomes quite different from the symmetric case. We gave a variational treatment of the symmetric case, using the connection between eigenvalue problems and quadratic forms (or ellipses and other conic sections, if you have a geometric mind).That connection, howver, is lost in the asymmetric case, and there is no obvious variational problem associated with eigenvalues and eigenvectors. Let us first define eigenvalues and eigenvectors in the asymmetric case. As before, an eigen-pair

is a solution to the equation

as

with

. This can also be written

, which shows that the eigenvalues are the solutions of the equation . Now the function

is the characteristic polynomial of . It

is a polynomial of degree , and by the fundamental theorem of algebra there are real and complex roots, counting multiplicities. Thus has eigenvalues, as before, although some of them can be complex A first indication that something may be wrong, or least fundamentally different, is the matrix

The characteristic equation

has the root

, with multiplicity 2. Thus

an eigenvector should satisfy

which merely says

. Thus does not have two linearly independent, let alone

orthogonal, eigenvectors. A second problem is illustrated by the anti-symmetric matrix

for which the characteristic polynomial is the two complex roots

and

. The characteristic equations has . The corresponding eigenvectors are the

columns of

Thus both eigenvalues and eigenvectors may be complex. In fact if we take complex

Eigenvalues and Eigenvectors of Asymmetric Matrices

43

Block Relaxation Algorithms in Statistics -- Part III

conjugates on both sides of Thus

, and remember that is real, we see that is an eigen-pair if and only if

is. If is real and of odd

order it always has at least one real eigenvalue. If an eigenvalue is real and of multiplicity , then there are corresponding real and linearly independent eigenvectors. They are simply a basis for the null space of

.

A third problem, which by definition did not come up in the symmetric case, is that we now have an eigen problem for both and its transpose

. Since for all we have

it follows that and say that

is a right eigen-pair of if

have the same eigenvalues. We

, and

, which is of course the same as

is a left eigen-pair of if

.

A matrix is diagonalizable if there exists a non-singular such that

, with

diagonal. Instead of the spectral decomposition of symmetric matrices we have the decomposition

or

. A matrix that is not diagonalizable is called

defective. Result: A matrix is diagonalizable if and only if it has linearly independent right eigenvectors if and only if it has linearly independent left eigenvectors. We show this for right eigenvectors. Collect them in the columns of a matrix . Thus non-singular. This implies

, and thus the rows of

independent left eigenvalues. Also and

. Conversely if

, with are linearly then

, so we have linearly independent left and right

eigenvectors. Result: If the eigenvalues

of are all diferent then the eigenvectors

are linearly

independent. We show this by contradiction. Select a maximally linearly independent subset from the

. Suppose there are

, so the eigenvectors are linearly dependent. Without

loss of generality the maximally linearly independent subset can be taken as the first . Then for all

there exist

Premultiply

with

Premultiply

by to get

such that

to get

Eigenvalues and Eigenvectors of Asymmetric Matrices

44

Block Relaxation Algorithms in Statistics -- Part III

Subtract

from

to get

which implies that eigenvalues are unequal, this implies that the

are eigenvectors. Thus

because the

are linearly independent. Since the

and thus and the

for all

, contradicting

are linearly independent.

Note 030615 Add small amount on defective matrices. Add stuff on characteristic and minimal polynomials. Take about using the SVD instead.

Eigenvalues and Eigenvectors of Asymmetric Matrices

45

Block Relaxation Algorithms in Statistics -- Part III

Modified Eigenvalue Problems Suppose we know an eigen decomposition

of a real symmetric matrix of

order , and we want to find an eigen decomposition of the rank-one modification , where

. The problem was first discussed systematically by Golub

[1973]. Also see Bunch, Nielsen, and Sorensen [1978] for a more detailed treatment and implmentation. Eigen-pairs of must satisfy

Change variables to

and define

. For the time being suppose all elements

of are non-zero and all elements of are different, with

.

We must solve

which we can also write as

Suppose

is a solution with

. Then

different must be a vector with a single element, say , which is non-zero. Thus

and because all

are

, not equal to zero. But then

is non-zero at a solution, and because

eigenvectors are determined up to a scalar factor we may as well require

.

Now solve

At a solution we must have

, because otherwise would be zero. Thus

and we can find by solving

If we define

Modified Eigenvalue Problems

46

Block Relaxation Algorithms in Statistics -- Part III

then we must solve

. Let's first look at a particular .

Figure 1: Linear Secular Equation

We have

for all , and

There are vertical asymptotes at all from

to

increases from

. For

and

the function increases from 0 to

to 0. Thus the equation

open intervals between the and if

, and between

. If

it has a solution larger than

Modified Eigenvalue Problems

the function increases and for

it

has one solution in each of the it has an additional solution smaller than

. If

then

47

Block Relaxation Algorithms in Statistics -- Part III

and if

then

Finding the actual eigenvalues in their intervals can be done with any root-finding method. Of course some will be better than other for solving this particular problem. See Melman [1995], [1997] [1998] for suggestions and comparisons. We still have to deal with the assumptions that the elements of are non-zero and that all are different. Suppose elements of are zero, without loss of generality it can be the last . Partition and accordingly. Then we need to solve the modified eigen-problem for

But this is a direct sum of smaller matrices and the eigenvalues problems for

and

can be solved separately. If not all

are different we can partitioning the matrix into blocks corresponding with the,

say, different eigenvalues.

Now use the matrices column equal to

which are square orthonormal of order Form the direct sum of the

, and have their first

and compute

.

This gives

with the

unit vectors, i.e. vectors that are zero except for element that is one.

A row and column permutation makes the matrix a direct sum of the diagonal matrices of order

and the

Modified Eigenvalue Problems

matrix

48

Block Relaxation Algorithms in Statistics -- Part III

This last matrix satisfies our assumptions of different diagonal elements and nonzero offdiagonal elements, and consequently can be analyzed by using our previous results. A very similar analysis is possible for modfied singular value decomposition, for which we refer to Bunch and Nielsen [1978].

Modified Eigenvalue Problems

49

Block Relaxation Algorithms in Statistics -- Part III

Quadratic on a Sphere Another problem naturally leading to a different secular equation is finding stationary values of a quadratic function defined by

on the unit sphere

. This was first studied by Forsythe and Golub [1965].

Their treatment was subsequently simplified and extended by Spjøtvoll [1972] and Gander [1981]. The problem has recently received some attention because of the development of trust region methods for optimization, and, indeed, because of Nesterov majorization. The stationary equations are

Suppose

with the

, change variables to

, and define

. Then we must solve

Assume for now that the elements of are non-zero. Then cannot be equal to one of the . Thus

and we must have

, where

Again, let's look at an example of a particular . The plots in Figure 1 show both ad . We see that

has 12 solutions, so the remaining question is which one corresponds

with the minimum of .

Quadratics on a Sphere

50

Block Relaxation Algorithms in Statistics -- Part III

Figure 1: Quadratic Secular Equation

Again has vertical asympotes at the

. Beween two asymptotes decreases from

to a minimum, and then increases again to

. Note that

and

and thus is convex in each of the intervals between asymptotes. Also is convex and increasing from zero to

on

and convex and decreasing from

to zero on

.

Quadratics on a Sphere

51

Block Relaxation Algorithms in Statistics -- Part III

Generalized Inverses

Generalized Inverses

52

Block Relaxation Algorithms in Statistics -- Part III

Partitioned Matrices

Partitioned Matrices

53

Block Relaxation Algorithms in Statistics -- Part III

Matrix Differential Calculus

Matrix Differential Calculus

54

Block Relaxation Algorithms in Statistics -- Part III

Matrix Derivatives A matrix, of course, is just an element of a finite dimensional linear vector space. We write , and we use the inner product

and corresponding norm

Thus derivatives of real-valued function of matrices, or derivatives of matrix-valued functions of matrices, are covered by the usual definitions and formulas. Nevertheless there is a surprisingly huge literature on differential calculus for real-valued functions of matrices, and matrix-valued functions of matrices. One of the reason for the proliferation of publications is that a matrix-valued function of matrices can be thought of a function of for matrix space also as a function of vector space

to vector space

to matrix-space

, but

. There are obvious

isomorphisms between the two representations, but they naturally lead to different notations. We will consistently choose the matrix-space formulation, and consequently minimize the role of the

operator and the special constructs such as the commutation and

duplication matrix. The other choice Nevertheless having a compendium of the standard real-valued and matrix-valued functions available is of some interest. The main reference is the book by Magnus and Neudecker [1999]. We will avoid using differentials and the

operator.

Suppose is a matrix valued function of a single variable . In other words is a matrix of functions, as in

Now the derivatives of any order of , if they exist, are also matrix valued functions

If is a function of a vector

Matrix Derivatives

then partial derivatives are defined similarly, as in

55

Block Relaxation Algorithms in Statistics -- Part III

with

The notation becomes slightly more complicated if is a function of a matrix , i.e. an element of where

Matrix Derivatives

. It then makes sense to write the partials as and

56

Block Relaxation Algorithms in Statistics -- Part III

Derivatives of Eigenvalues and Eigenvectors This appendix summarizes some of the results in De Leeuw [2007], De Leeuw [2008], and De Leeuw and Sorenson [2012]. We refer to those reports for more extensive calculations and applications. Suppose and are two real symmetric matrices depending smoothly on a real parameter . The notation below suppresses the dependence on of the various quantities we talk about, but it is important to remember that all eigenvalues and eigenvectors we talk about are functions of . The generalized eigenvalue

and the corresponding generalized eigenvector

defined implicitly by

are

. Moreover the eigenvector is identified by

We suppose that in a neighborhood of the eigenvalue

.

is unique and is positive

definite. A precise discussion of the required assumptions is, for example, in Wilkinson [1965] or Kato [1976]. Differentiating

while

Premultiplying

Now suppose

gives the equation

gives

by

gives

with

. Then from

, for

, premultiplying by

gives

If we define by

then

and thus

.

A first important special case is the ordinary eigenvalue problem, in which

Derivatives of Eigenvalues and Eigenvectors

which

57

Block Relaxation Algorithms in Statistics -- Part III

obviously does not depend on , and consequently has

. Then

while

If we use the Moore_Penrose inverse the derivatives of the eigenvector can be written as

Written in a different way this expression is

with

, so that

.

In the next important special case is the singular value problem The singular values and vectors of an

rectangular , with . It follows that

, solve the equations

and

, i.e. the right singular vectors are the

eigenvectors and the singular values are the square roots of the eigenvalues of Now we can apply our previous results on eigenvalues and eigenvectors. If . We have, at an isolated singular value

. then

,

and thus

For the singular vectors our previous results on eigenvectors give

and in the same way

Now let matrix (with

, with and square orthonormal, and with and

diagonal

positive diagonal entries in non-increasing order along the diagonal).

Derivatives of Eigenvalues and Eigenvectors

58

Block Relaxation Algorithms in Statistics -- Part III

Also define

. Then

, and

and

Note that if is symmetric we have

and is symmetric, so we recover our

previous result for eigenvectors. Also note that if the parameter is actually element , i.e. if we are computing partial derivatives, then

of

.

The results on eigen and singular value decomposition can be applied in many different ways. mostly by simply using the product rule for derivatives, For a square symmetric or order , for example, we have

and thus

The generalized inverse of a rectangular is

where

. Summation is over the positive singular values, and for differentiability

we must assume that the rank of is constant in a neighborhood of . The Procrustus transformation of a rectangular , which is the projection of on the Stiefel manifold of orthonormal matrices, is

where we assume for differentiability that is of full column rank.

Derivatives of Eigenvalues and Eigenvectors

59

Block Relaxation Algorithms in Statistics -- Part III

The projection of on the set of all matrices of rank less than or equal to , which is of key importance in PCA and MDS, is

where summation is over the largest singular values.

Derivatives of Eigenvalues and Eigenvectors

60

Block Relaxation Algorithms in Statistics -- Part III

Graphics and Code

Miscellaneous

61

Block Relaxation Algorithms in Statistics -- Part III

Multidimensional Scaling Many of the examples in the book are taken from the area of multidimensional scaling (MDS). In this appendix we describe the basic MDS notation and terminology. Our approach to MDS is based on Kruskal [1964ab], using terminology and notation of De Leeuw [1977] and De Leeuw and Heiser [1982]. For a more recent and more extensive discussion of MDS see Borg and Groenen [2005]. The data in an MDS problem consist of information about the dissimilarities between pairs of objects. Dissimilarities are like distances, in the sense that they give some information about physical or psychological closeness, but they need not satisfy any of the distance axioms. In metric MDS the dissimilarity between objects and is a given number

,

usually positive and symmetric, with possibly some of the dissimilarities missing. In nonmetric MDS we only have a partial order on some or all of the

dissimilarities. We want to

represent the objects as points in a metric space in such a way that the distances between the points approximate the dissimilarities between the objects. An MDS loss function is typically of the form pseudo-norm, on the space of with

for some norm, or

matrices. Here are the points in the metric space,

the symmetric, non-negative, and hollow matrix of distances. The MDS problem

is to minimize loss over all mappings and all feasible . In the metric MDS problems is fixed at the observed data, in non-metric MDS any monotone transformation of is feasible. The definition of MDS we have given leaves room for all kinds of metric spaces and all kinds of norms to measure loss. In almost all applications both in this book and elsewhere, we are interested in Euclidean MDS, where the metric space is the (weighted) sum of squares of residuals

, and in loss functions that use

. Thus the loss function has the

general form

where is an

matrix called the configuration.

The most popular choices for the residuals are

Multidimensional Scaling

62

Block Relaxation Algorithms in Statistics -- Part III

Here

and

are elementwise transformations of the dissimilarities, with

corresponding transformations operator

and

of the distances. In

we use the centering

. For Euclidean distances, and centered ,

with

. Metric Euclidean MDS, using

finding the best rank approximation to

with unit weights, means

, which can be done finding the dominant

eigenvalues and corresponding eigenvectors. This is also known as Classical MDS [Torgerson, 1958]. The loss function uses

that uses

is called stress [Kruskal, 1964ab], the function

is sstress [Takane et al, 1977], and loss

Heiser, 1982].

that uses

that

is strain [De Leeuw and

has been nameless so far, but it has been proposed by Ramsay [1977].

Because of its limiting properties (see below), we will call it strull. Both

ant

are obviously special cases of

for which the corresponding loss function

we see that

is a limiting case of

is called r-stress. Because

.

There is some matrix notation that is useful in dealing with Euclidean MDS. Suppose and are unit vectors, with all elements equal to zero, except one element which is equal to one. Then

where

Multidimensional Scaling

and

. If we define

63

Block Relaxation Algorithms in Statistics -- Part III

and

then

instead of matrices in

Multidimensional Scaling

, which allows us to work with vectors in .

64

Block Relaxation Algorithms in Statistics -- Part III

Cobweb Plots Suppose we have a one-dimensional Picard sequence which starts at

, and then is

defined by

The cobweb plot draws the line

and the function

. A fixed point is a point

where the line and the function intersect. We visualize the iteration by starting at , then draw a horizontal line to , then draw a vertical line to , and so on. For a convergent sequence we will see zig-zagging parallel to the axes in smaller and smaller steps to a point where the function and the line intersect. An illustration will make this clear. The Newton iteration for the square root of is

The iterations for

starting with

are in the cobweb plot in figure 1.

We also give R code for a general cobweb plotter with a variable number of parameters.

[Insert cobwebPlotter.R Here](../code/cobwebPlotter.R)

Cobweb Plots

65

Block Relaxation Algorithms in Statistics -- Part III

Figure 1: Cobweb plot for Newton Square Root Iteration

Cobweb Plots

66

Block Relaxation Algorithms in Statistics -- Part III

Notation

Notation

67

Block Relaxation Algorithms in Statistics -- Part III

Bibliography Abatzoglou T, O'Donnell B [1982] Minimization by Coordinate Descent. Journal of Optimization Theory and Applications 36: 163--174 Argyros IK, Szidarovszky F [1993] The Theory and Application of Iteration Methods. CRC Press, Boca Raton Berge C [1965] Espaces Topologiques, Fonctions Multivoques. Deuxième édition, Dunod, Paris Berge C [1997] Topological Spaces. Dover Publications, Mineola Berinde V [2007] Iterative Approximation of Fixed Points. (Second Edition) Berlin, Springer. Böhning D, Lindsay BG [1988] Monotonicity of Quadratic Approximation Algorithms. Annals of the Institute of Statiatical Mathematics 40:641-663 Borg I, Groenen PJF [2005] Modern Multidimensional Scaling. Second Edition, Springer, New York Browne MW [1987] The Young-Householder Algorithm and the Least Squares Multdimensional Scaling of Squared Distances Journal of Classification 4:175-190 Bryer J [2014] Rgitbook: Gitbook Projects with R Markdown. Package version 0.9 Bunch JR, Nielsen CP [1978] Updating the Singular Value Decomposition. Numerische Mathematik 31:111-129 Bunch JR, Nielsen CP, Sorensen DC [1978] Rank-one Modification of the Symmetric Eigenproblem. Numerische Matematik 31:31-48 Céa J [1968] Les Méthodes de ``Descente'' dans la Theorie de l'Optimisation. Revue Francaise d'Automatique, d'Informatique et de Recherche Opérationelle 2:79-102 Céa J [1970] Recherche Numérique d'un Optimum dans un Espace Produit. In Colloquium on Methods of Optimization. Springer, New York Céa J, Glowinski R [1973] Sur les Méthodes d'Optimisation par Rélaxation. Revue Francaise d'Automatique, d'Informatique et de Recherche Opérationelle 7:5-32 De Leeuw J [1968] Nonmetric Discriminant Analysis. Department of Data Theory, Leiden University, Research Note 06-68

Bibliography

68

Block Relaxation Algorithms in Statistics -- Part III

De Leeuw J [1975] An Alternating Least Squares Approach to Squared Distance Scaling Unpublished, probably lost forever De Leeuw J [1977] Applications of Convex Analysis to Multidimensional Scaling. In: Barra JR, Brodeau F, Romier G, Van Cutsem B (eds) Recent Developments in Statistics Amsterdam, North Holland Publishing Company De Leeuw J [1982] Generalized Eigenvalue Problems with Positive Semidefinite Matrices. Psychometrika 47:87-94 De Leeuw J [1988] Multivariate Analysis with Linearizable Regressions. Psychometrika 53:437-454 De Leeuw J [1994] Block Relaxation Algorithms in Statistics. In: Bock HH, Lenski W, Richter MM (eds) Information Systems and Data Analysis. Springer, Berlin De Leeuw J [2004] Least Squares Optimal Scaling for Partially Observed Linear Systems. In Van Montfort K, Oud J, Satorra A (eds) Recent Developments on Structural Equation Models. Dordrecht, Kluwer De Leeuw J [2006] Principal Component Analysis of Binary Data by Iterated Singular Value Decomposition. Computational Statiatics and Data Analysis 50:21-39 De Leeuw J [2007] Derivatives of Generalized Eigen Systems with Applications. Department of Statistics UCLA, Preprint 528 De Leeuw J [2007b] Minimizing the Cartesian Folium. Department of Statistics UCLA, Unpublished De Leeuw J [2008a] Derivatives of Fixed-Rank Approximations. Department of Statistics UCLA, Preprint 547 De Leeuw J [2008b] Rate of Convergence of the Arithmetic-Geometric Mean Process. Department of Statistics UCLA, Preprint 550 De Leeuw J, Heiser WJ [1982] Theory of Multidimensional Scaling. In Krishnaiah PR, Kanal L (eds) Handbook of Statistics. Volume II, North Holland Publishing Co, Amsterdam De Leeuw J, Lange K [2009] Sharp Quadratic Majorization in One Dimension. Computational Statistics and Data Analysis 53:2471-2484 De Leeuw J , Sorenson K [2012] Derivatives of the Procrustus Transformation with Applications. Department of Statistics UCLA, Unpublished Delfour MC [2012] Introduction to Optimization and Semidifferential Calculus. Philadelphia, SIAM

Bibliography

69

Block Relaxation Algorithms in Statistics -- Part III

Dempster AP, Laird NM, Rubin DB [1977] Maximum Likelihood for Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B39:1-38. Demyanov VF [2007] Nonsmooth Optimization. In Di Pillo G, Schoen F (eds) Nonlinear Optimization. Lectures given at the C.I.M.E. Summer School held in Cetraro, Italy, July 1-7, 2007 Springer, New York Demyanov VF [2009] Dini and Hadamard Derivatives in Optimization. In Floudas CA , Pardalos PM (eds) Encyclopedia of Optimization. Revised and expanded edition, Springer, New York D'Esopo DA [1959] A Convex Programming Procedure. Naval Research Logistic Quarterly 6:33-42 Dinkelbach W [1967] On Nonlinear Fractional Programming. Management Science 13:492498 Dontchev AL, Rockafellar RT [2014] Implicit Functions and Solution Mappings. Second Edition, Springer, New York Elkin RM [1968] Convergence Theorems for Gauss-Seidel and other Minimization Algorithms. Technical Report 68-59, Computer Sciences Center, University of Maryland Forsythe GE, Golub GH [1965] On the Stationary Values of a Second Degree Polynomial on the Unit Sphere. Journal of the Society for Industrial and Applied Mathematics 13:1050-1068 Gander W [1981] Least Squares with a Quadratic Constraint. Numerische Mathematik 36:291-307 Gifi A [1990] Nonlinear Multivariate Analysis. Chichester, Wiley Golub GH [1973] Some Modified Matrix Eigenvalue Problems. SIAM Review 15:318-334 Groenen PJF, Giaquinto P, Kiers HAL [2003] Weighted Majorization Algorithms for Weighted Least Squares Decomposition Models. Econometric Institute Report EI 2003-09, Erasmus University, Rotterdam Groenen PJF, Nalbantov G, Bioch JC [2007] Nonlinear Support Vector Machines Through Iterative Majorization and I-Splines. In Lenz HJ, Decker R (eds) Studies in Classification, Data Analysis, and Knowledge Organization. Springer, New York Groenen PJF, Nalbantov G, Bioch JC [2008] SVM-Maj: a Majorization Approach to Linear Support Vector Machines with Different Hinge Errors. Advances in Data Analysis and Classification 2:17-43 Harman HH, Jones WH [1966] Factor Analysis by Minimizing Residuals (MINRES). Psychometrika 31:351-368

Bibliography

70

Block Relaxation Algorithms in Statistics -- Part III

Heiser WJ [1986] A Majorization Algorithm for the Reciprocal Location Problem. Department of Data Theory, Leiden University, Report RR-86-12 Heiser WJ [1987] Correspondence Analysis with Least Absolute Residuals. Computational Statiatics and Data Analysis, 5:337-356 Heiser WJ [1995] Convergent Computation by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis. In Krzanowski WJ (ed) Recent Advances in Discriptive Multivariate Analysis. Clarendon Press, Oxford Hildreth C [1957] A Quadratic Programming Procedure. Naval Research Logistic Quarterly 14:79-84 Hunter DR, Lange K [2004] A Tutorial on MM Algorithms. American Statistician 58:30-37 Hunter DR, Li R [2005] Variable Selection Using MM Algorithms. Annals of Statistics 33:1617-1642 Jaakkola TSW, Jordan MIW [2000] Bayesian Parameter Estimation via Variational Methods. Statistical Computing 10:25-37 Kato T [1976] Perturbation Theory for Linear Operators. Second Edition, Springer, New York Krantz SG, Parks HR [2013] The Implicit Function Theorem: History, Theory, and Applications. Springer, New York Kruskal JB [1964a] Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29:1-27 Kruskal JB [1964b] Nonmetric Multidimensional Scaling: a Numerical Method. Psychometrika 29:115-129 Kruskal JB [1965] Analysis of Factorial Experiments by Estimating Monotone Transformations of the Data. Journal of the Royal Statistical Society B27:251-263 Lange K [2013] Optimization. Second Edition, Springer, New York Lange K [20xx] MM Algorithms. Book in progress Lange K, Chi EC, Zhou, H [2014] A Brief Survey of Modern Optimization for Statisticians. International Statistical Review, 82:46-70 Lange K, Hunter DR, Yang I [2000] Optimization Transfer Using Surrogate Objective Functions. Journal of Computational and Graphical Statistics 9:1-20 Lipp T, Boyd S [2014] [Variations and Extensions of the Convex-Concave Procedure.] (http://web.stanford.edu/~boyd/papers/pdf/cvx_ccv.pdf) (as yet) Unpublished paper, Stanford University

Bibliography

71

Block Relaxation Algorithms in Statistics -- Part III

Magnus JR, Neudecker H [1999] Matrix Differential Calculus with Applications in Statistics and Econometrics. (Revised Edition) New York, Wiley Mair P, De Leeuw J [2010] A General Framework for Multivariate Analysis with Optimal Scaling: The R Package aspect. Journal of Statistical Software, 32(9):1-23 Melman A [1995] Numerical Solution of a Secular Equation. Numerische Mathematik 69:483-493 Melman A [1997] A Unifying Convergence Analysis of Second-Order Methods for Secular Equations. Mathematics of Computation 66:333-344 Melman A [1998] Analysis of Third-order Methods for Secular Equations. Mathematics of Computation 67:271-286 Mönnigmann M [2011] Fast Calculation of Spectral Bounds for Hessian Matrices on Hyperrectangles. SIAM Journalof Matrix Analysis and Applications 32:1351-1366 Nesterov Y, Polyak BT [2006] Cubic Regularization of Newton Method and its Global Performance. Mathematical Programming, A108:177-205 Oberhofer W, Kmenta J [1974] A General Procedure for Obtaining Maximum Likelihood Estimates in Generalized Regression Models. Econometrica 42:579-590 Ortega JM, Rheinboldt WC [1970] Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York Ortega JM, Rheinboldt WC [1970] Local and Global Convergence of Generalized Linear Iterations. In Ortega JM, Rheinboldt WC (eds) Numerical Solutions of Nonlinear Problems. Philadelphia, SIAM Ostrowski AM [1966] Solution of Equations and Systems of Equations. (Second Edition) Academic Press, New York Penot J-P [2013] Calculus without Derivatives. New York, Springer Parring AM [1992] About the Concept of the Matrix Derivative. Linear Algebra and its Applications 176:223-235 Ramsay JO [1977] Maximum Likelihood Estimation in Multidimensional Scaling. Psychometrika 42:241-266 R Core Team [2015]. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Rockafellar RT [1970] Convex Analysis. Princeton University Press, Princeton Rockafellar RT, Wets RJB [1998] Variational Analysis. Springer, New York

Bibliography

72

Block Relaxation Algorithms in Statistics -- Part III

Roskam EEChI [1968] Metric Analysis of Ordinal Data in Psychology. VAM, Voorschoten, Netherlands Schechter S [1962] Iteration Methods for Nonlinear Problems. Transactions of the American Mathematical Society 104:179-189 Schechter S [1968] Relaxation Methods for Convex Problems. SIAM Journal Numerical Analysis 5:601-612 Schechter S [1970] Minimization of a Convex Function by Relaxation. In Abadie J (ed) Integer and nonlinear programming. North Holland Publishing Company, Amsterdam Shapiro A [1990] On Concepts of Directional Differentiability. Journal of Optimization Theory and Applications 66:477-487 Schirotzek W [2007] Nonsmooth Analysis. Springer, New York Smart DR [1974] Fixed Point Theorems. Cambridge Tracts in Mathematics 66, Cambridge University Press, Cambridge Spivak M [1965] Calculus on Manifolds. Westview Press, Boulder Spjøtvoll E [1972] A Note on a Theorem by Forsythe and Golub. SIAM Joural of Applied Mathematics 23:307-311 Sriperumbudur BK, Lanckriet GRG [2012] A Proof of Convergence of the Concave-Convex Procedure Using Zangwill’s Theory. Neural Computation 24:1391–1407 Takane Y [1977] On the Relations among Four Methods of Multidimensional Scaling. Behaviormetrika, 4:29-42 Takane Y, Young FW, De Leeuw J [1977] Nonmetric Individual Differences in Multidimensional Scaling: An Alternating Least Squares Method with Optimal Scaling Features Psychometrika 42:7-67 Theussl S, Borchers, HW [2014] CRAN Task View: Optimization and Mathematical Programming. Thomson GH [1934] Hotelling's Method Modified to Give Spearman's g. Journal of Educational Psychology 25:366-374 Torgerson WS [1958] Theory and Methods of Scaling. Wiley, New York Van Den Burg GJJ, Groenen PJF [2014] GenSVM: A Generalized Multiclass Support Vector Machine. Econometric Institute Report EI 2014-33, Erasmus University, Rotterdam Van der Heijden PGM, Sijtsma K [1996] Fifty Years of Measurement and Scaling in the Dutch Social Sciences. Statistica Neerlandica 50:111-135.

Bibliography

73

Block Relaxation Algorithms in Statistics -- Part III

Van Ruitenburg J [2005] Algorithms for Parameter Estimation in the Rasch Model. CITO Measurement and Research Department Reports 2005-04, Arnhem, Netherlands Varadhan R [2014] Numerical Optimization in R : Beyond optim . Journal of Statistical Software, 60: issue 1 Verboon P, Heiser WJ Resistant Lower Rank Approximation of Matrices by Iterative Majorization. Computational Statistics and Data Analysis 18:457-467 Voß H, Eckhardt U [1980] Linear Convergence of Generalized Weiszfeld's Method. Computing 25:243-251 Wainer H, Morgan A, Gustafsson JE [1980] A Review of Estimation Procedures for the Rasch Model with an Eye toward Longish Tests. Journal of Educational Statistics 5:35-64 Weiszfeld E [1937] _Sur le Point par lequel la Somme des Distances de n Points Donnès est Minimum.- Tôhoku Mathematics Journal 43:355--386 Weiszfeld E, Plastria F [2009] On the Point for which the Sum of the Distances to n Given Points is Minimum. Annals of Operations Research 167:7-41 Wilkinson GN [1958] Estimation of Missing Values for the Analysis of Incomplete Data. Biometrics 14:257-286 Wilkinson JH [1965] The Algebraic Eigenvalue Problem. Clarendon Press, Oxford Wong CS [1985] On the Use of Differentials in Statistics. Linear Algebra and its Applications 70:285-299 Xie Y [2013] Dynamic Documents with R and knitr. Boca Raton, Chapman and Hall/CRC. Yates, F [1933] The Analysis of Replicated Experiments when the Field Results are Incomplete. Empirical Journal of Experimental Agriculture, 1:129-142. Yen E-H, Peng N, Wang P-W, Lin S-D [2012] [On Convergence Rate of Concave-Convex Procedure.] (http://opt-ml.org/oldopt/papers/opt2012_paper_10.pdf) Paper presented at 5th NIPS Workshop on Optimization for Machine Learning, Lake Tahoe, December 8 2012 Young FW [1981] Quantitative analysis of qualitative data. Psychometrika 46:357-388 Young FW, De Leeuw J, Takane Y [1980] Quantifying Qualitative Data. In: Lantermann ED, Feger H (eds) Similarity and Choice. Papers in Honor of Clyde Coombs, Hans Huber, Bern Yuille AL, Rangarajan A [2003] The Concave-Convex Procedure. Neural Computation 15:915–936 Zangwill WI [1969] Nonlinear Programming: A Unified Approach. Prentice Hall, Englewood Cliffs

Bibliography

74

Block Relaxation Algorithms in Statistics -- Part III

Bibliography

75

Block Relaxation Algorithms in Statistics -- Part III

What's New Version 021215 Material in appendix on derivatives of eigenvalues Expanded bibliography Added "What's New" chapter Wrote intro to taylor majorization Some shaky stuff on Lipschitz Version 021315 Added folium in coordinate descent chapter Version 021415 Added univariate cubic in local majorization Revisiting the reciprocal in higher order majorization Version 021515 Various small editorial changes throughout Version numbering by date Appendix on cobweb plotter Moved the graphics from an external server to the book Added some section headers in ALS chapter Removed shaky Lipschitz stuff Version 021615 Cleanup first chapters Version 021715 Some stuff for section on product of derivatives of block mappings - representation as cyclic matrix Version 021815 Added some detail to the SVD example Added material on Dinkelbach majorization Version 021915 L-majorization and D-majorization

What's New

76

Block Relaxation Algorithms in Statistics -- Part III

Version 022015 Calculations for derivatives of general block methods Reorganized examples in block chapter Version 022115 Updated material on derivatives of algorithmic map for block methods Block Newton methods section Formulas for rate of block optimization Defined and used the LU-form and product form Version 022215 Removed Gauss-Jordan from multiple block LS Convergence rate of CCD for the Rayleigh quotient added blockRate.R function added mls.R function for multiple block LS various small edits Version 022315 Expanded bibliography Put LaTeX macros in header.md to be included on every page Tentatively started HTML cross-referencing of sections Clarified block relaxation derivatives Started some material on block Newton. Version 022415 Defined iteration map, iteration function, iteration matrix and iteration radius Add material on comparing majorizers (may have to be moved to composition section) Simple quartic example in neighborhood majorization Version 022515 No changes Version 022615 No changes Version 022715 Some changes in the definition section of majorization Version 022815

What's New

77

Block Relaxation Algorithms in Statistics -- Part III

More majorization definition changes Majorization duality Graphics for majorization Moved code and graphics to better places Version 030115 More meat on D-majorization Additions to mean value section Separate section on majorizing value functions logit example with higher derivatives Mean value and taylor majorization material Version 030215 Higher derivatives for the logit plots and formulas Higher derivatives for the probit, empty section Changed the order of some sections and chapters Cleanup of various sections Consistent notation for block least squares Consistent notation in tomography section Split convexity in convexity and linear majorization Moved EM into convexity Version 030315 Notation in EM sums and integrals Notation is now a separate chapter -- I intend to make a big deal out of this Moved appendices to background chapter Plan: merge silly README and introduction files for each chapter Added matrix background Added canonical correlation background section cover page ! DC example Added probit calculations to sharp quadratic Version 030415 Moved last graphics from gifi into book Moved bibliography and notation files to top level Checked and corrrected chapter and section headers Background on singular value decomposition - start Background on canonical correlation - start Background on eigen value decomposition - start

What's New

78

Block Relaxation Algorithms in Statistics -- Part III

Version 030515 Split eigen background in symmetric and asymmetric Background on symmetric eigen decomposition Started background asymmetric eigen problem Changed chapter title from linear to tangential majorization. Added necessary condition from tangential majorization of concave function. Changed chapter title to higher order majorization. Version 030615 More on asymmetric eigenproblems. Changed names: Sub-level majrization, Dinkelbach majorization, Nesterov Majorization. Added even higher order Nesterov majorization. Added multivariate quadratic sub-level majorization -- start. Nesterov majorization definition and implementation. Rearranged background chapter, new empty sections in matrices and inequalities. Introduced majorization scheme or majorization coupling Decomposition section with examples added. Version 030715 Defined majorizaton coupling. Redid coupling plot for log-logit in rgl. I will start moving figures back into the text, instead of putting them at the end of the page. Markdown and HTML are not LaTeX, there are no floats. Background section on necessary and suffficient conditions for a minimum, added some material on quadratics. Background section on generalized inverses. Still empty. Necessary conditions for minimum on sphere or ball. Insert stuff on partial majorization in value functions. Version 030815 I am experimenting with cross-referencing chapters, sections, theorems, figures, maybe even paragraphs and formulas, using the html name attribute. I am experimenting with pretty printing and executing R code in the pages, by using Rgitbook. Version 030915 Replaced some figure with code for knitr Added bibliography items Changed preface name in book.json

What's New

79

Block Relaxation Algorithms in Statistics -- Part III

Added link to my bibliography in preface Replaced all md files by Rmd files Pasted in sharp majorization of even functions Pasted in sharp majorization in two points Material on first secular equation in modified eigenvalues background Must remember to name my chunks Version 031015 Added material on modifed eigen problem Added material on quadratics on a sphere Version 031115 Added more material on quadratics on a sphere Included LaTeX macros on pages Replaced more figures by code Expanded glossary ALS intro Add knitr caches for computations Version 031215 ALS intro edited ALS rate of convergence edited ALSOS section edited Continue cosmetics, cross referencing, glossary, caches Renamed news.md, bibliography.md, notation.md using caps Include LaTeX macros on each page Update README.md in all directories Update book.json to gitbook 2.0.0-beta.3 Version 031315 Added some material to majorizing value functions Experimenting with how to link to R code files Thinking about merging subsections into a single section file Version 031415 (day of pi) Added section on projection vs block relaxation Version 031515 Rewrote section on missing data in linear models (Yates) Redid plots in quadratic majorization of normal distribution and density

What's New

80

Block Relaxation Algorithms in Statistics -- Part III

Version 031615 Try out variables in book.json for chapter and section cross references Version 031715 Moved HEADER.md to the top of all Rmd files. Version 031815 Some experimenting with variables and arrays in book.json Version 031915 No changes Version 032015 Sections rasch and morerasch edited Section on Block Newton edited Version 032115 Added derivatives of implicit functions Version 032215 Added material on marginal functions and solution maps (changed section title) Added classical implicit function theorem Version 032315 No changes Version 032415 Added S-majorization and D-majorization example, with figures and code to Dinkelbach section Graphics for Chebyshev example in 2.8.2 replaced by code, graphics in 2 x 2 plot matrix Started adding section and chapter numbers Graphics for folium redone with code and cache Version 032515 Section on univariate and separable functions Updated book.json to gitbook 2.0.0 Version 032615 Add material on Concave-Convex Procedure

What's New

81

Block Relaxation Algorithms in Statistics -- Part III

S-majorization by a quadratic: multivariate case Started section on majorization on a hyperrectangle Version 032715 Add material on location problem Add section on Generalized Weiszfeld Methods Version 032815 Background on liminf/limsup Rearranged some background sections Version 032915 Definition section Majorization Algorithm Replaced more plots with knitr chunks of R code Started section on matrix derivatives Version 033015 Material on matrix derivatives Started section on directional derivatives Version 033115 More on directional derivatives Shift focus to functions with values in extended reals Version 040115 Section on derivates between directional and Taylor Version 040215 More on derivatives book.json gitbook updated to version 2.0.1

What's New

82

Block Relaxation Algorithms in Statistics -- Part III

Workflow If I wanted you to understand it, I would have explained it better. -- Johan Cruijff This may be of some interest to some. I use the Rgitbook package in R together with gitbook-cli , and the (old) Gitbook.app that uses gitbook-1.0.3.

All files are in the BRAS directory. I edit Rmd files in Sublime Text, then run buildGitbook() from an R running in BRAS , then use openGitbook() to see the HTML in a browser. The Rmd files follow the conventions of knitr . Thus buildGitbook() as a first step knits all updated Rmd files to md files, creating caches and executing R code and generating figures and output when necessary. As a second step it knits the md to HTML and puts them in the right place in the gitbook hierarchy by calling gitbook-cli . Then I upload the book to GitbookIO using Gitbook.app . Note that uploading means another md to HTML conversions on the server side. The toolchain is a bit shaky because both Rgitbook and Gitbook.app are no longer actively maintained. It would be great if Rstudio could deal with the whole build sequence and have a book repository similar to its repository for compiled Rmd files. The major problem at the moment is that none of the tools can efficiently convert formulas to png files, and my book is full of formulas. MathJax does a great job generating a website for the book, but so far png conversion does not work for me.

Workflow

83

Block Relaxation Algorithms in Statistics -- Part III

Glossary ALS Alternating Least Squares. 4. What's New

ALSOS Alternating Least Squares with Optimal Scaling. 4. What's New

Alternating Least Squares Block relaxation of a least squares loss function. 3. Bibliography

Alternating Least Suares with Optimal Scaling Least Squares with data transformations.

Augmentation Algorithm Minimize a function by introducing additonal variables in block relaxation.

Augmentation Method Minimize a function by introducing additonal variables in block relaxation.

Block Relaxation Algorithm Glossary

84

Block Relaxation Algorithms in Statistics -- Part III

Minimize a function over alternating blocks of variables.

Block Relaxation Method Minimize a function over alternating blocks of variables.

Gauss-Newton Method Minimizing linear-quadratic Taylor approximation to least squares loss function.

Iteration Function Function defining the update in a step of an iterative algorithm. 4. What's New 1.8.1. Eigenvalues and Eigenvectors of Symmetric Matrices

Iteration Jacobian Derivative of the iteration function.

Iteration Map Point-to-set map defining the possible updates in a step of an iterative algorithm. 4. What's New

Iteration Rate Largest eigenvalue, in modulus, of the iteration Jacobian. Same as Iteration Spectral Radius.

Iteration Spectral Radius Largest eigenvalue, in modulus, of the iteration Jacobian.

Glossary

85

Block Relaxation Algorithms in Statistics -- Part III

Majorization Algorithm Minimize a function by iteratively minimizing majorizations. 3. Bibliography 4. What's New

Majorization Method Minimize a function by iteratively minimizing majorizations.

MM Algorithm Majorization/Minimization or Minorization/Maximization Algorithm.

MM Method Majorization/Minimization or Minorization/Maximization Algorithm.

Nesterov Majorization Majorize a function by bounding the cubic term in the Taylor series. 4. What's New 1.8.6. Quadratics on a Sphere

Newton's Method Minimizing quadratic Taylor approximation to function.

Glossary

86