dedicated journals. ... questions we will then have to answer are if and under what conditions our ...... Moved the graphics from an external server to the book.
Block Relaxation Algorithms in Statistics -- Part III
Table of Contents Project
0
Background
1
Introduction
1.1
Analysis
1.2
Semi-continuities
1.2.1
Directional Derivatives
1.2.2
Differentiability and Derivatives
1.2.3
Taylor's Theorem
1.2.4
Implicit Functions
1.2.5
Necessary and Sufficient Conditions for a Minimum
1.2.6
Point-to-Set Maps
1.3
Continuities
1.3.1
Marginal Functions
1.3.2
Solution Maps
1.3.3
Basic Inequalities
1.4
Jensen's Inequality
1.4.1
The AM-GM Inequality
1.4.2
Cauchy-Schwartz Inequality
1.4.3
Young's Inequality
1.4.4
Fixed Point Problems and Methods Subsequential Limits Convex Functions Composition Rates of Convergence
1.5 1.5.1 1.6 1.6.1 1.7
Over- and Under-Relaxation
1.7.1
Acceleration of Convergence of Fixed Point Methods
1.7.2
Matrix Algebra
1.8
Eigenvalues and Eigenvectors of Symmetric Matrices
1.8.1
Singular Values and Singular Vectors
1.8.2
Canonical Correlation
1.8.3
2
Block Relaxation Algorithms in Statistics -- Part III
Eigenvalues and Eigenvectors of Asymmetric Matrices
1.8.4
Modified Eigenvalue Problems
1.8.5
Quadratics on a Sphere
1.8.6
Generalized Inverses
1.8.7
Partitioned Matrices
1.8.8
Matrix Differential Calculus
1.9
Matrix Derivatives
1.9.1
Derivatives of Eigenvalues and Eigenvectors
1.9.2
Miscellaneous
1.10
Multidimensional Scaling
1.10.1
Cobweb Plots
1.10.2
Notation
2
Bibliography
3
What's New
4
Workflow
5
Glossary
3
Block Relaxation Algorithms in Statistics -- Part III
This is Part III of Block Relaxation Algorithms in Statistics. It discusses various mathematical, computational, and notational background topics.
Project
4
Block Relaxation Algorithms in Statistics -- Part III
Background
Background
5
Block Relaxation Algorithms in Statistics -- Part III
Introduction
Introduction
6
Block Relaxation Algorithms in Statistics -- Part III
14.2: Analysis
Analysis
7
Block Relaxation Algorithms in Statistics -- Part III
14.2.1: SemiContinuities The lower limit or limit inferior of a sequence
is defined as
Alternatively, the limit inferior is the smallest cluster point or subsequential limit
In the same way
We always have
Also if
then
The lower limit or limit inferior of a function at a point is defined as
where
Alternatively
In the same way
Semi-continuities
8
Block Relaxation Algorithms in Statistics -- Part III
A function is lower semi-continuous at if
Since we always have
we can also define lower semicontinuity as
A function is upper semi-continuous at if
We have
A function is continuous at if and only if it is both lower semicontinuous and upper semicontinous, i.e. if
Semi-continuities
9
Block Relaxation Algorithms in Statistics -- Part III
14.2.2: Directional Derivatives The notation and terminology are by no means standard. We generally follow Demyanov [2007, 2009]. The lower Dini directional derivative of at in the direction is
and the corresponding upper Dini directional derivative is
If
exists, i.e. if
, then it we simply write
for the Dini directional
derivative of at in the direction . Penot [2013] calls this the radial derivative and Schirotzek [2007]) calls it the directional Gateaux derivative. If directionally differentiable at in the direction , and if
exists is Dini exists at for all we say
that is Dini directionally differentiable at Delfour [2012] calls semidifferentiable at . In a similar way we can define the Hadamard lower and upper directional derivatives. They are
and
The Hadamard directional derivative
exists if both
and
exist
and are equal. In that case is Hadamard directionally differentiable at in the direction , and if
exists at for all we say that is Hadamard directionally differentiable at
Generally we have
Directional Derivatives
10
Block Relaxation Algorithms in Statistics -- Part III
The classical directional derivative of at in the direction is
Note that for the absolute value function at zero we have
, while
does not exist. The classical directional derivative is not particularly useful in the context of optimization problems.
Directional Derivatives
11
Block Relaxation Algorithms in Statistics -- Part III
Differentiability and Derivatives The function is Gateaux differentiable at if and only if the Dini directional derivative exists for all and is linear in . Thus The function is Hadamard differentiable at if the Hadamard directional derivative exists for all and is linear in . Function is locally Lipschitz at if there is a ball
and a
such that
for all If is locally Lipschitz and Gateaux differentiable then it is Hadamard differentiable. If the Gateaux derivative of is continuous then is Frechet differentiable. Define Frechet differentiable The function is Hadamard differentiable if and only if it is Frechet differentiable. Gradient, Jacobian
Differentiability and Derivatives
12
Block Relaxation Algorithms in Statistics -- Part III
14.2.3: Taylor's Theorem Suppose Define, for all
is
times continuously differentiable in the open set
.
,
as the inner product of the -dimensional array of partial derivatives dimensional outer power of By convention
and the -
Both arrays are super-symmetric, and have dimension .
Also define the Taylor Polynomials
and the remainder
Assume contains the line segment with endpoints and . Then Lagrange's form of the remainder says there is a
such that
and the integral form of the remainder says
Taylor's Theorem
13
Block Relaxation Algorithms in Statistics -- Part III
Implicit Functions The classical implicit function theorem is discussed in all analysis books. We are particularly fond of Spivak [1970, p. 40]. The history of the theorem, and many of its variations, is discussed in [Krantz and Parks [2013] and a comprenhensive modern treatment, using the tools of convex and variational analysis, is in Dontchev and Rockafellar [2014]. Suppose where
and suppose that
is continuously differentiable in an open set containing Define the
matrix
is non-singular. Then there is an open set containing and an
open set containing such that for every
The function
there is a unique
is differentiable. If we differentiate
with
we find
and thus
As an example consider the eigenvalue problem
where is a function of a real parameter . Then
which works out to
Implicit Functions
14
Block Relaxation Algorithms in Statistics -- Part III
Necessary and Sufficient Conditions for a Minimum Directional derivatives can be used to provide simple necessary or sufficient conditions for a minimum [Demyanov [2009, propositions 8 and 10]]. Result: If is a local minimizer of then directions . If
for all
and
for all
then has a strict local minimum at .
The special case of a quadratic deserves some separate study, because the quadratic model is so prevalent in optimization. So let us look at symmetric. Use the eigen-decomposition using
. Then
, with
to change variables to
, also
, which we can write as
Here
If
is non-empty we have
.
If
is empty, then attains its minimum if and only if
for all
. Otherwise
again If the minimum is attained, then
with
the Moore-Penrose inverse. And the minimum is attained if and only if is positive
Necessary and Sufficient Conditions for a Minimum
15
Block Relaxation Algorithms in Statistics -- Part III
semi-definite and
.
Necessary and Sufficient Conditions for a Minimum
16
Block Relaxation Algorithms in Statistics -- Part III
Point-to-set Maps
Point-to-Set Maps
17
Block Relaxation Algorithms in Statistics -- Part III
14.3.1: Continuities
Continuities
18
Block Relaxation Algorithms in Statistics -- Part III
Marginal Functions and Solution Maps Suppose at a unique
and , where
. Suppose the minimum is attained Then obviously
.
Differentiating gives
To differentiate the solution map we need second derivatives of . Differentiating the implicit definition
gives
or
Now combine both
and
We see that if
to obtain
then
.
Now consider minimization problem with constraints. Suppose continuously differentiable functions on
are twice
, and suppose
Define
and
where again we assume the minimizer is unique and satisfies
Differentiate again, and define
Marginal Functions
19
Block Relaxation Algorithms in Statistics -- Part III
and
\Then
which leads to
There is an alternative way of arriving at basically the same result. Suppose the manifold is parametrized locally as
and
Marginal Functions
, i.e.
. Then
. Let
. Then
20
Block Relaxation Algorithms in Statistics -- Part III
14.3.3: Solution Maps
Solution Maps
21
Block Relaxation Algorithms in Statistics -- Part III
Basic Inequalities
Basic Inequalities
22
Block Relaxation Algorithms in Statistics -- Part III
Jensen's Inequality
Jensen's Inequality
23
Block Relaxation Algorithms in Statistics -- Part III
The AM-GM Inequality The Arithmetic-Geometric Mean Inequality is simple, but quite useful for majorization. For completeness, we give the statement and proof here. Theorem: If
and
Proof: Expand
then
with equality if and only if
and collect terms. QED
Corollary: Proof: Just a simple rewrite of the theorem. QED
The AM-GM Inequality
24
Block Relaxation Algorithms in Statistics -- Part III
Polar Norms and the Cauchy-Schwarz Inequality Theorem: Suppose
Then
with equality if and only if and
are proportional. Proof: The result is trivially true if either or is zero. Thus we suppose both are non-zero. We have
for all Thus
which is the required result. QED
Cauchy-Schwartz Inequality
25
Block Relaxation Algorithms in Statistics -- Part III
Young's Inequality The AM-GM inequality is a very special cases of Young's inequality. We derive it in a general form, using the coupling functions introduced by Moreau. Suppose is a real-valued function on and is a real-valued function on
, called the coupling function. Here
and are arbitrary. Define the -conjugate of by
Then
and thus
, which is the generalized Young's inequality. We can also write this in the form that directly suggests minorization
The classical coupling function is take
, with
The
is attained for
Thus if
with both and in the positive reals. If we
, then
, from which we find
with
.
such that
Then for all
we have
with equality if and only if
Young's Inequality
.
26
Block Relaxation Algorithms in Statistics -- Part III
Fixed Point Problems and Methods As we have emphasized before, the algorithms discussed in this books are all special cases of block relation methods. But block relation methods are often appropriately analyzed as fixed point methods, which define an even wider class of iterative methods. Thus we will not discuss actual fixed point algorithms that are not block relation methods, but we will use general results on fixed point methods to analyze block relaxation methods. A (stationary, one-step) fixed point method on
is defined as a map
.
Depending on the context we refer to as the update map or algorithmic map. Iterative sequences are generated by starting with
for
and then setting
. Such a sequence is also called the Picard sequence generated by the
map. If the sequence and thus
converges to, say,
, and if is continuous, then
is a fixed point of on . The set of all
the fixed point set of on , and is written as
such that
, is called
.
The literature on fixed point methods is truly gigantic. There are textbooks, conferences, and dedicated journals. A nice and compact treatment, mostly on existence theorems for fixed points, is Smart, [1974]. An excellent modern overview, concentrating on metrical fixed point theory and iterative computation, is Berinde [2007]. The first key result in fixed point theory is the Brouwer Fixed Point Theorem, which says that for compact convex and continuous there is at least one
. The second is
the Banach Fixed Point Theorem, which says that if is a non-empty complete metric space and is a contraction, i.e. the Picard sequence
for some
converges from any starting point
, then
to the unique fixed point of
in . Much of the fixed point literature is concerned with relaxing the contraction assumption and choosing more general spaces on which the various mappings are defined. I shall discuss some of the generalizations that we will use later in this book. First, we can generalize to point-to-set maps
, where
is the power set of ,
i.e. the set of all subsets. Point-to-set maps are also called correspondences or multivalued maps. The Picard sequence is now defined by if and only if
Fixed Point Problems and Methods
and we have a fixed point
. The generalization of the Brouwer Fixed Point
27
Block Relaxation Algorithms in Statistics -- Part III
Theorem is the Kakutani Fixed Point Theorem. It assumes that is non-empty, compact and convex and that
is non-empty and convex for each
must be closed or upper semi-continuous on , i.e. whenever and
we have
. In addition, the map and
. Under these conditions Kakutani's Theorem asserts
the existence of a fixed point. Our discussion of the global convergence of block relaxation algorithms, in a later chapter, will be framed using fixed points of point-to-set maps, assuming the closedness of maps. In another generalization of iterative algorithms we get rid of the one-step and the stationary assumptions. The iterative sequence is
Thus the iterations have perfect memory, and the update map can change in each iteration. In an -step method, memory is less than perfect, because the update is a function of only the previous elements in the sequence. Formally, for
with some special provisions for
,
.
Any -step method on can be rewritten as a one-step method on
This makes it possible to limit our discussion to one-step methods. In fact, we will mostly discuss block-relaxation methods which are stationary one-step fixed point methods. For non-stationary methods it is somewhat more complicated to define fixed points. In that case it is natural to define a set
of desirable points or targets, which for stationary
algorithms will generally, but not necessarily, coincide with the fixed points of . The questions we will then have to answer are if and under what conditions our algorithms converge to desirable points, and if they converge how fast the convergence will take place.
Fixed Point Problems and Methods
28
Block Relaxation Algorithms in Statistics -- Part III
Subsequential Limits
Subsequential Limits
29
Block Relaxation Algorithms in Statistics -- Part III
Differentiable Convex Functions If a function attains its minimum on a convex set at , and is differentiable at , then for all If attains its minimum on
. at , and is differentiable at , then
precisely, if is differentiable from the right at and Suppose with
. Or, more
.
is the unit ball and a differentiable attains its a minimum at Then
for all
By Cauchy-Schwartz this means that
This is true if and only if
, with
As an aside, if a differentiable function attains its minimum on the unit sphere at then
attains is minimum over
equal to zero shows that we must have
at . Setting the derivative , which again translates to
, with
Convex Functions
30
Block Relaxation Algorithms in Statistics -- Part III
Composition
Composition
31
Block Relaxation Algorithms in Statistics -- Part III
Rates of Convergence The basic result we use is due to Perron and Ostrowski \cite{ostr}. Theorem: If the iterative algorithm
converges to
and is differentiable at and then the algorithm is linearly convergent with rate Proof: QED The norm in the theorem is the spectral norm, i.e. the modulus of the maximum eigenvalue. Let us call the derivative of the iteration matrix and write it as
In general block
relaxation methods have linear convergence, and the linear convergence can be quite slow. In cases where the accumulation points are a continuum we have sublinear rates. The same things can be true if the local minimum is not strict, or if we are converging to a saddle point. Generalization to non-differentiable maps. Points of attraction and repulsion. Superlinear etc
Rates of Convergence
32
Block Relaxation Algorithms in Statistics -- Part III
Over- and Under-Relaxation
Over- and Under-Relaxation
33
Block Relaxation Algorithms in Statistics -- Part III
Acceleration of Convergence of Fixed Point Methods
Acceleration of Convergence of Fixed Point Methods
34
Block Relaxation Algorithms in Statistics -- Part III
Matrix Algebra
Matrix Algebra
35
Block Relaxation Algorithms in Statistics -- Part III
Eigenvalues and Eigenvectors of Symmetric Matrices In this section we give a fairly complete introduction to eigenvalue problems and generalized eigenvalue problems. We use a constructive variational approach, basically using the Rayleigh quotient and deflation. This works best for positive semi-definite matrices, but after dealing with those we discuss several generalizations. Suppose is a positive semi-definite matrix of order . Consider the problem of maximizing the quadratic form
on the sphere
attained, we have
. At the maximum, which is always
, with a Lagrange multiplier, as well as
. It follows that
. Note that the maximum is not necessarily attained at a unique value. Also the maximum is zero if and only if is zero. Any pair
such that
and
is called an eigen-pair of . The members
of pair are the eigenvector and the corresponding eigenvalue . Result 1: Suppose
and
are two eigen-pairs, with
premultiplying both sides of
by
gives
Then , and thus
. This shows that cannot have more than distinct eigenvalues. If there were distinct eigenvalues, then the
matrix , which has the corresponding
eigenvectors as columns, would have column-rank and row-rank , which is impossible. In words: one cannot have more than orthonormal vectors in Suppose the distinct values are eigenvalues
, with
dimensional space. Thus each of the
is equal to one of the
Result 2: If
and
are two eigen-pairs with the same eigenvalue then any
linear combination
suitably normalized, is also an eigenvector with eigenvalue
. Thus the eigenvectors corresponding with an eigenvalue form a linear subspace of with dimension, say, matrix eigenvalue
,
. This subspace can be given an orthonormal basis in an
. The number
equal to
is the multiplicity of
and by implication of the
.
Of course these results are only useful if eigen-pairs exist. We have shown that at least one eigen-pair exists, the one corresponding to the maximum of on the sphere. We now give a procedure to compute additonal eigen-pairs. Consider the following algorithm for generating a sequence and 1. Test: If
of matrices. We start with
. stop.
2. Maximize: Computes the maximum of
over
Eigenvalues and Eigenvectors of Symmetric Matrices
. Suppose this is attained
36
Block Relaxation Algorithms in Statistics -- Part III
at an eigen-pair
. If the maximizer is not unique, select an arbitrary one.
3. Orthogonalize: Replace
by
4. Deflate: Set
,
5. Update: Go back to step 1 with replaced by If
.
then in step (2) we compute the largest eigenvalue of and a corresponding
eigenvector. In that case there is no step (3). Step (4) constructs
by deflation, which
basically removes the contribution of the largest eigenvalue and corresponding eigenvector. If is an eigenvector of with eigenvalue
, then
by result (1) above. Also, of course, eigenvalue . If
, so
is an eigenvector of with eigenvalue
we can choose such that
We see that
, then by result (2)
has the same eigenvectors as , with the same multiplicities, except for , and zero, which now has its old multiplicity
is the eigenvector corresponding with
by result (1)
with
and thus
, which now has its old multiplicity Now if
is an eigenvector of
is automatically orthogonal to
, the largest eigenvalue of , which is an eigenvalue of
.
, then with
eigenvalue zero. Thus step (3) is not ever necessary, although it will lead to more precise numerical computation. Following the steps of the algorithm we see thatit defines orthonormal matrices moreover satisfy
where
is the projector
for
, and with
, which
. Also
. This is the eigen decomposition or the spectral
decomposition of a positive semi-definite . Our algorithm stops when
, which is the same as
then the minimum eigenvalue is zero, and has multiplicity
. If .
is the orthogonal projector of the null-space of , with
. Using the square orthonormal
we can write the eigen decomposition in the form
Eigenvalues and Eigenvectors of Symmetric Matrices
37
Block Relaxation Algorithms in Statistics -- Part III
where the last
diagonal elements of are zero. Equation
can also be
written as
which says that the eigenvectors diagonalize and that is orthonormally similar to the diagonal matrix of eigenvalues, We have shown that the largest eigenvalue and corresponding eigenvector exist, but we have not indicated , at least in this section, how to compute them. Conceptually the power method is the most obvious way. It is a tangential minorization method, using the inquality , which means that the iteration function is
See the Rayleigh Quotient section for further details. We now discuss a first easy generalization. If is real and symmetric but not necessarily positive semi-definite then we can apply our previous results to the matrix . Or we can apply it to into an
, with
. Or we can modify the algorithm if we run
with maximum eigenvalue equal to zero. If this happens we switch to
finding the smallest eigenvalues, which will be negative. No matter how we modify the constructive procedure, we will still find an eigen decomposition of the same form
and
as in the positive semi-definite case. The second generalization, also easy, are generalized eigenvalues of a pair of real symmetric matrices
. We now maximize
over satisfying
.
In data analysis, and the optimization problems associated with it, we almost invariably assume that is positive definite. In fact we might as well make the weaker assumption that is positive semi-definite, and
is an eigen decomposition of Then
for all such that
Suppose
Change variables by writing as
and
.
. We can find the generalized
eigenvalues and eigenvectors from the ordinary eigen decomposition of This defines the
in
.
, and the choice of is completely
arbitrary. Now suppose is the square orthonormal matrix of eigenvectors diagonalizing with the corresponding eigenvalues, and and
. Then
. Thus diagonalizes both and . For the more general
Eigenvalues and Eigenvectors of Symmetric Matrices
38
Block Relaxation Algorithms in Statistics -- Part III
case, in which we do not assume that
for all with
, we refer to De Leeuw
[1982].
Eigenvalues and Eigenvectors of Symmetric Matrices
39
Block Relaxation Algorithms in Statistics -- Part III
Singular Values and Singular Vectors Suppose
is an
matrix,
is an
symmetric matrix, and
is an
symmetric matrix. Define
Consider the problem of finding the maximum, the minimum, and other stationary values of . In order to make the problem well-defined and interesting we suppose that the symmetric partitioned matrix
is positive semi-definite. This has some desirable consequences. Proposition: Suppose the symmetric partitioned matrix
is positive semi-definite. Then both
and
are positive semi-definite,
for all
with
we have
,
for all
with
we have
.
Proof: The first assertion is trivial. To prove the last two, consider the convex quadratic form
as a function of
for fixed
. It is bounded below by zero, and thus attains its minimum.
At this minimum, which is attained at some
, the derivative vanishes and we have
and thus . But must have
If
then
because the quadratic form is positive semi-definite. Thus if , which is true if and only if
we
. QED
Now suppose
Singular Values and Singular Vectors
40
Block Relaxation Algorithms in Statistics -- Part III
and
are the eigen-decompositions of matrix
and
. The
have positive diagonal elements, and
and
matrix
and the
are the ranks of
and
.
Define new variables
Then
which does not depend on and
and
at all. Thus we can just consider as a function of
, study its stationary values, and then translate back to , choosing
and
and
using
and
completely arbitrary.
Define
. The stationary equations we have to solve are
where is a Lagrange multiplier, and we identify
and
by
. It follows
that
and also
.
Singular Values and Singular Vectors
41
Block Relaxation Algorithms in Statistics -- Part III
Canonical Correlation Suppose
is an
matrix and
between two linear combinations
is an and
matrix. The cosine of the angle is
Consider the problem of finding the maximum, the minimum, and possible other stationary values of . are two matrices of dimensions, respectiSpecifically there exists a non-singular of order and a non-singular of order
Here
and
such that
are diagonal, with the
and
leading diagonal elements equal to one
and all other elements zero. is a matrix with the non-zero canonical correlations in nonincreasing order along the diagonal and zeroes everywhere else. http://en.wikipedia.org/wiki/Principal_angles http://meyer.math.ncsu.edu/Meyer/PS_Files/AnglesBetweenCompSubspaces.pdf
Canonical Correlation
42
Block Relaxation Algorithms in Statistics -- Part III
Eigenvalues and Eigenvectors of Asymmetric Matrices If is a square but asymmetric real matrix the eigenvector-eigenvalue situation becomes quite different from the symmetric case. We gave a variational treatment of the symmetric case, using the connection between eigenvalue problems and quadratic forms (or ellipses and other conic sections, if you have a geometric mind).That connection, howver, is lost in the asymmetric case, and there is no obvious variational problem associated with eigenvalues and eigenvectors. Let us first define eigenvalues and eigenvectors in the asymmetric case. As before, an eigen-pair
is a solution to the equation
as
with
. This can also be written
, which shows that the eigenvalues are the solutions of the equation . Now the function
is the characteristic polynomial of . It
is a polynomial of degree , and by the fundamental theorem of algebra there are real and complex roots, counting multiplicities. Thus has eigenvalues, as before, although some of them can be complex A first indication that something may be wrong, or least fundamentally different, is the matrix
The characteristic equation
has the root
, with multiplicity 2. Thus
an eigenvector should satisfy
which merely says
. Thus does not have two linearly independent, let alone
orthogonal, eigenvectors. A second problem is illustrated by the anti-symmetric matrix
for which the characteristic polynomial is the two complex roots
and
. The characteristic equations has . The corresponding eigenvectors are the
columns of
Thus both eigenvalues and eigenvectors may be complex. In fact if we take complex
Eigenvalues and Eigenvectors of Asymmetric Matrices
43
Block Relaxation Algorithms in Statistics -- Part III
conjugates on both sides of Thus
, and remember that is real, we see that is an eigen-pair if and only if
is. If is real and of odd
order it always has at least one real eigenvalue. If an eigenvalue is real and of multiplicity , then there are corresponding real and linearly independent eigenvectors. They are simply a basis for the null space of
.
A third problem, which by definition did not come up in the symmetric case, is that we now have an eigen problem for both and its transpose
. Since for all we have
it follows that and say that
is a right eigen-pair of if
have the same eigenvalues. We
, and
, which is of course the same as
is a left eigen-pair of if
.
A matrix is diagonalizable if there exists a non-singular such that
, with
diagonal. Instead of the spectral decomposition of symmetric matrices we have the decomposition
or
. A matrix that is not diagonalizable is called
defective. Result: A matrix is diagonalizable if and only if it has linearly independent right eigenvectors if and only if it has linearly independent left eigenvectors. We show this for right eigenvectors. Collect them in the columns of a matrix . Thus non-singular. This implies
, and thus the rows of
independent left eigenvalues. Also and
. Conversely if
, with are linearly then
, so we have linearly independent left and right
eigenvectors. Result: If the eigenvalues
of are all diferent then the eigenvectors
are linearly
independent. We show this by contradiction. Select a maximally linearly independent subset from the
. Suppose there are
, so the eigenvectors are linearly dependent. Without
loss of generality the maximally linearly independent subset can be taken as the first . Then for all
there exist
Premultiply
with
Premultiply
by to get
such that
to get
Eigenvalues and Eigenvectors of Asymmetric Matrices
44
Block Relaxation Algorithms in Statistics -- Part III
Subtract
from
to get
which implies that eigenvalues are unequal, this implies that the
are eigenvectors. Thus
because the
are linearly independent. Since the
and thus and the
for all
, contradicting
are linearly independent.
Note 030615 Add small amount on defective matrices. Add stuff on characteristic and minimal polynomials. Take about using the SVD instead.
Eigenvalues and Eigenvectors of Asymmetric Matrices
45
Block Relaxation Algorithms in Statistics -- Part III
Modified Eigenvalue Problems Suppose we know an eigen decomposition
of a real symmetric matrix of
order , and we want to find an eigen decomposition of the rank-one modification , where
. The problem was first discussed systematically by Golub
[1973]. Also see Bunch, Nielsen, and Sorensen [1978] for a more detailed treatment and implmentation. Eigen-pairs of must satisfy
Change variables to
and define
. For the time being suppose all elements
of are non-zero and all elements of are different, with
.
We must solve
which we can also write as
Suppose
is a solution with
. Then
different must be a vector with a single element, say , which is non-zero. Thus
and because all
are
, not equal to zero. But then
is non-zero at a solution, and because
eigenvectors are determined up to a scalar factor we may as well require
.
Now solve
At a solution we must have
, because otherwise would be zero. Thus
and we can find by solving
If we define
Modified Eigenvalue Problems
46
Block Relaxation Algorithms in Statistics -- Part III
then we must solve
. Let's first look at a particular .
Figure 1: Linear Secular Equation
We have
for all , and
There are vertical asymptotes at all from
to
increases from
. For
and
the function increases from 0 to
to 0. Thus the equation
open intervals between the and if
, and between
. If
it has a solution larger than
Modified Eigenvalue Problems
the function increases and for
it
has one solution in each of the it has an additional solution smaller than
. If
then
47
Block Relaxation Algorithms in Statistics -- Part III
and if
then
Finding the actual eigenvalues in their intervals can be done with any root-finding method. Of course some will be better than other for solving this particular problem. See Melman [1995], [1997] [1998] for suggestions and comparisons. We still have to deal with the assumptions that the elements of are non-zero and that all are different. Suppose elements of are zero, without loss of generality it can be the last . Partition and accordingly. Then we need to solve the modified eigen-problem for
But this is a direct sum of smaller matrices and the eigenvalues problems for
and
can be solved separately. If not all
are different we can partitioning the matrix into blocks corresponding with the,
say, different eigenvalues.
Now use the matrices column equal to
which are square orthonormal of order Form the direct sum of the
, and have their first
and compute
.
This gives
with the
unit vectors, i.e. vectors that are zero except for element that is one.
A row and column permutation makes the matrix a direct sum of the diagonal matrices of order
and the
Modified Eigenvalue Problems
matrix
48
Block Relaxation Algorithms in Statistics -- Part III
This last matrix satisfies our assumptions of different diagonal elements and nonzero offdiagonal elements, and consequently can be analyzed by using our previous results. A very similar analysis is possible for modfied singular value decomposition, for which we refer to Bunch and Nielsen [1978].
Modified Eigenvalue Problems
49
Block Relaxation Algorithms in Statistics -- Part III
Quadratic on a Sphere Another problem naturally leading to a different secular equation is finding stationary values of a quadratic function defined by
on the unit sphere
. This was first studied by Forsythe and Golub [1965].
Their treatment was subsequently simplified and extended by Spjøtvoll [1972] and Gander [1981]. The problem has recently received some attention because of the development of trust region methods for optimization, and, indeed, because of Nesterov majorization. The stationary equations are
Suppose
with the
, change variables to
, and define
. Then we must solve
Assume for now that the elements of are non-zero. Then cannot be equal to one of the . Thus
and we must have
, where
Again, let's look at an example of a particular . The plots in Figure 1 show both ad . We see that
has 12 solutions, so the remaining question is which one corresponds
with the minimum of .
Quadratics on a Sphere
50
Block Relaxation Algorithms in Statistics -- Part III
Figure 1: Quadratic Secular Equation
Again has vertical asympotes at the
. Beween two asymptotes decreases from
to a minimum, and then increases again to
. Note that
and
and thus is convex in each of the intervals between asymptotes. Also is convex and increasing from zero to
on
and convex and decreasing from
to zero on
.
Quadratics on a Sphere
51
Block Relaxation Algorithms in Statistics -- Part III
Generalized Inverses
Generalized Inverses
52
Block Relaxation Algorithms in Statistics -- Part III
Partitioned Matrices
Partitioned Matrices
53
Block Relaxation Algorithms in Statistics -- Part III
Matrix Differential Calculus
Matrix Differential Calculus
54
Block Relaxation Algorithms in Statistics -- Part III
Matrix Derivatives A matrix, of course, is just an element of a finite dimensional linear vector space. We write , and we use the inner product
and corresponding norm
Thus derivatives of real-valued function of matrices, or derivatives of matrix-valued functions of matrices, are covered by the usual definitions and formulas. Nevertheless there is a surprisingly huge literature on differential calculus for real-valued functions of matrices, and matrix-valued functions of matrices. One of the reason for the proliferation of publications is that a matrix-valued function of matrices can be thought of a function of for matrix space also as a function of vector space
to vector space
to matrix-space
, but
. There are obvious
isomorphisms between the two representations, but they naturally lead to different notations. We will consistently choose the matrix-space formulation, and consequently minimize the role of the
operator and the special constructs such as the commutation and
duplication matrix. The other choice Nevertheless having a compendium of the standard real-valued and matrix-valued functions available is of some interest. The main reference is the book by Magnus and Neudecker [1999]. We will avoid using differentials and the
operator.
Suppose is a matrix valued function of a single variable . In other words is a matrix of functions, as in
Now the derivatives of any order of , if they exist, are also matrix valued functions
If is a function of a vector
Matrix Derivatives
then partial derivatives are defined similarly, as in
55
Block Relaxation Algorithms in Statistics -- Part III
with
The notation becomes slightly more complicated if is a function of a matrix , i.e. an element of where
Matrix Derivatives
. It then makes sense to write the partials as and
56
Block Relaxation Algorithms in Statistics -- Part III
Derivatives of Eigenvalues and Eigenvectors This appendix summarizes some of the results in De Leeuw [2007], De Leeuw [2008], and De Leeuw and Sorenson [2012]. We refer to those reports for more extensive calculations and applications. Suppose and are two real symmetric matrices depending smoothly on a real parameter . The notation below suppresses the dependence on of the various quantities we talk about, but it is important to remember that all eigenvalues and eigenvectors we talk about are functions of . The generalized eigenvalue
and the corresponding generalized eigenvector
defined implicitly by
are
. Moreover the eigenvector is identified by
We suppose that in a neighborhood of the eigenvalue
.
is unique and is positive
definite. A precise discussion of the required assumptions is, for example, in Wilkinson [1965] or Kato [1976]. Differentiating
while
Premultiplying
Now suppose
gives the equation
gives
by
gives
with
. Then from
, for
, premultiplying by
gives
If we define by
then
and thus
.
A first important special case is the ordinary eigenvalue problem, in which
Derivatives of Eigenvalues and Eigenvectors
which
57
Block Relaxation Algorithms in Statistics -- Part III
obviously does not depend on , and consequently has
. Then
while
If we use the Moore_Penrose inverse the derivatives of the eigenvector can be written as
Written in a different way this expression is
with
, so that
.
In the next important special case is the singular value problem The singular values and vectors of an
rectangular , with . It follows that
, solve the equations
and
, i.e. the right singular vectors are the
eigenvectors and the singular values are the square roots of the eigenvalues of Now we can apply our previous results on eigenvalues and eigenvectors. If . We have, at an isolated singular value
. then
,
and thus
For the singular vectors our previous results on eigenvectors give
and in the same way
Now let matrix (with
, with and square orthonormal, and with and
diagonal
positive diagonal entries in non-increasing order along the diagonal).
Derivatives of Eigenvalues and Eigenvectors
58
Block Relaxation Algorithms in Statistics -- Part III
Also define
. Then
, and
and
Note that if is symmetric we have
and is symmetric, so we recover our
previous result for eigenvectors. Also note that if the parameter is actually element , i.e. if we are computing partial derivatives, then
of
.
The results on eigen and singular value decomposition can be applied in many different ways. mostly by simply using the product rule for derivatives, For a square symmetric or order , for example, we have
and thus
The generalized inverse of a rectangular is
where
. Summation is over the positive singular values, and for differentiability
we must assume that the rank of is constant in a neighborhood of . The Procrustus transformation of a rectangular , which is the projection of on the Stiefel manifold of orthonormal matrices, is
where we assume for differentiability that is of full column rank.
Derivatives of Eigenvalues and Eigenvectors
59
Block Relaxation Algorithms in Statistics -- Part III
The projection of on the set of all matrices of rank less than or equal to , which is of key importance in PCA and MDS, is
where summation is over the largest singular values.
Derivatives of Eigenvalues and Eigenvectors
60
Block Relaxation Algorithms in Statistics -- Part III
Graphics and Code
Miscellaneous
61
Block Relaxation Algorithms in Statistics -- Part III
Multidimensional Scaling Many of the examples in the book are taken from the area of multidimensional scaling (MDS). In this appendix we describe the basic MDS notation and terminology. Our approach to MDS is based on Kruskal [1964ab], using terminology and notation of De Leeuw [1977] and De Leeuw and Heiser [1982]. For a more recent and more extensive discussion of MDS see Borg and Groenen [2005]. The data in an MDS problem consist of information about the dissimilarities between pairs of objects. Dissimilarities are like distances, in the sense that they give some information about physical or psychological closeness, but they need not satisfy any of the distance axioms. In metric MDS the dissimilarity between objects and is a given number
,
usually positive and symmetric, with possibly some of the dissimilarities missing. In nonmetric MDS we only have a partial order on some or all of the
dissimilarities. We want to
represent the objects as points in a metric space in such a way that the distances between the points approximate the dissimilarities between the objects. An MDS loss function is typically of the form pseudo-norm, on the space of with
for some norm, or
matrices. Here are the points in the metric space,
the symmetric, non-negative, and hollow matrix of distances. The MDS problem
is to minimize loss over all mappings and all feasible . In the metric MDS problems is fixed at the observed data, in non-metric MDS any monotone transformation of is feasible. The definition of MDS we have given leaves room for all kinds of metric spaces and all kinds of norms to measure loss. In almost all applications both in this book and elsewhere, we are interested in Euclidean MDS, where the metric space is the (weighted) sum of squares of residuals
, and in loss functions that use
. Thus the loss function has the
general form
where is an
matrix called the configuration.
The most popular choices for the residuals are
Multidimensional Scaling
62
Block Relaxation Algorithms in Statistics -- Part III
Here
and
are elementwise transformations of the dissimilarities, with
corresponding transformations operator
and
of the distances. In
we use the centering
. For Euclidean distances, and centered ,
with
. Metric Euclidean MDS, using
finding the best rank approximation to
with unit weights, means
, which can be done finding the dominant
eigenvalues and corresponding eigenvectors. This is also known as Classical MDS [Torgerson, 1958]. The loss function uses
that uses
is called stress [Kruskal, 1964ab], the function
is sstress [Takane et al, 1977], and loss
Heiser, 1982].
that uses
that
is strain [De Leeuw and
has been nameless so far, but it has been proposed by Ramsay [1977].
Because of its limiting properties (see below), we will call it strull. Both
ant
are obviously special cases of
for which the corresponding loss function
we see that
is a limiting case of
is called r-stress. Because
.
There is some matrix notation that is useful in dealing with Euclidean MDS. Suppose and are unit vectors, with all elements equal to zero, except one element which is equal to one. Then
where
Multidimensional Scaling
and
. If we define
63
Block Relaxation Algorithms in Statistics -- Part III
and
then
instead of matrices in
Multidimensional Scaling
, which allows us to work with vectors in .
64
Block Relaxation Algorithms in Statistics -- Part III
Cobweb Plots Suppose we have a one-dimensional Picard sequence which starts at
, and then is
defined by
The cobweb plot draws the line
and the function
. A fixed point is a point
where the line and the function intersect. We visualize the iteration by starting at , then draw a horizontal line to , then draw a vertical line to , and so on. For a convergent sequence we will see zig-zagging parallel to the axes in smaller and smaller steps to a point where the function and the line intersect. An illustration will make this clear. The Newton iteration for the square root of is
The iterations for
starting with
are in the cobweb plot in figure 1.
We also give R code for a general cobweb plotter with a variable number of parameters.
[Insert cobwebPlotter.R Here](../code/cobwebPlotter.R)
Cobweb Plots
65
Block Relaxation Algorithms in Statistics -- Part III
Figure 1: Cobweb plot for Newton Square Root Iteration
Cobweb Plots
66
Block Relaxation Algorithms in Statistics -- Part III
Notation
Notation
67
Block Relaxation Algorithms in Statistics -- Part III
Bibliography Abatzoglou T, O'Donnell B [1982] Minimization by Coordinate Descent. Journal of Optimization Theory and Applications 36: 163--174 Argyros IK, Szidarovszky F [1993] The Theory and Application of Iteration Methods. CRC Press, Boca Raton Berge C [1965] Espaces Topologiques, Fonctions Multivoques. Deuxième édition, Dunod, Paris Berge C [1997] Topological Spaces. Dover Publications, Mineola Berinde V [2007] Iterative Approximation of Fixed Points. (Second Edition) Berlin, Springer. Böhning D, Lindsay BG [1988] Monotonicity of Quadratic Approximation Algorithms. Annals of the Institute of Statiatical Mathematics 40:641-663 Borg I, Groenen PJF [2005] Modern Multidimensional Scaling. Second Edition, Springer, New York Browne MW [1987] The Young-Householder Algorithm and the Least Squares Multdimensional Scaling of Squared Distances Journal of Classification 4:175-190 Bryer J [2014] Rgitbook: Gitbook Projects with R Markdown. Package version 0.9 Bunch JR, Nielsen CP [1978] Updating the Singular Value Decomposition. Numerische Mathematik 31:111-129 Bunch JR, Nielsen CP, Sorensen DC [1978] Rank-one Modification of the Symmetric Eigenproblem. Numerische Matematik 31:31-48 Céa J [1968] Les Méthodes de ``Descente'' dans la Theorie de l'Optimisation. Revue Francaise d'Automatique, d'Informatique et de Recherche Opérationelle 2:79-102 Céa J [1970] Recherche Numérique d'un Optimum dans un Espace Produit. In Colloquium on Methods of Optimization. Springer, New York Céa J, Glowinski R [1973] Sur les Méthodes d'Optimisation par Rélaxation. Revue Francaise d'Automatique, d'Informatique et de Recherche Opérationelle 7:5-32 De Leeuw J [1968] Nonmetric Discriminant Analysis. Department of Data Theory, Leiden University, Research Note 06-68
Bibliography
68
Block Relaxation Algorithms in Statistics -- Part III
De Leeuw J [1975] An Alternating Least Squares Approach to Squared Distance Scaling Unpublished, probably lost forever De Leeuw J [1977] Applications of Convex Analysis to Multidimensional Scaling. In: Barra JR, Brodeau F, Romier G, Van Cutsem B (eds) Recent Developments in Statistics Amsterdam, North Holland Publishing Company De Leeuw J [1982] Generalized Eigenvalue Problems with Positive Semidefinite Matrices. Psychometrika 47:87-94 De Leeuw J [1988] Multivariate Analysis with Linearizable Regressions. Psychometrika 53:437-454 De Leeuw J [1994] Block Relaxation Algorithms in Statistics. In: Bock HH, Lenski W, Richter MM (eds) Information Systems and Data Analysis. Springer, Berlin De Leeuw J [2004] Least Squares Optimal Scaling for Partially Observed Linear Systems. In Van Montfort K, Oud J, Satorra A (eds) Recent Developments on Structural Equation Models. Dordrecht, Kluwer De Leeuw J [2006] Principal Component Analysis of Binary Data by Iterated Singular Value Decomposition. Computational Statiatics and Data Analysis 50:21-39 De Leeuw J [2007] Derivatives of Generalized Eigen Systems with Applications. Department of Statistics UCLA, Preprint 528 De Leeuw J [2007b] Minimizing the Cartesian Folium. Department of Statistics UCLA, Unpublished De Leeuw J [2008a] Derivatives of Fixed-Rank Approximations. Department of Statistics UCLA, Preprint 547 De Leeuw J [2008b] Rate of Convergence of the Arithmetic-Geometric Mean Process. Department of Statistics UCLA, Preprint 550 De Leeuw J, Heiser WJ [1982] Theory of Multidimensional Scaling. In Krishnaiah PR, Kanal L (eds) Handbook of Statistics. Volume II, North Holland Publishing Co, Amsterdam De Leeuw J, Lange K [2009] Sharp Quadratic Majorization in One Dimension. Computational Statistics and Data Analysis 53:2471-2484 De Leeuw J , Sorenson K [2012] Derivatives of the Procrustus Transformation with Applications. Department of Statistics UCLA, Unpublished Delfour MC [2012] Introduction to Optimization and Semidifferential Calculus. Philadelphia, SIAM
Bibliography
69
Block Relaxation Algorithms in Statistics -- Part III
Dempster AP, Laird NM, Rubin DB [1977] Maximum Likelihood for Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B39:1-38. Demyanov VF [2007] Nonsmooth Optimization. In Di Pillo G, Schoen F (eds) Nonlinear Optimization. Lectures given at the C.I.M.E. Summer School held in Cetraro, Italy, July 1-7, 2007 Springer, New York Demyanov VF [2009] Dini and Hadamard Derivatives in Optimization. In Floudas CA , Pardalos PM (eds) Encyclopedia of Optimization. Revised and expanded edition, Springer, New York D'Esopo DA [1959] A Convex Programming Procedure. Naval Research Logistic Quarterly 6:33-42 Dinkelbach W [1967] On Nonlinear Fractional Programming. Management Science 13:492498 Dontchev AL, Rockafellar RT [2014] Implicit Functions and Solution Mappings. Second Edition, Springer, New York Elkin RM [1968] Convergence Theorems for Gauss-Seidel and other Minimization Algorithms. Technical Report 68-59, Computer Sciences Center, University of Maryland Forsythe GE, Golub GH [1965] On the Stationary Values of a Second Degree Polynomial on the Unit Sphere. Journal of the Society for Industrial and Applied Mathematics 13:1050-1068 Gander W [1981] Least Squares with a Quadratic Constraint. Numerische Mathematik 36:291-307 Gifi A [1990] Nonlinear Multivariate Analysis. Chichester, Wiley Golub GH [1973] Some Modified Matrix Eigenvalue Problems. SIAM Review 15:318-334 Groenen PJF, Giaquinto P, Kiers HAL [2003] Weighted Majorization Algorithms for Weighted Least Squares Decomposition Models. Econometric Institute Report EI 2003-09, Erasmus University, Rotterdam Groenen PJF, Nalbantov G, Bioch JC [2007] Nonlinear Support Vector Machines Through Iterative Majorization and I-Splines. In Lenz HJ, Decker R (eds) Studies in Classification, Data Analysis, and Knowledge Organization. Springer, New York Groenen PJF, Nalbantov G, Bioch JC [2008] SVM-Maj: a Majorization Approach to Linear Support Vector Machines with Different Hinge Errors. Advances in Data Analysis and Classification 2:17-43 Harman HH, Jones WH [1966] Factor Analysis by Minimizing Residuals (MINRES). Psychometrika 31:351-368
Bibliography
70
Block Relaxation Algorithms in Statistics -- Part III
Heiser WJ [1986] A Majorization Algorithm for the Reciprocal Location Problem. Department of Data Theory, Leiden University, Report RR-86-12 Heiser WJ [1987] Correspondence Analysis with Least Absolute Residuals. Computational Statiatics and Data Analysis, 5:337-356 Heiser WJ [1995] Convergent Computation by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis. In Krzanowski WJ (ed) Recent Advances in Discriptive Multivariate Analysis. Clarendon Press, Oxford Hildreth C [1957] A Quadratic Programming Procedure. Naval Research Logistic Quarterly 14:79-84 Hunter DR, Lange K [2004] A Tutorial on MM Algorithms. American Statistician 58:30-37 Hunter DR, Li R [2005] Variable Selection Using MM Algorithms. Annals of Statistics 33:1617-1642 Jaakkola TSW, Jordan MIW [2000] Bayesian Parameter Estimation via Variational Methods. Statistical Computing 10:25-37 Kato T [1976] Perturbation Theory for Linear Operators. Second Edition, Springer, New York Krantz SG, Parks HR [2013] The Implicit Function Theorem: History, Theory, and Applications. Springer, New York Kruskal JB [1964a] Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29:1-27 Kruskal JB [1964b] Nonmetric Multidimensional Scaling: a Numerical Method. Psychometrika 29:115-129 Kruskal JB [1965] Analysis of Factorial Experiments by Estimating Monotone Transformations of the Data. Journal of the Royal Statistical Society B27:251-263 Lange K [2013] Optimization. Second Edition, Springer, New York Lange K [20xx] MM Algorithms. Book in progress Lange K, Chi EC, Zhou, H [2014] A Brief Survey of Modern Optimization for Statisticians. International Statistical Review, 82:46-70 Lange K, Hunter DR, Yang I [2000] Optimization Transfer Using Surrogate Objective Functions. Journal of Computational and Graphical Statistics 9:1-20 Lipp T, Boyd S [2014] [Variations and Extensions of the Convex-Concave Procedure.] (http://web.stanford.edu/~boyd/papers/pdf/cvx_ccv.pdf) (as yet) Unpublished paper, Stanford University
Bibliography
71
Block Relaxation Algorithms in Statistics -- Part III
Magnus JR, Neudecker H [1999] Matrix Differential Calculus with Applications in Statistics and Econometrics. (Revised Edition) New York, Wiley Mair P, De Leeuw J [2010] A General Framework for Multivariate Analysis with Optimal Scaling: The R Package aspect. Journal of Statistical Software, 32(9):1-23 Melman A [1995] Numerical Solution of a Secular Equation. Numerische Mathematik 69:483-493 Melman A [1997] A Unifying Convergence Analysis of Second-Order Methods for Secular Equations. Mathematics of Computation 66:333-344 Melman A [1998] Analysis of Third-order Methods for Secular Equations. Mathematics of Computation 67:271-286 Mönnigmann M [2011] Fast Calculation of Spectral Bounds for Hessian Matrices on Hyperrectangles. SIAM Journalof Matrix Analysis and Applications 32:1351-1366 Nesterov Y, Polyak BT [2006] Cubic Regularization of Newton Method and its Global Performance. Mathematical Programming, A108:177-205 Oberhofer W, Kmenta J [1974] A General Procedure for Obtaining Maximum Likelihood Estimates in Generalized Regression Models. Econometrica 42:579-590 Ortega JM, Rheinboldt WC [1970] Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York Ortega JM, Rheinboldt WC [1970] Local and Global Convergence of Generalized Linear Iterations. In Ortega JM, Rheinboldt WC (eds) Numerical Solutions of Nonlinear Problems. Philadelphia, SIAM Ostrowski AM [1966] Solution of Equations and Systems of Equations. (Second Edition) Academic Press, New York Penot J-P [2013] Calculus without Derivatives. New York, Springer Parring AM [1992] About the Concept of the Matrix Derivative. Linear Algebra and its Applications 176:223-235 Ramsay JO [1977] Maximum Likelihood Estimation in Multidimensional Scaling. Psychometrika 42:241-266 R Core Team [2015]. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Rockafellar RT [1970] Convex Analysis. Princeton University Press, Princeton Rockafellar RT, Wets RJB [1998] Variational Analysis. Springer, New York
Bibliography
72
Block Relaxation Algorithms in Statistics -- Part III
Roskam EEChI [1968] Metric Analysis of Ordinal Data in Psychology. VAM, Voorschoten, Netherlands Schechter S [1962] Iteration Methods for Nonlinear Problems. Transactions of the American Mathematical Society 104:179-189 Schechter S [1968] Relaxation Methods for Convex Problems. SIAM Journal Numerical Analysis 5:601-612 Schechter S [1970] Minimization of a Convex Function by Relaxation. In Abadie J (ed) Integer and nonlinear programming. North Holland Publishing Company, Amsterdam Shapiro A [1990] On Concepts of Directional Differentiability. Journal of Optimization Theory and Applications 66:477-487 Schirotzek W [2007] Nonsmooth Analysis. Springer, New York Smart DR [1974] Fixed Point Theorems. Cambridge Tracts in Mathematics 66, Cambridge University Press, Cambridge Spivak M [1965] Calculus on Manifolds. Westview Press, Boulder Spjøtvoll E [1972] A Note on a Theorem by Forsythe and Golub. SIAM Joural of Applied Mathematics 23:307-311 Sriperumbudur BK, Lanckriet GRG [2012] A Proof of Convergence of the Concave-Convex Procedure Using Zangwill’s Theory. Neural Computation 24:1391–1407 Takane Y [1977] On the Relations among Four Methods of Multidimensional Scaling. Behaviormetrika, 4:29-42 Takane Y, Young FW, De Leeuw J [1977] Nonmetric Individual Differences in Multidimensional Scaling: An Alternating Least Squares Method with Optimal Scaling Features Psychometrika 42:7-67 Theussl S, Borchers, HW [2014] CRAN Task View: Optimization and Mathematical Programming. Thomson GH [1934] Hotelling's Method Modified to Give Spearman's g. Journal of Educational Psychology 25:366-374 Torgerson WS [1958] Theory and Methods of Scaling. Wiley, New York Van Den Burg GJJ, Groenen PJF [2014] GenSVM: A Generalized Multiclass Support Vector Machine. Econometric Institute Report EI 2014-33, Erasmus University, Rotterdam Van der Heijden PGM, Sijtsma K [1996] Fifty Years of Measurement and Scaling in the Dutch Social Sciences. Statistica Neerlandica 50:111-135.
Bibliography
73
Block Relaxation Algorithms in Statistics -- Part III
Van Ruitenburg J [2005] Algorithms for Parameter Estimation in the Rasch Model. CITO Measurement and Research Department Reports 2005-04, Arnhem, Netherlands Varadhan R [2014] Numerical Optimization in R : Beyond optim . Journal of Statistical Software, 60: issue 1 Verboon P, Heiser WJ Resistant Lower Rank Approximation of Matrices by Iterative Majorization. Computational Statistics and Data Analysis 18:457-467 Voß H, Eckhardt U [1980] Linear Convergence of Generalized Weiszfeld's Method. Computing 25:243-251 Wainer H, Morgan A, Gustafsson JE [1980] A Review of Estimation Procedures for the Rasch Model with an Eye toward Longish Tests. Journal of Educational Statistics 5:35-64 Weiszfeld E [1937] _Sur le Point par lequel la Somme des Distances de n Points Donnès est Minimum.- Tôhoku Mathematics Journal 43:355--386 Weiszfeld E, Plastria F [2009] On the Point for which the Sum of the Distances to n Given Points is Minimum. Annals of Operations Research 167:7-41 Wilkinson GN [1958] Estimation of Missing Values for the Analysis of Incomplete Data. Biometrics 14:257-286 Wilkinson JH [1965] The Algebraic Eigenvalue Problem. Clarendon Press, Oxford Wong CS [1985] On the Use of Differentials in Statistics. Linear Algebra and its Applications 70:285-299 Xie Y [2013] Dynamic Documents with R and knitr. Boca Raton, Chapman and Hall/CRC. Yates, F [1933] The Analysis of Replicated Experiments when the Field Results are Incomplete. Empirical Journal of Experimental Agriculture, 1:129-142. Yen E-H, Peng N, Wang P-W, Lin S-D [2012] [On Convergence Rate of Concave-Convex Procedure.] (http://opt-ml.org/oldopt/papers/opt2012_paper_10.pdf) Paper presented at 5th NIPS Workshop on Optimization for Machine Learning, Lake Tahoe, December 8 2012 Young FW [1981] Quantitative analysis of qualitative data. Psychometrika 46:357-388 Young FW, De Leeuw J, Takane Y [1980] Quantifying Qualitative Data. In: Lantermann ED, Feger H (eds) Similarity and Choice. Papers in Honor of Clyde Coombs, Hans Huber, Bern Yuille AL, Rangarajan A [2003] The Concave-Convex Procedure. Neural Computation 15:915–936 Zangwill WI [1969] Nonlinear Programming: A Unified Approach. Prentice Hall, Englewood Cliffs
Bibliography
74
Block Relaxation Algorithms in Statistics -- Part III
Bibliography
75
Block Relaxation Algorithms in Statistics -- Part III
What's New Version 021215 Material in appendix on derivatives of eigenvalues Expanded bibliography Added "What's New" chapter Wrote intro to taylor majorization Some shaky stuff on Lipschitz Version 021315 Added folium in coordinate descent chapter Version 021415 Added univariate cubic in local majorization Revisiting the reciprocal in higher order majorization Version 021515 Various small editorial changes throughout Version numbering by date Appendix on cobweb plotter Moved the graphics from an external server to the book Added some section headers in ALS chapter Removed shaky Lipschitz stuff Version 021615 Cleanup first chapters Version 021715 Some stuff for section on product of derivatives of block mappings - representation as cyclic matrix Version 021815 Added some detail to the SVD example Added material on Dinkelbach majorization Version 021915 L-majorization and D-majorization
What's New
76
Block Relaxation Algorithms in Statistics -- Part III
Version 022015 Calculations for derivatives of general block methods Reorganized examples in block chapter Version 022115 Updated material on derivatives of algorithmic map for block methods Block Newton methods section Formulas for rate of block optimization Defined and used the LU-form and product form Version 022215 Removed Gauss-Jordan from multiple block LS Convergence rate of CCD for the Rayleigh quotient added blockRate.R function added mls.R function for multiple block LS various small edits Version 022315 Expanded bibliography Put LaTeX macros in header.md to be included on every page Tentatively started HTML cross-referencing of sections Clarified block relaxation derivatives Started some material on block Newton. Version 022415 Defined iteration map, iteration function, iteration matrix and iteration radius Add material on comparing majorizers (may have to be moved to composition section) Simple quartic example in neighborhood majorization Version 022515 No changes Version 022615 No changes Version 022715 Some changes in the definition section of majorization Version 022815
What's New
77
Block Relaxation Algorithms in Statistics -- Part III
More majorization definition changes Majorization duality Graphics for majorization Moved code and graphics to better places Version 030115 More meat on D-majorization Additions to mean value section Separate section on majorizing value functions logit example with higher derivatives Mean value and taylor majorization material Version 030215 Higher derivatives for the logit plots and formulas Higher derivatives for the probit, empty section Changed the order of some sections and chapters Cleanup of various sections Consistent notation for block least squares Consistent notation in tomography section Split convexity in convexity and linear majorization Moved EM into convexity Version 030315 Notation in EM sums and integrals Notation is now a separate chapter -- I intend to make a big deal out of this Moved appendices to background chapter Plan: merge silly README and introduction files for each chapter Added matrix background Added canonical correlation background section cover page ! DC example Added probit calculations to sharp quadratic Version 030415 Moved last graphics from gifi into book Moved bibliography and notation files to top level Checked and corrrected chapter and section headers Background on singular value decomposition - start Background on canonical correlation - start Background on eigen value decomposition - start
What's New
78
Block Relaxation Algorithms in Statistics -- Part III
Version 030515 Split eigen background in symmetric and asymmetric Background on symmetric eigen decomposition Started background asymmetric eigen problem Changed chapter title from linear to tangential majorization. Added necessary condition from tangential majorization of concave function. Changed chapter title to higher order majorization. Version 030615 More on asymmetric eigenproblems. Changed names: Sub-level majrization, Dinkelbach majorization, Nesterov Majorization. Added even higher order Nesterov majorization. Added multivariate quadratic sub-level majorization -- start. Nesterov majorization definition and implementation. Rearranged background chapter, new empty sections in matrices and inequalities. Introduced majorization scheme or majorization coupling Decomposition section with examples added. Version 030715 Defined majorizaton coupling. Redid coupling plot for log-logit in rgl. I will start moving figures back into the text, instead of putting them at the end of the page. Markdown and HTML are not LaTeX, there are no floats. Background section on necessary and suffficient conditions for a minimum, added some material on quadratics. Background section on generalized inverses. Still empty. Necessary conditions for minimum on sphere or ball. Insert stuff on partial majorization in value functions. Version 030815 I am experimenting with cross-referencing chapters, sections, theorems, figures, maybe even paragraphs and formulas, using the html name attribute. I am experimenting with pretty printing and executing R code in the pages, by using Rgitbook. Version 030915 Replaced some figure with code for knitr Added bibliography items Changed preface name in book.json
What's New
79
Block Relaxation Algorithms in Statistics -- Part III
Added link to my bibliography in preface Replaced all md files by Rmd files Pasted in sharp majorization of even functions Pasted in sharp majorization in two points Material on first secular equation in modified eigenvalues background Must remember to name my chunks Version 031015 Added material on modifed eigen problem Added material on quadratics on a sphere Version 031115 Added more material on quadratics on a sphere Included LaTeX macros on pages Replaced more figures by code Expanded glossary ALS intro Add knitr caches for computations Version 031215 ALS intro edited ALS rate of convergence edited ALSOS section edited Continue cosmetics, cross referencing, glossary, caches Renamed news.md, bibliography.md, notation.md using caps Include LaTeX macros on each page Update README.md in all directories Update book.json to gitbook 2.0.0-beta.3 Version 031315 Added some material to majorizing value functions Experimenting with how to link to R code files Thinking about merging subsections into a single section file Version 031415 (day of pi) Added section on projection vs block relaxation Version 031515 Rewrote section on missing data in linear models (Yates) Redid plots in quadratic majorization of normal distribution and density
What's New
80
Block Relaxation Algorithms in Statistics -- Part III
Version 031615 Try out variables in book.json for chapter and section cross references Version 031715 Moved HEADER.md to the top of all Rmd files. Version 031815 Some experimenting with variables and arrays in book.json Version 031915 No changes Version 032015 Sections rasch and morerasch edited Section on Block Newton edited Version 032115 Added derivatives of implicit functions Version 032215 Added material on marginal functions and solution maps (changed section title) Added classical implicit function theorem Version 032315 No changes Version 032415 Added S-majorization and D-majorization example, with figures and code to Dinkelbach section Graphics for Chebyshev example in 2.8.2 replaced by code, graphics in 2 x 2 plot matrix Started adding section and chapter numbers Graphics for folium redone with code and cache Version 032515 Section on univariate and separable functions Updated book.json to gitbook 2.0.0 Version 032615 Add material on Concave-Convex Procedure
What's New
81
Block Relaxation Algorithms in Statistics -- Part III
S-majorization by a quadratic: multivariate case Started section on majorization on a hyperrectangle Version 032715 Add material on location problem Add section on Generalized Weiszfeld Methods Version 032815 Background on liminf/limsup Rearranged some background sections Version 032915 Definition section Majorization Algorithm Replaced more plots with knitr chunks of R code Started section on matrix derivatives Version 033015 Material on matrix derivatives Started section on directional derivatives Version 033115 More on directional derivatives Shift focus to functions with values in extended reals Version 040115 Section on derivates between directional and Taylor Version 040215 More on derivatives book.json gitbook updated to version 2.0.1
What's New
82
Block Relaxation Algorithms in Statistics -- Part III
Workflow If I wanted you to understand it, I would have explained it better. -- Johan Cruijff This may be of some interest to some. I use the Rgitbook package in R together with gitbook-cli , and the (old) Gitbook.app that uses gitbook-1.0.3.
All files are in the BRAS directory. I edit Rmd files in Sublime Text, then run buildGitbook() from an R running in BRAS , then use openGitbook() to see the HTML in a browser. The Rmd files follow the conventions of knitr . Thus buildGitbook() as a first step knits all updated Rmd files to md files, creating caches and executing R code and generating figures and output when necessary. As a second step it knits the md to HTML and puts them in the right place in the gitbook hierarchy by calling gitbook-cli . Then I upload the book to GitbookIO using Gitbook.app . Note that uploading means another md to HTML conversions on the server side. The toolchain is a bit shaky because both Rgitbook and Gitbook.app are no longer actively maintained. It would be great if Rstudio could deal with the whole build sequence and have a book repository similar to its repository for compiled Rmd files. The major problem at the moment is that none of the tools can efficiently convert formulas to png files, and my book is full of formulas. MathJax does a great job generating a website for the book, but so far png conversion does not work for me.
Workflow
83
Block Relaxation Algorithms in Statistics -- Part III
Glossary ALS Alternating Least Squares. 4. What's New
ALSOS Alternating Least Squares with Optimal Scaling. 4. What's New
Alternating Least Squares Block relaxation of a least squares loss function. 3. Bibliography
Alternating Least Suares with Optimal Scaling Least Squares with data transformations.
Augmentation Algorithm Minimize a function by introducing additonal variables in block relaxation.
Augmentation Method Minimize a function by introducing additonal variables in block relaxation.
Block Relaxation Algorithm Glossary
84
Block Relaxation Algorithms in Statistics -- Part III
Minimize a function over alternating blocks of variables.
Block Relaxation Method Minimize a function over alternating blocks of variables.
Gauss-Newton Method Minimizing linear-quadratic Taylor approximation to least squares loss function.
Iteration Function Function defining the update in a step of an iterative algorithm. 4. What's New 1.8.1. Eigenvalues and Eigenvectors of Symmetric Matrices
Iteration Jacobian Derivative of the iteration function.
Iteration Map Point-to-set map defining the possible updates in a step of an iterative algorithm. 4. What's New
Iteration Rate Largest eigenvalue, in modulus, of the iteration Jacobian. Same as Iteration Spectral Radius.
Iteration Spectral Radius Largest eigenvalue, in modulus, of the iteration Jacobian.
Glossary
85
Block Relaxation Algorithms in Statistics -- Part III
Majorization Algorithm Minimize a function by iteratively minimizing majorizations. 3. Bibliography 4. What's New
Majorization Method Minimize a function by iteratively minimizing majorizations.
MM Algorithm Majorization/Minimization or Minorization/Maximization Algorithm.
MM Method Majorization/Minimization or Minorization/Maximization Algorithm.
Nesterov Majorization Majorize a function by bounding the cubic term in the Taylor series. 4. What's New 1.8.6. Quadratics on a Sphere
Newton's Method Minimizing quadratic Taylor approximation to function.
Glossary
86