Maximum Likelihood Estimation of Parameters in Linkage. Comparative Numerical Iterative Methods Rodica SOBOLU1), Dana PUSTA2), Ioana POP1)
Faculty of Horticulture, University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca, 3-5 Manastur Street, 400372, Romania, Tel: +0040264-593792 1)
Faculty of Veterinary Medicine, University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca, 3-5 Manastur Street, 400372, Romania, Tel: +0040264-593792 2)
*)Corresponding author, e-mail:
[email protected],
[email protected] BulletinUASVM Horticulture 72(2) / 2015 Print ISSN 1843-5254, Electronic ISSN 1843-5394 DOI:10.15835/buasvmcn-hort: 11418
Abstract Taking into account the advantages of Maximum Likelihood Method (most precise estimation), the statistical properties of MLEs (unbiasedness, consistency, efficiency, invariance, asymptotic normality) this paper aim is to present MLE in the context of estimate the recombination fraction r in linkage analysis. Maximum Likelihood Method follows some steps: specifies the likelihood function; takes derivatives of likelihood with respect to the parameters; sets the derivatives equal to zero and finally generates a likelihood equation, that maximized provides the most precise estimation of the recombination fraction. Generally, it is solved by iterative procedures, if no, closed form solution exists for likelihood equation. In this work we discuss comparatively two iterative optimization methods useful in computing MLE of the recombination fraction: Newton-Raphson method and Fisher’s Method of Scoring. We implemented these two methods in Maple application and we illustrated them by an example: the estimation of the recombination fraction in the case of the Morgan (1909) experiment on fruit flies. The Maple code for these two methods connected with the Morgan example is given in the appendix. We can not guarantee which of the two presented methods give us an optimal maximum. Keywords: recombination fraction, MLE estimation, linkage analysis.
Introduction Most traits of biological and agricultural importance are under the control of genes which are in interaction, each with a small effect, and of environmental factors. Linkage analysis is a powerful tool to study the structure of the genome and detect the location of genes responsible for same quantitative trait of interest within the pedigrees. The genes are identified by mapping their positions relative to known marker loci. The degree of linkage between any two markers is expressed in terms of the recombination fraction r. The maximum likelihood estimate (MLE) is a statistical inference procedure that provides precise estimators for some parameters. Fisher
published first the method of maximum likelihood in 1912 in term of “inverse probability” and outlined the term “likelihood” in 1921. The full development of the method appears in articles of 1922 and 1925. MLE is the value of some parameter(s) that maximizes the likelihood of the observed samples, i.e. makes the observed data most likely to occur. We present MLE in the context of optimization methods specialized in statistics and we discuss some numerical methods useful for solving statistical optimization problems. Devore et al. 2012 reveal some properties of maximum likelihood estimators. In this study we take certain known properties of these types of estimators and adapt them in the context of numerical
538
optimization methods for likelihood function. We assume only one unknown parameter θ . 2. Preliminaries
2.1 Linkage and recombination fraction Linkage is the tendency for genes to be inherited together because they are located near one another on the same chromosome. A measure of the linkage is described in terms of the number of recombination fraction r between two loci. The r value quantifies the amount of linkage between two genes, r =
R where R is the number of N
recombinant gametes and N is the number of gametes in a sample of meioses. The recombination fraction has biologically significance in the case when r ∈ [0, 0.5] . The value r=0 shows complete linkage and the value r=0.5 indicates the independence of the two loci. A.H. Sturtevant (1913) first used recombination fraction to order, i.e. map genes along fruit fly chromosomes. 2.2 Likelihood-Estimation The likelihood function is a theoretical concept that underlies many statistical inferences made in genetics. This function is the probability of the observed data, considered as a function of the parameter. Let be y = ( y1 , y2 ,K , yn ) an iid sample and θ a scalar parameter or a parameter vector
θ = (θ1 ,θ 2,...,θ k ) , k ≤ n . The likelihood function
is
n
L(θ | y ) = f ( y θ ) (1),
f ( y θ ) = ∏ f ( yi θ ) i =1
where
is the sample density
function. Assume that L(θ | y ) is twicedifferentiable with respect to θ . The Maximum Likelihood Estimation method (MLE) chooses the estimated parameter θ , as the value that maximizes the log-likelihood function: n
log L(θ | y ) = ∑ log f i ( yi | θ ) (2). i =1
Bulletin UASVM Horticulture 72(2) / 2015
SOBOLU et al
The MLE estimator is the value θˆ such that
log L(θˆ | y ) ≥ log L(θ | y ) for all θ . MLE method
generates an equation, that maximized provides the most precise estimate. In what follows we consider the Fisher’s score function
S (θ ) =
∂ ln L(θ | y ) ∂θ
(3).
If θ is a parameter vector than S (θ ) is the score vector of first partial derivatives, one for each component of θ . The first derivative of the score function multiplied by -1 is called the observed information
I (θ ) = −
∂S (θ ) ∂θ
(4).
The quantity I (θ ) sets the sharpness of the peak in the likelihood function around its maximum. The expected information is defined as
J (θ ) = Var ( S (θ ) = E [I (θ )] (5). Note that for large i.i.d. samples it holds
approximately that θˆ ~ N (θ , J (θ ) −1 ) . The score function has some interesting statistical properties: the expectation of S (θ ) with respect
to y is equal to zero E[ S (θ )] = 0 and the inverse of the information value of a MLE is its variance.
(6).
(7).
(8).
2.3 Numerical Methods for solving equations Numerical analysis combines computer science and mathematics in order to create and implement algorithms for solving equations or
Maximum Likelihood Estimation of Parameters in Linkage. Comparative Numerical Iterative Methods
systems of equations that are not linear. There are many iterative methods which approximate the solution of these types of equations: Newton’s method, Bisection method, Secant method. Newton’s method is widely used for function optimization. It calls an iterative formula for finding a maximum of log L(θ ) and thus for solving the equation S (θ ) = 0 (10) when the direct solutions of this do not exist. Newton’s method (also known as the NewtonRaphson method) was proposed by the most influential scientists of all times, Isaac Newton (1642-1727). Generally, this method facilitates obtaining iteratively better approximations to the roots of the equation of type f ( x) = 0 (11), where
f is a real-valued function. Newton’s method uses
Taylor’s theorem to approximate the equation (11). We present below the Newton method in one variable (quoted by Ram 2010, Micula et. al, 2008): given a function f defined over the set of real numbers and its first derivative f ' , we start
with an initial guess x0 for a root of the function
f . A better approximation x1 is expressed by the recurrence x1 = x0 −
f ( x0 ) f ' ( x0 )
(12),
where ( x1 ,0) is the intersection with the x-axis of
a line tangent to f at the point ( x0 , f ( x0 ) ) . The process is repeated as
xk +1 = xk −
f ( xk ) f ' ( xk )
(13)
until the increment becomes small, so
xk +1 − xk < ε (14). The iterative process is graphically shown in Figure 1. Let the actual root of the equation (11) be a , f (a ) = 0 . A sequence of iterates xk that converges
to
a
has
order
of
convergence
p > 1 if
lim
k →∞
xk +1 − a x k −a
p
539
= c , where 0