ASReml User Guide Release 3.0 2009
A R Gilmour NSW Department of Primary Industries, Orange, Australia B J Gogel University of Adelaide, Adelaide, Australia B R Cullis NSW Department of Primary Industries, Wagga Wagga, Australia R Thompson School of Mathematical Sciences, Queen Mary, University of London, Mile End Road, London E1 4NS, and Centre for Mathematical and Computational Biology, and Department of Biomathematics and Bioinformatics, Rothamsted Research, Harpenden AL5 2JQ, United Kingdom
ASReml User Guide Release 3.0 ASReml is a statistical package that fits linear mixed models using Residual Maximum Likelihood (REML). It is a joint venture between the Biometrics Program of NSW Department of Primary Industries and the Biomathematics Unit of Rothamsted Research. Statisticians in Britain and Australia have collaborated in its development.
Main authors: A. R. Gilmour, B. J. Gogel, B. R. Cullis and R. Thompson
Other contributors: D. Butler, M. Cherry, D. Collins, G. Dutkowski, S. A. Harding, K. Haskard, A. Kelly, S. G. Nielsen, A. Smith, A. P. Verbyla, S. J. Welham and I. M. S. White.
Author email addresses
[email protected] [email protected] [email protected] [email protected]
Copyright Notice c 2009, NSW Department of Industry and Investment. All rights Copyright ° reserved. Except as permitted under the Copyright Act 1968 (Commonwealth of Australia), no part of the publication may be reproduced by any process, electronic or otherwise, without specific written permission of the copyright owner. Neither may information be stored electronically in any form whatever without such permission.
Published by:
E-mail: Website:
VSN International Ltd, 5 The Waterhouse, Waterhouse Street, Hemel Hempstead, HP1 1ES, UK
[email protected] http://www.vsni.co.uk/
The correct bibliographical reference for this document is: Gilmour, A.R., Gogel, B.J., Cullis, B.R., and Thompson, R. 2009 ASReml User Guide Release 3.0 VSN International Ltd, Hemel Hempstead, HP1 1ES, UK www.vsni.co.uk
Preface
ASReml3 ASReml2 Revised 08
ASReml is a statistical package that fits linear mixed models using Residual Maximum Likelihood (REML). It has been under development since 1993 and is a joint venture between the Biometrics Program of NSW Department of Primary Industries and the Biomathematics and Bioinformatics Division (previously the Statistics Department) of Rothamsted Research. Release 2 of ASReml was distributed in 2006. This guide relates to Release 3 first distributed in 2008. Changes in this version are indicated by the word ASReml3 in the margin. Features added in Release 2 have ASReml2 in the margin. Other significant changes to the text are indicated by Revised in the margin. A separate document, ASReml 3 Update, is available to highlight the changes from Release 2.00. Linear mixed effects models provide a rich and flexible tool for the analysis of many data sets commonly arising in the agricultural, biological, medical and environmental sciences. Typical applications include the analysis of (un)balanced longitudinal data, repeated measures analysis, the analysis of (un)balanced designed experiments, the analysis of multi-environment trials, the analysis of both univariate and multivariate animal breeding and genetics data and the analysis of regular or irregular spatial data. ASReml provides a stable platform for delivering well established procedures while also delivering current research in the application of linear mixed models. The strength of ASReml is the use of the Average Information (AI) algorithm and sparse matrix methods for fitting the linear mixed model. This enables it to analyse large and complex data sets quite efficiently. One of the strengths of ASReml is the wide range of variance models for the random effects in the linear mixed model that are available. There is a potential cost for this wide choice. Users should be aware of the dangers of either overfitting or attempting to fit inappropriate variance models to small or highly unbalanced data sets. We stress the importance of using data-driven diagnostics and encourage the user to read the examples chapter, in which we have attempted to not only present the syntax of ASReml in the context of real analyses but also to i
Preface
ii
indicate some of the modelling approaches we have found useful. Revised 08
There are several interfaces to the core functionality of ASReml. The program name ASReml relates to the primary program. ASReml-W refers to the user interface program developed by VSN and distributed with ASReml. ASReml-R refers to the S language interface to a DLL of the core ASReml routines. Genstat uses the same core routines for its REML directive. Both of these have good data manipulation and graphical facilities. The focus in developing ASReml has been on the core engine and it is freely acknowledged that its user interface is not to the level of these other packages. Nevertheless, as the developers interface, it is functional, it gives access to everything that the core can do and is especially suited to batch processing and running of large models without the overheads of other systems. Feedback from users is welcome and attempts will be made to rectify identified problems in ASReml. The guide has 15 chapters. Chapter 1 introduces ASReml and describes the conventions used in this guide. Chapter 2 outlines some basic theory while Chapter 3 presents an overview of the syntax of ASReml through a simple example. Data file preparation is described in Chapter 4 and Chapter 5 describes how to input data into ASReml. Chapters 6 and 7 are key chapters which present the syntax for specifying the linear model and the variance models for the random effects in the linear mixed model. Chapters 8 and 9 describe special commands for multivariate and genetic analyses respectively. Chapter 10 deals with prediction of linear functions of fixed and random effects in the linear mixed model and Chapter 13 presents the syntax for forming functions of variance components. Chapter 11 demonstrates running an ASReml job features available and Chapter 14 gives a detailed explanation of the output files. Chapter 15 gives an overview of the error messages generated in ASReml and some guidance as to their probable cause. The guide concludes with the most extensive chapter which presents the examples. Briefly, the improvements in Release 2 include more robust variance parameter updating so that ’Convergence Failure’ is less likely, extensions to the syntax, inclusion of the Mat´ern correlation model, ability to plot predicted values, improvements for testing fixed effects, improvements to the handling of pedigrees and some increases in computational speed.
ASReml3
Release 3 contains some extensions to data handling (merging files), pedigree processing, model specification, theshold models, prediction and examining residuals. The data sets and ASReml input files used in this guide are available from
Preface
iii
http://www.vsni.co.uk/products/asreml as well as in the examples directory of the distribution CD-ROM. They remain the property of the authors or of the original source but may be freely distributed provided the source is acknowledged. The authors would appreciate feedback and suggestions for improvements to the program and this guide. Proceeds from the licensing of ASReml are used to support continued development to implement new developments in the application of linear mixed models. The developmental version is available to supported licensees via a website upon request to VSN. Most users will not need to access the developmental version unless they are actively involved in testing a new development.
Acknowledgements We gratefully acknowledge the Grains Research and Development Corporation of Australia for their financial support for our research since 1988. Brian Cullis and Arthur Gilmour wish to thank the NSW Department of Primary Industries, for providing a stimulating and exciting environment for applied biometrical research and consulting. Rothamsted Research receives grant-aided support from the Biotechnology and Biological Sciences Research Council of the United Kingdom. We sincerely thank Ari Verbyla, Sue Welham, Dave Butler and Alison Smith, the other members of the ASReml ‘team’. Ari contributed the cubic smoothing splines technology, information for the Marker map imputation, on-going testing of the software and numerous helpful discussions and insight. Sue Welham has overseen the incorporation of the core into Genstat and contributed to the predict functionality. Dave Butler has developed the ASReml-R class of functions. Alison contributed to the development of many of the approaches for the analysis of multi-section trials. We also thank Ian White for his contribution to the spline methodology, and Simon Harding for the licensing and installation software and for his development of the WinASReml environment for running ASReml. The Mat´ern function material was developed with Kathy Haskard, a PhD student with Brian Cullis, and the denominator degrees of freedom material was developed with Sharon Nielsen, a Masters student with Brian Cullis. Damian Collins contributed the PREDICT !PLOT material. Greg Dutkowski has contributed to the extended pedigree options. The asremload.dll functionality is provided under license to VSN. Alison Kelly has helped with the review of the XFA models. Finally, we especially thank our close associates who continually test the enhancements.
Preface
iv
I, Arthur Gilmour, thank Jesus Christ for His forgiveness and personal support over many years. As He has said Behold I stand at the door and knock. If any man hear my voice and open the door I will come in, and sup with him and he with me. (Revelation 3:20). I thank the Lord for the privilege of collaborating with several very gifted people including those involved in the ASReml project, acknowledging their acceptance, generosity, patience and perseverence toward a boy from Boree Creek. The heavens declare the glory of God and the firmament (earth) shows His handiwork. Psalm 19:1 Be exalted O God, above the heavens: and Thy glory above all the earth. Psalm 108:5.
Contents
Preface
i
List of Tables
xxi
List of Figures
xxiii
1 Introduction
1
1.1
What ASReml can do . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
ASReml-W . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
ConTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
How to use this guide . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.5
Getting assistance and the ASReml forum
. . . . . . . . . . . . . . .
4
1.6
Typographic conventions . . . . . . . . . . . . . . . . . . . . . . . . .
5
2 Some theory 2.1
6
The linear mixed model . . . . . . . . . . . . . . . . . . . . . . . . .
v
7
Contents
vi
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Direct product structures . . . . . . . . . . . . . . . . . . . . . . .
7
Variance structures for the errors: R structures . . . . . . . . . . .
9
Variance structures for the random effects: G structures . . . . . . 10 2.2
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Estimation of the variance parameters . . . . . . . . . . . . . . . . 11 Estimation/prediction of the fixed and random effects . . . . . . . 14
2.3
What are BLUPs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4
Combining variance models . . . . . . . . . . . . . . . . . . . . . . . 16
2.5
Inference: Random effects . . . . . . . . . . . . . . . . . . . . . . . . 17 Tests of hypotheses: variance parameters . . . . . . . . . . . . . . 17 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6
Inference: Fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Incremental and Conditional Wald F Statistics . . . . . . . . . . . 20 Kenward and Roger Adjustments
. . . . . . . . . . . . . . . . . . 24
Approximate stratum variances . . . . . . . . . . . . . . . . . . . . 25 3 A guided tour
26
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2
Nebraska Intrastate Nursery (NIN) field experiment . . . . . . . . . . 27
Contents
vii
3.3
The ASReml data file . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4
The ASReml command file . . . . . . . . . . . . . . . . . . . . . . . . 31 The title line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Reading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 The data file line . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Specifying the terms in the mixed model . . . . . . . . . . . . . . 33 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Variance structures . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5
Running the job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Forming a job template
3.6
. . . . . . . . . . . . . . . . . . . . . . . 35
Description of output files . . . . . . . . . . . . . . . . . . . . . . . . 36 The .asr file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 The .sln file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 The .yht file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7
Tabulation, predicted values and functions of the variance components
4 Data file preparation
39 42
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2
The data file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Free format data files . . . . . . . . . . . . . . . . . . . . . . . . . 43
Contents
viii
Fixed format data files . . . . . . . . . . . . . . . . . . . . . . . . 45 Preparing data files in Excel . . . . . . . . . . . . . . . . . . . . . 45 Binary format data files . . . . . . . . . . . . . . . . . . . . . . . 45 5 Command file: Reading the data
46
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2
Important rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3
Title line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4
Specifying and reading the data . . . . . . . . . . . . . . . . . . . . . 48 Data field definition syntax . . . . . . . . . . . . . . . . . . . . . . 49 Storage of alphabetic factor labels . . . . . . . . . . . . . . . . . . 51 Reordering the factor levels . . . . . . . . . . . . . . . . . . . . . . 51 Skipping input fields . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5
Transforming the data . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Transformation syntax . . . . . . . . . . . . . . . . . . . . . . . . 54 QTL marker transformations . . . . . . . . . . . . . . . . . . . . . 59 Other rules and examples . . . . . . . . . . . . . . . . . . . . . . . 61 Special note on covariates . . . . . . . . . . . . . . . . . . . . . . 62
5.6
Datafile line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Data line syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7
Data file qualifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Contents
ix
Combining rows from separate files . . . . . . . . . . . . . . . . . 67 5.8
Job control qualifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Command file: Specifying the terms in the mixed model
93
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2
Specifying model formulae in ASReml
. . . . . . . . . . . . . . . . . 94
General rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3
Fixed terms in the model . . . . . . . . . . . . . . . . . . . . . . . . . 99 Primary fixed terms . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Sparse fixed terms . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4
Random terms in the model . . . . . . . . . . . . . . . . . . . . . . . 100
6.5
Interactions and conditional factors . . . . . . . . . . . . . . . . . . . 101 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Conditional factors . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Associated Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6
Alphabetic list of model functions . . . . . . . . . . . . . . . . . . . . 103
6.7
Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.8
Generalized Linear (Mixed) Models . . . . . . . . . . . . . . . . . . . 108 Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . 112
Contents
x
6.9
Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Missing values in the response . . . . . . . . . . . . . . . . . . . . 112 Missing values in the explanatory variables . . . . . . . . . . . . . 113
6.10 Some technical details about model fitting in ASReml . . . . . . . . . 114 Sparse versus dense . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Ordering of terms in ASReml
. . . . . . . . . . . . . . . . . . . . 114
Aliassing and singularities . . . . . . . . . . . . . . . . . . . . . . 114 Examples of aliassing . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.11 Wald F Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7 Command file: Specifying the variance structures 7.1
117
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Non singular variance matrices . . . . . . . . . . . . . . . . . . . . 118
7.2
Variance model specification in ASReml
. . . . . . . . . . . . . . . . 119
7.3
A sequence of structures for the NIN data . . . . . . . . . . . . . . . 119
7.4
Variance structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 General syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Variance header line . . . . . . . . . . . . . . . . . . . . . . . . . 128 R structure definition
. . . . . . . . . . . . . . . . . . . . . . . . 129
G structure header and definition lines 7.5
. . . . . . . . . . . . . . . 131
Variance model description . . . . . . . . . . . . . . . . . . . . . . . . 132
Contents
xi
Forming variance models from correlation models . . . . . . . . . . 137 Notes on the variance models . . . . . . . . . . . . . . . . . . . . 138 Notes on Mat´ern . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Notes on power models . . . . . . . . . . . . . . . . . . . . . . . . 141 Notes on Factor Analytic models
. . . . . . . . . . . . . . . . . . 142
Notes on OWN models . . . . . . . . . . . . . . . . . . . . . . . . 144 7.6
Variance structure qualifiers . . . . . . . . . . . . . . . . . . . . . . . 146
7.7
Rules for combining variance models . . . . . . . . . . . . . . . . . . 147
7.8
G structures involving more than one random term
7.9
Constraining variance parameters . . . . . . . . . . . . . . . . . . . . 150
. . . . . . . . . . 148
Parameter constraints within a variance model . . . . . . . . . . . 150 Constraints between and within variance models . . . . . . . . . . 151 Equating variance structures . . . . . . . . . . . . . . . . . . . . . 152 7.10 Model building using the !CONTINUE qualifier 7.11 Convergence issues
. . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8 Command file: Multivariate analysis 8.1
. . . . . . . . . . . . . 154
157
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Repeated measures on rats . . . . . . . . . . . . . . . . . . . . . . 158 Wether trial data . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2
Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Contents
xii
8.3
Variance structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Specifying multivariate variance structures in ASReml . . . . . . . 160
8.4
The output for a multivariate analysis . . . . . . . . . . . . . . . . . . 161
9 Command file: Genetic analysis
164
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.2
The command file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.3
The pedigree file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.4
Reading in the pedigree file . . . . . . . . . . . . . . . . . . . . . . . 167
9.5
Genetic groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.6
Reading a user defined inverse relationship matrix . . . . . . . . . . . 171 Genetic groups in GIV matrices . . . . . . . . . . . . . . . . . . . 173 The example continued . . . . . . . . . . . . . . . . . . . . . . . . 173
10 Tabulation of the data and prediction from the model
175
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 10.2 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 10.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Underlying principles . . . . . . . . . . . . . . . . . . . . . . . . . 177 Predict syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Predict failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Associated factors
. . . . . . . . . . . . . . . . . . . . . . . . . . 188
Contents
xiii
Complicated weighting with !PRESENT . . . . . . . . . . . . . . . 191 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 11 Command file: Running the job
194
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 11.2 The command line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Normal run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Processing a .pin file . . . . . . . . . . . . . . . . . . . . . . . . 196 Forming a job template from a data file . . . . . . . . . . . . . . . 196 11.3 Command line options . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Prompt for arguments (A) . . . . . . . . . . . . . . . . . . . . . . 199 Output control (B, J)
. . . . . . . . . . . . . . . . . . . . . . . . 199
Debug command line options (D, E) . . . . . . . . . . . . . . . . . 199 Graphics command line options (G, H, I, N, Q) . . . . . . . . . . . 199 Job control command line options (C, F, O, R) . . . . . . . . . . . 201 Workspace command line options (S, W) . . . . . . . . . . . . . . 202 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 11.4 Advanced processing arguments . . . . . . . . . . . . . . . . . . . . . 203 Standard use of arguments . . . . . . . . . . . . . . . . . . . . . . 203 Prompting for input
. . . . . . . . . . . . . . . . . . . . . . . . . 204
Paths and Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Contents
xiv
Order of Substitution
. . . . . . . . . . . . . . . . . . . . . . . . 208
11.5 Performance issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Multiple processors . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Slow processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Timing processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 12 Command file: Merging data files
210
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 12.2 Merge Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 12.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 13 Functions of variance components
214
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 13.2 VPREDICT: PIN file processing . . . . . . . . . . . . . . . . . . . . . 215 13.3 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Linear combinations of components . . . . . . . . . . . . . . . . . 216 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 A more detailed example . . . . . . . . . . . . . . . . . . . . . . . 218 14 Description of output files
220
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Contents
xv
14.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 14.3 Key output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 The .asr file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 The .sln file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 The .yht file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 14.4 Other ASReml output files . . . . . . . . . . . . . . . . . . . . . . . . 229 The .aov file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 The .asl file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 The .dpr file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 The .pvc file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 The .pvs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 The .res file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 The .rsv file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 The .tab file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 The .vrb file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 The .vvp file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 14.5 ASReml output objects and where to find them 15 Error messages
. . . . . . . . . . . . 243 246
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 15.2 Common problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Contents
xvi
15.3 Things to check in the .asr file . . . . . . . . . . . . . . . . . . . . . 250 15.4 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 15.5 Information, Warning and Error messages . . . . . . . . . . . . . . . . 263 16 Examples
278
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 16.2 Split plot design - Oats . . . . . . . . . . . . . . . . . . . . . . . . . 279 16.3 Unbalanced nested design - Rats . . . . . . . . . . . . . . . . . . . . 283 16.4 Source of variability in unbalanced data - Volts . . . . . . . . . . . . . 287 16.5 Balanced repeated measures - Height . . . . . . . . . . . . . . . . . . 290 16.6 Spatial analysis of a field experiment - Barley . . . . . . . . . . . . . . 298 16.7 Unreplicated early generation variety trial - Wheat . . . . . . . . . . . 305 16.8 Paired Case-Control study - Rice . . . . . . . . . . . . . . . . . . . . 311 Standard analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 A multivariate approach . . . . . . . . . . . . . . . . . . . . . . . 317 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . 321 16.9 Balanced longitudinal data - Random coefficients and cubic smoothing splines - Oranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 16.10Generalized Linear (Mixed) Models . . . . . . . . . . . . . . . . . . . 331 Binomial analysis of Footrot score . . . . . . . . . . . . . . . . . . 331 Bivariate analysis of Foot score . . . . . . . . . . . . . . . . . . . 336 Multinomial Ordinal GLM analysis of Cheese taste . . . . . . . . . 338
Contents
xvii
Multinomial Ordinal GLMM analysis of Footrot score . . . . . . . . 340 16.11Multivariate animal genetics data - Sheep . . . . . . . . . . . . . . . . 341 Half-sib analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Animal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Bibliography
355
Index
362
List of Tables
2.1
Combination of models for G and R structures . . . . . . . . . . . . 16
3.1
Trial layout and allocation of varieties to plots in the NIN field trial . 29
5.1
List of transformation qualifiers and their actions with examples . . . 55
5.2
Qualifiers relating to data input and output . . . . . . . . . . . . . . 64
5.3
List of commonly used job control qualifiers . . . . . . . . . . . . . 68
5.4
List of occasionally used job control qualifiers . . . . . . . . . . . . . 72
5.5
List of rarely used job control qualifiers . . . . . . . . . . . . . . . . 79
5.6
List of very rarely used job control qualifiers . . . . . . . . . . . . . 89
6.1
Summary of reserved words, operators and functions . . . . . . . . . 96
6.2
Alphabetic list of model functions and descriptions . . . . . . . . . . 103
6.3
Link qualifiers and functions . . . . . . . . . . . . . . . . . . . . . . 108
6.4
GLM distribution qualifiers The default link is listed first followed by permitted alternatives. . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5
Examples of aliassing in ASReml . . . . . . . . . . . . . . . . . . . 115
7.1
Sequence of variance structures for the NIN field trial data . . . . . 125
xviii
List of Tables
xix
7.2
Schematic outline of variance model specification in ASReml . . . . 127
7.3
Details of the variance models available in ASReml
7.4
List of R and G structure qualifiers . . . . . . . . . . . . . . . . . . 146
7.5
Examples of constraining variance parameters in ASReml . . . . . . 150
9.1
List of pedigree file qualifiers
10.1
List of prediction qualifiers . . . . . . . . . . . . . . . . . . . . . . . 183
10.2
List of predict plot options . . . . . . . . . . . . . . . . . . . . . . . 186
10.3
Trials classified by region and location
10.4
Trial means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10.5
Location means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.1
Command line options . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.2
The use of arguments in ASReml . . . . . . . . . . . . . . . . . . . 204
11.3
High level qualifiers
12.1
List of MERGE qualifiers . . . . . . . . . . . . . . . . . . . . . . . . 212
14.1
Summary of ASReml output files . . . . . . . . . . . . . . . . . . . 221
14.2
ASReml output objects and where to find them . . . . . . . . . . . 243
15.1
Some information messages and comments . . . . . . . . . . . . . . 263
15.2
List of warning messages and likely meaning(s) . . . . . . . . . . . . 264
. . . . . . . . . 132
. . . . . . . . . . . . . . . . . . . . . 168
. . . . . . . . . . . . . . . . 188
. . . . . . . . . . . . . . . . . . . . . . . . . . 205
List of Tables
xx
15.3
Alphabetical list of error messages and probable cause(s)/remedies . 268
16.1
A split-plot field trial of oat varieties and nitrogen application . . . . 279
16.2
Rat data: AOV decomposition . . . . . . . . . . . . . . . . . . . . . 284
16.3
REML log-likelihood ratio for the variance components in the voltage data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
16.4
Summary of variance models fitted to the plant data . . . . . . . . . 292
16.5
Summary of Wald F statistics for fixed effects for variance models fitted to the plant data . . . . . . . . . . . . . . . . . . . . . . . . . 298
16.6
Field layout of Slate Hall Farm experiment . . . . . . . . . . . . . . 300
16.7
Summary of models for the Slate Hall data . . . . . . . . . . . . . . 305
16.8
Estimated variance components from univariate analyses of bloodworm data. (a) Model with homogeneous variance for all terms and (b) Model with heterogeneous variance for interactions involving tmt 315
16.9
Equivalence of random effects in bivariate and univariate analyses . . 317
16.10 Estimated variance parameters from bivariate analysis of bloodworm data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 16.11 Orange data: AOV decomposition . . . . . . . . . . . . . . . . . . . 327 16.12 Sequence of models fitted to the Orange data . . . . . . . . . . . . 328 16.13 Response frequencies in a cheese tasting experiment . . . . . . . . . 338 16.14 REML estimates of a subset of the variance parameters for each trait for the genetic example, expressed as a ratio to their asymptotic s.e. 343 16.15 Wald F statistics of the fixed effects for each trait for the genetic example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
List of Tables
xxi
16.16 Variance models fitted for each part of the ASReml job in the analysis of the genetic example . . . . . . . . . . . . . . . . . . . . . . . . . 346
List of Figures
5.1
Variogram in 4 sectors for Cashmore data . . . . . . . . . . . . . . . 92
14.1
Residual versus Fitted values . . . . . . . . . . . . . . . . . . . . . . 228
14.2
Variogram of residuals . . . . . . . . . . . . . . . . . . . . . . . . . 237
14.3
Plot of residuals in field plan order . . . . . . . . . . . . . . . . . . 238
14.4
Plot of the marginal means of the residuals . . . . . . . . . . . . . . 239
14.5
Histogram of residuals . . . . . . . . . . . . . . . . . . . . . . . . . 239
16.1
Residual plot for the rat data . . . . . . . . . . . . . . . . . . . . . 286
16.2
Residual plot for the voltage data . . . . . . . . . . . . . . . . . . . 289
16.3
Trellis plot of the height for each of 14 plants . . . . . . . . . . . . 291
16.4
Residual plots for the EXP variance model for the plant data . . . . . 294
16.5
Sample variogram of the residuals from the AR1×AR1 model for the Slate Hall data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
16.6
Sample variogram of the residuals from the AR1×AR1 model for the Tullibigeal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
16.7
Sample variogram of the residuals from the AR1×AR1 + pol(column,-1) model for the Tullibigeal data . . . . . . . . . . . . . . . . . . . . . 309
xxii
List of Figures
xxiii
16.8
Rice bloodworm data: Plot of square root of root weight for treated versus control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.9
BLUPs for treated for each variety plotted against BLUPs for control 320
16.10 Estimated deviations from regression of treated on control for each variety plotted against estimate for control . . . . . . . . . . . . . . 321 16.11 Estimated difference between control and treated for each variety plotted against estimate for control . . . . . . . . . . . . . . . . . . 322 16.12 Trellis plot of trunk circumference for each tree . . . . . . . . . . . . 324 16.13 Fitted cubic smoothing spline for tree 1 . . . . . . . . . . . . . . . . 326 16.14 Plot of fitted cubic smoothing spline for model 1 . . . . . . . . . . . 329 16.15 Trellis plot of trunk circumference for each tree at sample dates (adjusted for season effects), with fitted profiles across time and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 330 16.16 Plot of the residuals from the nonlinear model of Pinheiro and Bates 331
1
Introduction
What ASReml can do Installation User Interface How to use the guide Help and discussion list Typographic conventions
1
1
Introduction
1.1
2
What ASReml can do ASReml (pronounced A S Rem el) is used to fit linear mixed models to quite large data sets with complex variance models. It extends the range of variance models available for the analysis of experimental data. ASReml has application in the analysis of •
(un)balanced longitudinal data,
•
repeated measures data (multivariate analysis of variance and spline type models),
•
(un)balanced designed experiments,
•
multi-environment trials and meta analysis,
•
univariate and multivariate animal breeding and genetics data (involving a relationship matrix for correlated effects),
•
regular or irregular spatial data.
The engine of ASReml underpins the REML procedure in GENSTAT. An interface for R called ASReml-R is available and runs under the same license as the ASReml program. While these interfaces will be adequate for many analyses, some large problems will need to use ASReml. The ASReml user interface is terse. Most effort has been directed towards efficiency of the engine. It normally operates in a batch mode. Problem size depends on the sparsity of the mixed model equations and the size of your computer. However, models with 500,000 effects have been fitted successfully. The computational efficiency of ASReml arises from using the Average Information REML procedure (giving quadratic convergence) and sparse matrix operations. ASReml has been operational since March 1996 and is updated periodically.
1.2
Installation Installation instructions are distributed with the program. If you require help with installation or licensing, please email
[email protected].
1
Introduction
1.3 ASReml2
3
User Interface ASReml is essentially a batch program with some optional interactive features. The typical sequence of operations when using ASReml is •
Prepare the data (typically using a spreadsheet or data base program)
•
Export that data as an ASCII file (for example export it as a .csv (comma separated values) file from Excel)
•
Prepare a job file with filename extension .as
•
Run the job file with ASReml
•
Review the various output files
•
revise the job and re run it, or
•
extract pertinent results for your report.
So you need an ASCII editor to prepare input files and review and print output files. We directly provide two options.
ASReml-W The ASReml-W interface is a graphical tool allowing the user to edit programs, run and then view the output, before saving results. It is available on the following platforms: •
Windows (32-bit and 64-bit),
•
Linux (32-bit and 64-bit, various incantations),
•
Sun/Solaris 32-bit
ASReml-W has a built-in help system explaining its use.
ConTEXT ConTEXT is a third-party freeware text editor, with programming extensions which make it a suitable environment for running ASReml under Windows. The ConTEXT directory on the CD-ROM includes installation files and instructions for configuring it for use in ASReml. Full details of ConTEXT are available from http://www.context.cx/.
1
Introduction
1.4
How to use this guide The guide consists of 16 chapters. Chapter 1 introduces ASReml and describes the conventions used in the guide. Chapter 2 outlines some basic theory which you may need to come back to.
Theory
Getting started Examples
Data file
Linear model Variance model
Prediction
New ASReml users are advised to read Chapter 3 before attempting to code their first job. It presents an overview of basic ASReml coding demonstrated on a real data example. Chapter 16 presents a range of examples to assist users further. When coding you first job, look for an example to use as a template. Data file preparation is described in Chapter 4, and Chapter 5 describes how to input data into ASReml. Chapters 6 and 7 are key chapters which present the syntax for specifying the linear model and the variance models for the random effects in the linear mixed model. Variance modelling is a complex aspect of analysis. We introduce variance modelling in ASReml by example in Chapter 7. Chapters 8 and 9 describe special commands for multivariate and genetic analyses respectively. Chapter 10 deals with prediction of fixed and random effects from the linear mixed model and Chapter 13 presents the syntax for forming functions of variance components such as heritability. Chapter 11 discusses the operating system level command for running an ASReml job. Chapter 12 describes a new data merging facility. Chapter 14 gives a detailed explanation of the output files. Chapter 15 gives an overview of the error messages generated in ASReml and some guidance as to their probable cause.
Output
1.5
4
Getting assistance and the ASReml forum The ASReml help accessable through ASReml-W can also be linked to ConText or accessed directly (ASReml.chm).
Audio Tutorials
Support
There is a User Area on the website (http://www.VSNi.co.uk select ASReml and then User Area) which contains contributed material that may be of assistance. It includes an ASReml tutorial in the form of sixteen sets of slides with audio (.mp3) discussion. The sessions last about 20 minutes each. Users with a support contract with VSN should email
[email protected] for assistance with installation and running ASReml. When requesting help, please send the input command file, the data file and the corresponding primary output
1
Introduction
5
file along with a description of the problem.
ASReml forum
1.6
There is an ASReml forum which all ASReml users (including unsupported users) are encouraged to join. Register now at http://www.vsni.co.uk/forum.
Typographic conventions A hands on approach is the best way to develop a working understanding of a new computing package. We therefore begin by presenting a guided tour of ASReml using a sample data set for demonstration (see Chapter 3). Throughout the guide new concepts are demonstrated by example wherever possible. In this guide you will find framed sample boxes to the right of the page as shown here. These contain ASReml command file (sample) code. Note that –
–
the code under discussion is highlighted in bold type for easy identification, . the continuation symbol ( .. ) is used to indicate that some of the original code is omitted.
An example ASReml code box bold type highlights sections of code currently under discussion remaining code is not highlighted . . . indicates that some of the original code is omitted from the display
Data examples are displayed in larger boxes in the body of the text, see, for example, page 43. Other conventions are as follows: •
keyboard key names appear in smallcaps, for example, tab and esc,
•
example code within the body of the text is in this size and font and is highlighted in bold type, see pages 34 and 50,
•
in the presentation of general ASReml syntax, for example [path] asreml basename[.as] [arguments] –
–
–
typewriter font is used for text that must be typed verbatim, for example, asreml and .as after basename in the example, italic font is used to name information to be supplied by the user, for example, basename stands for the name of a file with an .as filename extension, square brackets indicate that the enclosed text and/or arguments are not always required. Do not enter these square brackets.
•
ASReml output is in this size and font, see page 36,
•
this font is used for all other code.
2
Some theory
The linear mixed model Introduction Direct product structures Variance structures for the errors: R structures Variance structures for the random effects: G structures
Estimation Estimation of the variance parameters Estimation/prediction of fixed and random effects
What are BLUPs? Combining variance models Inference: Random effects Tests of hypotheses: variance parameters Diagnostics
Inference: Fixed effects Introduction Incremental and Conditional Wald Statistics Kenward and Roger Adjustments Approximate stratum variances
6
2
Some theory
2.1
7
The linear mixed model Introduction If y denotes the n × 1 vector of observations, the linear mixed model can be written as y = Xτ + Zu + e (2.1) where τ is the p × 1 vector of fixed effects, X is an n × p design matrix of full column rank which associates observations with the appropriate combination of fixed effects, u is the q × 1 vector of random effects, Z is the n × q design matrix which associates observations with the appropriate combination of random effects, and e is the n × 1 vector of residual errors. The model (2.1) is called a linear mixed model or linear mixed effects model. It is assumed ¸¶ · µ· ¸ · ¸ G(γ) 0 0 u , θ ∼N (2.2) 0 R(φ) 0 e where the matrices G and R are functions of parameters γ and φ, respectively. The parameter θ is a variance parameter which we will refer to as the scale parameter. In mixed effects models with more than one residual variance, arising for example in the analysis of data with more than one section (see below) or variate, the parameter θ is fixed to one. In mixed effects models with a single residual variance then θ is equal to the residual variance (σ 2 ). In this case R must be a correlation matrix (see Table 2.1 for a discussion).
Direct product structures To undertake variance modelling in ASReml you need to understand the formation of variance structures via direct products (⊗). The direct product of two matrices A (m×p) and B (n×q) is
a11 B
.. .
am1 B
. . . a1p B .. ..
. . ..
. amp B
. Direct products in R structures Consider a vector of common errors associated with an experiment. The usual least squares assumption (and the default in ASReml) is that these are independently and identically distributed (IID). However, if e was from a field experiment
2
Some theory
8
laid out in a rectangular array of r rows by c columns, we could arrange the residuals as a matrix and might consider that they were autocorrelated within rows and columns. Writing the residuals as a vector in field order, that is, by sorting the residuals rows within columns (plots within blocks) the variance of the residuals might then be σe2 Σc (ρc ) ⊗ Σr (ρr ) where Σc (ρc ) and Σr (ρr ) are correlation matrices for the row model (order r, autocorrelation parameter ρr ) and column model (order c, autocorrelation parameter ρc ) respectively. More specifically, a two-dimensional separable autoregressive spatial structure (AR1 ⊗ AR1) is sometimes assumed for the common errors in a field trial analysis (see Gogel (1997) and Cullis et al. (1998) for examples). In this case Σr =
See Chapter 8 for further details
1 ρr ρ2r .. . ρr−1 r
1 ρr .. . ρr−2 r
1 .. .. . . r−3 ρr ... 1
and
Σc =
1 ρc ρ2c .. . ρc−1 c
1 ρc .. . ρc−2 c
1 .. .. . . c−3 ρc ... 1
.
Alternatively, the residuals might relate to a multivariate analysis with nt traits and n units and be ordered traits within units. In this case an appropriate variance structure might be In ⊗ Σ where Σ (nt ×nt ) is a general or unstructured variance matrix. Direct products in G structures Likewise, the random terms in u in the model may have a direct product variance structure. For example, for a field trial with s sites, g varieties and the effects ordered varieties within sites, the model term site.variety may have the variance structure Σ ⊗ Ig where Σ is the variance matrix for sites. This would imply that the varieties are independent random effects within each site, have different variances at each site, and are correlated across sites. Important Whenever a random term is formed as the interaction of two factors you should consider whether the IID assumption is sufficient or if a direct product structure might be more appropriate.
2
Some theory
9
Variance structures for the errors: R structures The vector e will in some situations be a series of vectors indexed by a factor or factors. The convention we adopt is to refer to these as sections. Thus e = [e01 , e02 , . . . , e0s ]0 and the ej represent the errors of sections of the data. For example, these sections may represent different experiments in a multi-environment trial (MET), or different trials in a meta analysis. It is assumed that R is the direct sum of s matrices Rj , j = 1 . . . s, that is,
R1 0 . R = ⊕sj=1 Rj = .. 0 0
0 R2 .. . 0 0
... ... .. .
0 0 .. .
. . . Rs−1 ... 0
0 0 .. .
, 0
Rs
so that each section has its own variance structure which is assumed to be independent of the structures in other sections. A structure for the residual variance for the spatial analysis of multi-environment trials (Cullis et al., 1998) is given by Rj
= Rj (φj ) = σj2 (Σj (ρj )).
Each section represents a trial and this model accounts for between trial error variance heterogeneity (σj2 ) and possibly a different spatial variance model for each trial. In the simplest case the matrix R could be known and proportional to an identity matrix. Each component matrix, Rj (or R itself for one section) is assumed to be the direct product (see Searle, 1982) of one, two or three component matrices. The component matrices are related to the underlying structure of the data. If the structure is defined by factors, for example, replicates, rows and columns, then the matrix R can be constructed as a direct product of three matrices describing the nature of the correlation across replicates, rows and columns. These factors must completely describe the structure of the data, which means that 1. the number of combined levels of the factors must equal the number of data points, 2. each factor combination must uniquely specify a single data point. These conditions are necessary to ensure the expression var (e) = θR is valid. The assumption that the overall variance structure can be constructed as a direct
2
Some theory
10
product of matrices corresponding to underlying factors is called the assumption of separability and assumes that any correlation process across levels of a factor is independent of any other factors in the term. Multivariate data and repeated measures data usually satisfy the assumption of separability. In particular, if the data are indexed by factors units and traits (for multivariate data) or times (for repeated measures data), then the R structure may be written as units ⊗ traits or units ⊗ times. This assumption is sometimes required to make the estimation process computationally feasible, though it can be relaxed, for certain applications, for example fitting isotropic covariance models to irregularly spaced spatial data.
Variance structures for the random effects: G structures The q × 1 vector of random effects is often composed of b subvectors u = [u01 u02 . . . u0b ]0 where the subvectors ui are of length qi and these subvectors are usually assumed independent normally distributed with variance matrices θGi . Thus just like R we have
G1 0 . G = ⊕bi=1 Gi = .. 0 0
0 G2 .. . 0 0
... ... .. .
0 0 .. .
. . . Gb−1 ... 0
0 0 .. .
. 0
Gb
There is a corresponding partition in Z, Z = [Z 1 Z 2 . . . Z b ]. As before each submatrix, Gi , is assumed to be the direct product of one, two or three component matrices. These matrices are indexed for each of the factors constituting the term in the linear model. For example, the term site.genotype has two factors and so the matrix Gi is comprised of two component matrices defining the variance structure for each factor in the term. Models for the component matrices Gi include the standard model for which Gi = γi I qi and direct product models for correlated random factors given by Gi = Gi1 ⊗ Gi2 ⊗ Gi3 for three component factors. The vector ui is therefore assumed to be the vector representation of a 3-way array. For two factors the vector ui is simply the vec of a matrix with rows and columns indexed by the component factors in the term, where vec of a matrix is a function which stacks the columns of its matrix argument below each other. A range of models are available for the components of both R and G. They include correlation (C) models (that is, where the diagonals are 1), or covariance
2
Some theory
11
(V ) models and are discussed in detail in Chapter 7. Some correlation models include •
autoregressive (order 1 or 2)
•
moving average (order 1 or 2)
•
ARMA(1,1)
•
uniform
•
banded
•
general correlation.
Some of the covariance models include •
diagonal (that is, independent with heterogeneous variances)
•
antedependence
•
unstructured
•
factor analytic.
There is the facility within ASReml to allow for a nonzero covariance between the subvectors of u, for example in random regression models. In this setting the intercept and say the slope for each unit are assumed to be correlated and it is more natural to consider the two component terms as a single term, which gives rise to a single G structure. This concept is discussed later.
2.2
Estimation Estimation involves two processes that are closely linked. They are performed within the ‘engine’ of ASReml. One process involves estimation of τ and prediction of u (although the latter may not always be of interest) for given θ, φ and γ. The other process involves estimation of these variance parameters. Note that in the following sections we have set θ = 1 to simplify the presentation of results.
Estimation of the variance parameters Estimation of the variance parameters is carried out using residual or restricted maximum likelihood (REML), developed by Patterson and Thompson (1971). An historical development of the theory can be found in Searle et al. (1992). Note firstly that y ∼ N (Xτ , H). (2.3)
2
Some theory
12
where H = R + ZGZ 0 . REML does not use (2.3) for estimation of variance parameters, but rather uses a distribution free of τ , essentially based on error contrasts or residuals. The derivation given below is presented in Verbyla (1990). We transform y using a non-singular matrix L = [L1 L2 ] such that L01 X = Ip ,
L02 X = 0.
If y j = L0j y, j = 1, 2, ·
¸
y1 ∼N y2
µ·
¸ ·
τ , 0
L01 HL1 L02 HL1
L01 HL2 L02 HL2
¸¶
.
The full distribution of L0 y can be partitioned into a conditional distribution, namely y 1 |y 2 , for estimation of τ , and a marginal distribution based on y 2 for estimation of γ and φ; the latter is the basis of the residual likelihood. The estimate of τ is found by equating y 1 to its conditional expectation, and after some algebra we find, τˆ = (X 0 H −1 X)−1 X 0 H −1 y Estimation of κ = [γ 0 φ0 ]0 is based on the log residual likelihood, 1 `R = − (log det L02 H −1 L2 + y 02 (L02 HL2 )−1 y 2 ) 2 1 = − (log det X 0 H −1 X + log det H + y 0 P y 2 ) 2
(2.4)
where P = H −1 − H −1 X(X 0 H −1 X)−1 X 0 H −1 . Note that y 0 P y = (y − X τˆ )0 H −1 (y − X τˆ ). The log-likelihood (2.4) depends on X and not on the particular non-unique transformation defined by L. The log residual likelihood (ignoring constants) can be written as 1 `R = − (log det C + log det R + log det G + y 0 P y). 2 We can also write P
= R−1 − R−1 W C −1 W 0 R−1
(2.5)
2
Some theory
13
with W = [X Z] . Letting κ = (γ, φ), the REML estimates of κi are found by calculating the score 1 U (κi ) = ∂`R /∂κi = − [tr (P H i ) − y 0 P H i P y] 2
(2.6)
and equating to zero. Note that H i = ∂H/∂κi . The elements of the observed information matrix are −
∂ 2 `R ∂κi ∂κj
=
1 1 tr (P H ij ) − tr (P H i P H j ) 2 2 1 + y 0 P H i P H j P y − y 0 P H ij P y 2
(2.7)
where H ij = ∂ 2 H/∂κi ∂κj . The elements of the expected information matrix are Ã
∂ 2 `R E − ∂κi ∂κj
!
=
1 tr (P H i P H j ) . 2
(2.8)
Given an initial estimate κ(0) , an update of κ, κ(1) using the Fisher-scoring (FS) algorithm is κ(1) = κ(0) + I(κ(0) , κ(0) )−1 U (κ(0) ) (2.9) where U (κ(0) ) is the score vector (2.6) and I(κ(0) , κ(0) ) is the expected information matrix (2.8) of κ evaluated at κ(0) . For large models or large data sets, the evaluation of the trace terms in either (2.7) or (2.8) is either not feasible or is very computer intensive. To overcome this problem ASReml uses the AI algorithm (Gilmour, Thompson and Cullis, 1995). The matrix denoted by IA is obtained by averaging (2.7) and (2.8) and approximating y 0 P H ij P y by its expectation, tr (P H ij ) in those cases when H ij 6= 0. For variance components models (that is those linear with respect to variances in H), the terms in IA are exact averages of those in (2.7) and (2.8). The basic idea is to use IA (κi , κj ) in place of the expected information matrix in (2.9) to update κ. The elements of IA are IA (κi , κj ) =
1 0 y P H i P H j P y. 2
(2.10)
2
Some theory
14
The IA matrix is the (scaled) residual sums of squares and products matrix of y = [y 1 , . . . , y k ] where y i is the ‘working’ variate for κi and is given by yi = H iP y ˜ = H i R−1 e ˜ , κi ∈ φ = Ri R−1 e ˜ , κi ∈ γ = ZGi G−1 u
˜ = y − X τˆ − Z u ˜ , τˆ and u ˜ are solutions to (2.11). In this form the AI where e matrix is relatively straightforward to calculate. The combination of the AI algorithm with sparse matrix methods, in which only non-zero values are stored, gives an efficient algorithm in terms of both computing time and workspace.
Estimation/prediction of the fixed and random effects To estimate τ and predict u the objective function log fY (y | u ; τ , R) + log fU (u ; G) is used. The is the log-joint distribution of (Y , u). Differentiating with respect to τ and u leads to the mixed model equations (Robinson, 1991) which are given by "
X 0 R−1 X X 0 R−1 Z Z 0 R−1 X Z 0 R−1 Z + G−1
These can be written as
#"
#
τˆ ˜ u
"
=
X 0 R−1 y Z 0 R−1 y
#
.
(2.11)
˜ = W R−1 y Cβ
where C = W 0 R−1 W + G∗ , β = [τ 0 u0 ]0 and " ∗
G =
0 0 0 G−1
#
.
The solution of (2.11) requires values for γ and φ. In practice we replace γ and ˆ ˆ and φ. φ by their REML estimates γ
2
Some theory
15
˜ is the best Note that τˆ is the best linear unbiased estimator (BLUE) of τ , while u linear unbiased predictor (BLUP) of u for known γ and φ. We also note that ·
¸
˜ − β = τˆ − τ ∼ N β ˜ −u u
2.3
µ· ¸
¶
0 , C −1 . 0
What are BLUPs? Consider a balanced one-way classification. For data records ordered r repeats within b treatments regarded as random effects, the linear mixed model is y = Xτ + Zu + e where X = 1b ⊗ 1r is the design matrix for τ (the overall mean), Z = I b ⊗ 1r is the design matrix for the b (random) treatment effects ui and e is the error vector. Assuming that the treatment effects are random implies that u ∼ N (Aψ, σb2 I b ), for some design matrix A and parameter vector ψ. It can be shown that rσ 2 σ2 ˜ = 2 b 2 (¯ u y − 1¯ y·· ) + 2 Aψ (2.12) rσb + σ rσb + σ 2 ¯ is the vector of treatment means, y¯·· is the grand mean. The differences where y of the treatment means and the grand mean are the estimates of treatment effects if treatment effects are fixed. The BLUP is therefore a weighted mean of the data based estimate and the ‘prior’ mean Aψ. If ψ = 0, the BLUP in (2.12) becomes ˜= u
rσb2 (¯ y − 1¯ y·· ) rσb2 + σ 2
(2.13)
and the BLUP is a so-called shrinkage estimate. As rσb2 becomes large relative to σ 2 , the BLUP tends to the fixed effect solution, while for small rσb2 relative to σ 2 the BLUP tends towards zero, the assumed initial mean. Thus (2.13) represents a weighted mean which involves the prior assumption that the ui have zero mean. Note also that the BLUPs in this simple case are constrained to sum to zero. This is essentially because the unit vector defining X can be found by summing the columns of the Z matrix. This linear dependence of the matrices translates to dependence of the BLUPs and hence constraints. This aspect occurs whenever the column space of X is contained in the column space of Z. The dependence is slightly more complex with correlated random effects.
2
Some theory
2.4
16
Combining variance models The combination of variance models within G structures and R structures and between G structures and R structures is a difficult and important concept. The underlying principle is that each Ri and Gi variance model can only have a single scaling variance parameter associated with it. If there is more than one scaling variance parameter for any Ri or Gi then the variance model is overspecified, or nonidentifiable. Some variance models are presented in Table 2.1 to illustrate this principle. While all 9 forms of model in Table 2.1 can be specified within ASReml only models of forms 1 and 2 are recommended. Models 4-6 have too few variance parameters and are likely to cause serious estimation problems. For model 3, where the scale parameter θ has been fitted (univariate single site analysis), it becomes the scale for G. This parameterisation is bizarre and is not recommended. Models 7-9 have too many variance parameters and ASReml will arbitrarily fix one of the variance parameters leading to possible confusion for the user. If you fix the variance parameter to a particular value then it does not count for the purposes of applying the principle that there be only one scaling variance parameter. That is, models 7-9 can be made identifiable by fixing all but one of the nonidentifiable scaling parameters in each of G and R to a particular value. Table 2.1 Combination of models for G and R structures model
G1
G2
R1
R2
θ
comment
1. 2. 3. 4. 5. 6. 7. 8. 9.
V V C * C C V V *
C C C * C C V C *
C V V C C V * V V
C C C C C C * C V
y n y n y n * y *
valid valid valid, but not recommended inappropriate as R is a correlation model inappropriate, same scale for R and G inappropriate, no scaling parameter for G nonidentifiable, 2 scaling parameters for G nonidentifiable, scale for R and overall scale nonidentifiable, 2 scaling parameters for R
* indicates the entry is not relevant in this case Note that G1 and G2 are interchangeable in this table, as are R1 and R2
2
Some theory
2.5
17
Inference: Random effects Tests of hypotheses: variance parameters Inference concerning variance parameters of a linear mixed effects model usually relies on approximate distributions for the (RE)ML estimates derived from asymptotic results. It can be shown that the approximate variance matrix for the REML estimates is given by the inverse of the expected information matrix (Cox and Hinkley, 1974, section 4.8). Since this matrix is not available in ASReml we replace the expected information matrix by the AI matrix. Furthermore the REML estimates are consistent and asymptotically normal, though in small samples this approximation appears to be unreliable (see later). A general method for comparing the fit of nested models fitted by REML is the REML likelihood ratio test, or REMLRT. The REMLRT is only valid if the fixed effects are the same for both models. In ASReml this requires not only the same fixed effects model, but also the same parameterisation. If `R2 is the REML log-likelihood of the more general model and `R1 is the REML log-likelihood of the restricted model (that is, the REML log-likelihood under the null hypothesis), then the REMLRT is given by D = 2 log(`R2 /`R1 ) = 2 [log(`R2 ) − log(`R1 )]
(2.14)
which is strictly positive. If ri is the number of parameters estimated in model i, then the asymptotic distribution of the REMLRT, under the restricted model is χ2r2 −r1 .
Revised 08
The REMLRT is implicitly two-sided, and must be adjusted when the test involves an hypothesis with the parameter on the boundary of the parameter space. It can be shown that for a single variance component, the theoretical asymptotic distribution of the REMLRT is a mixture of χ2 variates, where the mixing probabilities are 0.5, one with 0 degrees of freedom (spike at 0) and the other with 1 degree of freedom. The approximate P-value for the REMLRT statistic (D), is 0.5(1-Pr(χ21 ≤ d)) where d is the observed value of D. This has a 5% critical value of 2.71 in contrast to the 3.84 critical value for a χ2 variate with 1 degree of freedom. The distribution of the REMLRT for the test that k variance components are zero, or tests involved in random regressions, which involve both variance and covariance components, involves a mixture of χ2 variates from 0 to k degrees of freedom. See Self and Liang (1987) for details.
2
Some theory
18
Tests concerning variance components in generally balanced designs, such as the balanced one-way classification, can be derived from the usual analysis of variance. It can be shown that the REMLRT for a variance component being zero is a monotone function of the F statistic for the associated term. To compare two (or more) non-nested models we can evaluate the Akaike Information Criteria (AIC) or the Bayesian Information Criteria (BIC) for each model. These are given by AIC = −2`Ri + 2ti BIC = −2`Ri + ti log ν
(2.15)
where ti is the number of variance parameters in model i and ν = n − p is the residual degrees of freedom. AIC and BIC are calculated for each model and the model with the smallest value is chosen as the preferred model.
Diagnostics In this section we will briefly review some of the diagnostics that have been implemented in ASReml for examining the adequacy of the assumed variance matrix for either R or G structures, or for examining the distributional assumptions regarding e or u. Firstly we note that the BLUP of the residual vector is given by ˜ ˜ = y − Wβ e = RP y
(2.16)
It follows that E (˜ e) = 0 var (˜ e) = R − W C −1 W 0 The matrix θW C −1 W 0 is the so-called ‘extended hat’ matrix. It is the linear mixed effects model analogue of σ 2 X(X 0 X)−1 X 0 for ordinary linear models. The diagonal elements are returned in the fourth field of the .yht file. Outliers ASReml3
The !OUTLIER qualifier invokes a partial implementation of research by Alison Smith, Ari Verbyla and Brian Cullis. With this qualifier, ASReml writes p
•
G−1 u and G−1 u/diag G−1 − G−1 C ZZ G−1 to the .sln file,
•
R−1 e and R−1 e/diag R−1 − R−1 W C −1 W 0 R−1 to the .yht file,
p
2
Some theory
19
•
and copies lines where the last ratio exceeds 3 in magnitude to the .res file
•
and reports the number of such lines to the .asr file.
•
It is not debugged for multivariate models or XFA models with zero Ψs.
Variogram
The variogram has been suggested as a useful diagnostic for assisting with the identification of appropriate variance models for spatial data (Cressie, 1991). Gilmour et al. (1997) demonstrate its usefulness for the identification of the sources of variation in the analysis of field experiments. If the elements of the data vector (and hence the residual vector) are indexed by a vector of spatial coordinates, si , i = 1, . . . , n, then the ordinates of the sample variogram are given by 1 ei (si ) − e˜j (sj )]2 , i, j = 1, . . . , n; i 6= j vij = [˜ 2
ASReml2
The sample variogram reported by ASReml has two forms depending on whether the spatial coordinates represent a complete rectangular lattice (as typical of a field trial) or not. In the lattice case, the sample variogram is calculated from the triple (lij1 , lij2 , vij ) where lij1 = si1 −sj1 and lij2 = si2 −sj2 are the displacements. As there will be many vij with the same displacements, ASReml calculates the means for each displacement pair lij1 , lij2 either ignoring the signs (default) or separately for same sign and opposite sign (!TWOWAY), after grouping the larger displacements: 9-10, 11-14, 15-20, .... The result is displayed as a perspective plot (see page 238) of the one or two surfaces indexed by absolute displacement group. In this case, the two directions may be on different scales. Otherwise ASReml forms a variogram based on polar coordinates. It calculates q 2 + l2 and an angle θ (−180 < θ < 180) the distance between points dij = lij1 ij ij ij2 subtended by the line from (0, 0) to (lij1 , lij2 ) with the x-axis. The angle can be calculated as θij = tan−1 (lij1 /lij2 ) choosing (0 < θij < 180) if lij2 > 0 and (−180 < θij < 0) if lij2 < 0. Note that the variogram has angular symmetry in that vij = vji , dij = dji and |θij − θji | = 180. The variogram presented averages the vij within 12 distance classes and 4, 6 or 8 sectors (selected using a !VGSECTORS qualifier) centred on an angle of (i − 1) ∗ 180/s (i = 1, ...s). A figure is produced which reports the trends in v¯ij with increasing distance for each sector. ASReml also computes the variogram from predictors of random effects which appear to have a variance structures defined in terms of distance. The variogram details are reported in the .res file.
2
Some theory
2.6
20
Inference: Fixed effects Introduction
ASReml2
Inference for fixed effects in linear mixed models introduces some difficulties. In general, the methods used to construct F -tests in analysis of variance and regression cannot be used for the diversity of applications of the general linear mixed model available in ASReml. One approach would be to use likelihood ratio methods (see Welham and Thompson, 1997) although their approach is not easily implemented. Wald-type test procedures are generally favoured for conducting tests concerning τ . The traditional Wald statistic to test the hypothesis H0 : Lτ = l for given L, r × p, and l, r × 1, is given by W = (Lˆ τ − l)0 {L(X 0 H −1 X)−1 L0 }−1 (Lˆ τ − l)
(2.17)
and asymptotically, this statistic has a chi-square distribution on r degrees of freedom. These are marginal tests, so that there is an adjustment for all other terms in the fixed part of the model. It is also anti-conservative if p-values are constructed because it assumes the variance parameters are known. The small sample behaviour of such statistics has been considered by Kenward and Roger (1997) in some detail. They presented a scaled Wald statistic, together with an F -approximation to its sampling distribution which they showed performed well in a range (though limited in terms of the range of variance models available in ASReml) of settings. In the following we describe the facilities now available in ASReml for conducting inference concerning terms which are the in dense fixed effects model component of the general linear mixed model. These facilities are not available for any terms in the sparse model. These include facilities for computing two types of Wald F statistics and partial implementation of the Kenward and Roger adjustments.
Incremental and Conditional Wald F Statistics The basic tool for inference is the Wald statistic defined in equation 2.17. ASReml produces a test of fixed effects, that reduces to an F statistic in special cases, by dividing the Wald statistic, constructed with l = 0, by r, the numerator degrees of freedom. In this form it is possible to perform an approximate F test if we can deduce the denominator degrees of freedom. However, there are several ways L can be defined to construct a test for a particular model term, two of which are available in ASReml. These Wald F statistics are labelled F-inc
2
Some theory
21
(for incremental) and F-con (for conditional) respectively. For balanced designs, these Wald F statistics are numerically identical to the F statistics obtained from the standard analysis of variance. The first method for computing Wald statistics (for each term) is the so-called “incremental” form. For this method, Wald statistics are computed from an incremental sum of squares in the spirit of the approach used in classical regression analysis (see Searle, 1971). For example if we consider a very simple model with terms relating to the main effects of two qualitative factors A and B, given symbolically by y ∼1+A+B where the 1 represents the constant term (µ), then the incremental sums of squares for this model can be written as the sequence R(1) R(A|1) = R(1, A) − R(1) R(B|1, A) = R(1, A, B) − R(1, A) where the R(·) operator denotes the reduction in the total sums of squares due to a model containing its argument and R(·|·) denotes the difference between the reduction in the sums of squares for any pair of (nested) models. Thus R(B|1, A) represents the difference between the reduction in sums of squares between the so-called maximal “model” y ∼1+A+B and y ∼1+A Implicit in these calculations is that •
we only compute Wald statistics for estimable functions (Searle, 1971, page 408),
•
all variance parameters are held fixed at the current REML estimates from the maximal model
In this example, it is clear that the incremental Wald statistics may not produce the desired test for the main effect of A, as in many cases we would like to produce a Wald statistic for A based on R(A|1, B) = R(1, A, B) − R(1, B)
2
Some theory
22
The issue is further complicated when we invoke “marginality” considerations. The issue of marginality between terms in a linear (mixed) model has been discussed in much detail by Nelder (1977). In this paper Nelder defines marginality for terms in a factorial linear model with qualitative factors, but later Nelder (1994) extended this concept to functional marginality for terms involving quantitative covariates and for mixed terms which involve an interaction between quantitative covariates and qualitative factors. Referring to our simple illustrative example above, with a full factorial linear model given symbolically by y ∼ 1 + A + B + A.B then A and B are said to be marginal to A.B, and 1 is marginal to A and B. In a three way factorial model given by y ∼ 1 + A + B + C + A.B + A.C + B.C + A.B.C the terms A, B, C, A.B, A.C and B.C are marginal to A.B.C. Nelder (1977, 1994) argues that meaningful and interesting tests for terms in such models can only be conducted for those tests which respect marginality relations. This philosophy underpins the following description of the second Wald statistic available in ASReml, the so-called “conditional” Wald statistic. This method is invoked by placing !FCON on the datafile line. ASReml attempts to construct conditional Wald statistics for each term in the fixed dense linear model so that marginality relations are respected. As a simple example, for the three way factorial model the conditional Wald statistics would be computed as Term 1 A B C A.B A.C B.C A.B.C
Sums of Squares R(1) R(A | 1,B,C,B.C) R(B | 1,A,C,A.C) R(C | 1,A,B,A.B) R(A.B | 1,A,B,C,A.C,B.C) R(A.C | 1,A,B,C,A.B,B.C) R(B.C | 1,A,B,C,A.B,A.C) R(A.B.C | 1,A,B,C,A.B,A.C,B.C)
= = = = = = =
M code . R(1,A,B,C,B.C) - R(1,B,C,B.C) A R(1,A,B,C,A.C) - R(1,A,C,A.C) A R(1,A,B,C,A.B) - R(1,A,B,A.B) A R(1,A,B,C,A.B,A.C,B.C) - R(1,A,B,C,A.C,B.C) B R(1,A,B,C,A.B,A.C,B.C) - R(1,A,B,C,A.B,B.C) B R(1,A,B,C,A.B,A.C,B.C) - R(1,A,B,C,A.B,A.C) B R(1,A,B,C,A.B,A.C,B.C,A.B.C) R(1,A,B,C,A.B,A.C,B.C) C
Of these the conditional Wald statistic for the 1, B.C and A.B.C terms would be the same as the incremental Wald statistics produced using the linear model y ∼ 1 + A + B + C + A.B + A.C + B.C + A.B.C The preceeding table includes a so-called M (marginality) code reported by ASReml when conditional Wald statistics are presented. All terms with the highest M code letter are tested conditionally on all other terms in the model, i.e. by dropping the term from the maximum model. All terms with the preceeding M code letter,
2
Some theory
23
are marginal to at least one term in a higher group, and so forth. For example, in the table, model term A.B has M code B because it is marginal to model term A.B.C and model term A has M code A because it is marginal to A.B, A.C and A.B.C. Model term mu (M code .) is a special case in that its test is conditional on all covariates but no factors. Following is some ASReml output from the .aov table which reports the terms in the conditional statistics.
Model Term 1 2 3 4 5
mu water variety sow water.variety
Marginality pattern for F-con calculation -- Model terms -DF 1 2 3 4 5 6 7 8 1 1 7 2 7
* I I I I
. * I I I
. C * I I
. C C * I
. . . C *
. . c . C
. c . . C
. . . . .
6 water.sow 2 7 variety.sow 14 8 water.variety.sow 14
I I I
I I I
I I I
I I I
I I I
* I I
C * I
. . *
F-inc tests the additional variation explained when the term (*) is added to a model consisting of the I terms. F-con tests the additional variation explained when the term (*) is added to a model consisting of the I and C/c terms. Any c terms are ignored in calculating DenDF for F-con using numerical derivatives for computational reasons. The . terms are ignored for both F-inc and F-con tests. Consider now a nested model which might be represented symbolically by y ∼ 1 + REGION + REGION.SITE For this model, the incremental and conditional Wald F statistics will be the same. However, it is not uncommon for this model to be presented to ASReml as y ∼ 1 + REGION + SITE with SITE identified across REGION rather than within REGION. Then the nested structure is hidden but ASReml will still detect the structure and produce a valid conditional Wald F statistic. This situation will be flagged in the M code field by changing the letter to lower case. Thus, in the nested model, the three M codes would be ., A and B because REGION.SITE is obviously an interaction dependent
2
Some theory
24
on REGION. In the second model, REGION and SITE appear to be independent factors so the initial M codes are ., A and A. However they are not independent because REGION removes additional degrees of freedom from SITE, so the M codes are changed from ., A and A to ., a and A. When using the conditional Wald F statistic, it is important to know what the “maximal conditional” model (MCM) is for that particular statistic. It is given explicitly in the .aov file. The purpose of the conditional Wald F statistic is to facilitate inference for fixed effects. It is not meant to be prescriptive of the appropriate test nor is the algorithm for determining the MCM foolproof. The Wald statistics are collectively presented in a summary table in the .asr file. The basic table includes the numerator degrees of freedom (ν1i ) and the incremental Wald F statistic for each term. To this is added the conditional Wald F statistic and the M code if !FCON is specified. A conditional Wald F statistic is not reported for mu in the .asr but is in the .aov file (adjusted for covariates). ASReml3
The !FOWN qualifier (page 84) allows the user to replace any/all of the conditional Wald F statistics with tests of the same terms but adjusted for other model terms as specified by the user; the !FOWN test is not performed if it implies a change in degrees of freedom from that obtained by the incremental model.
Kenward and Roger Adjustments In moderately sized analyses, ASReml will also include the denominator degrees of freedom (DenDF, denoted by ν2i , Kenward and Roger, 1997) and a probablity value if these can be computed. They will be for the conditional Wald F statistic if it is reported. The !DDF i (see page 69) qualifier can be used to suppress the DenDF calculation (!DDF -1) or request a particular algorithmic method: !DDF 1 for numerical derivatives, !DDF 2 for algebraic derivatives. The value in the probability column (either P inc or P con) is computed from an Fν1i ,ν2i reference distribution. An approximation is used for computational convenience when calculating the DenDF for Conditional F statistics using numerical derivatives. The DenDF reported then relates to a maximal conditional incremental model (MCIM) which, depending on the model order, may not always coincide with the maximal conditional model (MCM) under which the conditional F statistic is calculated. The MCIM model omits terms fitted after any terms ignored for the conditional test (I after . in marginality pattern). In the example above, MCIM ignores variety.sow when calculating DenDF for the test of water and ignores water.sow when calculating DenDF for the test of variety. When DenDF is not
2
Some theory
25
available, it is often possible, though anti-conservative to use the residual degrees of freedom for the denominator. Kenward and Roger (1997) pursued the concept of construction of Wald-type test statistics through an adjusted variance matrix of τˆ. They argued that it is useful to consider an improved estimator of the variance matrix of τˆ which has less bias and accounts for the variability in estimation of the variance parameters. There are two reasons for this. Firstly, the small sample distribution of Wald F statistics is simplified when the adjusted variance matrix is used. Secondly, if measures of precision are required for τˆ or effects therein, those obtained from the adjusted variance matrix will generally be preferred. Unfortunately the Wald statistics are currently computed using an unadjusted variance matrix.
Approximate stratum variances ASReml reports approximate stratum variances and degrees of freedom for simple variance components models. For the linear mixed-effects model with vari2 = 1) where G = ⊕q γ I , it is often possible ance components (setting σH j=1 j bj to consider a natural ordering of the variance component parameters including σ 2 . Based on an idea due to Thompson (1980), ASReml computes approximate stratum degrees of freedom and stratum variances by a modified Cholesky diagonalisation of the average information matrix. That is, if F is the average information matrix for σ, let U be an upper triangular matrix such that F = U 0 U . We define U c = Dc U where D c is a diagonal matrix whose elements are given by the inverse elements of the last column of U ie dcii = 1/uir , i = 1, . . . , r. The matrix U c is therefore upper triangular with the elements in the last column equal to one. If the vector σ is ordered in the “natural” way, with σ 2 being the last element, then we can define the vector of so called “pseudo” stratum variance components by ξ = U cσ Thence
var (ξ) = D 2c
The diagonal elements can be manipulated to produce effective stratum degrees of freedom Thompson (1980) viz νi = 2ξi2 /d2cii In this way the closeness to an orthogonal block structure can be assessed.
3
A guided tour
Introduction Nebraska Intrastate Nursery (NIN) field experiment The ASReml data file The ASReml command file The title line Reading the data The data file line Specifying the terms in the mixed model Tabulation Prediction Variance structures
Running the job Description of output files The .asr file The .sln file The .yht file
Tabulation, predicted values and functions of the variance components
26
3
A guided tour
3.1
27
Introduction This chapter presents a guided tour of ASReml, from data file preparation and basic aspects of the ASReml command file, to running an ASReml job and interpreting the output files. You are encouraged to read this chapter before moving to the later chapters;
Revised 08
•
a real data example is used in this chapter for demonstration, see below,
•
the same data are also used in later chapters,
•
links to the formal discussion of topics are clearly signposted by margin notes.
This example is of a randomised block analysis of a field trial, and is only one of many forms of analysis that ASReml can perform. It is chosen because it allows an introduction to the main ideas involved in running ASReml. However some aspects of ASReml, in particular, pedigree files (see Chapter 9) and multivariate analysis (see Chapter 8) are only covered in later chapters. ASReml is essentially a batch program with some optional interactive features. The typical sequence of operations when using ASReml is
ASReml2
3.2
•
Prepare the data (typically using a spreadsheet or data base program)
•
Export that data as an ASCII file (for example export it as a .csv (comma separated values) file from Excel)
•
Prepare a job file with filename extension .as.
•
Run the job file with ASReml
•
Review the various output files
•
revise the job and re run it, or
•
extract pertinent results for your report.
You will need a file editor to create the command file and to view the various output files. On unix systems, vi and emacs are commonly used. Under Windows, there are several suitable program editors available such as ASReml-W and ConText described in section 1.3.
Nebraska Intrastate Nursery (NIN) field experiment The yield data from an advanced Nebraska Intrastate Nursery (NIN) breeding trial conducted at Alliance in 1988/89 will be used for demonstration, see Stroup
3
A guided tour
28
et al. (1994) for details. Four replicates of 19 released cultivars, 35 experimental wheat lines and 2 additional triticale lines were laid out in a 22 row by 11 column rectangular array of plots; the varieties were allocated to the plots using a randomised complete block (RCB) design. In field trials, complete replicates are typically allocated to consecutive groups of whole columns or rows. In this trial the replicates were not allocated to groups of whole columns, but rather, overlapped columns. Table 3.1 gives the allocation of varieties to plots in field plan order with replicates 1 and 3 in ITALICS and replicates 2 and 4 in BOLD.
3.3
The ASReml data file
See Chapter 4 for details
The standard format of an ASReml data file is to have the data arranged in space, TAB or comma separated columns/fields with a line for each sampling unit. The columns contain covariates, factors, response variates (traits) and weight variables in any convenient order. This is the first 30 lines of the file nin89.asd containing the data for the NIN variety trial. The data are in field order (rows within columns) and an optional heading (first line of the file) has been included to document the file. In this case there are 11 space separated data fields (variety. . . column) and the complete file has 224 data lines, one for each variety in each replicate. variety id pid raw repl nloc yield lat long row column LANCER 1 1101 585 1 4 29.25 4.3 19.2 16 1 BRULE 2 1102 631 1 4 31.55 4.3 20.4 17 1 REDLAND 3 1103 701 1 4 35.05 4.3 21.6 18 1 CODY 4 1104 602 1 4 30.1 4.3 22.8 19 1 ARAPAHOE 5 1105 661 1 4 33.05 4.3 24 20 1 NE83404 6 1106 605 1 4 30.25 4.3 25.2 21 1 NE83406 7 1107 704 1 4 35.2 4.3 26.4 22 1 NE83407 8 1108 388 1 4 19.4 8.6 1.2 1 2 CENTURA 9 1109 487 1 4 24.35 8.6 2.4 2 2 SCOUT66 10 1110 511 1 4 25.55 8.6 3.6 3 2 COLT 11 1111 502 1 4 25.1 8.6 4.8 4 2 NE83498 12 1112 492 1 4 24.6 8.6 6 5 2 NE84557 13 1113 509 1 4 25.45 8.6 7.2 6 2 NE83432 14 1114 268 1 4 13.4 8.6 8.4 7 2 NE85556 15 1115 633 1 4 31.65 8.6 9.6 8 2 NE85623 16 1116 513 1 4 25.65 8.6 10.8 9 2 CENTURAK78 17 1117 632 1 4 31.6 8.6 12 10 2 NORKAN 18 1118 446 1 4 22.3 8.6 13.2 11 2 KS831374 19 1119 684 1 4 34.2 8.6 14.4 12 2 . . .
optional field labels data for sampling unit 1 data for sampling unit 2 . . .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
row
NORKAN
KS831374
TAM200
NE86482
HOMESTEAD
LANCOTA
NE86501
-
-
-
-
-
LANCER
BRULE
CHEYENNE
CENTURAK78
-
NE83406
NE85623
-
TAM107
NE85556
-
NE86509
NE83432
-
NE83404
NE84557
-
ARAPAHOE
NE83498
-
NE86503
COLT
-
NE86507
SCOUT66
-
CODY
CENTURA
REDLAND
NE83407
-
2
-
1
NE87522
NE87513
NE87512
NE87499
NE87463
NE87457
NE87451
NE87446
NE87409
NE87408
NE87403
NE86T666
NE83T12
GAGE
SIOUXLAND
VONA
ROUGHRIDER
NE86607
NE86606
NE86582
NE86527
BUCKSKIN
3
REDLAND
TAM107
LANCER
ARAPAHOE
NE87627
NE87513
NE87409
NE83T12
CENTURAK78
NE83432
NE87451
NE87408
NE86582
CODY
NE85623
CENTURA
-
NE87627
NE87619
NE87615
NE87613
NE87612
4
NE86501
HOMESTEAD
SIOUXLAND
NE87446
NE83404
NE83498
NE86607
CHEYENNE
NE87499
NE87619
GAGE
KS831374
NE84557
NE86606
NE86509
SCOUT66
NE86527
ROUGHRIDER
BUCKSKIN
NE86507
NE87463
VONA
5
NE87513
LANCOTA
NE86607
-
NE86T666
NE87613
NE87612
BRULE
NE86482
NE86503
LANCOTA
TAM200
NE85556
NE87615
NORKAN
NE87522
COLT
NE83406
NE87457
NE87403
NE83407
NE87512
6
column
Table 3.1: Trial layout and allocation of varieties to plots in the NIN field trial
NE86482
NE87446
LANCER
GAGE
NE87451
SIOUXLAND
CHEYENNE
NE87522
NE86501
NE87615
NE86T666
NE87627
CENTURAK78
TAM107
VONA
NE86527
COLT
KS831374
REDLAND
NORKAN
NE83407
NE87408
7
BRULE
NE86606
NE87463
NE87619
NE86582
NE86503
NE83404
HOMESTEAD
NE85556
NE86509
NE85623
NE87403
SCOUT66
ARAPAHOE
NE87613
TAM200
NE86507
NE83T12
NE84557
NE87457
NE87612
CODY
8
SIOUXLAND
NE86607
NE83406
LANCER
COLT
NE87408
NE86503
CENTURA
NE87446
NE87512
NE87403
NE86T666
-
NE83498
ROUGHRIDER
NE87512
NE83432
CENTURA
NE87499
NE87409
NE83406
BUCKSKIN
9
LANCOTA
NE86509
NE87457
NE86606
NE87627
CENTURAK78
NE83T12
NE87513
SCOUT66
NORKAN
NE87499
NE86582
NE87463
CODY
NE83404
VONA
ROUGHRIDER
NE86507
BRULE
NE85556
BUCKSKIN
NE87612
10
HOMESTEAD
TAM107
NE84557
NE87522
TAM200
NE86501
NE87613
NE83498
NE87619
NE83432
REDLAND
CHEYENNE
ARAPAHOE
NE87615
NE83407
GAGE
NE87409
NE87451
NE86527
NE85623
NE86482
KS831374
11
3
A guided tour
30
These data are analysed again in Chapter 7 using spatial methods of analysis, see model 3a in Section 7.3. For spatial analysis using a separable error structure (see Chapter 2) the data file must first be augmented to specify the complete 22 row × 11 column array of plots. These are the first 20 lines of the augmented data file nin89aug.asd with 242 data rows. variety id pid raw repl nloc yield lat long row column LANCER 1 NA NA 1 4 NA 4.3 1.2 1 1 LANCER 1 NA NA 1 4 NA 4.3 2.4 2 1 LANCER 1 NA NA 1 4 NA 4.3 3.6 3 1 LANCER 1 NA NA 1 4 NA 4.3 4.8 4 1 LANCER 1 NA NA 1 4 NA 4.3 6 5 1 LANCER 1 NA NA 1 4 NA 4.3 7.2 6 1 LANCER 1 NA NA 1 4 NA 4.3 8.4 7 1 LANCER 1 NA NA 1 4 NA 4.3 9.6 8 1 LANCER 1 NA NA 1 4 NA 4.3 10.8 9 1 LANCER 1 NA NA 1 4 NA 4.3 12 10 1 LANCER 1 NA NA 1 4 NA 4.3 13.2 11 1 LANCER 1 NA NA 1 4 NA 4.3 14.4 12 1 LANCER 1 NA NA 1 4 NA 4.3 15.6 13 1 LANCER 1 NA NA 1 4 NA 4.3 16.8 14 1 LANCER 1 NA NA 1 4 NA 4.3 18 15 1 LANCER 1 NA NA 2 4 NA 17.2 7.2 6 4 LANCER 1 NA NA 3 4 NA 25.8 22.8 19 6 LANCER 1 NA NA 4 4 NA 38.7 12.0 10 9 LANCER 1 1101 585 1 4 29.25 4.3 19.2 16 1 BRULE 2 1102 631 1 4 31.55 4.3 20.4 17 1 REDLAND 3 1103 701 1 4 35.05 4.3 21.6 18 1 CODY 4 1104 602 1 4 30.1 4.3 22.8 19 1 . . .
optional field labels file augmented by missing values for first 15 plots and 3 buffer plots and variety coded LANCER to complete 22×11 array . . .
buffer plots between reps original data . . .
Note that •
the pid, raw, repl and yield data for the missing plots have all been made NA (one of the three missing value indicators in ASReml, see Section 4.2),
•
variety is coded LANCER for all missing plots; one of the variety names must be used but the particular choice is arbitrary.
3
A guided tour
3.4
The ASReml command file
See Chapters 5,
By convention an ASReml command file has a .as extension. The file defines
6 and 7 for details
31
•
a title line to describe the job,
•
labels for the data fields in the data file and the name of the data file,
•
the linear mixed model and the variance model(s) if required,
•
output options including directives for tabulation and prediction.
Below is the ASReml command file for an RCB analysis of the NIN field trial data highlighting the main sections. Note the order of the main sections. title line −→ data field definition−→ . . .
data field definition −→ data file name and qualifiers−→ tabulate statement−→ linear mixed model definition−→ predict statement−→ variance model specification−→ . .
NIN Alliance trial 1989 variety !A id pid raw repl 4 nloc yield lat long row 22 column 11 nin89.asd !skip 1 tabulate yield ∼ variety yield ∼ mu variety !r repl predict variety 0 0 1 repl 1 repl 0 IDV 0.1
The title line The first text (non-blank, non control) line in an ASReml command file is taken as the title for the job and is purely descriptive for future reference.
NIN Alliance trial 1989 variety !A id . . .
3
A guided tour
32
Reading the data The data fields are defined before the data file name is specified. Field definitions must be given for all fields in the data file and in the order in which they appear in the data file. Data field definitions must be indented. In this case there are 11 data fields (variety . . . column) in nin89.asd, see Section 3.3.
NIN Alliance trial 1989 variety !A id pid raw repl 4 nloc yield lat long row 22 column 11 nin89.asd !skip 1 . . .
The !A after variety tells ASReml that the first field is an alphanumeric factor and the 4 after repl tells ASReml that the field called repl (the fifth field read) is a numeric factor with 4 levels coded 1:4. Similarly for row and column. The other fields include variates (yield) and various other variables.
The data file line
See Section 5.7
See Section 5.8
The data file name is specified immediately after the last data field definition. Data file qualifiers that relate to data input and output are also placed on this line if they are required. In this example, !skip 1 tells ASReml to ignore (skip) the first line of the data file nin89.asd, the line containing the field labels. The data file line can also contain qualifiers that control other aspects of the analysis. These qualifiers are presented in Section 5.8.
NIN Alliance trial 1989 variety !A id pid . . . row 22 column 11 nin89.asd !skip 1 tabulate yield ∼ variety yield ∼ mu variety !r repl predict variety 0 0 1 repl 1 repl 0 IDV 0.1
Tabulation See Chapter 10
Optional tabulate statements provide a simple way of exploring the structure of a data. They should appear immediately before the model line. In this case the 56 simple variety means for yield are formed and written to a .tab output file. See Chapter 10 for a discussion of tabulation.
. . . column 11 nin89.asd !skip 1 tabulate yield ∼ variety yield ∼ mu variety !r repl predict variety . . .
3
A guided tour
33
Specifying the terms in the mixed model See Chapter 6
The linear mixed model is specified as a list NIN Alliance trial 1989 variety !A of model terms and qualifiers. All elements . . must be space separated. ASReml accommo. column 11 dates a wide range of analyses. See Section nin89.asd !skip 1 2.1 for a brief discussion and general algebraic tabulate yield ∼ variety formulation of the linear mixed model. The yield ∼ mu variety !r repl model specified here for the NIN data is a simpredict variety 0 0 1 ple random effects RCB model having fixed varepl 1 riety effects and random replicate effects. The repl 0 IDV 0.1 reserved word mu fits a constant term (intercept), variety fits a fixed variety effect and repl fits a random replicate effect. The !r qualifier tells ASReml to fit the terms that follow as random effects.
Prediction See Chapter 10
Prediction statements appear after the model statement and before any variance structure lines. In this case the 56 variety means for yield as predicted from the fitted model would be formed and returned in the .pvs output file. See Chapter 10 for a detailed discussion of prediction in ASReml.
NIN Alliance trial 1989 variety !A . . . column 11 nin89.asd !skip 1 tabulate yield ∼ variety yield ∼ mu variety !r repl predict variety 0 0 1 repl 1 repl 0 IDV 0.1
Variance structures
See Chapter 7
The last three lines are included for expository purposes and are not actually needed for this particular analysis. An extensive range of variance structures can be fitted in ASReml. See Chapter 7 for a lengthy discussion of variance modelling in ASReml. In this case independent and identically distributed random replicate effects are specified using the identifier IDV in a G structure. G structures are described in Section 2.1 and the list of avail-
NIN Alliance trial 1989 variety !A . . . column 11 nin89.asd !skip 1 tabulate yield ∼ variety yield ∼ mu variety !r repl predict variety 0 0 1 repl 1 repl 0 IDV 0.1
3
A guided tour
34
able variance structures/models is presented in Table 7.3. Since IDV is the default variance structure for random effects, the same analysis would be performed if these lines were omitted.
3.5
Running the job
Revised 08 See Chapter 11
Assuming you have located the nin89.asd file (under Windows it will typically be located in ASRemlPath/Examples and created the ASCII command file nin89.as described in the previous section, in the same folder, you can run the job. ASRemlPath is typically C:\Program Files\ASReml3 under Windows. Installation details vary with the implementation and are distributed with the program. You could use ASReml-W or ConText to create nin89.as. These programs can then run ASReml directly after they have been configured for ASReml. An ASReml job is also run from a command line or by ’clicking’ the .as file in Windows Explorer. The basic command to run an ASReml job is ASRemlPath/bin/ASReml basename[.as] where basename[.as] is the name of the command file. Typically, a system Path is defined which includes ASRemlPath/bin/ so that just the program name ASReml is required at the command prompt. For example, the command to run nin89.as from the command prompt when attached to the appropriate folder is ASReml nin89.as However, if the path to ASReml is not specified in your system’s Path environment variable, the path must also be given, and the path is required when configuring ASReml-W or Context.
In this guide we assume the command file has a filename extension .as. ASReml .as also recognises the filename extension .asc as an ASReml command file. When these are used, the extension (.as or .asc) may be omitted from basename.as in the command line if there is no file in the working directory with the name basename. The options and arguments that can be supplied on the command line to modify a job at run time are described in Chapter 11.
Give
command
files
the
extension
3
A guided tour
35
Forming a job template ASReml2
Notice that the data files nin89.asd and nin89aug.asd commenced with a line of column headings. Since these headings do not contain embedded blanks, we can use ASReml to make a template for the .as file by running ASReml with the datafile as the command argument (see Chapter 11). For example, running the command asreml nin89aug.asd writes a file nin89aug.as (if it does not already exist) which looks like Title: nin89aug. #variety id pid raw rep nloc yield lat long row column #LANCER 1 NA NA 1 4 NA 4.3 1.2 1 1 #LANCER 1 NA NA 1 4 NA 4.3 2.4 2 1 #LANCER 1 NA NA 1 4 NA 4.3 3.6 3 1 #LANCER 1 NA NA 1 4 NA 4.3 4.8 4 1 variety !A id * pid raw rep * nloc * yield lat long row * column * # Check/Correct these field definitions. nin89aug.asd !SKIP 1 column ~ mu , # Specify fixed model !r # Specify random model # 1 2 0 # column column AR1 0.1 # row row AR1 0.1
This is a template in that it needs editing (it has nominated an inappropriate response variable) but it displays the first few lines of the data and infers whether fields are factors or variates as follows: Missing fields and those with decimal points in the data value are taken as covariates, integer fields are taken as simple factors (*) and alphanumeric fields are taken as !A factors.
3
A guided tour
3.6
36
Description of output files A series of output files are produced with each ASReml run. Nearly all files, all that contain user information, are ASCII files and can be viewed in any ASCII editor including ASReml-W , ConText and NotePad. The primary output from the nin89.as job is written to nin89.asr. This file contains a summary of the data, the iteration sequence, estimates of the variance parameters and an a table of Wald F statistics for testing fixed effects. The estimates of all the fixed and random effects are written to nin89.sln. The residuals, predicted values of the observations and the diagonal elements of the hat matrix (see Chapter 2) are returned in nin89.yht, see Section 14.3. Other files produced by this job include the .aov, .pvs, .res, .tab, .vvp and .veo files, see Section 14.4.
The .asr file Below is nin89.asr with pointers to the main sections. The first line gives the version of ASReml used (in square brackets) and the title of the job. The second line gives the build date for the program and indicates whether it is a 32bit or 64bit version. The third line gives the date and time that the job was run and reports the size of the workspace. The general announcements box (outlined in asterisks) at the top of the file notifies the user of current release features. The remaining lines report a data summary, the iteration sequence, the estimated variance parameters and a table of Wald F statistics. The final line gives the date and time that the job was completed and a statement about convergence. job heading version
ASReml 3.01d [01 Apr 2008]
NIN alliance trial 1989
Build: e [01 Apr 2008] 04 Apr 2008 17:00:47.453
32 bit 32 Mbyte Windows
Licensed to: NSW Primary Industries
nin89
permanent
*********************************************************** * Contact
[email protected] for licensing and support
*
***************************************************** ARG * Folder: C:\data\asr3\ug3\manex variety !A QUALIFIERS: !SKIP 1 QUALIFIER: !DOPART Reading nin89.asd
1 is active FREE FORMAT skipping
Univariate analysis of yield Data summary
Summary of 224 records retained of 224 read
1 lines
3
A guided tour
37
Model term
Size #miss #zero
1 variety
56
MinNon0
Mean
MaxNon0
StndDevn
0
0
1
28.5000
56
2 id
0
0
1.000
28.50
56.00
16.20
3 pid
0
0
1101.
2628.
4156.
1121.
4 raw
0
0
21.00
510.5
840.0
149.0
5 repl
4
6 nloc 7 yield
Variate
8 lat 9 long
0
0
1
2.5000
4
0
0
4.000
4.000
4.000
0
0
1.050
25.53
42.00
7.450
0
0
4.300
27.22
47.30
12.90 7.698
0
0
1.200
14.08
26.40
10 row
22
0
0
1
11.7321
22
11 column
11
0
0
1
6.3304
11
12 mu
0.000
1
4 identity
[
5:
5]
Structure for repl has Forming
0.1000 4 levels defined
61 equations:
57 dense.
Initial updates will be shrunk by factor Notice:
0.316
1 singularities detected in design matrix.
convergence
1 LogL=-454.807
S2=
50.329
168 df
1.000
0.1000
sequence
2 LogL=-454.663
S2=
50.120
168 df
1.000
0.1173
3 LogL=-454.532
S2=
49.868
168 df
1.000
0.1463
4 LogL=-454.472
S2=
49.637
168 df
1.000
0.1866
5 LogL=-454.469
S2=
49.585
168 df
1.000
0.1986
6 LogL=-454.469
S2=
49.582
168 df
1.000
0.1993
7 LogL=-454.469
S2=
49.582
168 df
1.000
0.1993
Final parameter values
1.0000
0.19932
- - - Results from analysis of yield - - Source parameter
Variance
estimates
repl
Model
terms
224 identity
testing fixed effects
Gamma
Component
Comp/SE
% C
168
1.00000
49.5824
9.08
0 P
4
0.199323
9.88291
1.12
0 U
Wald F statistics Source of Variation 12 mu 1 variety
NumDF
DenDF
F_inc
Prob
1
3.0
242.05
, !10
takes absolute values - no argument required.
yield ABSyield !=yield !ABS
!ABS
ASReml3
action
!ARCSIN
v
forms an ArcSin transformation using the sample size specified in the argument, a number or another field. In the side example, for two existing fields Germ and Total containing counts, we form the ArcSin for their ratio (ASG) by copying the Germ field and applying the ArcSin transformation using the Total field as sample size.
Germ Total ASG !=Germ !ARCSIN Total
!COS, !SIN
s
takes cosine and sine of the data variable with period s having default 2π; omit s if data is in radians, set s to 360 if data is in degrees.
Day CosDay !=Day !COS 365
!D, !D, !D=
v v v
!D[o] v discards records which have v or ’missing value’ in the field, subject to the logical operator o.
yield !D=
v v v
!Mv converts data values of v to missing; if !M is used after !A or !I, v should refer to the encoded factor level rather than the value in the data file (see also Section 4.2).
yield !M-9 yield !M100
!MAX, !MIN, !MOD
v
the maximum, minimum and modulus of the field values and the value v.
yield !MAX 9
!MM
s
assigns Haldane map positions (s) to marker variables and imputes missing values to the markers (see below).
ChrAadd !G 10 !MM 1 · · ·
!NA
v
replaces any missing values in the variate with the value v. If v is another field, its value is copied.
Rate !NA 0 WT !=Wt2 !NA Wt1
ASReml2
5
Command file: Reading the data
58
Table 5.1: List of transformation qualifiers and their actions with examples
qualifier
argument
action
examples
!NORMAL
v
replaces the variate with normal random variables having variance v.
Ndat !=0 !Normal 4.5 is equivalent to Ndat !=Normal 4.5
!REPLACE
on
replaces data values o with n in the current variable. I.e. IF(DataValue.EQ.o) DataValue=n
Rate !REPLACE -9 0
!RESCALE
os
rescales the column(s) in the current variable (!G group of variables) using Y = (Y + o) ∗ s
Rate !RESCALE -10 0.1
!SEED
v
sets the seed for the random number generator.
· · · !SEED 848586
!SET
vlist
for vlist, a list of n values, the data values 1 . . . n are replaced by the corresponding element from vlist; data values that are < 1 or > n are replaced by zero. vlist may run over several lines provided each incomplete line ends with a comma, i.e., a comma is used as a continuation symbol (see Other examples below).
treat !L C A B CvR !=treat !SET 1 -1 -1
ASReml2
ASReml2
ASReml2
ASReml2
group !=treat !SET 1, 2 2 3 3 4
!SETN
vn
!SETN v n replaces data values 1 : n with normal random variables having variance v. Data values outside the range 1 · · · n are set to 0.
Anorm !=A !SETN 2.5 10
!SETU
vn
replaces data values 1 : n with uniform random variables having range 0 : v. Data values outside the range 1 · · · n are set to 0.
Aeff !=A !SETU 5 10
!SUB
vlist
replaces data values = vi with their index i where vlist is a vector of n values. Data values not found in vlist are set to 0. vlist may run over several lines if necessary provided each incomplete line ends with a comma. ASReml allows for a small rounding error when matching. It may not distinguish properly if values in vlist only differ in the sixth decimal place (see Other examples below).
year 3 !SUB 66 67 68
ASReml2
ASReml2
5
Command file: Reading the data
59
Table 5.1: List of transformation qualifiers and their actions with examples
qualifier
argument
!SEQ
action
examples
replaces the data values with a sequential number starting at 1 which increments whenever the data value changes between successive records; the current field is presumed to define a factor and the number of levels in the new factor is set to the number of levels identified in this sequential process (see Other examples below). Missing values remain missing.
plot !=V3 !SEQ
!TARGET
v
changes the focus of subsequent transformations to variable (field) v.
sqrtA meanAB !+A !/2 , !TARGET sqrtA !^0.5
!UNIFORM
v
replaces the variate with uniform random variables having range 0 : v.
Udat !=0. !Uniform 4.5 is equivalent to Udat !=Uniform 4.5
!Vtarget=
value
assigns value to data field target overwriting previous contents; subsequent transformation qualifiers will operate on data field target.
· · · !V3=2.5
Vfield
assigns the contents of data field field to data field target overwriting previous contents; subsequent transformation qualifiers will operate on data field target. If field is 0 the number of the data record is inserted.
· · · !V10=V3 · · · !V11=block · · · !V12=V0
ASReml3
ASReml2
QTL marker transformations ASReml2
!MM s associates marker positions in the vector s (based on the Haldane mapping function) with marker variables and replaces missing values in a vector of marker states with expected values calculated using distances to non-missing flanking markers. This transformation will normally be used on a !G n factor where the n variables are the marker states for n markers in a linkage group in map order and coded [-1,1] (backcross) or [-1,0,1] (F2 design). s (length n+1) should be the n marker positions relative to a left telomere position of zero, and an extra value being the length of the linkage group (the position of the right telomere).
5
Command file: Reading the data
60
The length (right telomere) may be omitted in which case the last marker is taken as the end of the linkage group. The positions may be given in Morgans or centiMorgans (if the length is greater than 10, it will be divided by 100 to convert to Morgans). The recombination rate between markers at sL and sR (L is left and R is right of some putative QTL at Q) is θLR = (1 − e−2(sR −sL ) )/2. Consequently, for 3 markers (L,Q,R), θLR = θLQ + θQR − 2θLQ θQR . The expected value of a missing marker at Q (between L and R) depends on the marker states at L and R: E(q|1, 1) = (1 − θLQ − θQR )/(1 − θLR ), E(q|1, −1) = (θQR − θLQ )/θLR , E(q| − 1, 1) = (θLQ − θQR )/θLR and E(q| − 1, −1) = (−1 + θLQ + θQR )/(1 − θLR ). θ (1−θ )(1−2θLQ ) Let λL = (E(q|1, 1) + E(q|1, −1))/2 = QR θLRQR (1−θLR ) θ
(1−θ
)(1−2θ
)
QR and λR = (E(q| − 1, 1) + E(q| − 1, −1))/2 = LQ θLRLQ (1−θLR ) Then E(q|xL , xR ) = λL xL + λR xR . Where there is no marker on one side, E(q|xR ) = (1 − θQR )xR + θQR (−xR ) = xR (1 − 2θQR ) . This qualifier facilitates the QTL method discussed in Gilmour (2007).
ASReml2
!DOM A is used to form dominance covariables from a set of additive marker covariables previously declared with the!MM marker map qualifier. It assumes the argument A is an existing group of marker variables relating to a linkage group defined using !MM which represents additive marker variation coded [-1, 0, 1] (representing marker states aa, aA and AA) respectively. It is a group transformation which takes the [-1,1] interval values, and calculates (|X| − 0.5) ∗ 2 i.e. -1 and 1 become one, 0 becomes -1. The marker map is also copied and applied to this model term so it can be the argument in a qtl() term (page 106).
ASReml3
!DO ... !ENDDO provides a mechanism to repeat transformations on a set of variables. All tranformations except !DOM and !RESCALE operate once on a single field unless preceded by a !DO qualifier. The !DO qualifier has three arguments: n[[it ]iv ]. n is the number of times the following transformations are to be performed. it (default 1) is the increment applied to the target field. iv (default 0.0) is the increment applied to the transformation argument. The default for n is the number of variables in the current field definition. !ENDDO is formally equivalent to !DO 1 and is implicit when another !DO appears or the next field definition begins. Note that when several transformations are repeated, the processing order is that each is performed n times before the next is processed (contrary to the implication of the syntax). However, the target is reset for each transformation so that the transformations apply to the same set of variables.
5
Command file: Reading the data
61
Y1 Y2 Y3 Y4 Y5 # Repeat 5 times, incrementing just Ymean !=0. !DO 5 0 1 !+Y1 !ENDDO !/5 # the argument is equivalent to Y1 Y2 Y3 Y4 Y5 Ymean !=0. !+Y1 !+Y2 !+Y3 !+Y4 !+Y5 !/5 Y0 Y1 Y2 Y3 Y4 Y5 !TARGET Y1 !do 5 1 0 !-Y0 !ENDDO#Take Y0 from rest Markers !G 12 !do !D * !ENDDO # Delete records with missing marker values The default arguments ( 12, 1, 0.) are used. The initial target is the first marker.
Other rules and examples Other rules include the following •
variables that are created should be listed after all variables that are read in unless the intention is to overwrite an input field.
•
missing values are unaffected by arithmetic operations, that is, missing values in the current or target column remain missing after the transformation has been performed except in assignment
Revised 08
– –
!+3 will leave missing values (NA, * and .) as missing, !=3 will change missing values to 3,
•
multiple arithmetic operations cannot be expressed in a complex expression but must be given as separate operations that are performed in sequence as they appear, for example, yield !-120 !*0.0333 would calculate 0.0333 * (yield - 120),
•
Most transformations only operate on a single field and will not therefore be performed on all variables in a !G factor set. The only transformations that apply to the whole set are !DOM, !MM and !RESCALE.
ASReml code
action
yield !M0
changes the zero entries in yield to missing values
yield !^0
takes natural logarithms of the yield data
score !-5
subtracts 5 from all values in score
score !SET -0.5 1.5 2.5
replaces data values of 1, 2 and 3 with -0.5, 1.5 and 2.5 respectively
5
Command file: Reading the data
62
ASReml code
action
score !SUB -0.5 1.5 2.5
replaces data values of -0.5, 1.5 and 2.5 with 1, 2 and 3 respectively; a data value of 1.51 would be replaced by 0 since it is not in the list or very close to a number in the list
block 8 variety 20 yield plot * !=variety !SEQ
in the case where – there are multiple units per plot, – contiguous plots have different treatments, and – the records are sorted units within plots within blocks, this code generates a plot factor assuming a new plot whenever the code in V2 (variety) changes; whether this creates a variable or overwrites an input variable depends on whether any subsequent variables are input variables,
Var 3 Nit 4 VxN 12 !=Var !-1 !*4 !+Nit
assuming Var is coded 1:3 and Nit is coded 1:4, this syntax could be used to create a new factor VxN with the 12 levels of the composite Var by Nit factor.
YA !V98=YA !NA 0 YB !V99=YB !NA 0 !+V98 !D0
will discard records where both YA and YB have missing values (assuming neither have zero as valid data). The first line sets the focus to variable 98, copies YA into V98 and changes any missing values in V98 to zero. The second line sets the focus to variable 99, copies YB into V99 and changes any missing values in V99 to zero. It then adds V98 and discards the whole record if the result is zero, i.e. both YA and YB have missing values for that record. Variables 98 and 99 are not labelled and so are not retained for subsequent use in analysis.
Special note on covariates Covariates are variates that appear as independent variables in the model. It is recommended that covariates be centred and scaled to have a mean of zero and a variance of approximately one to avoid failure to detect singularities. This can be achieved either •
externally to ASReml in data file preparation,
•
using !RESCALE -mean scale where mean and scale are user supplied values, for example, age !rescale -140 .142857 # in weeks
5
Command file: Reading the data
5.6
63
Datafile line The purpose of the datafile line is to •
nominate the data file,
•
specify qualifiers to modify – – –
the reading of the data, the output produced, the operation of ASReml.
NIN Alliance Trial 1989 variety !A . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety . . .
Data line syntax The datafile line appears in the ASReml command file in the form datafile [qualifiers] •
datafile is the path name of the file that contains the variates, factors, covariates, traits (response variates) and weight variables represented as data fields, see Chapter 4; enclose the path name in quotes if it contains embedded blanks,
•
the qualifiers tell ASReml to modify either –
–
the reading of the data and/or the output produced, see Table 5.2 below for a list of data file related qualifiers, the operation of ASReml, see Tables 5.3 to 5.6 for a list of job control qualifiers
•
the data file related qualifiers must appear on the data file line,
•
the job control qualifiers may appear on the data file line or on following lines,
•
the arguments to qualifiers are represented by the following symbols f — a filename, n — an integer number, typically a count, p — a vector of real numbers, typically in increasing order, r — a real number, s — a character string, t — a model term label, v — the number or label of a data variable, vlist — a list of variable labels.
5
Command file: Reading the data
5.7
64
Data file qualifiers Table 5.2 lists the qualifiers relating to data input. Use the Index to check for examples or further discussion of these qualifiers. Table 5.2: Qualifiers relating to data input and output
qualifier
action
Frequently used data file qualifier !SKIP n
causes the first n records of the (non-binary) data file to be ignored. Typically these lines contain column headings for the data fields.
Other data file qualifiers
ASReml3
!CSV
used to make consecutive commas imply a missing value; this, is automatically set if the file name ends with .csv or .CSV (see Section 4.2) Warning This qualifier is ignored when reading binary data.
!DATAFILE f
specifies the datafile name replacing the one obtained from the datafile line. It is required when different !PATHS (see !DOPATH in Table 11.3) of a job must read different files. The !SKIP qualifier, if specified, will be applied when reading the file.
!FILTER v [ !SELECT n]
enables a subset of the data to be analysed; v is the number or name of a data field. When reading data, the value in field v is checked after any transformations are performed. If !select is omitted, records with zero in field v are omitted from the analysis. Otherwise, records with n in field v are retained and all other records are omitted. The argument n is typically an integer which is compared with the numeric value if a field after any conversion if the input field performed by the !A or !I data field qualifiers. However, n may be a quoted string in which case to is compared to the character value of the field as it is read and before any conversion to numeric value. Warning If the filter column contains a missing value, the value from the previous non-missing record is assumed in that position.
!FOLDER
specifies an alternative folder for ASReml to find input files. This qualifier is usually placed on a separate line BEFORE the data filename line (and any pedigree/.giv .grm filename lines. For example, !FOLDER ../Data data.asd !SKIP 1 is equivalent to ../Data/data.asd !SKIP 1
s
5
Command file: Reading the data
65
Table 5.2: Qualifiers relating to data input and output
qualifier
!FORMAT
action
s
supplies a Fortran like FORMAT statement for reading fixed format files. A simple example is !FORMAT(3I4,5F6.2) which reads 3 integer fields and 5 floating point fields from the first 42 characters of each data line. A format statement is enclosed in parentheses and may include 1 level of nested parentheses, for example, e.g. !FORMAT(4x,3(I4,f8.2)). Field descriptors are • rX to skip r character positions, • rAw to define r consecutive fields of w characters width, • rIw to define r consecutive fields of w characters width,
and • rFw.d to define r consecutive fields of w characters width;
d indicates where to insert the decimal point if it is not explicitly present in the field, where r is an optional repeat count. In ASReml, the A and I field descriptors are treated identically and simply set the field width. Whether the field is interpreted alphabetically or as a number is controlled by the !A qualifier. Other legal components of a format statement are • the , character; required to separate fields - blanks are
not permitted in the format. • the / character; indicates the next field is to be read from
•
• •
•
the next line. However a / on the end of a format to skip a line is not honoured. BZ; the default action is to read blank fields as missing values. * and NA are also honoured as missing values. If you wish to read blank fields as zeros, include the string BZ. the string BM; switches back to ’blank missing’ mode. the string Tc; moves the ’last character read’ pointer to line position c so that the next field starts at position c + 1. For example T0 goes back to the beginning of the line. the string D; invokes debug mode.
A format showing these components !FORMAT(D,3I4,8X,A6,3(2x,F5.2)/4x,BZ,20I1) and suitable for reading 27 fields from 2 data records such as 111122223333xxxxxxxxALPHAFxx 4.12xx 5.32xx 6.32 xxxx123 567 901 345 7890
is is
5
Command file: Reading the data
66
Table 5.2: Qualifiers relating to data input and output
qualifier
ASReml3
action
!MERGE c f [ !SKIP n ] [ !MATCH a b ] may be specified on a line following the datafile line. The purpose is to combine data fields from the (primary) data file with data fields from a secondary file (f ). This !MERGE qualifier has been replaced by the much more powerful MERGE statement (see Chapter 12). The effect is to open the named file (skip n lines) and then insert the columns from the new file into field positions starting at position c. If !MATCH a b is specified, ASReml checks that the field a (0 < a < c) has the same value as field b. If not, it is assumed that the merged file has some missing records and missing values are inserted into the data record and the line from the MERGE file is kept for comparison with the next record. It is assumed that the lines in the MERGE file are in the same order as the corresponding lines occur in the primary data file, and that there are no extraneous lines in the MERGE file. A much more powerful merging facility is provided by the MERGE directive described in chapter 12. For example, assuming the field definitions define 10 fields, PRIMARY.DAT !skip 1 !MERGE 6 SECOND.DAT !SKIP 1 !MATCH 1 6 would obtain the first five fields from PRIMARY.DAT and the next five from SECOND.DAT, checking that the first field in each file has the same value. Thus each input record is obtained by combining information from each file, before any transformations are performed. !READ n
formally instructs ASReml to read n data fields from the data file. It is needed when there are extra columns in the data file that must be read but are only required for combination into earlier fields in transformations, or when ASReml attempts to read more fields than it needs to.
!RECODE
is required when reading a binary data file with pedigree identifiers that have not been recoded according to the pedigree file. It is not needed when the file was formed using the !SAVE option but will be needed if formed in some other way (see Section 4.2).
5
Command file: Reading the data
67
Table 5.2: Qualifiers relating to data input and output
qualifier
action
!RREC [n]
causes ASReml to read n records or to read up to a data reading error if n is omitted, and then process the records it has. This allows data to be extracted from a file which contains trailing non-data records (for example extracting the predicted values from a .pvs file). The argument (n) specifies the number of data records to be read. If not supplied, ASReml reads until a data reading error occurs, and then processes the data it has. Without this qualifier, ASReml aborts the job when it encounters a data error. See !RSKIP.
ASReml2
!RSKIP n [s] ASReml2
allows ASReml to skip lines at the heading of a file down to (and including) the nth instance of string s. For example, to read back the third set predicted values in a .pvs file, you would specify !RREC !RSKIP 4 ’ Ecode’ since the line containing the 4th instance of ’ Ecode’ immediately precedes the predicted values. The !RREC qualifier means that ASReml will read until the end of the predict table. The keyword Ecode which occurs once at the beginning and then immediately before each block of data in the .pvs file is used to count the sections.
Combining rows from separate files ASReml2
ASReml can read data from multiple files provided the files have the same layout. The file specified as the ’primary data file’ in the command file can contain lines of the form !INCLUDE !SKIP n where is the (path)name of the data subfile and !SKIP n is an optional qualifier indicating that the first n lines of the subfile are to be skipped. After reading each subfile, input reverts to the primary data file. Typically, the primary data file will just contain !INCLUDE statements identifying the subfiles to include. For example, you may have data from a series of related experiments in separate data files for individual analysis. The primary data file for the subsequent combined analysis would then just contain a set of !INCLUDE statements to specify which experiments were being combined.
5
Command file: Reading the data
68
If the subfiles have CSV format, they should all have it and the !CSV file should be declared on the primary datafile line. This option is not available in combination with !MERGE.
5.8
Job control qualifiers The following tables list the job control qualifiers. These change or control various aspects of the analysis. Job control qualifiers may be placed on the datafile line and following lines. They may also be defined using an environment variable called ASREML QUAL. The environment variable is processed immediately after the datafile line is processed. All qualifier settings are reported in the .asr file. Use the Index to check for examples or further discussion of these qualifiers. Important Many of these are only required in very special circumstances and new users should not attempt to understand all of them. You do need to understand that all general qualifiers are specified here. Many of these qualifiers are referenced in other chapters where their purpose will be more evident. Table 5.3: List of commonly used job control qualifiers
qualifier
action
!CONTINUE
is used to restart/resume iterations from the point reached in a previous run. This qualifier can alternately be set from the command line using the option letters C (continue) or F (final) (see Section 11.3 on command line options). After each iteration, ASReml writes the current values of the variance parameters to a file with extension .rsv (re-start values) with information to identify individual variance parameters. The !CONTINUE qualifier causes ASReml to scan the .rsv file for parameter values related to the current model replacing the values obtained from the .as file before iteration resumes. If the model has changed, ASReml will pick up the values it recognises as being for the same terms. Furthermore, ASReml will use estimates in the .rsv file for certain models to provide starting values for certain more general models, inserting reasonable defaults where necessary. The transitions recognised are listed and discussed in Section 7.10.
5
Command file: Reading the data
69
Table 5.3: List of commonly used job control qualifiers
qualifier
action DIAG to FA1 DIAG to CORUH (uniform heterogeneous) CORUH to FA1 and to XFA1 FAi to FAi+1 XFAi to XFAi+1 FAi to CORGH (full heterogeneous) FAi to US (full heterogeneous) CORGH (heterogeneous) to US
!CONTRAST s t p
provides a convenient way to define contrasts among treatment levels. !CONTRAST lines occur as separate lines between the datafile line and the model line. s is the name of the model term being defined. t is the name of an existing factor. p is the list of contrast coefficients. For example !CONTRAST LinN Nitrogen 3 1 -1 -3 defines LinN as a contrast based on the 4 (implied by the length of the list) levels of factor Nitrogen. Missing values in the factor become missing values in the contrast. Zero values in the factor (no level assigned) become zeros in the contrast. The user should check that the levels of the factor are in the order assumed by contrast (check the .ass or .sln or .tab files). It may also be used on the implicit factor Trait in a multivariate analysis provided it implicitly identifies the number of levels of Trait; the number of traits is implied by the length of the list. Thus, if the analysis involves 5 traits, !CONTRAST Time Trait 1 3 5 10 20
!DDF [i]
requests computation of the approximate denominator degrees of freedom according to Kenward and Roger (1997) for the testing of fixed effects terms in the dense part of the linear mixed model. There are three options for i: i = −1 suppresses computation, i = 1 and i = 2 compute the denominator d.f. using numerical and algebraic methods respectively. If i is omitted then i = 2 is assumed. If !DDF i is omitted, i = −1 is assumed except for small jobs (< 10 parameters, < 500 fixed effects, < 10, 000 equations and < 100 Mbyte workspace) when i = 2.
ASReml2
ASReml2
5
Command file: Reading the data
70
Table 5.3: List of commonly used job control qualifiers
qualifier
action Calculation of the denominator degrees of freedom is computationally expensive. Numerical derivatives require an extra evaluation of the mixed model equations for every variance parameter. Algebraic derivatives require a large dense matrix, potentially of order number of equations plus number of records and is not available when MAXIT is 1 or for multivariate analysis.
!FCON
adds a ’conditional’ Wald F statistic column to the Wald F Statistics table. It enables inference for fixed effects in the dense part of the linear mixed model to be conducted so as to respect both structural and intrinsic marginality (see Section 2.6). The detail of exactly which terms are conditioned on is reported in the .aov file. The marginality principle used in determining this conditional test is that a term cannot be adjusted for another term which encompasses it explicitly (e.g. term A.C cannot be adjusted for A.B.C) or implicitly (e.g. term REGION cannot be adjusted for LOCATION when locations are actually nested in regions although they are coded independently). !FOWN on page 84 provides a way of replacing the conditional Wald F statistic by specifying what terms are to be adjusted for, provided its degrees of freedom are unchanged from the incremental test.
!MAXIT n
sets the maximum number of iterations; the default is 10. ASReml iterates for n iterations unless convergence is achieved first. Convergence is presumed when the REML log-likelihood changes less than 0.002* current iteration number and the individual variance parameter estimates change less than 1%.
ASReml2
If the job has not converged in n iterations, use the !CONTINUE qualifier to resume iterating from the current point. To abort the job at the end of the current iteration, create a file named ABORTASR.NOW in the directory in which the job is running. At the end of each iteration, ASRemlchecks for this file and if present, stops the job, producing the usual output but not producing predicted values since these are calculated in the last iteration. Creating FINALASR.NOW will stop ASReml after one more iteration (during which predictions will be formed).
5
Command file: Reading the data
71
Table 5.3: List of commonly used job control qualifiers
qualifier
action
On case sensitive operating systems (eg. Unix), the filename (ABORTASR.NOW or FINALASR.NOW) must be upper case. Note that the ABORTASR.NOW file is deleted so nothing of importance should be in it. If you perform a system level abort (CTRL C or close the program window) output files other than the .rsv file will be incomplete. The .rsv file should still be functional for resuming iteration at the most recent parameter estimates (see !CONTINUE). Use !MAXIT 1 where you want estimates of fixed effects and predictions of random effects for the particular set of variance parameters supplied as initial values. Otherwise the estimates and predictions will be for the updated variance parameters (see the !BLUP qualifier below). If !MAXIT 1 is used and an Unstructured Variance model is fitted, ASReml will perform a Score test of the US matrix. Thus, assume the variance structure is modelled with reduced parameters, if that modelled structure is then processed as the initial values of a US structure, ASReml tests the adequacy of the reduced parameterization. !SUM
causes ASReml to report a general description of the distribution of the data variables and factors and simple correlations among the variables for those records included in the analysis. This summary will ignore data records for which the variable being analysed is missing unless a multivariate analysis is requested or missing values are being estimated. The information is written to the .ass file.
!X v !Y v !G v !JOIN
is used to plot the (transformed) data. Use !X to specify the x variable, !Y to specify the y variable and !G to specify a grouping variable. !JOIN joins the points when the x value increases between consecutive records. The grouping variable may be omitted for a simple scatter plot. Omit !Y y produce a histogram of the x variable.
ASReml2
For example, !X age !Y height !G sex Note that the graphs are only produced in the graphics versions of ASReml (Section 11.3).
5
Command file: Reading the data
72
Table 5.3: List of commonly used job control qualifiers
qualifier
action
For multivariate repeated measures data, ASReml can plot the response profiles if the first response is nominated with the !Y qualifier and the following analysis is of the multivariate data. ASReml assumes the response variables are in contiguous fields and are equally spaced. For example Response profiles Treatment !A Y1 Y2 Y3 Y4 Y5 rat.asd !Y Y1 !G Treatment !JOIN Y1 Y2 Y3 Y4 Y5 ∼ Trait Treatment Trait.Treatment
Table 5.4: List of occasionally used job control qualifiers
qualifier
action
!ASMV n
indicates a multivariate analysis is required although the data is presented in a univariate form. ’Multivariate Analysis’ is used in the narrow sense where an unstructured error variance matrix is fitted across traits, records are independent, and observations may be missing for particular traits, see Chapter 8 for a complete discussion. The data is presumed arranged in lots of n records where n is the number of traits. It may be necessary to expand the data file to achieve this structure, inserting a missing value NA on the additional records. This option is sometimes relevant for some forms of repeated measures analysis. There will need to be a factor in the data to code for trait as the intrinsic Trait factor is undefined when the data is presented in a univariate manner.
5
Command file: Reading the data
73
Table 5.4: List of occasionally used job control qualifiers
qualifier
action
!ASUV
indicates that a univariate analysis is required although the data is presented in a multivariate form. Specifically, it allows you to have an error variance other than I ⊗ Σ where Σ is the unstructured (US, see Table 7.3) variance structure. If there are missing values in the data, include !f mv on the end of the linear model. It is often also necessary to specify the !S2==1 qualifier on the R-structure lines. The intrinsic factor Trait is defined and may be used in the model. See Chapter 8 for more information. This option is used for repeated measures analysis when the variance structure required is not the standard multivariate unstructured matrix.
!COLFAC v
is used with !SECTION v and !ROWFAC v to instruct ASReml to set up R structures for analysing a multi-environment trial with a separable first order autoregressive model for each site (environment). v is the name of a factor or variate containing column numbers (1 . . . nc where nc is the number of columns) on which the data is to be sorted. See !SECTION for more detail.
!DISPLAY n
is used to select particular graphic displays. In spatial analysis of field trials, four graphic displays are possible (see Section 14.4). Coding these 1=variogram 2=histogram 4=row and column trends 8=perspective plot of residuals, set n to the sum of the codes for the desired graphics. The default is 9=1+8. These graphics are only displayed in versions of ASReml linked with Winteracter (that is, Linux, Sun and PC) versions. Line printer versions of these graphics are written to the .res file. See the G command line option (Section 11.3 on graphics) for how to save the graphs in a file for printing. Use !NODISPLAY to suppress graphic displays.
!EPS
sets hardcopy graphics file type to .eps.
!G v
is used to set a grouping variable for plotting, see !X.
5
Command file: Reading the data
74
Table 5.4: List of occasionally used job control qualifiers
qualifier
action
!GKRIGE [p]
controls the expansion of !PVAL lists for fac(X,Y ) model terms. For kriging prediction in 2 dimensions (X,Y ), the user will typically want to predict at a grid of values, not necessarily just at data combinations. The values at which the prediction is required can be specified separately for X and Y using two !PVAL statements. Normally, predict points will be defined for all combinations of X and Y values. This qualifier is required (with optional argument 1) to specify the lists are to be taken in parallel. The lists must be the same length if to be taken in parallel. Be aware that adding two dimensional prediction points is likely to substantially slow iterations because the variance structure is dense and becomes larger. For this reason, ASReml will ignore the extra PVAL points unless either !FINAL or !GKRIGE are set, to save processing time.
ASReml2
!GROUPFACTOR t v p
The !GROUPFACTOR qualifier, like !SUBSET, must appear on a line by itself after the data line and before the model line. Its purpose is to define a factor t by merging levels of an existing factor v. The syntax is !GROUPFACTOR for example !GROUPFACTOR Year YearLoc 1 1 1 2 2 3 3 3 4 4 forms a new factor Year with 4 levels from the existing factor YearLoc with 10 levels. Alternatively, Year could be formed data transformation: Year * !=YearLoc !set 1 1 1 2 2 3 3 3 4 4 !L 2001 2002 2003 2004
ASReml3
!JOIN
is used to join lines in plots, see !X.
5
Command file: Reading the data
75
Table 5.4: List of occasionally used job control qualifiers
ASReml3
qualifier
action
!MBF mbf(v,n) f [!FACTOR ] [!FIELD s] [!KEY k] [!NOKEY ] [!RENAME t] [!RFIELD r] [!SKIP k] [!SPARSE ]
specified on a separate line after the datafile line predefines the model term mbf(v,n) as a set of n covariates indexed by the data values in variable v. MBF stands for My Basis Function and uses the same mechanism as the leg(), pol() and spl() model functions but with covariates supplied by the user. It is used for reading in specialized design matrices indexed by a factor in the data including genetic marker covariables. By default, the file f should contain 1+n fields where the first field, the key field, contains the values which are in the data variable or at which prediction is required, and the remaining n fields define the corresponding covariate values. If n is omitted, all fields after the key field, are taken unless !FACTOR is specified for which n is 1 and the covariate values are treated as coding for a multilevel factor. !RENAME t changes the name of the the term from mbf(...) to the new name t. This is necessary when several mbf(...) terms are being defined which would otherwise have the same name/label. For example !MBF mbf(entry) mlib/m35.csv !rename Marker35
If the key values are the ordered sequence 1 : N , the key field may be omitted if !NOKEY is specified. If the key is not in the first field, its location can be specified with !KEY k. If extracting a single covariate from a large set of covariates in the file, the specific field to extract can be given by !FIELD s in absolute terms, or relative to the key field by !RFIELD r. For example !MBF mbf(variety,1) markers.csv !key 1 !RFIELD 35 !rename Marker35 !SKIP k requests the first k lines of the file be ignored. !SPARSE can be used when the covariates are predominately zero. Each key value is followed by as many column,value pairs as required to specifiy the non zero elements of the design for that value of key. The pairs should be arranged in increasing order of column within rows. The rows may be continued on subsequent lines of the file provided incomplete lines end with a COMMA. Restrictions: The key field MUST be numeric. In particular, if the data field it relates to is either an !A or !I encoded factor, the original (uncoded) level labels may not specified in the MBF file. Rather the coded levels must be specified. The MBF file is processed before the data file is read in and so the mapping to coded levels has not been defined in ASReml when the MBF file is processed, although the user can/must anticipate what it will be.
5
Command file: Reading the data
76
Table 5.4: List of occasionally used job control qualifiers
qualifier
action
Comment: If this MBF process is to be used repeatedly, for example to process a large set of marker variables in conjunction with !CYCLE, processing will be much faster if the markers variables are in separate files. ASReml will read 10 files containing a single field much faster than reading a single file containing 400 fields, ten times to extract 10 different markers. !MVINCLUDE
When missing values occur in the design ASReml will report this fact and abort the job unless !MVINCLUDE is specified (see Section 6.9); then missing values are treated as zeros. Use the !DV transformation to drop the records with the missing values.
!MVREMOVE
instructs ASReml to discard records which have missing values in the design matrix (see Section 6.9).
!NODISPLAY
suppresses the graphic display of the variogram and residuals which is otherwise produced for spatial analyses in the PC and SUN versions. This option is usually set on the command line using the option letter N (see Section 11.3 on graphics). The text version of the graphics is still written to the .res file.
!PVAL v p
is a mechanism for specifying the particular points to be predicted for covariates modelled using fac(v), leg(v,k), spl(v,k) and pol(v,k). The points are specified here so that they can be included in the appropriate design matrices. v is the name of a data field. p is the list of values at which prediction is required. See !GKRIGE for special conditions pertaining to fac(x,y) prediction.
!PVAL f vlist
is used to read predict points for several variables from a file f. vlist is the names of the variables having values defined. If the file contains unwanted fields, put the pseudo variate label skip in the appropriate position in vlist to ignore them. The file should only have numeric values. predict points cannot be specified for design factors.
!ROWFAC v
is used with !SECTION v and !COLFAC v to instruct ASReml to setup the R structures for multi-environment spatial analysis. v is the name of a factor or variate containing row numbers (1 . . . nr where nr is the number of rows) on which the data is to be sorted. See !SECTION for more detail.
5
Command file: Reading the data
77
Table 5.4: List of occasionally used job control qualifiers
qualifier
action
!SECTION v
specifies the factor in the data that defines the data sections. This qualifier enables ASReml to check that sections have been correctly dimensioned but does not cause ASReml to sort the data unless !ROWFAC and !COLFAC are also specified. Data is assumed to be presorted by section but will be sorted on row and column within section. The following is a basic example assuming 5 sites (sections). When !ROWFAC v and !COLFAC v are both specified ASReml generates the R structures for a standard AR ⊗ AR spatial analysis. The R structure lines that a user would normally be required to work out and type into the .as file (see the example of Section 16.6) are written to the .res file. The user may then cut and paste them into the .as file for a later run if the structures need to be modified.
!SPLINE spl(v,n) p
Basic multi-environment trial analysis site 5 # sites coded 1 ... 5 column * # columns coded 1 ... row * # rows coded 1 ... variety !A # variety names yield met.dat !SECTION site !ROWFAC row !COLFAC col yield ∼ site !r variety site.variety !f mv site 2 0 # variance header line # ASReml inserts the 10 lines required to define # the R structure lines for the five sites here defines a spline model term with an explicit set of knot points. The basic form of the spline model term, spl(v), is defined in Table 6.1 where v is the underlying variate. The basic form uses the unique data values as the knot points. The extended form is spl(v,n) which uses n knot points. Use this !SPLINE qualifier to supply an explicit set of n knot points (p) for the model term t. Using the extended form without using this qualifier results in n equally spaced knot points being used. The !SPLINE qualifier may only be used on a line by itself after the datafile line and before the model line.
5
Command file: Reading the data
78
Table 5.4: List of occasionally used job control qualifiers
qualifier
action
When knot points are explicitly supplied they should be in increasing order and adequately cover the range of the data or ASReml will modify them before they are applied. If you choose to spread them over several lines use a comma at the end of incomplete lines so that ASReml will to continue reading values from the next line of input. If the explicit points do not adequately cover the range, a message is printed and the values are rescaled unless !NOCHECK is also specified. Inadequate coverage is when the explicit range does not cover the midpoint of the actual range. See !KNOTS, !PVAL and !SCALE. !STEP r
reduces the update step sizes of the variance parameters. The default value is the reciprocal of the square root of !MAXIT. It may be set between 0.01 and 1.0. The step size is increased towards 1 each iteration. Starting at 0.1, the sequence would be 0.1, 0.32, 0.56, 1. This option is useful when you do not have good starting values, especially in multivariate analyses.
!SUBGROUP t v p
forms a new group factor (t) derived from an existing group factor (v) by selecting a subset (p) of its variables. A subgroup factor may not be used in a PREDICT or TABULATE directive.
!SUBSET t v p
forms a new factor (t) derived from an existing factor (v) by selecting a subset (p) of its levels. Missing values are transmitted as missing and records whose level is zero are transmitted as zero. The qualifier occupies its own line after the datafile line but before the linear model. e.g. !SUBSET EnvC Env 3 5 8 9 :15 21 33 defines a reduced form of the factor Env just selecting the environments listed. It might then be used in the model in an interaction. A subset factor can be used in a TABULATE directive but not in a PREDICT directive.
ASReml3
ASReml2
The intention is to simplify the model specification in MET (Multi Environment Trials) analyses where say Column effects are to be fitted to a subset of environments. It may also be used on the intrinsic factor Trait in a multivariate analysis provided it correctly identifies the number of levels of Trait either by including the last trait number, or appending sufficient zeros. Thus, if the analysis involves 5 traits, !SUBSET Trewe Trait 1 3 4 0 0 !WMF
sets hardcopy graphics file type to .wmf.
5
Command file: Reading the data
79
Table 5.5: List of rarely used job control qualifiers
qualifier
action
!AILOADINGS i
controls modification to AI updates of loadings in eXtended Factor Analytic models. After ASReml calculates updates for variance parameters, it checks whether the updates are reasonable and sometimes reduces them over and above any !STEPSIZE shrinkage. The extra shrinkage has two levels. Loadings that change sign are restricted to doubling in magnitude, and if the average change in magnitude of loadings is greater than 10-fold, they are all shrunk back. When the user does not provide constraits, ASReml rotates the loadings each iteration. When !AILOADINGS i is specified, it also prevents AI updates of some loadings during the first i iterations. For f (> 1) factors, only the last factor is estimated (conditional on the earlier ones) in the first f − 1 iterations. Then pairs including the last are estimated until iteration i. If !AILOADINGS is not specified and !CONTINUE is used and initializes the XFA model from a lower order, the i parameter is set internally.
!AISINGULARITIES
can be specified to force a job to continue even though a singularity was detected in the Average Information (AI) matrix. The AI matrix is used to give updates to the variance parameter estimates. In release 1, if singularities were present in the AI matrix, a generalized inverse was used which effectively conditioned on whichever parameters were identified as singular. ASReml now aborts processing if such singularities appear unless the !AISINGULARITIES qualifier is set. Which particular parameter is singular is reported in the variance component table printed in the .asr file.
ASReml3
ASReml2
The most common reason for singularities is that the user has overspecified the model and is likely to misinterpret the results if not fully aware of the situation. Overspecification will occur in a direct product of two unconstrained variance matrices (see Section 2.4), when a random term is confounded with a fixed term and when there is no information in the data on a particular component. The best solution is to reform the variance model so that the ambiguity is removed, or to fix one of the parameters in the variance model so that the model can be fitted. For instance, if !ASUV is specified, you may also need !S2==1. Only rarely will it be reasonable to specify the !AISINGULARITIES qualifier. !BMP
sets hardcopy graphics file type to .bmp.
5
Command file: Reading the data
80
Table 5.5: List of rarely used job control qualifiers
qualifier
action
!BRIEF [n]
suppresses some of the information written to the .asr file. The data summary and regression coefficient estimates are suppressed. This qualifier should not be used for initial runs of a job until the user has confirmed from the data summary that the data is correctly interpreted by ASReml. Use !BRIEF 2 to cause the predicted values to be written to the .asr file instead of the .pvs file. Use !BRIEF -1 to get BLUE (fixed effect) estimates reported in .asr file. The !BRIEF qualifier may be set with the B command line option.
!BLUP n
is used to calculate the effects reported in the .sln file without calculating any derived quantities such as predicted values or updated variance parameters. For argument values 1:3, ASReml solves for the effects directly while for values 4:19 it solves the mixed model equations by iteration, allowing larger models to be fitted. With direct solution, the estimation REML iteration routine is aborted after n = 1: forming the estimates of the vector of fixed and random effects by matrix inversion, n = 2: forming the estimates of the vector of fixed and random effects, REML log-likelihood and residuals (this is the default), n = 3: forming the estimates of the vector of fixed and random effects, REML log-likelihood, residuals and inverse coefficient matrix. For arguments 4, 10:19, ASReml forms the mixed model equations and solves them iteratively to obtain solutions for the fixed and random effects. The options are: n = 4: forming the estimates of the vector of fixed and random effects using the Preconditioned Conjugate Gradient (PCG) Method (Mrode, 2005), n = 10:19 forming the estimates of the vector of fixed and random effects by Gauss-Seidel iteration of the mixed model equations, with relaxation factor n/10, The default maximum number of iterations is 12000. This can be reset by supplying a value greater than 100 with the !MAXIT qualifier in conjunction with the !BLUP qualifier. Iteration stops when the average squared update divided by the average squared effect is less than 1e−10 . Gauss-Seidel iteration is generally much slower than the PCG method.
ASReml2
ASReml3
5
Command file: Reading the data
81
Table 5.5: List of rarely used job control qualifiers
qualifier
action
ASReml prints its standard reports as if it had completed the iteration normally, but since it has not completed it, some of the information printed will be incorrect. In particular, variance information on the variance parameters will always be unavailable. Standard errors on the estimates will be wrong unless n=3. Residuals are not available if n=1. Use of n=3 or n=2 will halve the processing time when compared to the alternative of using !MAXIT 1 rather than a !tt !BLUP n qualifier. However, !MAXIT 1 does result in complete and correct output. !DENSE n
sets the number of equations solved densely up to a maximum of 5000. By default, sparse matrix methods are applied to the random effects and any fixed effects listed after random factors or whose equation numbers exceed 800. Use !DENSE n to apply sparse methods to effects listed before the !r (reducing the size of the DENSE block) or if you have large fixed model terms and want Wald F statistics calculated for them. Individual model terms will not be split so that only part is in the dense section. n should be kept small ( 2, data fields n. . . t are written to the file where t is the last defined column.
!PNG
sets hardcopy graphics file type to .png.
!PS
sets hardcopy graphics file type to .ps.
!PVSFORM n
modifies the format of the tables in the .pvs file and changes the file extension of the file to reflect the format. !PVSFORM 1 is TAB separated: .pvs → pvs.txt !PVSFORM 2 is COMMA separated: .pvs → pvs.csv !PVSFORM 3 is Ampersand separated: .pvs → pvs.tex See !TXTFORM for more detail.
!RESIDUALS [2]
instructs ASReml to write the transformed data and the residuals to a binary file. The residual is the last field. The file basename.srs is written in single precision unless the argument is 2 in which case basename.drs is written in double precision. Factor names are held in a .vll file: see !SAVE below.
ASReml3
ASReml2
5
Command file: Reading the data
87
Table 5.5: List of rarely used job control qualifiers
qualifier
action
The file will not be written from a spatial analysis (twodimensional error) when the data records have been sorted into field order because the residuals are not in the same order that the data is stored. The residual from a spatial analysis will have the units part added to it when units is also fitted. The .drs file could be renamed (with extension .dbl) and used for input in a subsequent run. !SAVE n
instructs ASReml to write the data to a binary file. The file asrdata.bin is written in single precision if the argument n is 1 or 3; asrdata.dbl is written in double precision if the argument n is 2 or 4; the data values are written before transformation if the argument is 1 or 2 and after transformation if the argument is 3 or 4. The default is single precision after transformation (see Section 4.2). When either !SAVE or !RESIDUALS is specified, ASReml saves the factor level labels to a basename.vll and attempts to read them back when data input is from a binary file. Note that if the job basename changes between runs, the .vll file will need to be copied to the new basename. If the .vll file does not match the factor structure (i.e. the same factors in the same order), reading the .vll file is aborted.
!SCREEN [n] [ !SMX m ] ASReml2
performs a ’Regression Screen’, a form of all subsets regression. For d model terms in the DENSE equations, there are 2d − 1 possible submodels. Since for d > 8, 2d − 1 is large, the submodels explored are reduced by the parameters n and m so that only models with at least n (default 1) terms but no more than m (default 6) terms are considered. The output (see page 225) is a report to the .asr file with a line for every submodel showing the sums of squares, degrees of freedom and terms in the model. There is a limit of d = 20 model terms in the screen. ASReml will not allow interactions to be included in the screened terms. For example, to identify which three of my set of 12 covariates best explain my dependent variable given the other terms in the model, specify !SCREEN 3 !SMX 3. The number of models evaluated quickly increases with d but ASReml has an arbitrary limit of 900 submodels evaluated. Use the !DENSE qualifier to control which terms are screened. The screen is conditional on all other terms (those in the SPARSE equations) being present.
5
Command file: Reading the data
88
Table 5.5: List of rarely used job control qualifiers
qualifier
action
!SLNFORM [n]
modifies the format of the .sln file. !SLNFORM -1 prevents the .sln file from being written. !SLNFORM 1 is TAB separated: .sln becomes sln.txt !SLNFORM 2 is COMMA separated: .sln becomes sln.csv !SLNFORM 3 is Ampersand separated: .sln becomes sln.tex See !TXTFORM for more detail.
!SPATIAL
increases the amount of information reported on the residuals obtained from the analysis of a two dimensional regular grid field trial. The information is written to the .res file.
!TABFORM [n]
controls form of the .tab file !TABFORM 1 is TAB separated: .tab becomes tab.txt !TABFORM 2 is COMMA separated: .tab becomes tab.csv !TABFORM 3 is Ampersand separated: .tab becomes tab.tex See !TXTFORM for more detail.
!TXTFORM [n]
sets the default argument for !PVSFORM, !SLNFORM, !TABFORM and !YHTFORM if these are not explicitly set. !TXTFORM (or !TXTFORM 1) replaces multiple spaces with TAB and changes the file extension to, say, sln.txt. This makes it easier to load the solutions into Excel. !TXTFORM 2 replaces multiple spaces with COMMA and changes the file extension to, say, sln.csv. However, since factor labels sometimes contain COMMAS, this form is not so convenient. !TXTFORM 3 replaces multiple spaces with Ampersand, appends a double backslash to each line and changes the file extension to say sln.tex (Latex style). Additional significant digits are reported with these formats. Omitting the qualifier means the standard fixed field format is used. For .yht and .sln files, setting n to -1 means the file is not formed.
!TWOWAY
modifies the appearance of the variogram calculated from the residuals obtained when the sampling coordinates of the spatial process are defined on a lattice. The default form is based on absolute ’distance’ in each direction. This form distinguishes same sign and different sign distances and plots the variances separately as two layers in the same figure.
!VCC n
specifies that n constraints are to be applied to the variance parameters. The constraint lines occur after the G structures are defined. The constraints are described in Section 7.9. The variance header line (Section 7.4) must be present, even if only 0 0 0 indicating there are no explicit R or G structures (see Section 7.9).
ASReml2
ASReml2
ASReml2
ASReml2
ASReml2
5
Command file: Reading the data
89
Table 5.5: List of rarely used job control qualifiers
qualifier
action
!VGSECTORS [s]
requests that the variogram formed with radial coordinates (see page 19) be based on s (4, 6 or 8) sectors of size 180/s degrees. The default is 4 sectors if !VGSECTORS is omitted and 6 sectors if it is specified without an argument. The first sector is centred on the X direction.
ASReml2
Figure 5.1 is the variogram using radial coordinates obtained using predictors of random effects fitted as fac(xsca,ysca). It shows low semivariance in xsca direction, high semivariance in the ysca direction with intermediate values in the 45 and 135 degrees directions. !YHTFORM [f ] ASReml2
!YSS [r] ASReml2
controls the form of the .yht file !YHTFORM -1 suppresses formation of the .yht file !YHTFORM 1 is TAB separated: .yht becomes yht.txt !YHTFORM 2 is COMMA separated: .yht becomes yht.csv !YHTFORM 3 is Ampersand separated: .yht becomes yht.tex adds r to the total Sum of Squares. This might be used with !DF to add some variance to the analysis when analysing summarised data.
Table 5.6: List of very rarely used job control qualifiers
qualifier
action
!CINV n
prints the portion of the inverse of the coefficient matrix pertaining to the nth term in the linear model. Because the model has not been defined when ASReml reads this line, it is up to the user to count the terms in the model to identify the portion of the inverse of the coefficient matrix to be printed. The option is ignored if the portion is not wholly in the SPARSE stored equations. The portion of the inverse is printed to a file with extension .cii The sparse form of the matrix only is printed in the form i j C ij , that is, elements of C ij that were not needed in the estimation process are not included in the file.
!FACPOINTS n
affects the number of distinct points recognised by the fac() model function (Table 6.1). The default value of n is 1000 so that points closer than 0.1% of the range are regarded as the same point.
5
Command file: Reading the data
90
Table 5.6: List of very rarely used job control qualifiers
qualifier
action
!KNOTS n
changes the default knot points used when fitting a spline to data with more than n different values of the spline variable. When there are more than n (default 50) points, ASReml will default to using n equally spaced knot points.
!NOCHECK
forces ASReml to use any explicitly set spline knot points (see !SPLINE) even if they do not appear to adequately cover the data values.
!NOREORDER
prevents the automatic reversal of the order of the fixed terms (in the dense equations) and possible reordering of terms in the sparse equations.
!NOSCRATCH
forces ASReml to hold the data in memory. ASReml will usually hold the data on a scratch file rather than in memory. In large jobs, the system area where scratch files are held may not be large enough. A Unix system may put this file in the /tmp directory which may not have enough space to hold it.
!POLPOINTS n
affects the number of distinct points recognised by the pol() model function (Table 6.1). The default value of n is 1000 so that points closer than 0.1% of the range are regarded as the same point.
!PPOINTS n
influences the number of points used when predicting splines and polynomials. The design matrix generated by the leg(), pol() and spl() functions are modified to include extra rows that are accessed by the PREDICT directive. The default value of n is 21 if there is no !PPOINTS qualifier. The range of the data is divided by n-1 to give a step size i. For each point p in the list, a predict point is inserted at p + i if there is no data value in the interval [p,p+1.1×i]. !PPOINTS is ignored if !PVAL is specified for the variable. This process also effects the number of levels identified by the fac() model term.
!REPORT
forces ASReml to attempt to produce the standard output report when there is a failure of the iteration algorithm. Usually no report is produced unless the algorithm has at least produced estimates for the fixed and random effects in the model. Note that residuals are not included in the output forced by this qualifier. This option is primarily intended to help debugging a job that is not converging properly.
!SCALE 1
When forming a design matrix for the spl() model term, ASReml uses a standardized scale (independent of the actual scale of the variable). The qualifier !SCALE 1 forces ASReml to use the scale of the variable. The default standardised scale is appropriate in most circumstances.
5
Command file: Reading the data
91
Table 5.6: List of very rarely used job control qualifiers
qualifier
action
!SCORE
requests ASReml write the SCORE vector and the Average Information matrix to files basename.SCO and basename.AIM. The values written are from the last iteration.
!SLOW n
reduces the update step sizes of the variance parameters more persistently than the !STEP r qualifier. If specified, ASReml looks at the potential size of the updates and if any are large, it reduces the size of r. If n is greater than 10 ASReml also modifies the Information matrix by multiplying the diagonal elements by n. This has the effect of further reducing the updates. In the iteration subroutine, if the calculated LogL is more than 1.0 less than the LogL for the previous iteration and !SLOW is set and NIT>1, ASReml immediately moves the variance parameters back towards the previous values and restarts the iteration.
!TOLERANCE [s1 [ s2 ]]
modifies the ability of ASReml to detect singularities in the mixed model equations. This is intended for use on the rare occasions when ASReml detects singularities after the first iteration; they are not expected.
ASReml2
ASReml2
Normally (when no !TOLERANCE qualifier is specified), a singularity is declared if the adjusted sum of squares of a covariable is less than a small constant (η) or less than the uncorrected sum of squares ×η, where η is 10−8 in the first iteration and 10−10 thereafter. The qualifier scales η by 10si for the the first or subsequent iterations respectively, so that it is more likely an equation will be declared singular. Once a singularity is detected, the corresponding equation is dropped (forced to be zero) in subsequent iterations. If neither argument is supplied, 2 is assumed. If the second argument is omitted, it is given the value of the first. If the problem of later singularities arises because of the low coefficient of variation of a covariable, it would be better to centre and rescale the covariable. If the degrees of freedom are correct in the first iteration, the problem will be with the variance parameters and a different variance model (or variance constraints) is required. !VRB ASReml2
requests writing of .vrb file. Previously, the default was to write the file.
this is a test of matern Variogram of fac(xsca,ysca) predictors 21.61
S e m i v a r i a n c e
135 0
45
90 Distance
Figure 5.1 Variogram in 4 sectors for Cashmore data
2.80
6
Command file: Specifying the terms in the mixed model Introduction Specifying model formulae in ASReml General rules Examples
Fixed terms in the model Primary fixed terms Sparse fixed terms
Random terms in the model Interactions, expansions and conditional factors Interactions Model Expansions Conditional factors
Alphabetic list of model functions Weights Missing values Missing values in the response Missing values in the explanatory variables
Some technical details about model fitting in ASReml Sparse versus dense Ordering of terms in ASReml Aliassing and singularities Examples of aliassing
93
Wald F Statistics
6
Command file: Specifying the terms in the mixed model
6.1
94
Introduction The linear mixed model is specified in ASReml as a series of model terms and qualifiers. In this chapter the model formula syntax is described.
6.2
Specifying model formulae in ASReml The linear mixed model is specified in ASReml as a series of model terms and qualifiers. Model terms include factor and variate labels (Section 5.4), functions of labels, special terms and interactions of these. The model is specified immediately after the datafile and any job control qualifier and/or tabulate lines. The syntax for specifying the model is
NIN Alliance Trial 1989 variety . . . column 11 nin89.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 11 column AR1 .3 22 row AR1 .3
response [!wt weight] ∼ fixed [!r random] [!f sparse fixed] •
response is the label for the response variable(s) to be analysed; multivariate analysis is discussed in Chapter 8,
•
weight is a label of a variable containing weights; weighted analysis is discussed in Section 6.7,
•
∼ separates response from the list of fixed and random terms,
•
fixed represents the list of primary fixed explanatory terms, that is, variates, factors, interactions and special terms for which Wald F statistics are required. See Table 6.1 for a brief definition of reserved model terms, operators and commonly used functions. The full definition is in Section 6.6,
•
random represents the list of explanatory terms to be fitted as random effects, see Table 6.1,
•
sparse fixed are additional fixed terms not included in the table of Wald F statistics.
General rules The following general rules apply in specifying the linear mixed model •
all elements in the model must be space separated,
•
elements in the model may also be separated by + which is ignored,
6
Command file: Specifying the terms in the mixed model
•
Choose
labels •
that will avoid
the character ∼ separates the response variables(s) from the explanatory variables in the model, data fields are identified in the model by their labels –
confusion
–
–
–
•
95
labels are case sensitive, labels may be abbreviated (truncated) when used in the model line but care must be taken that the truncated form is not ambiguous. If the truncated form matches more than one label, the term associated with the first match is assumed, For example, dens is an abbeviation for density but spl(dens,7) is a different model term to spl(density,7) because it does not represent a simple truncation. model terms may only appear once in the model line; repeated occurrences are ignored, model terms other than the original data fields are defined the first time they appear on the model line. They may be abbreviated (truncated) if they are referred to again provided no ambiguity is introduced. Important It is often clearer if labels are not abbreviated. If abbreviations are used then they need to be chosen to avoid confusion.
if the model is written over several lines, all but the final line must end with a comma to indicate that the list is continued.
In Tables 6.1 and 6.2, the arguments in model term functions are represented by the following symbols f — the label of a data variable defined as a model factor, k, n — an integer number, r — a real number, t — a model term label (includes data variables), v, y — the label of a data variable, Parsing of model terms in ASReml is not very sophisticated. Where a model term takes another model term as an argument, the argument must be predefined. If necessary, include the argument in the model line with a leading ’-’ which will cause the term to be defined but not fitted. For example Trait.male -Trait.female and(Trait.female)
6
Command file: Specifying the terms in the mixed model
96
Table 6.1: Summary of reserved words, operators and functions
model term
brief description
common usage fixed
reserved
mu
the constant term or intercept
terms
mv
a term to estimate missing values
Trait
multivariate counterpart to mu
units
forms a factor with a level for each experimental unit
. or :
placed between labels to specify an interaction
/
forms nested expansion (Section 6.5)
∗
forms factorial expansion (Section 6.5)
-
placed before model terms to exclude them from the model
,
placed at the end of a line to indicate that the model specification continues on the next line
+
treated as a space
operators
!{ ...
!}
√ √ √ √ √
√
√
√
√
√
√
√
√
√ √
placed around some model terms when it is important the terms not be reordered (Section 6.4)
commonly used
at(f,n)
condition on level n of factor f. n may be a list of values
functions
at(f)
forms conditioning covariables for all levels of factor f
fac(v )
forms a factor from v with a level for each unique value in v
fac(v,y )
forms a factor with a level for each combination of values in v and y
lin(f )
forms a variable from the factor f with values equal to 1. . . n corresponding to level(1). . . level(n) of the factor
spl(v [,k ])
forms the design matrix for the random component of a cubic spline for variable v
random
√
√
√
√ √ √
√
√
6
Command file: Specifying the terms in the mixed model
97
Table 6.1: Summary of reserved words, operators and functions
model term
other functions
brief description
t{n}
fits variable n from the !G set of variables t. This is a special case of the !SUBGROUP qualifier function applied to !G variables. Note that the square parentheses are permitted alternative syntax.
and(t[,r])
adds r times the design matrix for model term t to the previous design matrix; r has a default value of 1. If t is complex if may be necessary to predefine it by saying -t and(t,r)
c(f)
factor f is fitted with sum to zero constraints
cos(v,r)
forms cosine from v with period r
ge(f)
condition on factor/variable f >= r
giv(f,n)
associates the nth .giv G-inverse with the factor f
gt(f)
condition on factor/variable f > r
h(f)
factor f is fitted Helmert constraints
ide(f)
fits pedigree factor f without relationship matrix
inv(v[,r])
forms reciprocal of v + r
le(f)
condition on factor/variable f 100 levels) terms.
NIN Alliance Trial 1989 variety . . . row 22 column 11 nin89.asd !skip 1 yield ∼ mu variety !r repl, !f mv lin(row) 1 2 11 column AR1 .424 22 row AR1 .904
Random terms in the model The !r random terms in the model formula
NIN Alliance Trial 1989 variety . . . row 22 column 11 nin89.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 11 column AR1 .424 22 row AR1 .904
•
comprise random covariates, factors and interactions including special functions and reserved words, see Table 6.1,
•
involve an initial non-zero variance component or ratio (relative to the residual variance) default 0.1; the initial value can be specified after the model term or if the variance structure is not scaled identity, by syntax described in detail in Chapter 7,
•
an initial value of its variance (ratio) may be followed by a !GP (keep positive, the default), !GU (unrestricted) or !GF (fixed) qualifier, see Table 7.4,
•
use !{ and !} to group model terms that may not be reordered. Normally ASReml will reorder the model terms in the sparse equations - putting smaller terms first to speed up calculations. However, the order must be preserved if the user defines a structure for a term which also covers the following term(s) (a way of defining a covariance structure across model terms). Grouping is specifically required if the model terms are of differing sizes (number of effects). For example, for traits weaning weight and yearling weight, an animal model with maternal weaning weight should specify model terms !{ Trait.animal at(Trait,1).dam !} when fitting a genetic covariance between the direct and maternal effects.
•
The model can be split into submodels with !SM i qualifiers.
6
Command file: Specifying the terms in the mixed model
6.5
101
Interactions and conditional factors Interactions •
interactions are formed by joining two or more terms with a ‘.’ or a ‘:’, for example, a.b is the interaction of factors a and b,
•
interaction levels are arranged with the levels of the second factor nested within the levels of the first,
•
labels of factors including interactions are restricted to 31 characters of which only the first 20 are ever displayed. Thus for interaction terms it is often necessary to shorten the names of the component factors in a systematic way, for example, if Time and Treatment are defined in this order, the interaction between Time and Treatment could be specified in the model as Time.Treat; remember that the first match is taken so that if the label of each field begins with a different letter, the first letter is sufficient to identify the term,
•
interactions can involve model functions.
Expansions •
+ is ignored,
•
- makes sure the following term is defined but does notinclude it in the model,
•
* indicates factorial expansion (up to 5 way) a*b is expanded to a b a.b a*b*c*d is expanded to a b c d a.b a.c a.d b.c b.d c.d a.b.c a.b.d a.c.d b.c.d a.b.c.d
•
/ indicates nested expansion a/b is expanded to a a.b
•
a.(b c d) e is expanded to a.b a.c a.d e. This syntax is detected by the string ‘.(’ and the closing parenthesis must occur on the same line and before any comma indicating continuation. Any number of terms may be enclosed. Each may have ‘-’ prepended to suppress it from the model. Each enclosed term may have initial values and qualifiers following. For example,
ASReml2
yield∼site site.(lin(row) !r variety), at(site,1).(row .3 col .2) expands to yield∼site site.lin(row) !r site.variety, at(site,1).row .3 at(site,1).col .2
6
Command file: Specifying the terms in the mixed model
102
Conditional factors A conditional factor is a factor that is present only when another factor has a particular level.
ASReml2
•
individual components are specified using the at(f,n) function (see Table 6.2), for example, at(site,1).row will fit row as a factor only for site 1,
•
a complete set of conditional terms are specified by omitting the level specification in the at(f) function provided the correct number of levels of f is specified in the field definitions. Otherwise, a list of levels may be specified. –
–
Important
–
–
at(f).b creates a series of model terms representing b nested within a for any model term b. A model term is created for each level of a; each has the size of b. For example, if site and geno are factors with 3 and 10 levels respectively, then for at(site).geno ASReml constructs 3 model terms at(site,1).geno at(site,2).geno at(site,3).geno, each with 10 levels, this is similar to forming an interaction except that a separate model term is created for each level of the first factor; this is useful for random terms when each component can have a different variance. The same effect is achieved by using an interaction (e.g. site.geno) and associating a DIAG variance structure with the first component (see Section 7.5). any at() term to be expanded MUST be the FIRST component of the interaction. geno.at(site) will not work. at(site,1).at(year).geno will not work but at(year).at(site,1).geno is OK. the at() factor must be declared with the correct number of levels because the model line is expanded BEFORE the data is read. Thus if site is declared as site * or site !A in the data definitions, at(site).geno will expand to at(site,01).geno at(site,02).geno regardless of the actual number of sites.
Associated Factors
ASReml3
Sometimes there is a hierarchical structure to factors which should be recognised as it aids formulation of prediction tables (see !ASSOCIATE qualifier on page 188). Common examples are Genotypes grouped into Families and Locations grouped by Region. We call these associated factors. The key characteristic of associated factors is that they are coded such that the levels of one are uniquely nested in the levels of another. If one is unknown (coded as missing), all associated factors must
6
Command file: Specifying the terms in the mixed model
103
be unknown for that data record. It is typically unnecessary to interact associated factors except when required to adequately define the variance structure.
6.6
Alphabetic list of model functions Table 6.2 presents detailed descriptions of the model functions discussed above. Note that some three letter function names may be abbreviated to the first letter. Table 6.2: Alphabetic list of model functions and descriptions
model function
action
and(t,r) a(t,r)
overlays (adds) r times the design matrix for model term t to the existing design matrix. Specifically, if the model up to this point has p effects and t has a effects, the a columns of the design matrix for t are multiplied by the scalar r (default value 1.0) and added to the last a of the p columns already defined. The overlaid term must agree in size with the term it overlays. This can be used to force a correlation of 1 between two terms as in a diallel analysis male and(female) assuming the ith male is the same individual as the ith female. Note that if the overlaid term is complex, it must be predefined; e.g. Tr.male -Tr.female and(Tr.female).
at(f,n) @(f,n)
defines a binary variable which is 1 if the factor f has level n for the record. For example, to fit a row factor only for site 3, use the expression at(site,3).row.. The string @( is equivalent to at( for this function.
ASReml2 at(f) @(f)
at(f ) is expanded to a series of terms like at(f ,i) where i takes the values 01 to the number of levels of factor f . Since this command is interpreted before the data is read, it is necessary to declare the number of levels correctly in the field definition. This extended form may only be used as the first term in an interaction.
at(f,m,n) @(f,m,n)
at(f ,i,j,k) is expanded to a series of terms at(f ,i) at(f ,j) at(f ,k). Similarly, at(f ,i).X at(f ,j).X at(f ,k).X can be written as at(f ,i,j,k).X provided at(f ,i,j,k) is written as the first component of the interaction. Any number of levels may be listed.
cos(v,r)
forms cosine from v with period r. Omit r if v is radians. If v is degrees, r is 360.
con(f) c(f)
apply sum to zero constraints to factor f. It is not appropriate for random factors and fixed factors with missing cells. ASReml assumes you specify the correct number of levels for each factor. The formal effect of the con() function is to form a model term with the highest level formally equal to minus the sum of the preceding terms.
6
Command file: Specifying the terms in the mixed model
104
Table 6.2: Alphabetic list of model functions and descriptions
model function
action With sum to zero constraints, a missing treatment level will generate a singularity but in the first coefficient rather than in the coefficient corresponding to the missing treatment. In this case, the coefficients will not be readily interpretable. When interacting constrained factors, all cells in the cross-tabulation should have data.
fac(v) fac(v,y)
fac(v) forms a factor with a level for each value of x and any additional points inserted as discussed with the qualifiers !PPOINTS and !PVAL. fac(v,y) forms a factor with a level for each combination of values from v and y. The values are reported in the .res file.
giv(f,n) g(f,n)
associates the nth .giv G-inverse with the factor. This is used when there is a known (except for scale) G-structure other than the additive inverse genetic relationship matrix. The G-inverse is supplied in a file whose name has the file extension .giv described in Section 9.6
h(f)
h(f ) requests ASReml to fit the model term for factor f using Helmert constraints. Neither Sum-to-zero nor Helmert constraints generate interpretable effects if singularities occur. ASReml runs more efficiently if no constraints are applied. Following is an example of Helmert and sum-to-zero covariables for a factor with 5 levels. H1 H2 H3 H4 C1 C2 C3 C4 F1 -1 -1 -1 -1 1 0 0 0 F2 1 -1 -1 -1 0 1 0 0 F3 0 2 -1 -1 0 0 1 0 F4 0 0 3 -1 0 0 0 1 F5 0 0 0 4 -1 -1 -1 -1
ide(f) i(f)
is used to take a copy of a pedigree factor f and fit it without the genetic relationship covariance. This facilitates fitting a second animal effect. Thus, to form a direct, maternal genetic and maternal environment model, the maternal environment is defined as a second animal effect coded the same as dams. viz. !r !{ animal dam !} ide(dam)
inv(v[,r])
forms the reciprocal of v + r. This may also be used to transform the response variable.
leg(v,[-]n)
forms n+1 Legendre polynomials of order 0 (intercept), 1 (linear). . . n from the values in v; the intercept polynomial is omitted if n is preceded by the negative sign. The actual values of the coefficients are written to the .res file. This is similar to the pol() function described below.
lin(f) l(f)
takes the coding of factor f as a covariate. The function is defined for f being a simple factor, Trait and units. The lin(f) function does not centre or scale the variable. Motivation: Sometimes you may wish to fit a covariate as a random factor as well. If the coding is say 1. . .n, then you should define the field as a factor in the field definition and use the lin() function to include it as a covariate in the model. Do not centre the field in this case. If the covariate values are irregular, you would leave the field as a covariate and use the fac() function to derive a factor version.
ASReml2
6
Command file: Specifying the terms in the mixed model
105
Table 6.2: Alphabetic list of model functions and descriptions
model function
action
log(v[,r])
forms the natural log of v + r. This may also be used to transform the response variable.
ma1 ma1(f)
creates a first-differenced (by rows) design matrix which, when defining a random effect, is equivalent to fitting a moving average variance structure in one dimension. In the ma1 form, the first-difference operator is coded across all data points (assuming they are in time/space order). Otherwise the coding is based on the codes in the field indicated.
mbf(f,c) mbf(f)
is a term that is predefined by using the !MBF qualifier (see page 75)
mu
is used to fit the intercept/constant term. It is normally present and listed first in the model. It should be present in the model if there are no other fixed factors or if all fixed terms are covariates or contrasts except in the special case of regression through the origin.
mv
is used to estimate missing values in the response variable. Formally this creates a model term with a column for each missing value. Each column contains zeros except for a solitary -1 in the record containing the corresponding missing value. This is used in spatial analyses so that computing advantages arising from a balanced spatial layout can be exploited. The equations for mv and any terms that follow are always included in the sparse set of equations. Missing values are handled in three possible ways during analysis (see Section 6.9). In the simplest case, records containing missing values in the response variable are deleted. For multivariate (including some repeated measures) analysis, records with missing values are not deleted but ASReml drops the missing observation and uses the appropriate unstructured R-inverse matrix. For regular spatial analysis, we prefer to retain separability and therefore estimate the missing value(s) by including the special term mv in the model.
out(n) out(n,t) ASReml2
out(n), out(n,t) establishes a binary variable which is: out(i) 1 if data relates to observation i, (trait 1), else is 0 out(i,t) 1 if data relates to observation i, (trait t), else is 0 The intention is that this be used to test/remove single observations for example to remove the influence of an outlier or influential point. Possible outliers will be evident in the plot of residuals versus fitted values (see the .res file) and the appropriate record numbers for the out() term are reported in the .res file. Note that i relates to the data analysed and will not be the same as the record number as obtained by counting data lines in the data file if there were missing observations in the data and they have not been estimated. (To drop records based on the record number in the data file, use the !D transformation in association with the !=V0 transformation.)
6
Command file: Specifying the terms in the mixed model
106
Table 6.2: Alphabetic list of model functions and descriptions
model function
action
pol(v,n) p(v,n)
forms a set of orthogonal polynomials of order |n| based on the unique values in variate (or factor) v and any additional interpolated points, see !PPOINTS and !PVAL in Table 5.4. It includes the intercept if n is positive, omits it if n is negative. For example, pol(time,2) forms a design matrix with three columns of the orthogonal polynomial of degree 2 from the variable time. Alternatively, pol(time,-2) is a term with two columns having centred and scaled linear coefficients in the first column and centred and scaled quadratic coefficients in the second column. The actual values (Robson, 1959, Steep and Torrie, 1960) of the coefficients are written to the .res file. This factor could be interacted with a design factor to fit random regression models. The leg() function differs from the pol() function in the way the quadratic and higher polynomials are calculated.
pow(x, p[,o])
defines the covariable (x+o)p for use in the model where x is a variable in the data, p is a power and o is an offset. pow(x,0.5[,o]) is equivalent to sqr(x[,o]); pow(x,0[,o]) is equivalent to log(x[,o]); pow(x,-1[,o]) is equivalent to inv(x[,o]).
qtl(f,r)
calculates an expected marker state from flanking marker information at position r of the linkage group f (see !MM to define marker locations). r may be specified as $TPn where $TPn has been previously internally defined with a predict statement (see page 185). r should be given in Morgans.
sin(v,r)
forms sine from v with period r. Omit r if v is radians. If v is degrees, r is 360.
spl(v [,k]) s(v [,k ])
In order to fit spline models associated with a variate v and k knot points in ASReml, v is included as a covariate in the model and spl(v,k) as a random term. The knot points can be explicitly specified using the !SPLINE qualifier (Table 5.4). If k is specified but !SPLINE is not specified, equally spaced points are used. If k is not specified and there are less than 50 unique data values, they are used as knot points. If there are more than 50 unique points then 50 equally spaced points will be used. The spline design matrix formed is written to the .res file. An example of the use of spl() is price ∼ mu week !r spl(week)
sqrt(v[,r])
forms the square root of v + r. This may also be used to transform the response variable.
Trait
is used with multivariate data to fit the individual trait means. It is formally equivalent to mu but Trait is a more natural label for use with multivariate data. It is interacted with other factors to estimate their effects for all traits.
ASReml2
6
Command file: Specifying the terms in the mixed model
107
Table 6.2: Alphabetic list of model functions and descriptions
model function
action
units
creates a factor with a level for every record in the data file. This is used to fit the ’nugget’ variance when a correlation structure is applied to the residual.
uni(f[,0[,n]])
creates a factor with a new level whenever there is a level present for the factor f. Levels (effects) are not created if the level of factor f is 0, missing or negative. The size may be set in the third argument by setting the second argument to zero.
uni(f,k[,n])
creates a factor with a level for every record subject to the factor level of f equalling k, i.e. a new level is created for the factor whenever a new record is encountered whose integer truncated data value from data field f is k. Thus uni(site,2) would be used to create an independent error term for site 2 in a multi-environment trial and is equivalent to at(site,2).units. The default size of this model term is the number of data records. The user may specify a lower number as the third argument. There is little computational penalty from the default but the .sln file may be substantially larger than needed. However, if the units vector is full size, the effects are mapped by record number and added back to the fitted residual for creating ’residual’ plots.
vect(v)
is used in a multivariate analysis on a multivariate set of covariates (v) to pair them with the variates. The test example included signal !G 93 # 93 slides background !G 93 dart.asd !ASUV signal ∼ Trait Trait.vect(background) ... to fit a slide specific regression of signal on background. In this example, signal is a multivariate set of 93 variates and background is a set of 93 covariates. The signal values relate to either the Red or Green channels. So for each slide and channel, we need to fit a simple regression of signal ∼ mu background. But the data for the 93 slides is presented in parallel. If it were presented in series, with a factor slide indexing the slides, the equivalent model would be signal ∼ slide slide.background.
xfa(f,k)
Factor analytic models are discussed in Chapter 7. There are three forms, FAk, FACVk and XFAk where k is the number of factors. The XFAk form is a sparse formulation that requires an extra k levels to be inserted into the mixed model equations for the k factors. This is achieved by the xfa(f,k) model function which defines a design matrix based on the design matrix for f augmented with k columns of zeros for the k factors.
ASReml3
6
Command file: Specifying the terms in the mixed model
6.7
caution
6.8
108
Weights Weighted analyses are achieved by using !WT weight as a qualifier to the response variable. An example of this is y !WT wt ∼ mu A X where y is the name of the response variable and wt is the name of a variate in the data containing weights. If these are relative weights (to be scaled by the units variance) then this is all that is required. If they are absolute weights, that is, the reciprocal of known variances, use the !S2==1 qualifier described in Table 7.4 to fix the unit variance. When a structure is present in the residuals the weights are applied as a matrix product. If Σ is the structure and W is the diagonal matrix constructed from the square root of the values of the variate weight, then R−1 = W Σ−1 W . Negative weights are treated as zeros.
Generalized Linear (Mixed) Models
!IDENTITY
Table 6.3 Link qualifiers and functions Link Inverse Link Available with η=µ µ=η All
!SQRT
η=
Qualifier
√ µ
µ = η2
Poisson Normal, Poisson, Negative Binomial, Gamma
!LOGARITHM
η = ln(µ)
µ = exp(η)
!INVERSE
η = 1/µ
µ = 1/η
!LOGIT
η = µ/(1 − µ)
µ=
!PROBIT
η = Φ−1 (µ)
µ = Φ(η)
!COMPLOGLOG
η = ln(−ln(1 − µ))
µ = 1 − e−e
1 (1+exp(−η))
η
Normal, Gamma, Negative Binomial Binomial, Multinomial Threshold Binomial, Multinomial Threshold Binomial, Multinomial Threshold
where µ is the mean on the data scale and η = Xτ is the linear predictor on the underlying scale. ASReml includes facilities for fitting the family of Generalized Linear Models (GLMs, McCullagh and Nelder, 1994). A GLM is defined by a mean variance function and a link function. In this context y is the observation, n is the count for grouped data specified by the !TOTAL qualifier,
6
Command file: Specifying the terms in the mixed model
109
φ is a parameter set with the !PHI qualifier, µ is the mean on the data scale calculated using the inverse link function from the predicted value η on the underlying scale where η = Xτ , v is the variance under some distributional assumption calculated as a function of µ and n, and d is the deviance (-twice the log likelihood) for that distribution. GLMs are specified by qualifiers after the name of the dependent variable but before the ∼ character. Table 6.3 lists the link function qualifiers which relate the linear predictor (η) scale to the observation (µ =E[y]) scale. Table 6.4 lists the distribution and other qualifiers. Table 6.4: GLM distribution qualifiers The default link is listed first followed by permitted alternatives.
qualifiers
action
!NORMAL [ !IDENTITY | !LOGARITHM | !INVERSE ] allows the model to be fitted on the log/inverse scale but with the residuals on the natural scale. !NORMAL !IDENTITY is the default. !BINOMIAL [ !LOGIT | !IDENTITY | !PROBIT | !COMPLOGLOG ] [ !TOTAL n ] v = µ(1 − µ)/n Proportions or counts [r = ny] are indicated if !TOTAL specifies the d = 2n(yln(y/µ) variate containing the binomial totals. Proportions are assumed if 1−y +(1 − y)ln( 1−µ no response value exceeds 1. A binary variate [0, 1] is indicated if )) !TOTAL is unspecified. The expression for d on the left applies when y is proportions (or binary). The logit is the default link function. The variance on the underlying scale is π 2 /3 ∼ 3.3 (underlying logistic distribution) for the logit link. !MULTINOMIAL k !CUMULATIVE [ !LOGIT | !PROBIT | !COMPLOGLOG ] [ !TOTAL n ] fits a multiple threshold model with t = k − 1 thresholds to vij = µi (1 − µj )/n polytomous ordinal data with k categories assuming a multinomial for i ≤ j ≤ t distribution. Typically, the response variable is a single variable containing the d = 2nΣki=1 ordinal score (1 : k) or a set of k variables containing counts (ri ) (yi ln(yi /pi )+ in the k categories. The response may also be a series of t binary where Yi = Σij=1 yj variables or a series of t variables containing counts. If t counts µi =E(Yi ) are supplied, the total (including the kth category) must be given and pi = µi − µi−1 in another variable indicated by the !TOTAL qualifier. ASReml3
6
Command file: Specifying the terms in the mixed model
110
Table 6.4: GLM distribution qualifiers
qualifier
action The multinomial threshold model is fitted as a cumulative probability model. The proportions (yi = ri /n) in the ordered categories are summed to form the cumulative proportions (Yi ) which are modelled with logit (!LOGIT), probit (!PROBIT) or Complementary LogLog (!CLOG) link functions. The implicit residual variance on the underlying scale is π 2 /3 ∼ 3.3 (underlying logistic distribution) for the logit link, 1 for the probit link. The distribution underlying the Complementary LogLog link is the Gumbel distribution with implicit residual variance on the underlying svale of π 2 /6 ∼ 1.65 For example Lodging !MULTINOMIAL 4 !CUMULATIVE ∼ Trait Variety !r block predict Variety where Lodging is a factor with 4 ordered categories. Predicted values are reported for the cumulative proportions.
!POISSON [ !LOGARITHM | !IDENTITY | !SQRT ] Natural logarithms are the default link function. v=µ ASReml assumes the Poisson variable is not negative. d = 2(yln(y/µ) −(y − µ)) !GAMMA [ !INVERSE | !IDENTITY | !LOGARITHM ] [ !PHI φ ] [ !TOTAL n ] v = µ2 /(φn) The inverse is the default link function. n is defined with the d = 2n(−φln( φy ) !TOTAL qualifier and would be degrees of freedom in the typical µ application to mean-squares. The default value of φ is 1. + φy−µ ) µ !NEGBIN [ !LOGARITHM | !IDENTITY | !INVERSE ] [ !PHI φ ] v = µ + µ2 /φ fits the Negative Binomial distribution. Natural logarithms are d = 2((φ + y)ln( µ+φ the default link function. The default value of φ is 1. ) y+φ +yln( µy )) General qualifiers !AOD
requests an Analysis of Deviance table be generated. This is formed by fitting a series of sub models for terms in the DENSE part building up to the full model, and comparing the deviances. An example if its use is LS !BIN !TOT COUNT !AOD ∼ mu SEX GROUP !AOD may not be used in association with PREDICT.
!DISP [h]
includes an overdispersion scaling parameter (h) in the weights. If !DISP is specified with no argument, ASReml estimates it as the residual variance of the working variable. Traditionally it is estimated from the deviance residuals, reported by ASReml as Variance heterogeneity. An example if its use is count !POIS !DISP ∼ mu group
ASReml2 Caution
6
Command file: Specifying the terms in the mixed model
111
Table 6.4: GLM distribution qualifiers
qualifier
action
!OFFSET [o]
is used especially with binomial data to include an offset in the model where o is the number or name of a variable in the data. The offset is only included in binomial and Poisson models (for Normal models just subtract the offset variable from the response variable), for example count !POIS !OFFSET base !DISP ∼ mu group The offset is included in the model as η = Xτ + o. The offset will often be something like ln(n).
Revised 08
!TOTAL [n]
is used especially with binomial and ordinal data where n is the field containing the total counts for each sample. If omitted, count is taken as 1.
Residual qualifiers
control the form of the residuals returned in the .yht file. The predicted values returned in the .yht file will be on the linear predictor scale if the !WORK or !PVW qualifiers are used. They will be on the observation scale if the !DEVIANCE, !PEARSON, !RESPONSE or !PVR qualifiers are used.
!DEVIANCE
produces deviance residuals, the signed square root of d/h from Table 6.4 where h is the dispersion parameter controlled by the !DISP qualifier. This is the default.
!PEARSON
writes Pearson residuals,
!PVR
writes fitted values on the response scale in the .yht file. This is the default.
!PVW
writes fitted values on the linear predictor scale in the .yht file.
!RESPONSE
produces simple residuals, y − µ
!WORK
produces residuals on the linear predictor scale,
y−µ √ , v
in the .yht file
y−µ dµ/dη
A second dependent variable may be specified (except with a multinomial response (!MULTINOMIAL)) if a bivariate analysis is required but it will always be treated as a normal variate (no syntax is provided for specifying GLM attributes for it). The !ASUV qualifier is required in this situation for the GLM weights to be utilized.
6
Command file: Specifying the terms in the mixed model
112
Generalized Linear Mixed Models This section was written by Damian Collins A Generalized Linear Mixed Model (GLMM) is an extension of a GLM to include random terms in the linear predictor. Inference concerning GLMMs is impeded by the lack of a closed form expression for the likelihood. ASReml currently uses an approximate likelihood technique called penalized quasi-likelihood, or PQL (Breslow and Clayton, 1993), which is based on a first order Taylor series approximation. This technique is also known as Schalls technique (Schall, 1991), pseudo-likelihood (Wolfinger and OConnell, 1993) and joint maximisation (Harville and Mee, 1984, Gilmour et al., 1985). Implementations of PQL are found in many statistical packages, for instance, in the GLMM (Welham, 2005) and the IRREML procedures of Genstat (Keen, 1994), the MLwiN package (Goldstein et al., 1998), the GLMMIX macro in SAS (Wolfinger, 1994), and in the GLMMPQL function in R. The PQL technique is well-known to suffer from estimation biases for some types of GLMMs. For grouped binary data with small group sizes, estimation biases can be over 50% (e.g. Breslow and Lin, 1995, Goldstein and Rasbash, 1996, Rodriguez and Goldman, 2001, Waddington et al., 1994). For other GLMMs, PQL has been reported to perform adequately (e.g. Breslow, 2003). McCulloch and Searle (2001) also discuss the use of PQL for GLMMs. The performance of PQL in other respects, such as for hypothesis testing, has received much less attention, and most studies into PQL have examined only relatively simple GLMMs. Anecdotal evidence suggests that this technique may give misleading results in certain situations. Therefore we cannot recommend the use of this technique for general use, and it is included in the current version of ASReml for advanced users. If this technique is used, we recommend the use of cross-validatory assessment, such as applying PQL to simulated data from the same design (Millar and Willis, 1999).
Caution
6.9
The standard GLM Analysis of Deviance (!AOD) should not be used when there are random terms in the model as the variance components are reestimated for each submodel.
Missing values Missing values in the response
6
Command file: Specifying the terms in the mixed model
It is sometimes computationally convenient to estimate missing values, for example, in spatial analysis of regular arrays, see example 3a in Section 7.3. Missing values are estimated if the model term mv is included in the model. Formally, mv creates a factor with a covariate for each missing value. The covariates are coded 0 except in the record where the particular missing value occurs, where it is coded -1.
113
NIN Alliance Trial 1989 variety . . . row 22 column 11 nin89.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 11 column AR1 .424 22 row AR1 .904
The action when mv is omitted from the model depends on whether a univariate or multivariate analysis is being performed. For a univariate analysis, ASReml discards records which have a missing response. In multivariate analyses, all records are retained and the R matrix is modified to reflect the missing value pattern.
Missing values in the explanatory variables ASReml will abort the analysis if it finds missing values in the design matrix unless !MVINCLUDE or !MVREMOVE is specified, see Section 5.8. !MVINCLUDE causes the missing value to be treated as a zero. !MVREMOVE causes ASReml to discard the whole record. Records with missing values in particular fields can be explicitly dropped using the !DV * transformation, Table 5.1. Covariates: Treating missing values as zero in covariates is usually only sensible if the covariate is centred (has mean of zero).
Revised 08
Design factors: Where the factor level is zero (or missing and the !MVINCLUDE qualifier is specified), no level is assigned to the factor for that record. These effectively defines an extra level (class) in the factor which becomes a reference level.
6
Command file: Specifying the terms in the mixed model
6.10
114
Some technical details about model fitting in ASReml Sparse versus dense ASReml partitions the terms in the linear model into two parts: a dense set and a sparse set. The partition is at the !r point unless explicitly set with the !DENSE data line qualifier or mv is included before !r, see Table 5.5. The special term mv is always included in sparse. Thus random and sparse terms are estimated using sparse matrix methods which result in faster processing. The inverse coefficient matrix is fully formed for the terms in the dense set. The inverse coefficient matrix is only partially formed for terms in the sparse set. Typically, the sparse set is large and sparse storage results in savings in memory and computing. A consequence is that the variance matrix for estimates is only available for equations in the dense portion.
Ordering of terms in ASReml The order in which estimates for the fixed and random effects in linear mixed model are reported will usually differ from the order the model terms are specified. Solutions to the mixed model equations are obtained using the methods outlined Gilmour et al., 1995. ASReml orders the equations in the sparse part to maintain as much sparsity as it can during the solution. After absorbing them, it absorbs the model terms associated with the dense equations in the order specified.
Aliassing and singularities A singularity is reported in ASReml when the diagonal element of the mixed model equations is effectively zero (see the !TOLERANCE qualifier) during absorption. It indicates there is either •
no data for that fixed effect, or
•
a linear dependence in the design matrix means there is no information left to estimate the effect.
ASReml handles singularities by using a generalized inverse in which the singular row/column is zero and the associated fixed effect is zero. Which equations are singular depends on the order the equations are processed. This is controlled by ASReml for the sparse terms but by the user for the dense terms. They should be specified with main effects before interactions so that the table of Wald F statistics has correct marginalization. Since ASReml processes the dense terms from the bottom up, the first level (the last level processed) is often singular.
6
Command file: Specifying the terms in the mixed model
115
The number of singularities is reported in the .asr file immediately prior to the REML log-likelihood (LogL) line for that iteration (see Section 14.3). The effects (and associated standard or prediction error) which correspond to these singularities are zero in the .sln file. Warning
Singularities in the sparse fixed terms of the model may change with changes in the random terms included in the model. If this happens it will mean that changes in the REML log-likelihood are not valid for testing the changes made to the random model. This situation is not easily detected as the only evidence will be in the .sln file where different fixed effects are singular. A likelihood ratio test is not valid if the fixed model has changed.
Examples of aliassing The sequence of models in Table 6.5 are presented to facilitate an understanding of over-parameterised models. It is assumed that var is a factor with 4 levels, trt with 3 levels and rep with 3 levels and that all var.trt combinations are present in the data. Table 6.5: Examples of aliassing in ASReml
model
number of singularities
order of fitting
yield ∼ var !r rep
0
rep var
yield ∼ mu var !r rep
1
rep mu var first level of var is aliassed and set to zero
yield ∼ var trt !r rep
1
rep var trt var fully fitted, first level of trt is aliassed and set to zero
yield ∼ mu var trt, var.trt !r rep
8
rep mu var trt var.trt first levels of both var and trt are aliassed and set to zero, together with subsequent interactions
yield ∼ mu var trt !r rep, !f var.trt
8
[ var.trt rep ] mu var trt var.trt fitted before mu, var and trt, var.trt fully fitted; mu, var and trt are completely singular and set to zero. The order within [ var.trt rep ] is determined internally.
6
Command file: Specifying the terms in the mixed model
6.11
116
Wald F Statistics The so called ANOVA table of Wald F statistics has 4 forms: Source NumDF F-inc Source NumDF F-inc F-con M Source NumDF DDF_inc F-inc P-inc Source NumDF DDF_con F-inc F-con M P-con depending on whether conditional Wald F statistics are reported (requested by the !FCON qualifier) and whether the denominator degrees of freedom are reported. ASReml always reports incremental Wald F statistics (F-inc) for the fixed model terms (in the DENSE partition) conditional on the order the terms were nominated in the model. Note that probability values are only available when the denominator degrees of freedom is calculated, and this must be explicitly requested with the !DDF qualifier in larger jobs. Users should study Section 2.6 to understand the contents of this table. The ’conditional maximum’ model used as the basis for the conditional F statistic is spelt out in the .aov file described in section 14.4. The numerator degrees of freedom (NumDF) for each term is easily determined as the number of non-singular equations involved in the term. However, in general, calculation of the denominator degrees of freedom (DDF) is not trivial. ASReml will by default attempt the calculation for small analyses, by one of two methods. In larger analyses, users can request the calculation be attempted using the !DDF qualifier (page 69). Use !DDF -1 to prevent the calculation to save processing time when significance testing is not required.
7
Command file: Specifying the variance structures Introduction Non-singular variance matrices
Variance model specification in ASReml A sequence of structures for the NIN data Variance structures General syntax Variance header line R structure definition G structure header and definition lines
Variance model description Forming the variance models from correlation models Additional notes of variance models
Variance structure qualifiers Rules for combining variance models G structures involving more than one random term Constraining variance parameters Parameter constraint within a variance model Constraints between and within variance models
Model building using the !CONTINUE qualifier 117
7
Command file: Specifying the variance structures
7.1
118
Introduction The subject of this chapter is variance model specification in ASReml. ASReml allows a wide range of models to be fitted. The key concepts you need to understand are •
the mixed linear model y = Xτ + Zu + e has a residual term e ∼ N (0, R) and random effects u ∼ N (0, G),
•
we use the terms R structure and G structure to refer to the independent blocks of R and G respectively,
•
R and G structures are typically formed as a direct product of particular variance models,
•
the order of terms in a direct product must agree with the order of effects in the corresponding model term,
•
variance models may be correlation matrices or variance matrices with equal or unequal variances on the diagonal. A model for a correlation matrix (eg. AR1) can be converted to an equal variance form (eg. AR1V) and to a heterogeneous variance form (eg. AR1H),
•
variances are sometimes estimated as variance ratios (relative to the residual variance).
These issues are fully discussed in Chapter 2. In this chapter we begin by considering an ordered sequence of variance structures for the NIN variety trial (see Section 7.3). This is to introduce variance modelling in practice. We then present the topics in detail.
Non singular variance matrices When undertaking the REML estimation, ASReml needs to invert each variance matrix. For this it requires that the matrices be negative definite or positive definite. They must not be singular. Negative definite matrices will have negative elements on the diagonal of the matrix and/or its inverse. The exception is the XFA model which has been specifically designed to fit singular matrices (Thompson et al. 2003). Let x0 Ax represent an arbitrary quadratic form for x = (x1 , . . . , xn )0 . The quadratic form is said to be nonnegative definite if x0 Ax ≥ 0 for all x ∈ Rn . If x0 Ax is nonnegative definite and in addition the null vector 0 is the only value of x for which x0 Ax = 0, then the quadratic form is said to be positive definite. Hence the matrix A is said to be positive definite if x0 Ax is positive definite, see Harville (1997), pp 211.
7
Command file: Specifying the variance structures
7.2
119
Variance model specification in ASReml The variance models are specified in the ASReml command file after the model line, as shown in the code box. In this case just one variance model is specified (for replicates, see model 2b below for details). predict and tabulate lines may appear after the model line and before the first variance structure line. These are described in Chapter 10.
NIN Alliance Trial 1989 variety !A . . . column 11 nin89.asd !skip 1 yield ∼ mu variety !r repl 0 0 1 repl 1 repl 0 IDV 0.1
Table 7.3 presents the full range of variance models available in ASReml. The identifiers for specifying the individual variance models in the command file are described in Section 7.5 under Specifying the variance models in ASReml. Many of the models are correlation models. However, these are generalized to homogeneous variance models by appending V to the base identifier. They are generalized to heterogeneous variance models by appending H to the base identifier.
7.3
A sequence of structures for the NIN data Eight variance structures of increasing complexity are now considered for the NIN field trial data (see Chapter 3 for an introduction to these data). This is to give a feel for variance modelling in ASReml and some of the models that are possible.
See Section 2.1
Before proceeding, it is useful to link this section to the algebra of Chapter 2. In this case the mixed linear model is y = Xτ + Zu + e where y is the vector of yield data, τ is a vector of fixed variety effects but would also include fixed replicate effects in a simple RCB analysis and might also include fixed missing value effects when spatial models are considered, u ∼ N (0, G) is a vector of random effects (for example, random replicate effects) and the errors are in e ∼ N (0, R). The focus of this discussion is on •
changes to u and e and the assumptions about these terms,
•
the impact this has on the specification of the G structures for u and the R structures for e.
7
Command file: Specifying the variance structures
120
1 Traditional randomised complete block (RCB) analysis The only random term in a traditional RCB analysis of these data is the (residual) error term e ∼ N (0, σe2 I 224 ). The model therefore involves just one R structure and no G structures (u = 0). In ASReml •
the error term is implicit in the model and is not formally specified on the model line,
•
the IID variance structure (R = σe2 I 224 ) is the default for error.
NIN Alliance Trial 1989 variety !A id pid raw repl 4 . . . row 22 column 11 nin89.asd !skip 1 yield ∼ mu variety repl
Important The error term is always present in the model but its variance structure does not need to be formally declared when it has the default IID structure.
2a Random effects RCB analysis The random effects RCB model has 2 random terms to indicate that the total variation in the data is comprised of 2 components, a random replicate effect ur ∼ N (0, γr σe2 I 4 ) where γr = σr2 /σe2 , and error as in 1. This model involves both the original implicit IID R structure and an implicit IID G structure for the random replicates. In ASReml •
See Section 6.4
IID variance structure is the default for random terms in the model.
NIN Alliance Trial 1989 variety !A id pid raw repl 4 . . . row 22 column 11 nin89.asd !skip 1 yield ∼ mu variety !r repl
For this reason the only change to the former command file is the insertion of !r before repl. Important All random terms (other than error which is implicit) must be written after !r in the model specification line(s).
7
Command file: Specifying the variance structures
121
2b Random effects RCB analysis with a G structure specified
See Section 7.4
See page 131
This model is equivalent to 2a but we explicNIN Alliance Trial 1989 variety !A itly specify the G structure for repl, that is, id 2 ur ∼ N (0, γr σe I 4 ), to introduce the syntax. pid The 0 0 1 line is called the variance header raw repl 4 line. In general, the first two elements of this . . line refer to the R structures and the third el. ement is the number of G structures. In this row 22 column 11 case 0 0 tells ASReml that there are no exnin89.asd !skip 1 plicit R structures but there is one G structure yield ∼ mu variety !r repl (1). The next two lines define the G structure. 0 0 1 repl 1 The first line, a G structure header line, links 4 0 IDV 0.1 the structure that follows to a term in the linear model (repl) and indicates that it involves one variance model (1) (a 2 would mean that the structure was the direct product of two variance models). The second line tells ASReml that the variance model for replicates is IDV of order 4 (σr2 I4 ). The 0.1 is a starting value for γr = σr2 /σe2 ; a starting value must be specified. Finally, the second element (0) on the last line of the file indicates that the effects are in standard order. There is almost always a 0 (no sorting) in this position for G structures. The following points should be noted: •
the 4 on the final line could have been written as repl to give repl 0 IDV 0.1 This would tell ASReml that the order or dimension of the IDV variance model is equal to the number of levels in repl (4 in this case),
•
when specifying G structures, the user should ensure that one scale parameter is present. ASReml does not automatically include and estimate a scale parameter for a G structure when the explicit G structure does not include one. For this reason –
See Sections 2.1 and 7.5
–
•
the model supplied when the G structure involves just one variance model must not be a correlation model (all diagonal elements equal 1), all but one of the models supplied when the G structure involves more than one variance model must be correlation models; the other must be either an homogeneous or a heterogeneous variance model (see Section 7.5 for the distinction between these models; see also 5 for an example),
an initial value must be supplied for all parameters in G structure definitions. ASReml expects initial values immediately after the variance model identifier
7
Command file: Specifying the variance structures
122
or on the next line (0.1 directly after IDV in this case), – –
See Chapter 15
–
•
0 is ignored as an initial value on the model line, if there is no initial value after the identifier, ASReml will look on the next line, if ASReml does not find an initial value it will stop and give an error message in the .asr file,
in this case V = σr2 Z r Z 0r + σe2 I 224 which is fitted as σe2 (γr Z r Z 0r + I 224 ) where γr is a variance ratio (γr = σr2 /σe2 ) and σe2 is the scale parameter. Thus 0.1 is a reasonable initial value for γr regardless of the scale of the data.
3a Two-dimensional spatial model with spatial correlation in one direction
See page 129
This code specifies a two-dimensional spatial NIN Alliance Trial 1989 variety !A structure for error but with spatial correlaid tion in the row direction only, that is, e ∼ pid N (0, σe2 I 11 ⊗ Σr (ρr )). The variance header raw repl 4 line tells ASReml that there is one R struc. . ture (1) which is a direct product of two vari. ance models (2); there are no G structures (0). row 22 column 11 The next two lines define the components of nin89aug.asd !skip 1 the R structure. A structure definition line yield ∼ mu variety !f mv must be specified for each component. For 1 2 0 11 column ID V = σe2 I 11 ⊗Σ(ρr ), the first matrix is an iden22 row AR1 0.3 tity matrix of order 11 for columns (ID), the second matrix is a first order autoregressive correlation matrix of order 22 for rows (AR1) and the variance scale parameter σe2 is implicit. Note the following: •
placing column and row in the second position on lines 1 and 2 respectively tells ASReml to internally sort the data rows within columns before processing the job. This is to ensure that the data matches the direct product structure specified. If column and row were replaced with 0 in these two lines, ASReml would assume that the data were already sorted in this order (which is not true in this case),
•
the 0.3 on line 2 is a starting value for the autoregressive row correlation. Note that for spatial analysis in two dimensions using a separable model, a complete matrix or array of plots must be present. To achieve this we augmented the data with the 18 records for the missing yields as shown on page 30. In the
7
Command file: Specifying the variance structures
123
augmented data file the yield data for the missing plots have all been made NA (one of the missing value indicators in ASReml) and variety has been arbitrarily coded LANCER for all of the missing plots (any of the variety names could have been used), •
!f mv is now included in the model specification. This tells ASReml to estimate the missing values. The !f before mv indicates that the missing values are fixed effects in the sparse set of terms,
•
unlike the case with G structures, ASReml automatically includes and estimates a scale parameter for R structures (σe2 for V = σe2 (I 11 ⊗ Σ(ρr )) in this case). This is why the variance models specified for row (AR1) and column (ID) are correlation models. The user could specify a non-correlation model (diagonal elements 6= 1) in the R structure definition, for example, ID could be replaced by IDV to represent V = σe2 (σc2 I 11 ) ⊗ Σ(ρr ). However, IDV would then need to be followed by !S2==1 to fix σe2 at 1 and prevent ASReml trying (unsuccessfully) to estimate both parameters as they are confounded: the scale parameter associated with IDV and the implicit error variance parameter, see Section 2.1 under Combining variance models. Specifically, the code
See Chapter 14 See Sections 6.3 and 6.10
See Sections 2.1 and 7.5
See Section 7.7
11 column IDV 48 !S2==1 would be required in this case, where 48 is the starting value for the variances. This complexity allows for heterogeneous error variance.
3b Two-dimensional separable autoregressive spatial model This model extends 3a by specifying a first order autoregressive correlation model of order 11 for columns (AR1). The R structure in this case is therefore the direct product of two autoregressive correlation matrices that is, V = σe2 Σc (ρc ) ⊗ Σr (ρr ), giving a twodimensional first order separable autoregressive spatial structure for error. The starting column correlation in this case is also 0.3. Again note that σe2 is implicit.
NIN Alliance Trial 1989 variety !A id . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety !f mv 1 2 0 11 column AR1 0.3 22 row AR1 0.3
7
Command file: Specifying the variance structures
124
3c Two-dimensional separable autoregressive spatial model with measurement error This model extends 3b by adding a random units term. Thus V = σe2 (γη I 242 + Σc (ρc ) ⊗ Σr (ρr )) . The reserved word units tells ASReml to construct an additional random term with one level for each experimental unit so that a second (independent) error term can be fitted. A units term is fitted in the model in cases like this, where a variance structure is applied to the errors. Because a G structure is not explicitly specified here for units, the default IDV structure is assumed. The units term is often trial data to allow for a nugget effect.
NIN Alliance Trial 1989 variety !A id . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety !r units, !f mv 1 2 0 11 column AR1 0.3 22 row AR1 0.3
fitted in spatial models for field
4 Two-dimensional separable autoregressive spatial model with random replicate effects
See Section 7.4
This is essentially a combination of 2b and 3c to demonstrate specifying an R structure and a G structure in the same model. The variance header line 1 2 1 indicates that there is one R structure (1) that involves two variance models (2) and is therefore the direct product of two matrices, and there is one G structure (1). The R structures are defined first so the next two lines are the R structure definition lines for e, as in 3b. The last two lines are the G structure definition lines for repl, as in 2b. In this case V = σe2 (γr I242 + Σc (ρc ) ⊗ Σr (ρr )) .
NIN Alliance Trial 1989 variety !A id . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 1 11 column AR1 0.3 22 row AR1 0.3 repl 1 repl 0 IDV 0.1
7
Command file: Specifying the variance structures
125
Table 7.1: Sequence of variance structures for the NIN field trial data
ASReml syntax
extra random terms
residual error term
term
term
G structure models 1
R structure models
2
1
2
1
yield ∼ mu variety repl
-
-
error
ID
-
2a
yield ∼ mu variety, !r repl
repl
IDV
error
ID
-
2b
yield ∼ mu variety, !r repl 0 0 1 repl 1 4 0 IDV 0.1
repl
IDV
error
ID
-
3a
yield ∼ mu variety, !f mv 1 2 0 11 column ID 22 row AR1 0.3
-
-
column.row
ID
AR1
3b
yield ∼ mu variety, !f mv 1 2 0 11 column AR1 0.3 22 row AR1 0.3
-
-
column.row
AR1
AR1
3c
yield ∼ mu variety, !r units !f mv 1 2 0 11 column AR1 0.3 22 row AR1 0.3
units
IDV
column.row
AR1
AR1
4
yield ∼ mu variety, !r repl !f mv 1 2 1 11 column AR1 0.3 22 row AR1 0.3 repl 1 4 0 IDV 0.1
repl
IDV
column.row
AR1
AR1
5
yield ∼ mu variety, !r column.row 0 0 1 column.row 2 column 0 AR1 .5 row 0 AR1V 0.5 0.1
column.row
AR1
error
ID
-
AR1V
7
Command file: Specifying the variance structures
126
5 Two-dimensional separable autoregressive spatial model defined as a G structure This model is equivalent to 3c but with the spatial model defined as a G structure rather than an R structure. As discussed in 2b, one and only one of the component models must be a variance model and all others must be correlation models. See Section 7.7
The V in AR1V converts the correlation model AR1 to a variance model and the second initial value (0.1) is for the variance (ratio). That is, V = σe2 (γrc Σc (ρc ) ⊗ Σr (ρr ) + I224 ) .
NIN Alliance Trial 1989 variety !A id . . . row 22 column 11 nin89.asd !skip 1 yield ∼ mu variety, !r row.column 0 0 1 row.column 2 row 0 AR1V 0.5 0.1 column 0 AR1 0.5
Try starting this model with initial correlations of 0.3; it fails to converge! Use of row.column as a G structure is a useful approach for analysing incomplete spatial arrays; it will often run faster for large trials but requires more memory. Important
7.4
Note that we have used the original version of the data and !f mv is omitted from this analysis since row.column is fitted as a G structure. If we had used the augmented data nin89aug.asd we would still omit !f mv and ASReml would discard the records with missing yield.
Variance structures The previous sections have introduced variance modelling in ASReml using the NIN data for demonstration. In this and the remaining sections the syntax is described formally, still using the example where appropriate.
Revised 08
Recall from Equation 2.2 on page 7 that the variance for the random effects in the linear mixed model was defined including an overall scale parameter θ. When this parameter is 1.0, R and G are defined in terms of variances. Otherwise they are defined relative to this scale parameter. Typically, θ is 1 if there are several residual variances as in the case of multivariate analysis (a different residual variance for each trait) or multienvironment trials (a different residual variance for each trial). However, for simple analyses with a single residual variance, θ is modelled as the residual variance so that R becomes a correlation matrix.
7
Command file: Specifying the variance structures
127
General syntax Variance model specification in ASReml has the following general form [variance header line [R structure definition lines ] [G structure header and definition lines ] [variance parameter constraints ]] •
variance header line specifies the number of R and G structures,
•
R structure definition lines define the R structures (variance models for error) as specified in the variance header line,
•
G structure header and definition lines define the G structures (variance models for the additional random terms in the model) as specified in the variance header line; these lines are always placed after any R structure definition lines,
•
variance parameter constraints are included if parameter constraints are to be imposed, see the !VCC c qualifier in Table 5.5 and Section 7.9 on constraints between and within variance structures.
A schematic outline of the variance model specification lines (variance header line, and R and G structure definition lines) is presented in Table 7.2 using the variance model of 4 for demonstration. Table 7.2: Schematic outline of variance model specification in ASReml
general syntax variance header line
model 4 [s [c [g]]]
1 2 1
------------------------------------------------------------R structure definition lines S1 C1 11 column AR1 0.3 C2 22 row AR1 0.3 .. . . . . Cc S2
C1 .. .
. . .
Cc
-
.. .
. . .
. . .
Ss
C1 .. .
. . .
Cc
-
-------------------------------------------------------------
7
Command file: Specifying the variance structures
128
Table 7.2: Schematic outline of variance model specification in ASReml
general syntax G structure definition lines
model 4 G1
repl 1 4 0 IDV 0.1
G2
-
.. .
. . .
Gg
-
Variance header line The variance header line is of the form [s [c [g]]] •
s and c relate to the R structures, g is the number of G structures,
•
the variance header line may be omitted if the default IID R structure is required, no G structures are being explicitly defined and there are no parameter constraints (see !VCC and examples 1 and 2a),
•
s is used to code the number of independent sections in the error term –
–
–
•
NIN Alliance Trial 1989 variety !A id . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 1 22 row AR1 0.3 11 column AR1 0.3 repl 1 repl 0 IDV 0.1
if s = 0, the default IID R structure is assumed and no R structure definition lines are required (as in examples 2b and 5), if s > 0, s R structure definitions are required, one for each of the s sections (as in examples 3a, 3b, 3c and 4), for the analysis of multi-section data s can be replaced by the name of a factor with the appropriate number of levels, one for each section,
c is the number of component variance models involved in the variance structure for the error term for each section; for example, 3a, 3b and 3c have column.row as the error term and the variance structure for column.row involves 2 variance models, the first for column and the second for row, –
c has a default value of 2 when s is not specified as zero,
7
Command file: Specifying the variance structures
•
See Table 7.3 See Section 7.7
129
g is the number of variance structures (G structures) that will be explicitly specified for the random terms in the model.
R and G structures are now discussed with reference to s, c and g. As already noted, each variance structure may involve several variance models which relate to the individual terms involved in the random effect or error. For example, a two factor interaction may have a variance model for each of the two factors involved in the interaction. Variance models are listed in Table 7.3. As indicated in the discussion of 2b, care must be taken with respect to scale parameters when combining variance models (see also Section 7.7).
R structure definition For each of the s sections there must be c R structure definitions. Each definition may take several lines. Each R structure definition specifies a variance model and has the form order [field model [initial values] [qualifiers] [additional initial values]] •
order is either the number of levels in the corresponding term or the name of a factor that has the same number of levels as the term, for example, 11 column AR1 0.5 is equivalent to column column AR1 0.5
NIN Alliance Trial 1989 variety !A . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 1 11 column AR1 0.3 22 row AR1 0.3 repl 1 repl 0 IDV 0.1
when column is a factor with 11 levels, •
field is the name of the data field (variate or factor) that corresponds to the term and therefore indexes the levels of the term; – –
–
ASReml uses this field to sort the units so they match the R structure, in the example the data will be sorted internally rows within columns for the analysis but the residuals will be printed in the .yht file in the original order (which is actually rows within columns in this case). Important It is assumed that the joint indexing of the components uniquely defines the experimental units, if field is a variable, it can be plot coordinates provided the plots are in a regular grid. Thus in this example
7
Command file: Specifying the variance structures
130
11 lat AR1 0.3 22 long AR1 0.3
–
•
is valid because lat gives column position and long gives row position, and the positions are on a regular grid. The autoregressive correlation values will still be on an plot index basis (1, 2, 3, . . . ), not on a distance basis (10m, 20m, 30m, . . . ), if the data is sorted appropriately for the order the models are specified, set field to 0,
model specifies the variance model for the term, for example, 22 row AR1 0.3 chooses a first order autoregressive model for the row error process, – – –
– •
all the variance models available in ASReml are listed in Table 7.3, these models have associated variance parameters, a error variance component (σe2 for the example, see Section 7.3) is automatically estimated for each section, the default model is ID,
initial values are initial or starting values for the variance parameters and must be supplied, for example, 22 row AR1 0.3 chooses an autoregressive model for the row error process (see Table 7.1) with a starting value of 0.3 for the row correlation,
•
qualifiers tell ASReml to modify the variance model in some way; the qualifiers are described in Table 7.4,
•
additional initial values are read from the following lines if there are not enough initial values on the model line. Each variance model has a certain number of parameters. If insufficient non zero values are found on the model line ASReml expects to find them on the following line(s), –
–
–
initial values of 0.0 will be ignored if they are on the model line but are accepted on subsequent lines, the notation n*v (for example, 5 * 0.1) is permitted on subsequent lines (but not the model line) when there are n repeats of a particular initial value v, only in a few specified cases is 0 permitted as an initial value of a non-zero parameter.
7
Command file: Specifying the variance structures
131
G structure header and definition lines There are g sets of G structure definition lines and each set is of the form model term d order [key model [additional initial order [key model [additional initial .. .
[initial values] [qualifier] values]] [initial values] [qualifier] values]]
order [key model [initial values] [qualifier] [additional initial values]]
NIN Alliance Trial 1989 variety !A id . . . row 22 column 11 nin89aug.asd !skip 1 yield ∼ mu variety !r repl, !f mv 1 2 1 22 row AR1 0.3 11 column AR1 0.3 repl 1 repl 0 IDV 0.1
•
model term is the term from the linear model to which the variance structure applies; the variance structure may cover additional terms in the linear model, see Section 7.8
•
d is the number of variance models and hence direct product matrices involved in the G structure; the following lines define the d variance models,
•
order is either the number of levels in the term or the name of a factor that has the same number of levels as the component,
•
key is usually zero but for power models (EXP, GAU,. . . ) provides the distance data needed to construct the model,
•
model is the ASReml variance model identifier/acronym selected for the term, – –
variance models are listed in Table 7.3, these models have associated variance parameters,
•
initial values are initial or starting values for the variance parameters, the values for initial values are as described above for R structure definition lines,
•
qualifier tells ASReml to modify the variance model in some way; the qualifiers are described in Table 7.4.
7
7.5
Command file: Specifying the variance structures
132
Variance model description
Important Revised 08
Table 7.3 presents the full range of variance models, that is, correlation, homogeneous variance and heterogeneous variance models available in ASReml. The table contains the model identifier, a brief description, its algebraic form and the number of parameters. The first section defines (BASE) correlation models and in the next section we show how to extend them to form variance models. The second section defines some models parameterized as variance/covariance matrices rather than as correlation matrices. The third section covers some special cases where the covariance structure is known except for the scale. Note that in many cases, the ’variance’ or scaling parameter will actually be a variance ratio (see page 126. This depends on how the R structure is defined. It is important to recognise whether it is a variance or a variance ratio when setting initial values. Table 7.3: Details of the variance models available in ASReml
base identifier
description
algebraic form
number of parameters† corr
homo’s variance
hetero’s variance
Cii = 1, Cij = 0, i 6= j
0
1
ω
Cii = 1, Ci+1,i = φ1
1
2
1+ω
2
3
2+ω
3
4
3+ω
Correlation models One-dimensional, equally spaced ID AR[1]
identity st
1 order autoregressive
Cij = φ1 Ci−1,j , i > j + 1 |φ1 | < 1
AR2
nd
2 order autoregressive
Cii = 1, Ci+1,i = φ1 /(1 − φ2 ) Cij = φ1 Ci−1,j + φ2 Ci−2,j , i > j + 1 |φ1 | < (1 − φ2 ), |φ2 | < 1
AR3 ASReml2
rd
3 order autoregressive
Cii = 1, Ω = 1 − φ2 − φ3 (φ1 + φ3 ), Ci+1,i = (φ1 + φ2 φ3 )/Ω,
Ci+2,i = (φ1 (φ1 + φ3 ) + φ2 (1 − φ2 ))/Ω, Cij = φ1 Ci−1,j + φ2 Ci−2,j + φ3 Ci−3,j , i > j + 2 |φ1 | < (1 − φ2 ), |φ2 | < 1, |φ3 | < 1
7
Command file: Specifying the variance structures
133
Table 7.3: Details of the variance models available in ASReml
base identifier
description
SAR
symmetric autoregressive
number of parameters†
algebraic form
Cii = 1,
corr
homo’s variance
hetero’s variance
1
2
1+ω
2
3
2+ω
1
2
1+ω
2
3
2+ω
2
3
2+ω
2
Ci+1,i = φ1 /(1 + φ1 /4) Cij = φ1 Ci−1,j − φ21 /4 Ci−2,j , i>j+1 |φ1 | < 1
SAR2 ASReml2
MA[1]
constrained autoregressive 3 used for competition
st
1 order moving average
as for AR3 using φ1 = γ1 + 2γ2 , φ2 = −γ2 (2γ1 + γ2 ), φ3 = γ1 γ22 , Cii = 1, 2
Ci+1,i = −θ1 /(1 + θ1 ) Cji = 0, j > i + 2 |θ1 | < 1
MA2
nd
2 order moving average
Cii = 1, Ci+1,i = −θ1 (1 − θ2 )/(1 + θ12 + θ22 ) Ci+2,i = −θ2 /(1 + θ12 + θ22 ) Cji = 0, j > i + 2 θ2 ± θ1 < 1 |θ1 | < 1, |θ2 | < 1
ARMA
autoregressive moving average
Cii = 1, Ci+1,i = (θ − φ)(1 − θφ)/(1 + θ2 − 2θφ) Cji = φCj−1,i , j > i + 1 |θ| < 1, |φ| < 1
CORU
uniform correlation
Cii = 1, Cij = φ, i 6= j
1
2
1+ω
CORB
banded correlation
Cii = 1
ω−1
ω
2ω − 1
Ci+j,i = φj , 1 ≤ j ≤ ω − 1 |φj | < 1
7
Command file: Specifying the variance structures
134
Table 7.3: Details of the variance models available in ASReml
base identifier
description
CORG
general correlation CORGH = US
number of parameters†
algebraic form
Cii = 1
corr
homo’s variance
hetero’s variance
ω(ω−1) 2
ω(ω−1) +1 2
ω(ω−1) 2
1
2
1+ω
1
2
1+ω
1
2
1+ω
1
2
1+ω
1
2
1+ω
1
2
1+ω
Cij = φij , i 6= j
+ω
|φij | < 1
One-dimensional unequally spaced EXP
exponential
Cii = 1 |xi −xj |
Cij = φ
, i 6= j
xi are coordinates 0 screen.srt sorts a summary of the results to identify the best fit. The best fit can then be added to the model and the process repeated. Assuming Marker35 was best, the revised job could be Marker screen Genotype * yield
11
Command file: Running the job
208
PhenData.txt !CYCLE 1:1000 !MBF mbf(Genotype) MLIB\Marker$I.csv !rename Marker$I !MBF mbf(Genotype) MLIB\Marker35.csv !rename MKR035 yld ~ mu !r MKR035 Marker$I
We have given Marker35 a new name because it is still also generated by the !CYCLE unless it is modified to read !CYCLE 1:34 36:1000 .
Order of Substitution The substitution order is ASSIGN, CYCLE, TP, command line arguments and finally the interactive prompt.
11.5
Performance issues The following subsections raise several issues which affect the performance of ASReml.
Multiple processors ASReml has not been configured for parallel processing. Performance is downgraded if it tries to use two processors simultaneously as it wastes time swapping between processors.
Slow processes The processing time is related to the size of the model, the complexity of the variance mmodel (in particular the number of parameters), the sparsity of the mixed model equations, the amount of data being processed. Typically, the first iteration take longer than other iterations. The extra work in the first iteration is to determine an optimum equation order for processing the model (see !EQORDER). The extra processes in the last iteration are optional. They include •
calculation of predicted values (see PREDICT statement,
•
calculation of denominator degrees of freedom (see !DDF),
11
Command file: Running the job
•
209
calculation of outlier statistics (see !OUTLIER).
If a job is being run a large number of times, signicicant gains in processing time can sometimes be made by reorganising the data (so reading of irrelevant data is avoided), use of !CONTINUE to reduce the number of iterations, and avoiding uncessessary output (see !SLNFORM, YHTFORM and !NOGRAPHICS.
Timing processes The elapsed time for the whole job can be calculated approximately by comparing the start time with the finish time. Timings of particular processes can be obtained by using the !DEBUG !LOGFILE qualifiers on the first line of the job. This requests the .asl file be created and hold some intermediate results, especially from data setup and the first iteration. Included in that information is timing information on each phase of the job.
12
Command file: Merging data files
Introduction Merge Syntax Examples
210
12
Command file: Merging data files
12.1
211
Introduction The MERGE directive, described in this chapter, is designed to combine information from two files into a third file with a range of qualifiers to accomodate various scenarios. It was developed with assistance from Chandrapal Kailasanathan to replace the !MERGE qualifier (see page 66) which had very limited functionality. The MERGE directive is placed BEFORE the data filename lines. It is an independent part of the ASReml job in the sense that none of the files are necessarily involved in the subsequent analyses performed by the job, and there may be multiple MERGE directives. Indeed, the job may just consist of a title line and MERGE directives. The !MERGE qualifier, on the other hand, combines information from two files into the internal data set which ASReml uses for analysis and does not save it to file. It has very limited in functionality. The files to be merged must conform to the following basic structure:
12.2
•
the data fields must be TAB, COMMA or SPACE separated,
•
there will be one heading line that names the columns in the file,
•
the names may not have embedded spaces,
•
the number of fields is determined from the number of names,
•
missing values are implied by adjacent commas in comma delimited files. Otherwise, they are indicated by NA, * or . as in normal ASReml files.
•
the merged file will be TAB separated if a .txt file, COMMA separated if a .csv file and SPACE separated otherwise.
Merge Syntax The basic merge command is MERGE file1 !WITH file2 !to newfile. Typically files to be merged will have common key fields. In the basic merge, (!KEY not specified) any fields having the same names are taken as the key fields and if the files have no fields in common, they are assumed to match on row number. Fields are referenced by name (case sensitive).
12
Command file: Merging data files
212
The full command is: MERGE file1 [ !KEY keyfields ] [ !KEEP ] [ !SKIP fields ] !WITH file2 [ !KEY keyfields ] [ !KEEP ] [ !NODUP ] [ !SKIP fields] !TO newfile [!CHECK ] [ !SORT ]. Check
output
field order
Warning: Fields in the merged file will be arranged with key fields followed by other fields from the primary file and then fields from the secondary file. Table 12.1: List of MERGE qualifiers
qualifier
action
!CHECK
requests ASReml confirm that fields having a common name have the same contents. Discrepancies are reported to the .asr file. If there are fields with common names which are not key fields, and !CHECK is omitted, the fields will be assumed different and both versions will be copied.
!KEY keyfields
names the fields which are to be used for matching records in the files. If the fields have the same name in both file headers, they need only be named in association with the primary input file. If the key fields are the only fields with common names, the !KEY qualifier may be omitted altogether. If key fields are not nominated and there are no common field names, the files are interleaved.
!KEEP
instructs ASReml to include in the merged file records from the input file which are not matched in the other input file. Missing values are inserted as the values from the other file. Otherwise, unmatched records are discarded. !KEEP may be specified with either or both input files.
!NODUP fields
Typically when a match occurs, the field contents from the second file are combined with the field contents of the first file to produce the merged file. The !NODUP qualifier, which may only be associated with the second file, causes the field contents for the nominated fields from the second file only be inserted once into the merged file. For example, assume we want to merge two files containing data from sheep. The first file has several records per animal containing fleece data from various years. The second file has one record per animal containing birth and weaning weights. Merging with !NODUP bwt wwt will copy these traits only once into the merged file.
!SKIP fields
is used to exclude fields from the merged file. It may be specified with either or both input files.
!SORT
instructs ASReml to produce the merged file sorted on the key fields. Otherwise the records are return in the order they appear in the primary file.
12
Command file: Merging data files
213
The merging algorithm is briefly as follows: The secondary file is read in, skip fields being omitted, and the records are sorted on the key fields. If sorted output is required, the primary file is also read in and sorted. The primary file (or its sorted form) is then processed line by line and the merged file is produced. Matching of key fields is on a string basis, not a value basis. If there are no key fields, the files are merged by interleaving. If there are multiple records with the same key, these are severally matched. That is if 3 lines of file 1 match 4 lines of file 2, the merged file will contain all 12 combinations.
12.3
Examples Key fields have different names !MERGE file1 !Key key1a key1b !WITH file2 !KEY key2a key2b !to newfile Key fields have common name and other fields are also duplicated !MERGE file1 !Key keya keyb !WITH file2 !to newfile !CHECK !MERGE file1 !Key key !KEEP !WITH file2 !to newfile will discard records from file2 that do not match records in file1 but all records in file1 are retained. Omitting fields from the merged file !MERGE file1 !Key key !skip s1a s1b !WITH file2 !skip s2a s2b !to newfile
Single insertion merging !MERGE adult.txt !Key ewe !KEEP !WITH birth.txt !KEEP !TO newfile !NODUP bwt.
13
Functions of variance components
Introduction VPREDICT directive PIN file syntax Linear combinations of components Heritability Correlation A more detailed example
214
13
Functions of variance components
13.1
215
Introduction ASReml includes a post-analysis procedure to F phenvar 1 + 2 # pheno var calculate functions of variance components. F genvar 1 * 4 # geno var H herit 4 3 # heritability Its intended use is when the variance components are either simple variances or are variances and covariances in an unstructured matrix. The functions covered are linear combinations of the variance components (for example, phenotypic variance), a ratio of two components (for example, heritabilities) and the correlation based on three components (for example, genetic correlation). The user must prepare a .pin file. A simple sample .pin file is shown in the ASReml code box above. The .pin file specifies the functions to be calculated. The .pin file can be formally created as a separate file and processed by running ASReml with the -P command line option specifying the .pin file as the input file. ASReml reads the model information from the .asr and .vvp files and calculates the requested functions. These are reported in the .pvc file. Alternatively, the .pin file may be processed by ASReml as the final stage of an analysis run, if the VPREDICT directive is included in the .as file.
13.2 ASReml3
VPREDICT: PIN file processing Processing of a .pin file is activated from within the .as file by including a VPREDICT directive. The VPREDICT line may appear anywhere in the .as file but it is recommended it be placed after the model line. It is recognised by the characters VPR in character positions 1:3 of a line. It is processed after the job (part/cycle) has finished. There are four forms of the VPREDICT directive. •
If the .pin file exists and has the same name as the jobname (including any suffix appended by using !RENAME), just specify the VPREDICT directive.
•
If the .pin file exists but has a different name to the jobname, specify the VPREDICT directive with the .pin file name as its argument.
•
If the .pin file does not exist or must be reformed, a name argument for the file is optional but the !DEFINE qualifier should be set. Then the lines of the .pin file should follow on the next lines, terminated by a blank line.
13
Functions of variance components
13.3
216
Syntax Functions of the variance components are specified in the .pin file in lines of the form letter label coefficients •
letter ( either F, H or R ) must occur in column 1 – – –
F is for linear combinations of variance components, H is for forming the ratio of two components, R is for forming the correlation based on three components,
•
label names the result,
•
coefficients is the list of coefficients for the linear function.
Linear combinations of components First ASReml extracts the variance compoF phenvar 1 + 2 # pheno var nents from the .asr file and their variance F genvar 1 * 4 # geno var H herit 4 3 # heritability matrix from the .vvp file. Each linear function formed by an F line is added to the list of components. Thus, the number of coefficients increases by one each line. We seek to calculate k + c0 v, cov (c0 v, v) and var (c0 v) where v is the vector of existing variance components, c is the vector of coefficients for the linear function and k is an optional offset which is usually omitted but would be 1 to represent the residual variance in a probit analysis and 3.289 to represent the residual variance in a logit analysis. The general form of the directive is F label a + b ∗ cb + c + d + m ∗ k where a, b, c and d are subscripts to existing components va , vb , vc and vd and cb is a multiplier for vb . m is a number bigger than the current length of v to flag the special case of adding the offset k. Where matrices are to be combined the form F label a:b * k + c:d can be used, as in the Coopworth data example, see page 349. Assuming that the .pin file in the ASReml code box corresponds to a simple sire model and that variance component 1 is the sire variance and variance component 2 is the residual variance, then F phenvar
1 + 2
13
Functions of variance components
217
gives a third component which is the sum of the variance components, that is, the phenotypic variance, and F genvar
1 * 4
gives a fourth component which is the sire variance component multiplied by 4, that is, the genotypic variance.
Heritability Heritabilities are requested by lines in the .pin file beginning with an H. The specific form of the directive in this case is
F phenvar 1 + 2 # pheno var F genvar 1 * 4 # geno var H herit 4 3 # heritability
H label n d This calculates σn2 /σd2 and se[σn2 /σd2 ] where n and d are integers pointing to components vn and vd that are to be used as the numerator and denominator respectively in the heritability calculation. Var( σσn2 ) = ( σσn2 )2 ( Varσ4(σn ) + 2
d
2
d
2
n
Var(σd2 ) σd4
−
2 ,σ 2 ) 2Cov(σn d ) 2 2 σn σd
In the example H herit 4 3 calculates the heritability by calculating component 4 (from second line of .pin) / component 3 (from first line of .pin), that is, genetic variance / phenotypic variance.
Correlation Correlations are requested by lines in the .pin file beginning with an R. The specific form of the directive is R label a ab b
F phenvar 1:3 + 4:6 R phencorr 7 8 9 R gencorr 4:6
q
This calculates the correlation r = σab / σa2 σb2 and the associated standard error. a, b and ab are integers indicating the position of the components to be used. Alternatively, R label a:n
q
calculates the correlation r = σab / σa2 σb2 for all correlations in the lower triangular row-wise matrix represented by components a to n and the associated
13
Functions of variance components
218
standard errors. ¡
¢
¡
¢
var σa2 var σb2 var (σab ) var (r) = r [ + + 2 2 2 σab 4σa2 4σb2 2
¡
¢
¡
¢
¡
¢
2cov σa2 , σb2 2cov σa2 , σab 2cov σab , σb2 + − − ] 2 2σa2 σab 4σa2 σb 2σab σb2 In the example R phencorr 7 8 9 calculates the phenotypic covariance by calculating √ component 8 / component 7 × component 9 where components 7, 8 and 9 are created with the first line of the .pin file, and R gencorr 4:6 calculates the genotypic covariance by calculating √ component 5 / component 4 × component 6 where components 4, 5 and 6 are variance components from the analysis.
A more detailed example The following example is a little more complicated and has the .pin file coding inserted in the job file for a bivariate sire model in bsiremod.as shown in the code box to the right. Numbering the parameters reported in bsiremod.asr (and bsiremod.vvp) 1 2 3 4 5 6
error variance for ywt error covariance for ywt and fat error variance for fat sire variance component for ywt sire covariance for ywt and fat sire variance for fat
then F phenvar 1:3 + 4:6 creates new components 7 = 1+4, 8 = 2+5
Bivariate sire model sire !I ywt fat bsiremod.asd ywt fat ∼ Trait !r Trait.sire !PIN !define F phenvar 1:3 + 4:6 F addvar 4:6 * 4 H heritA 10 7 H heritB 12 9 R phencorr 7 8 9 R gencor 4:6 1 2 1 0 # ASReml will count units Trait 0 US 3*0 Trait.sire 2 Trait 0 US 3*0 sire
13
Functions of variance components
219
and 9 = 3+6, F addvar 4:6 * 4 creates new components 10 = 4 × 4, 11 = 5 × 4 and 12 = 6 × 4, H heritA 10 7 forms 10 / 7 to give the heritability for ywt, H heritB 12 9 forms 12 / 9 to give the heritability for fat, R phencorr 7 8 9 √ forms 8 / 7 × 9, that is, the phenotypic correlation between ywt and fat, R gencorr 4:6 √ forms 5 / 4×6, that is, the genetic correlation between ywt and fat. The resulting .pvc file contains: - - - ywt fat The first 6 lines
1
Residual
26.2191
are copied from
2
Residual
2.85058
the .asr file
3
Residual
1.71554
4
Tr.sire
16.5244
5
Tr.sire
1.14335
6
Tr.sire
0.132734
7 phenvar
1
42.75
6.297
8 phenvar
2
3.995
0.6761
9 phenvar
3
1.848
0.1183
66.10
24.58
10 addvar
4
11 addvar
5
4.577
2.354
12 addvar
6
0.5314
0.2831
h2ywt
= addvar
10/phenvar
7=
1.5465
0.3574
h2fat
= addvar
12/phenvar
9=
0.2875
0.1430
0.4495
0.0483
0.7722
0.1537
phencorr gencor
2
= phenvar /SQR[phenvar *phenvar ]= 1 = Tr.si
5/SQR[Tr.si
4*Tr.si
6]=
14
Description of output files
Introduction An example Key output files The .asr file The .sln file The .yht file
Other ASReml output files The The The The The The The The The
.aov .res .vrb .vvp .rsv .dpr .pvc .pvs .tab
file file file file file file file file file
ASReml output objects and where to find them
220
14
Description of output files
14.1
221
Introduction With each ASReml run a number of output files are produced. ASReml generates the output files by appending various filename extensions to basename. A brief description of the filename extensions is presented in Table 14.1. Table 14.1: Summary of ASReml output files
file
description
Key output files .asr
contains a summary of the data and analysis results.
.pvc
contains the report produced with the P option.
.pvs
contains predictions formed by the predict directive.
.res
contains information from using the pol(), spl() and fac() functions, the iteration sequence for the variance components and some statistics derived from the residuals. contains the final parameter values for reading back if the !CONTINUE qualifier is invoked, see Table 5.4.
.rsv .sln
contains the estimates of the fixed and random effects and their corresponding standard errors.
.tab
contains tables formed by the tabulate directive.
.yht
contains the predicted values, residuals and diagonal elements of the hat matrix for each data point.
Other output files .asl
contains a progress log and error messages if the L command line option is specified.
.aov
contains details of the ANOVA calculations.
.apj
is an ASReml project file created by ASReml-W .
.ask
holds the !RENAME !ARG argument from the most recent run so that ASReml can retrieve restart values from the most recent run when !CONTINUE is specified but there is no particular .rsv file for the current !ARG argument.
.asp
contains transformed data, see !PRINT in Table 5.2.
.ass
contains the data summary created by the !SUM qualifier (see page 71).
.dbr/.dpr/.spr
contains the data and residuals in a binary form for further analysis (see !RESIDUALS, Table 5.5).
.veo
holds the equation order to speed up re-running big jobs when the model is unchanged. This binary file is of no use to the user.
14
Description of output files
222
Table 14.1: Summary of ASReml output files
file
description
.vll
holds factor level names when data/residuals are saved in binary form. See !SAVE on page 87.
.vrb
contains the estimates of the fixed effects and their variance.
.vvp
contains the approximate variances of the variance parameters. It is designed to be read back with the P option for calculating functions of the variance parameters.
.was
basename.was is open while ASReml is running and deleted when it finishes. It will normally be invisible to the user unless the job crashes. It is used by ASReml-W to tell when the job finishes.
An ASReml run generates many files and the .sln and .yht files, in particular, are often quite large and could fill up your disk space. You should therefore regularly tidy your working directories, maybe just keeping the .as, .asr and .pvs files.
14.2
An example In this chapter the ASReml output files are discussed with reference to a two-dimensional separable autoregressive spatial analysis of the NIN field trial data, see model 3b on page 123 of Chapter 7 for details. The ASReml command file for this analysis is presented to the right. Recall that this model specifies a separable autoregressive correlation structure for residual or plot errors that is the direct product of an autoregressive correlation matrix of order 22 for rows and an autoregressive correlation matrix of order 11 for columns. In this case 0.5 is the starting correlation for both columns and rows.
NIN Alliance Trial 1989 variety !A id raw repl 4 nloc yield lat long row 22 column 11 nin89a.asd !skip 1 !DISPLAY 15 yield ∼ mu variety !f mv predict variety 1 2 0 row row AR1 0.5 column column AR1 0.5
14
Description of output files
14.3
223
Key output files The key ASReml output files are the .asr, .sln and .yht files.
The .asr file This file contains •
a general announcements box (outlined in asterisks) containing current messages,
•
a summary of the data to for the user to confirm the data file has been interpreted correctly and to review the basic structure of the data and validate the specification of the model,
•
the iteration sequence of REML loglikelihood values to check convergence,
•
a summary of the variance parameters: – –
–
– –
Revised 08
The Gamma column reports the actual parameter fitted, the Component column reports the gamma converted to a variance scale if appropriate, Comp/SE is the ratio of the component relative to the square root of the diagonal element of the inverse of the average information matrix Warning Comp/SE should not be used for formal testing, The % shows the percentage change in the parameter at the last iteration, use the .pin file described Chapter 13 to calculate meaningful functions of the variance components,
•
an table of Wald F statistics for testing fixed effects. (Section 6.11). The table contains the numerator degrees of freedom for the terms and ’incremental’ F-statistics for approximate testing of effects. It may also contain denominator degrees of freedon, a ’conditional’ Wald F statistic and a significance probability.
•
estimated effects, their standard errors and t values for equations in the DENSE portion of the SSP matrix are reported if !BRIEF -1 is invoked; the T-prev column tests difference between successive coefficients in the same factor.
The reported log-likelihood value may be positive or negative and typically excludes some constants from its calculation. It is sometimes reported relative to an offset (when its magnitude exceeds 10000); any offset is reported in the .asr file. Twice the difference in the likelihoods for two models is commonly used as
14
Description of output files
224
the basis for a likelihood ratio test (see page 17). This is not valid for generalised linear mixed models as the reported LogL does not include components relating to the reweighting. Furthermore, it is not appropriate if the fixed effects in the model have changed. In particular, if fixed effects are fitted in the sparse equations, the order of fitting may change with a change in the fitted variance structure resulting in non comparable likelihoods even though the fixed terms in the model have not changed. The iteration sequence terminates when the maximum iterations (see !MAXIT on page 70) has been reached or successive LogL values are less than 0.002i apart. The following is a copy of nin89a.asr. version & title date, workspace
data summary
ASReml 3.01d [01 Apr 2008] NIN alliance trial 1989 Build: e [01 Apr 2008] 32 bit 10 Apr 2008 16:47:40.140 32 Mbyte Windows nin89a Licensed to: NSW Primary Industries permanent *********************************************************** * Contact
[email protected] for licensing and support * *
[email protected] * ***************************************************** ARG * Folder: C:\data\asr3\ug3\manex variety !A QUALIFIERS: !SKIP 1 !DISPLAY 15 QUALIFIER: !DOPART 1 is active Reading nin89aug.asd FREE FORMAT skipping 1 lines Univariate analysis of yield Summary of 242 records retained of 242 read Model term Size #miss #zero MinNon0 Mean 1 variety 56 0 0 1 26.4545 2 id 0 0 1.000 26.45 3 pid 18 0 1101. 2628. 4 raw 18 0 21.00 510.5 5 repl 4 0 0 1 2.4132 6 nloc 0 0 4.000 4.000 7 yield Variate 18 0 1.050 25.53 8 lat 0 0 4.300 25.80 9 long 0 0 1.200 13.80 10 row 22 0 0 1 11.5000 11 column 11 0 0 1 6.0000 12 mu 1 13 mv_estimates 18 22 AR=AutoReg [ 5: 5] 0.5000 11 AR=AutoReg [ 6: 6] 0.5000
MaxNon0 56 56.00 4156. 840.0 4 4.000 42.00 47.30 26.40 22 11
StndDevn 17.18 1121. 149.0 0.000 7.450 13.63 7.629
14
Description of output files
iterations
225
Forming 75 equations: 57 dense. Initial updates will be shrunk by factor Notice: 1 singularities detected in 1 LogL=-401.827 S2= 42.467 2 LogL=-400.780 S2= 43.301 3 LogL=-399.807 S2= 45.066 4 LogL=-399.353 S2= 47.745 5 LogL=-399.326 S2= 48.466 6 LogL=-399.324 S2= 48.649 7 LogL=-399.324 S2= 48.696 8 LogL=-399.324 S2= 48.708 Final parameter values
0.316 design matrix. 168 df 1.000 0.5000 0.5000 168 df 1.000 0.5388 0.4876 168 df 1.000 0.5895 0.4698 168 df 1.000 0.6395 0.4489 168 df 1.000 0.6514 0.4409 168 df 1.000 0.6544 0.4384 168 df 1.000 0.6552 0.4377 168 df 1.000 0.6554 0.4375 1.0000 0.65550 0.43748
- - - Results from analysis of yield - - -
parameter estimates
testing fixed effects
Source Variance Residual Residual
Model terms 242 168 AR=AutoR 22 AR=AutoR 11
Gamma 1.00000 0.655505 0.437483
Component 48.7085 0.655505 0.437483
Comp/SE 6.81 11.63 5.43
% 0 0 0
C P U U
Wald F statistics Source of Variation NumDF DenDF F_inc Prob 12 mu 1 25.0 331.85 >>> >>>>. These lines also mark progress through the iteration.
The .dpr file The .dpr file contains the data and residuals from the analysis in double precision binary form. The file is produced when the !RES qualifier (Table 4.3) is invoked. The file could be renamed with filename extension .dbl and used for input to another run of ASReml. Alternatively, it could be used by another Fortran program or package. Factors will have level codes if they were coded using !A or !I. All the data from the run plus an extra column of residuals is in the file. Records omitted from the analysis are omitted from the file.
The .pvc file The .pvc file contains functions of the variance components produced by running a .pin file on the results of an ASReml run as described in Chapter 13. The .pin and .pvc files for a half-sib analysis of the Coopworth data are presented in Section 16.11.
14
Description of output files
233
The .pvs file The .pvs file contains the predicted values formed when a predict statement is included in the job. Below is an edited version of nin89a.pvs. See Section 3.6 for the .pvs file for the simple RCB analysis of the NIN data considered in that chapter. title line
nin alliance trial
14 Jul 2005 12:41:18 nin89a
Ecode is E for Estimable, * for Not Estimable Warning: mv_estimates
is ignored for prediction
---- ---- ---- ---- ---- ---- ---Predicted values of yield
predicted variety means
.. .
variety LANCER BRULE REDLAND CODY ARAPAHOE NE83404 NE83406 NE83407 CENTURA SCOUT66
1 ---- ---- ---- ---- ---- ---- ----
Predicted_Value Standard_Error Ecode 24.0894 2.4645 E 27.0728 2.4944 E 28.7954 2.5064 E 23.7728 2.4970 E 27.0431 2.4417 E 25.7197 2.4424 E 25.3797 2.5028 E 24.3982 2.6882 E 26.3532 2.4763 E 29.1743 2.4361 E
NE87615 25.1238 2.4434 E NE87619 30.0267 2.4666 E NE87627 19.7126 2.4833 E SED: Overall Standard Error of Difference 2.925
SED summary
The .res file The .res file contains miscellaneous supplementary information including •
a list of unique values of x formed by using the fac() model term,
•
a list of unique (x, y) combinations formed by using the fac(x,y) model term,
•
legandre polynomials produced by leg() model term,
•
orthogonal polynomials produced by pol() model term,
•
the design matrix formed for the spl() model term,
14
Description of output files
234
•
predicted values of the curvature component of cubic smoothing splines,
•
the empirical variance-covariance matrix based on the BLUPs when a Σ ⊗ I or I ⊗ Σ structure is used; this may be used to obtain starting values for another run of ASReml,
•
a table showing the variance components for each iteration,
•
a figure and table showing the variance partitioning for any XFA structures fitted,
•
some statistics derived from the residuals from two-dimensional data (multivariate, repeated measures or spatial) –
–
–
–
the residuals from a spatial analysis will have the units part added to them (defined as the combined residual) unless the data records were sorted (within ASReml ) in which case the units and the correlated residuals are in different orders (data file order and field order respectively), the residuals are printed in the .yht file but the statistics in the .res file are calculated from the combined residual, the Covariance/Variance/Correlation (C/V/C) matrix calculated directly from the residuals; it contains the covariance below the diagonals, the variances on the diagonal and the correlations above the diagonal: The ’FITTED’ matrix is the same as is reported in the .asr file and if the Logl has converged is the one you would report; the ’BLUPS’ matrix is clculated from the BLUPS and is provided so it can be used as starting values when a simple initial model has been used and you are wanting to attempt to fit a full unstructured matrix; the rescaled has the variance from the FITTED and the covariance from the BLUPS and might we more suitable as an initial matrix if the variances have been estimated. The FITTED and RESCALEd matrices should not be reported. relevant portions of the estimated variance matrix for each term for which an R structure or a G structure has been associated,
•
a variogram and spatial correlations for spatial analysis; the spatial correlations are based on distance between data points (see Gilmour et al., 1997),
•
the slope of the log(absolute residual) on log(predicted value) for assessing possible mean-variance relationships and the location of large residuals. For example, SLOPES FOR LOG(ABS(RES)) ON LOG(PV) for section 1 0.99 2.01 4.34 produced from a trivariate analysis reports the slopes. A slope of b suggests
14
Description of output files
235
that y 1−b might have less mean variance relationship. If there is no mean variance relation, a slope of zero is expected. A slope of 12 suggests a SQRT transformation might resolve the dependence; a slope of 1 means a LOG transformation might be appropriate. So, for the 3 traits, log(y1 ), y2−1 and y3−3 are indicated. This diagnostic strategy works better when based on grouped data regressing log(standard deviation) on log(mean). Also, STND RES 16 -2.35 6.58 5.64 indicates that for the 16th data record, the residuals are -2.35, 6.58 and 5.64 times the respective standard deviations. The standard deviation used in this test is calculated directly from the residuals rather than from the analysis. They are intended to flag the records with large residuals rather than to precisely quantify their relative size. They are not studentised residuals and are generally not relevant when the user has fitted heterogeneous variances. This is nin89a.res. Convergence sequence of variance parameters Iteration 1 2 3 4 5 6 LogL -401.827 -400.780 -399.807 -399.353 -399.326 -399.324 Change % 59 80 83 21 5 1 Adjusted 0 0 0 0 0 0 StepSz 0.316 0.562 1.000 1.000 1.000 1.000 5 0.500000 0.538787 0.589519 0.639457 0.651397 0.654445 0.5 6 0.500000 0.487564 0.469768 0.448895 0.440861 0.438406 -0.6 Plot of Residuals [-24.8730 15.9145] vs Fitted values [ 16.7724 -----------------------------1-----------------------------. 1 . . 1 . . 1 1 1 . 1 12 2 1211 1 21 1 1 1 . 1 112 15 1 311 121 1 . . 1 1 312 111 221 3 . . 1 1 1 4 1 4 1 22121 411 2 1 2 . 2 1 1 11 1112 23 11 1 2 1 . . 1 2 1 21 2 1213 1 13 2 11 . -----------1--1------1--2--1-61-212--3---------------------. 1 1 11 1 11 41 2 12 1 . . 1 1 1 11 . . 1 3 2 . 1 1 1 . . 1 1 1 . . 11 1 1 . 111 1 2 . . . . 1 1 . . 1 1 . 1 .
35.9355] RvE
14
Description of output files
236
. 1 . . 1 2 1 . ---------------------------1----1---1---1------------------SLOPES FOR LOG(ABS(RES)) on LOG(PV) for Section 1 0.15 SLOPES FOR LOG(SDi) on LOG(PVBari) for Section 1 1.37 * * * ** *** * ** *** * ** *** *** *** * ********** ***** ****************** * * * * ** ****************** * ** ** ** ** ***** ** ************************ ** * Min Mean Max -24.873
0.27954
15.915
Spatial diagnostic statistics of Residuals Residual Plot and Autocorrelations [se 0.077] | ++xxx+X| |--- - O+ + x +x+x>+X | |o - -- + +++xxx++X++| |+ + + +x- +xxx+++| | o -- ++ +- xx+xHxx| |-+xxx+xXx +++x xX ++x| |-++ o- +XxxXXx-xXX +++| |ooL