Acceleration of sparse operations for average-information REML ...

Published October 9, 2015

Technical note: Acceleration of sparse operations for average-information REML analyses with supernodal methods and sparse-storage refinements1,2 Y. Masuda,*†3 I. Aguilar,‡ S. Tsuruta,* and I. Misztal* *Department of Animal and Dairy Science, University of Georgia, Athens 30602; †Department of Life Science and Agriculture, Obihiro University of Agriculture and Veterinary Medicine, Obihiro 0808555, Japan; and ‡Instituto Nacional de Investigación Agropecuaria, Canelones 90200, Uruguay

ABSTRACT: The objective of this study was to remove bottlenecks generally found in a computer program for average-information REML. The refinements included improvements to setting-up mixed-model equations on a hash table with a faster hash function as sparse matrix storage, changing sparse structures in calculation of traces, and replacing a sparse matrix package using traditional methods (FSPAK) with a new package using supernodal methods (YAMS); the latter package quickly processed sparse matrices containing large, dense blocks. Comparisons included 23 models with data sets from broiler, swine, beef, and dairy cattle. Models included single-trait, multiple-trait, maternal,

and random regression models with phenotypic data; selected models used genomic information in a singlestep approach. Setting-up mixed model equations was completed without abnormal termination in all analyses. Calculations in traces were accelerated with a hash format, especially for models with a genomic relationship matrix, and the maximum speed was 67 times faster. Computations with YAMS were, on average, more than 10 times faster than with FSPAK and had greater advantages for large data and more complicated models including multiple traits, random regressions, and genomic effects. These refinements can be applied to general average-information REML programs.

Key words: average-information REML, hash format, sparse matrix, sparse packages © 2015 American Society of Animal Science. All rights reserved. INTRODUCTION Average-information REML (Gilmour et al., 1995) is the most popular method for variance component estimation in animal breeding. This method has been implemented as the AIREMLF90 program in the BLUPF90 package (Misztal et al., 2002), in ASREML (Gilmour et al., 2009), as DMUAI in the 1We acknowledge the work by François Guillaume in program-

ming a hash function. We greatly appreciate the work of the two anonymous reviewers. 2The AIREMLF90 program with a sparse package YAMS, along with a manual, is available at http://nce.ads.uga.edu. The YAMS package is available on request for academic or noncommercial purposes by contacting the corresponding author. 3Corresponding author: [email protected] Received June 8, 2015. Accepted August 7, 2015.

J. Anim. Sci. 2015.93:4670–4674 doi:10.2527/jas2015-9395

DMU package (Madsen et al., 2010), and in Wombat (Meyer, 2006). Compared with Bayesian algorithms, average-information REML offers several benefits, including fast operations with medium-sized models (Misztal, 2008). Efficient computations in average-information REML are based on sparse matrix computations because the left-hand side (LHS) of mixed-model equations (MME) is essentially sparse (Misztal and Pérez-Enciso, 1993). For example in AIREMLF90, sparse matrix manipulations are performed using a package, SPARSEM (http://nce.ads.uga.edu/wiki/ doku.php), with numerical computations by FSPAK (Pérez-Enciso et al., 1994). Although the current hardware can be capable of creating the LHS from large data sets and complicated models, the current software does not efficiently process the matrix, especially when including a dense genomic relationship matrix (GRM; VanRaden, 2008). The supernodal

4670

Faster average-information REML estimation

methods are expected to process the denser LHS efficiently because the methods specially process dense blocks in a sparse matrix (Masuda et al., 2014). The YAMS package employs the supernodal methods using optimized libraries with multiple processing cores. In addition to the sparse package, a storage format for a sparse matrix is a serious bottleneck in setting-up MME and calculations of traces in the REML equations especially for genomic models. The objective of this study was to remove bottlenecks generally found in a computer program for average-information REML. We demonstrated that supernodal methods were more advantageous than traditional methods when many covariance components or GRM were included to LHS. Also, the bottlenecks caused by the sparse storage format were determined and removed. MATERIALS AND METHODS Animal Care and Use Committee approval was not obtained for this study because no animals were used. Modifications In this study, as an example, AIREMLF90 was modified to incorporate the YAMS package and to refine the storage format. Its performance was examined with various models and data sets. In AIREMLF90, slow computing and abnormal terminations have been identified during the setup of MME, computations of the traces, and sparse operations. The following modifications were made. First, the LHS was stored as a hash format using a faster hash function with 8-byte integer operations. Then, the hash matrix was converted to an IJA format, which is formally known as the compressed sparse row format (Barrett et al., 1994). The hash format is flexible and allows us to add new nonzero elements to a matrix and is suitable for direct access to a specific element, whereas the IJA format is memory efficient and suitable for sequential access (Misztal, 2014). The conversion was performed just before the call of FSPAK or YAMS because the packages accepted only the IJA format as an input matrix. The input matrix was finally rewritten with the selected elements of the inverse (sparse inverse) of the input matrix. Second, before the calculation of the traces, the sparse inverse in the IJA format was converted back to the hash format, which had already allocated on the setup of MME. For a regular animal model, the trace requires the product of sparse inverse of LHS and the inverse of a numerator relationship matrix (NRM), which is replaced with the inverse of a combined matrix including NRM and GRM for a single-step genomic BLUP (ssGBLUP) model (Jensen et al., 1996; Aguilar

4671

et al., 2010). The hash format accelerates the search in a sparse matrix with many nonzero elements. Third, an existing sparse matrix package, FSPAK, was replaced with YAMS. The last modification was programmed as optional to enable either FSPAK or YAMS for comparisons. Data Sets and Models Testing involved 23 animal models as described in Table 1 (see Table S1 in the supplementary material for details). The models originated from research on broilers, swine, beef cattle, and dairy cattle at the University of Georgia. Models 1 through 4 were from commercial flocks. Models 5 through 8 were similar to models 1 through 4 but used ssGBLUP with GRM (Aguilar et al., 2010) created from 39,102 SNP. Models 9 and 10 were for BW of piglets and sows, respectively. Models 11 and 12 were for growth data in beef cattle from the study by Iwaisaki et al. (2005). Models 13 through 15 were from the research of genotype by region interaction in beef cattle (Williams et al., 2012). Models 16 through 18 arose from studies of test-day models with random regressions in dairy cattle. Third-order Legendre polynomials were applied to both additive genetic and permanent environmental effects for each trait. Models 19 through 22 were single- and multiple-trait models applied to subsets from the national database of final score and linear-type traits in Holsteins (Tsuruta et al., 2002). Model 23 was an ssGBLUP version of model 19. Computations For all models, the wall-clock time was split into several parts: preparation (setting-up MME), finding the ordering (to reduce working memory in subsequent operations), symbolic factorization (setting up the data structure), numerical factorization, sparse inversion, and the remaining operations, including calculations of traces and quadratic forms (Misztal, 2014). Programs were compiled with the Intel Fortran Compiler 14.0 (Intel Corp., Santa Clara, CA) and used multithreaded versions of BLAS and LAPACK in the Math Kernel Library (Intel Corp.). All analyses were performed on a computer running Linux (x86_64) with an Intel Xeon E5–2689 CPU (2.9 GHz) processor with 16 cores with all of the cores potentially being used. RESULTS AND DISCUSSION The use of refined hash format solved the crash problems in setting-up MME. The new hash function reduced wall-clock time in setting-up MME for all models. The maximum advantage was found

4672

Masuda et al.

Table 1. Descriptions of data sets tested in our study and elapsed wall-clock time and speedup in the second round using the FSPAK and YAMS package in AIREMLF90 Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Animal Broiler Broiler Broiler Broiler Broiler Broiler Broiler Broiler Pig Pig Gelbvieh Gelbvieh Angus Angus Angus Holstein Holstein Holstein Holstein Holstein Holstein Holstein Holstein

1RRM:

Description1 Animal Animal Animal Animal Genomic Genomic Genomic Genomic Maternal Spline-RRM Maternal Spline-RRM Animal Animal Animal RR-TDM RR-TDM RR-TDM Animal Animal Animal Animal Genomic

Traits 1 2 3 4 1 2 3 4 1 1 3 1 1 2 3 1 2 3 1 2 4 8 1

FIXED 470 940 1,410 1,880 470 940 1,410 1,880 84 21 8,484 8,478 22,646 45,292 67,938 7,851 15,702 23,553 1,873 3,746 7,492 14,984 1,837

Number2 PED 213,297 213,297 213,297 213,297 213,297 213,297 213,297 213,297 109,113 282,695 46,923 46,923 322,451 322,451 322,451 55,063 55,063 55,063 100,775 100,775 100,775 100,775 100,775

GENO 0 0 0 0 15,723 15,723 15,723 15,723 0 0 0 0 0 0 0 0 0 0 0 0 0 0 34,506

OLDM 848 1,635 2,547 3,394 15,740 31,959 47,938 63,917 1,276 2,790 9,399 9,156 8,987 26,628 51,138 8,568 17,159 25,738 3,791 7,834 15,732 29,859 34,518

Running time, s FSPAK4 YAMS 3 2 16 6 43 14 96 30 2,808 199 NA 595 NA 1,746 NA 3,381 5 3 46 10 856 72 856 69 554 31 17,560 829 NA 4,174 515 32 4,262 209 14,491 818 49 5 395 27 3,201 179 25,769 1,460 NA 1,360

Speedup3 1.4 2.4 3.0 3.2 14.1

1.6 4.7 11.9 12.4 17.7 21.2 16.0 20.4 17.7 9.9 14.4 17.8 17.7

random regression model; RR-TDM: random regression test-day model.

2FIXED: total levels of fixed effects; PED: animals in a pedigree file; GENO: genotyped animals; OLDM: the order of the largest dense matrix in a factor. 3Speedup: 4NA:

the ratio of running time in YAMS to FSPAK. did not complete.

in model 7, for which the running time was 13 min with the new function compared with 32 min with the previous function. The hash format was from 9 to 67 times faster in trace calculations than the IJA format for genomic models (models 5 to 8 and 23). For example, the trace calculations for model 8 required 8 min with the hash format and 8.8 h with the IJA format. For the models without genomic information, the trace calculations finished within 10 s in both formats. Wall-clock times in AIREMLF90 using FSPAK and YAMS for various models are shown in Table 1. Times are provided separately for rounds 1 and 2 in Table S2 because times for round 1 included the initial set up (setting-up MME, ordering, and symbolic factorization). In several cases, FSPAK abnormally stopped during the factorization of LHS with integer overflow in addressing because of many nonzero elements in a factor. For simpler models (models 1 and 9), both packages resulted in similar wall-clock times. As the number of traits increased (models 1 through 4), the wall-clock time increased, and runs with YAMS became faster than those with FSPAK. With genomic information (models 5 through 8), runs with FSPAK

finished with only the single-trait model, whereas runs with YAMS successfully finished for models with 1 to 4 traits. To finish the first 5 rounds in averageinformation REML for model 5, the total wall-clock time was 4 h with FSPAK and 20 min with YAMS. The wall-clock time with YAMS for the 4-trait model (model 8) was approximately 5 h for the first 5 rounds. Assuming FSPAK could finish the analysis for model 8 and have 15 times less computing capability, the wall-clock time would be approximately 3 d. FSPAK requires compromises in models and the amount of data because of long computing time. The computing time with YAMS is reasonable for academic research or commercial applications. The advantage of YAMS compared with FSPAK becomes obvious in genomic models (models 5 through 8 and 23) and in complicated models (models 10 through 23) with more fixed effects, with correlated random effects (maternal effects or random regressions), or involving additive genetic effects for sires with many progeny, especially in beef and dairy cattle. These effects form a dense block in a factor (sparse Cholesky decomposition after ordering), which can be

Faster average-information REML estimation

efficiently handled with YAMS. For instance, the order of the largest dense matrix in a factor for the Angus data sets (models 13 through 15) varied from 8,987 for a single-trait model to 51,138 for a 3-trait model. The greatest speedup from using YAMS instead of FSPAK was more than 20 times for the 2-trait analysis (model 14) in Angus cattle. Table S3 presents the wall-clock time for various operations in average-information REML iteration with each package. For simpler models (models 1 through 4 and 9), most of the time was spent in ordering, for which YAMS was up to 3 times faster than FSPAK. For the other models, ordering with YAMS was much faster, and the remaining computations with YAMS were also faster. This advantage in ordering came from the more efficient algorithm used in YAMS (Masuda et al., 2014). The maximum speedup with YAMS was in sparse inversion for model 18, in which YAMS was 83 times faster than FSPAK. The wall-clock time for sparse operations with different numbers of cores in the first round for model 5 in YAMS is shown in Figure S1 in the supplemental material. Even when only 1 core was used, YAMS was 8 and 11 times faster than FSPAK in factorization and inversion, respectively. The wall-clock time reduced as the number of active cores increased in all models. In FSPAK, parallel computing is not supported, and parallelization of the program is not a trivial task. As YAMS performed better, a larger proportion of wall-clock time tended to be spent in the nonsparse operations. For example, the proportion of the wallclock time in the “Other” operation to the total time in factorization and inversion for model 5 was only 2% in FSPAK but 30% in YAMS. For model 18, these operations took 5% and 75% of the time in FSPAK and YAMS, respectively. Temporary memory requirements with YAMS were greater, especially in inversion, than with FSPAK. For example, in model 21, the total memory usage including storage of the Cholesky factor was 1.31 GB for FSPAK and 1.77 GB for YAMS in inversion, and the difference (460 MB) was mostly from the temporary memory consumed with YAMS to accelerate sparse inversion (Masuda et al., 2014). The maximum amount of temporary memory was consumed in model 15, in which YAMS used 915 MB of temporary memory in addition to 12.1 GB of memory for the Cholesky factor. The largest Cholesky factor was found in model 8, and the total required memory was 15.9 GB for YAMS. This amount of memory was never a limiting factor for modern computers. For nearly all problems, most of the wall-clock time was spent in YAMS, which is already efficient. For genomic models, further speedup could be

4673

achieved with separately storing dense relationship matrices from the remaining part of the LHS. A limiting factor in genomic analysis is the number of genotyped animals because of large computational demand and large storage requirement in creating and inverting a GRM. These costs can be reduced if the inverse matrix can be approximated and stored as a sparse matrix, e.g., by the Algorithm for proven and young (APY; Misztal et al., 2014). Memory and computing requirements of the APY inverse are approximately linear, whereas those requirements of the regular inverse are quadratic and cubic, respectively. The refinements suggested in this study can be applied to general average-information REML programs. Supernodal methods successfully reduce the computing time in sparse operations for various animal models. A hash format accelerates the setting up of MME and removes the bottlenecks in trace calculations, especially for genomic models. The refinements need more memory than the traditional methods, but the additional memory requirement is not an issue in modern computers. LITERATURE CITED Aguilar, I., I. Misztal, D. L. Johnson, A. Legarra, S. Tsuruta, and T. J. Lawlor. 2010. Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J. Dairy Sci. 93:743–752. doi:10.3168/jds.2009-2730. Barrett, R., M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst. 1994. Templates for the solution of linear systems: Building blocks for iterative methods. 2nd ed. SIAM, Philadelphia, PA. Gilmour, A. R., B. J. Gogel, B. R. Cullis, and R. Thompson. 2009. ASReml user guide release 3.0. VSN Int. Ltd, Hemel Hempstead, UK. Gilmour, A. R., R. Thompson, and B. R. Cullis. 1995. Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51:1440– 1450. doi:10.2307/2533274. Iwaisaki, H., S. Tsuruta, I. Misztal, and J. K. Bertrand. 2005. Genetic parameters estimated with multitrait and linear spline-random regression models using Gelbvieh early growth data. J. Anim. Sci. 83:757–763. Jensen, J., E. A. Mäntysaari, P. Madsen, R. Thompson. 1996. Residual maximum likelihood estimation of (co)variance components in multivariate mixed linear models using average information. J. Indian Soc. Agr. Stat. 49:215–236. Madsen, P., P. Sørensen, G. Su, L. H. Damgaard, H. Thomsen, and R. Labouriau. 2010. DMU—A package for analyzing multivariate mixed models. In: Proc. 9th World Congr. Genet. Appl. Livest. Prod., Leipzig, Germany. Paper 732. Masuda, Y., T. Baba, and M. Suzuki. 2014. Application of supernodal sparse factorization and inversion to the estimation of (co)variance components by residual maximum likelihood. J. Anim. Breed. Genet. 131:227–236. doi:10.1111/jbg.12058.

4674

Masuda et al.

Meyer, K. 2006. WOMBAT—Digging deep for quantitative genetic analyses by restricted maximum likelihood. In: Proc. 8th World Congr. Genet. Appl. Livest. Prod., Belo Horizonte, Brazil. Communication No. 27-14. Misztal, I. 2008. Reliable computing in estimation of variance components. J. Anim. Breed. Genet. 125:363–370. doi:10.1111/j.1439-0388.2008.00774.x. Misztal, I. 2014. Computational techniques in animal breeding. http://nce.ads.uga.edu/wiki/doku.php?id=course_materials_-_from_uga_2014. (Accessed 3 February 2015.) Misztal, I., A. Legarra, and I. Aguilar. 2014. Using recursion to compute the inverse of the genomic relationship matrix. J. Dairy Sci. 97:3943–3952. doi:10.3168/jds.2013-7752. Misztal, I., and M. Pérez-Enciso. 1993. Sparse matrix inversion for restricted maximum likelihood estimation of variance components by expectation-maximization. J. Dairy Sci. 76:1479–1483. doi:10.3168/jds.S0022-0302(93)77478-0.

Misztal, I., S. Tsuruta, T. Strabel, B. Auvray, T. Druet, and D. H. Lee. 2002. BLUPF90 and related programs (BGF90). In: Proc. 7th World Congr. Genet. Appl. Livest. Prod., Montpellier, France. Communication No. 28-07. Pérez-Enciso, M., I. Misztal, and M. A. Elzo. 1994. FSPAK: An interface for public domain sparse matrix subroutines. In: Proc. 5th World Congr. Genet. Appl. Livest. Prod., Guelph, ON, Canada 22:87–88. Tsuruta, S., I. Misztal, L. Klei, and T. J. Lawlor. 2002. Analysis of age-specific predicted transmitting abilities for final scores in Holsteins with a random regression model. J. Dairy Sci. 85:1324–1330. doi:10.3168/jds.S0022-0302(02)74197-0. VanRaden, P. M. 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91:4414–4423. doi:10.3168/jds.2007-0980. Williams, J. L., J. K. Bertrand, I. Misztal, and M. Łukaszewicz. 2012. Genotype by environment interaction for growth due to altitude in United States Angus cattle. J. Anim. Sci. 90:2152– 2158. doi:10.2527/jas.2011-4365.

Acceleration of sparse operations for average-information REML ...

Acceleration of sparse operations for average-information REML ...

Suggest Documents

Sparse Collective Operations for MPI - CiteSeerX

Segmented Operations for Sparse Matrix ... - Semantic Scholar

Sparse Collective Operations for MPI - CiteSeerX

Hardware Acceleration of Database Operations - Stanford PPL

Hardware Acceleration of Database Operations - Stanford PPL

REML ESTIMATION - Project Euclid

Exact and Approximate REML for Heteroscedastic ... - CiteSeerX

Efficient Implementation of the AI-REML Iteration for Variance ...

Acceleration of Bulk Memory Operations in a Heterogeneous Multicore ...

reml is an effective analysis for mixed modelling of ... - cimmyt

Validation of an Approximate REML Algorithm for Parameter ...

Complexity of Kronecker Operations on Sparse Matrices ... - CiteSeerX

Software Selegen-REML/BLUP: a useful tool for plant breeding

Estimation of peak ground acceleration and spectral acceleration for ...

Estimation of peak ground acceleration and spectral acceleration for ...

Dense and Sparse Matrix Operations on the Cell ...

Sparse-TDA: Sparse Realization of Topological Data Analysis for Multi ...

A Sparse QS-Decomposition for Large Sparse Linear System of

ACCeleRAtIon

ACCeleRAtIon

EM-REML estimation of covariance parameters in Gaussian ... - NCBI

EM-REML estimation of covariance parameters in ... - Springer Link

Better Estimates of Genetic Covariance Matrices by Penalized REML

Math Acceleration for All - CiteSeerX