Developing Chemometrics with the Tools of Information Sciences – MASIT23 Olli Simula Helsinki University of Technology Adaptive Informatics Research Centre P.O. Box 5400, 02015 TKK Tel. +358 9 4513271 Email:
[email protected]
Amaury Lendasse Helsinki University of Technology Adaptive Informatics Research Centre P.O. Box 5400, 02015 TKK Tel. +358 40 7700237 Email:
[email protected]
Francesco Corona Helsinki University of Technology Adaptive Informatics Research Centre P.O. Box 5400, 02015 TKK Tel. +358 9 4513922 Email:
[email protected]
Satu-Pia Reinikainen Lappeenranta University of Technology Laboratory of Chemistry P.O. Box 20 53851 Lappeenranta Email:
[email protected]
Pentti Minkkinen Lappeenranta University of Technology Laboratory of Chemistry P.O. Box 20 53851 Lappeenranta Email:
[email protected]
Jarno Kohonen Lappeenranta University of Technology Laboratory of Chemistry P.O. Box 20 53851 Lappeenranta Email:
[email protected]
Marja-Liisa Riekkola University of Helsinki Laboratory of Analytical Chemistry P.O. BOX 55 FI-00014 University of Helsinki Email:
[email protected]
Kari Hartonen University of Helsinki Laboratory of Analytical Chemistry P.O. BOX 55 FI-00014 University of Helsinki Email:
[email protected]
Ilppo Vuorinen University of Turku Archipelago Research Institute 20014 University of Turku Email:
[email protected]
Jari Hänninen University of Turku Archipelago Research Institute 20014 University of Turku Email:
[email protected] Jukka Silén Tampere University of Technology Plastics and Elastomer Technology P.O. Box 589 FIN-33101 Tampere Email:
[email protected]
Abstract In the CHESS project, novel algorithms and variations of existing algorithms are developed for process data analysis, visualization, and monitoring. The algorithms are implemented in a variety of industrial applications under five test cases, including oil production, food production, process monitoring, plastics production, and environmental analysis and forecasting. In the first phase of CHESS, data sets from industrial partners were analyzed. Based on this study, the research partners have created a set of general-purpose information science tools. The emphasis is on real-time implementation of the methods in practical industrial environment. The final implementation of the methods and algorithms in products will be further developed by the small partner companies of CHESS. Keywords: Data analysis, on-line monitoring, visualization, time series prediction, environmental analysis
1 Project background and goals In this project, we have developed the tools of information sciences that best serve chemometrics, which is computational analysis of chemical data. Computational methods and chemometrics have been developed rather separately. For example, most computer scientists are not familiar with the concepts of chemometrics. Vice versa, the chemometrics society is slow to recognize the new approaches developed in computer and information sciences. The goal of this project is to merge the modern tools of information science onto the application platform created by chemometrics society. Having partners from both fields, the created synergy allows new approaches to emerge. Great practical application possibilities lie in the field of environmental monitoring, which ties together the chemical environmental data producers (manufacturers of the analyzers) and the governmental institutes (the environmental officers) responsible of environmental monitoring (where, especially, the possibilities of modern time series analysis still are largely unexploited or underdeveloped).
2 Project work programme The CHESS project addresses 5 problems, Test cases, from rather different application domains – oil production, food production, process manufacturing, plastics production and environmental analysis. The following describes the test problems shortly and discusses their commonalities. Due to confidentiality, the meaning or the significance of the studied processes will not be given or published. TEST CASE 1: Oil Production. The aim is to get new empirical modeling tools, which are based on information technology. The outcome has been emphasized on tools, which are suitable in fast data mining from large data sets, e.g., 1) tools capable to find the most significant variables from large empirical databases, 2) tools for reliable dynamic modelling, 3) tools resulting for solid multi-block modelling and 4) robust novel nonlinear approaches. Research partners responsible for the test problem: TKK/AIRC and LUT/CG together with Neste Oil Oyj TEST CASE 2: Food Production. Danisco is interested in creating monitors that display the current state of the process warning the user of potential hazards; and predictors, that allow future process conditions to be estimated. Research partner responsible for the test problem: TKK/AIRC and Danisco TEST CASE 3: Process manufacturing In UPM-Kymmene RC produces data from the technical paper properties. The data is used for research, development and quality monitoring purposes as well as for competitor surveillance purposes. UPM-Kymmene Oyj is interested in developing the statistical methods for evaluating and classifying different paper grades and their properties in multidimensional approach. The chemometric tools are the most beneficial approaches of this type. Research partner responsible for the test problem: LUT/CG TEST CASE 4: Plastics Production The aim of the research work is to increase the material knowledge and the competitiveness of the Finnish plastic and rubber industry. Research partner responsible for the test problem: TUT/LPET TEST CASE 5: Environmental analysis. Envidata will study problems of environmental monitoring, such as how a biological monitoring time series is responding to another, e.g. chemical environmental time series, but so far the lack of modeling (analysis) tools has prevented this. This is not due to lack of techniques as such, but merely a lack of expertise of environmental officers in applying new approaches developed in information sciences. Research partner responsible for the test problem: UT/ARI, TKK/AIRC, and UH/LAC
3 Project results 3.1. TEST CASE 1: Oil production In the industrial Test Case 1, there has been applied process data from Neste Oil Oyj. The aim has been to get new empirical modelling tools, which are based on information technology. The outcome has been emphasized on tools, which are suitable in fast data mining from large data sets. The test cases have included: • • • •
Analysis of instrumental data, on-line monitoring data and quality data Non-linear processes Identification of delays between stages in industrial processes Robust variable selection methods
Analysis of instrumental data, on-line monitoring data and quality data The case has been progressed using a real process data set having 13000 on-line samples (time points) and over a thousand variables. The variables contained different blocks: Z (NIR), X (Process variables) and Y (Quality of end product). This data have been utilized in development of algorithms of different kind: 1) Multivariate Control Charts have been utilized to diagnose the quality of spectral data, 2) Dynamic PLS models to predict the quality of end product. The models have been made using a. Direct regression b. Multi-block methods c. Priority regression methods d. Non-linear methods 3) Preprocessing methods of spectral data a. CovProc (Covariance Procedures) b. Orthogonal Signal Correction c. Savitzky-Golay Smoothing d. Multiplicative Scatter Correction e. Wavelet transformation A Matlab based program (GUI) has been developed. The GUI enables rapid updating of models (data and methods), and can be used as a tool for model development during the project. All these methods are advanced variations of the traditional chemometric methods and the analyses have been done by LUT/CG. Algorithms such as advanced multi-block algorithms have been developed further. The multi-block methods are applicable within certain limits in integration of economics into MSPC models. Non-linear processes Sometimes the data exhibit some curvature, which makes forecasts from the linear models not reliable. Methods have been developed to identify the situations, where data exhibit (multivariate) curvature. They also suggest revised non-linear methods, which take the curvature into account and provide with more reliable forecasts than linear methods. In chemometrics there are three common types of solutions: 1) Parameter estimation (hard or kinetic models) 2) Nonlinear regression (soft models). There is great interest in stable algorithms for non-linear
regression. There exist different nonlinear variations of PLS Regression. The most commonly used are the ones based on polynomial solutions. A novel approach has been introduced. 3) Neural networks of different kind (soft models) 4) Tools for identification of non-linear behaviour between/within explanatory phenomena and response surface. This is an essential task for reliable and robust non-linear modelling. This task has been started with study of the algorithms. The aim is to utilize polynomial PLS with and without Heisenbergs modelling procedure. Also a novel PLS algorithm utilizing non-linear score surface has been developed and published by LUT/CG. The algorithms are partially under development. The task will continue with compiling suitable data sets, and the results gained with these methods will be compared with the results gained by TKK/AIRC. Identification of delays between stages in industrial processes Identification of delays or diagnoses of process dynamics is an important topic in process industry. Identification methods in this case study are 1) Numerical alignment of process variables determined from different stage of process (different units) 2) Utilization of spectral data in defining delays or process dynamics between different process units (a multivariate extension is being developed) 3) Diagnosis of response time from one unit to another (the changes may be detected with response window of different length). The existing data set can be utilized in tasks 1) and 3). However a new data set for task 2) utilization of spectral data will be needed. This data set has been recently delivered by Neste Oil Oyj to LUT/CG.
3.2 TEST CASE 2: Food Production Danisco is one of the world's leading producers of ingredients for food and other consumer products. Danisco Sweeteners Kotka plant is part of sweeteners-division which develops, manufactures and market speciality sweeteners and related ingredients worldwide. The main products are fructose and xylitol. Fructose is ideal for use in diabetic & light foods and beverages and has interesting metabolic advantages as well as flavour enhancing properties. Xylitol is a naturally occurring sweetener with great taste and unique dental benefits. Fructose and xylitol process data is saved in Metso DNA-database and quality data in Labmasterdatabase. CHESS project goal is develop overall data analysis using both process and quality data, starting with data from the fructose production. Danisco interest has been in gaining insight on the correlations and delays between the different process stages, and also in creating monitors that display current state of the process and early warnings. Most interesting is learning something about those correlations that cannot be explored by visual check. The target variable is fructose purity after crystallization and before centrifuging. Potential models can be implemented in the automation system, and embedded in advanced process control strategies. The project focused on the problem of missing data in the database and the determination of important variables for the prediction of the fructose. The results of the research activities are described below: •
The fructose database includes 50% of missing data. The presence of missing values in the underlying time series is a recurrent problem when dealing with databases. A number of methods have been developed to solve the problem and fill the missing values. The methods can be classified into two distinct categories: deterministic methods and stochastic methods. The SelfOrganizing Map (SOM, c.f.r, Kohonen 2001) aims to ideally group homogeneous individuals, highlighting a neighborhood structure between classes in a chosen lattice. The SOM algorithm is based on unsupervised learning principle where the training is entirely stochastic, data-driven.
The SOM allows the projection of high-dimensional data onto a low-dimensional grid. Through this projection and focusing on its property of topology preservation, SOM allows nonlinear interpolation for missing values. On the other hand, the Empirical Orthogonal Functions (EOF) are deterministic, enabling linear projection to a high-dimensional space. They have also been used to develop models for finding missing data. Moreover, EOF models allow continuous interpolation of missing values, but are sensitive to the initialization. We have developed a new methodology, which combines the advantages of both the SOM and the EOF. The nonlinearity property of the SOM is used as a denoising tool and then the continuity property of the EOF method is used to efficiently recover missing data. •
We have created tools for the selection of the best variables for 2 days-ahead prediction of the fructose value. The selection is made among process measurements including previous values of the fructose. Feature selection and dimension reduction includes two different ways of reducing the number of inputs of the regression model. First, inputs are selected among the original features; this is usually referred to as feature selection or input selection. Second, inputs can be built from the original features, by combining them in a linear or nonlinear way; this leads to dimension reduction. In asucha context, a new algorithm for variable selection and feature extraction has been developed. This selection strategy algorithm is based on Noise Variance Estimation.
The method for the determination missing values was implemented in the Matlab environment and it is available for download from www.cis.hut.fi/projects/tsp.
3.3 TEST CASE 3: Process manufacturing (Extraction of information from process data) In UPM RC a vast amount of data is produced as a daily routine from the technical paper properties. These results are used for research, development and quality monitoring purposes. In case study 3 the main focus has been on effective utilization of available data and estimation of reliability of the data. A multivariate tool has been developed, which is devoted solely for this purpose. The tool features the following attributes: 1. Calibration tools 2. Classification tools 3. Univariate and multivariate statistical process control tools There have been different types of data sets involved: • Lorentzen & Wettre Autoline data (Automatic instrument for measuring the quality of paper) • Comparison of different paper grades and different manufacturers based on technical paper properties Lorentzen & Wettre Autoline data The equipment determines automatically over 20 variables related to paper quality. The aim was to quickly and easily analyze different paper sheets, and identify deviations from “normal quality”. The application will automatically detect the paper grade and compare the sample to a calibration set of this paper grade. The solution is based on PCA and Multivariate Statistical Process Control charts combined with diagnostics of Soft Independent Modelling of Class Analogy. Thus the Matlab program contains calibration tools and a tool for diagnosing the new upcoming data, and enables early warning of significant deviations from normal quality. It also enables comparison of different paper grades in multi- and univariate space. Alternative solutions The general aim of activities of TKK/AIRC was in the development of data analysis and estimation methods that, although general in their formulation, can be tailored to fulfill the ad hoc requirements
of monitoring and visualizing production processes. In fact, when the complexity of the phenomena and the processes is a limiting factor, the design and the implementation of efficient monitoring devices can be achieved only after a preliminary extraction of the significant information that exists between the properties of interest (Y) and the available field measurements (X and Z). With this primary objective in mind, our emphasis has been on methods that explicitly take into account the intrinsic characteristics of the two main typologies of the process data: process variables (Z) and spectroscopic measurements (X). With respect to the process variables (Z), classical and advanced methods for data exploration and visualization have been considered. The activities focused on methods for dimensionality reduction finalized to topology and distance preservation for data projection and visualization. Given the generality of the approach, diverse manifold learning techniques, ranging from traditional Principal Components Analysis to Laplacian Eigenmaps, have been considered (c.f.r., Lee, 2008) and directly applied to our study cases. The results obtained from the application of such methods on the available datasets were coupled and validated by the domain knowledge of our industrial partners and allowed the definition of meaningful displays for process visualization. The aforementioned displays provided the qualitative framework for a macroscopic understanding the studied production processes (e.g., the recognition of their most relevant operational stages) and facilitated the quantitative development of devices for monitoring the properties of interest Z from the spectral measurements Y. The activities on the spectral measurements (X) of the materials mostly concentrated on the development and application of methods for input variable selection. During the project we investigated the possibility to select only a subset of few relevant spectral variables emerging from the metric structure of the data. The topologically preserving representation of the spectral measurements was performed using the SelfOrganizing Map (SOM, c.f.r, Kohonen 2001) where the relevance of the inputs is measured as similarity between distance matrices. As a result, we found that the spectral inputs with a topology that is similar to the property of interest are also associated to the wavelengths that chemically explain the influence of the most important functional groups in the samples. The selected variables also exhibit an important predictive power; in fact, when used to develop a prediction model for the properties of interest, the obtained accuracies are always at least comparable to what is achieved with standard methods. The research on spectral measurements was also conducted considering techniques that exploit the functional structure of the observations. In particular, in our research we developed an original approach to variable selection were the relevant inputs are identified in the correspondence of the functional features that characterise the shape of the spectral curves; that is at wavelengths where not only the function’s values but also the slope and the curvature are significant for estimating the property of interest. Moreover, we formalized a method for compressing the spectra by representing them as a linear combination of Gaussian basis functions whose location and width is optimized. The methods have been analysed mostly on referenced problems from literature and compared to conventional techniques. In addition, innovative methods for obtaining an optimal projection of the spectral variables Y were developed by minimizing the estimates of the variance of noise on the properties of interest Z.
3.4 TEST CASE 4: Plastics Production In plastics industry processes – especially when converting plastics into products, semi-finished products or components – are almost always managed in very traditional ways. Tools of statistics and/or modern information technology are seldom used. The tools are unfamiliar or considered difficult to use. The aim of the research work in this TC4 (Plastics Production) is to increase the competitiveness of the Finnish plastics and rubber industry using tools from information technology and not commonly exploited in this specific branch of industry. The work will be carried out in very firm co-operation with enterprises producing plastic parts and components. Targeted enterprises are mainly SMEs. TC4 has been run during year 2007 with determination of clear vision. In this vision phase the state-of-art
of research work, as well as expectations coming from industry for these research activities, has been determined. In this work the role of the customer and end-user has been increasing – quite surprising high. The rapid change in working methods in whole business chain has risen the know-how of the status of process the absolute main topic of the everyday process control. Quality data tools, such as SPC´s, capability factors, as well as ordinary statistical variables have emerged (or at least should have) as common tools in routines of plastics converters. The first touch to plastics industry gave very practical goals such as tools (or set of tools) to manage the production processes (solve process problems, increase efficiency, rise competitiveness). The tools must be easily understandable, well guide-lined. Technical support in Finnish has also been in wish list of SME´s. Therefore TC4 was deeply involved with Data Rangers Ltd and its software. It is strongly designed that tools like Webmailer and Dataminer could form common everyday tools in plastics industry. The quality of data has been on top in order of importance. The processes of plastics materials (polymers) are normally non-linear including a huge amount of process parameters. The processes are unpredictable and very difficult to manage. Therefore a “data handbook” for data production was formed. This brief but sufficient list would give answers for plastics converters to produce data of sufficient amount and quality – such as which kind of process or production, which kind of data or measurement, how the measurement must be organized, the frequency, etc. – i.e. the most essentials to manage the industrial processes. Together with the data determination a study of critical parameters was carried out. In this work assistance of previous Tekes Programs, such as ProMuovi was reached. It is commenly known that there is not any consensus among the experts in determination of critical parameters. Research group was concentrated in injection moulding in which parameters like mould temperature, melt temperature, velocity (time) for injection and back pressure. These parameters can, however, vary depending on product, and will certainly vary in different processes. In the very last part of 2007 started actions to carry out field tests in SME´s to determine suitability and functionality of selected tools. The group of enterprises consists of customers of Tampere University of Technology with following common attributes: small companies, Pirkanmaa area, plastics, mould shapes, short-run production, no previous data collection, high know-how in plastics production technology. And above all: highly increased demands of quality management coming from customers. These field tests will last till CHESS (30.4.2008) and the results will be published in final report of CHESS. As a final goal of TC4 it is determined to create not only single quality tools, but also a comprehensive network-based quality management service – from measurements to satisfied customer.
3.5 TEST CASE 5: Environmental analysis (a) Analysis of Baltic Sea data During the recent phase of the project, the UTU Test Case 5 has focused to define more detail the eutrophication processes in the Baltic Sea. Although it is well known that the majority of nutrient loading to the Baltic Sea will land with freshwater runoffs (70-90% of all loading comes with riverine waters), the actual loading processes have not been well known yet. This concerns both nitrogen and phosphorus concentrations carried along with freshwater runoffs (the nutrient loading intensity related to amount of incoming water masses), but also the accumulation processes in stratified seawater column (surface, middle and bottom water layers). Moreover, the accumulation processes related to spatial dimensions (local, regional, basin wide processes) need similar update of understanding. All this information is closely linked to the interest of the BACC, the BALTEX Assessment of Climate Change for the Baltic Sea basin, which will assemble, integrate and assess available knowledge of past, current, and expected future climate change and its impacts on ecosystems in the Baltic Sea basin. One of the main statements of the BACC assessment is that in the future the overall rainfall will increase in the Baltic Sea catchment area which will enhance the eutrophication processes in the sea. Focused on present eutrophication processes, we already have been able to increase this scientific knowledge during the project. Examples of such cases are:
General runoff regulation - We demonstrated that the North Atlantic weather effect can generally be detected in the Baltic Sea runoffs with various climate indices, and even separated between Baltic subareas as the effect showed considerable geographical variation. The NAO (North Atlantic Oscillation) proved, again, to be the best climatic regulation explainer which has repeatedly been demonstrated also earlier (Hänninen et al. 2000, 2003, Vuorinen et al. 2004). Nutrient loading processes - Nutrient loading models indicated very strong coupling between nutrient loading and freshwater runoff, and showed that the Baltic nutrient loading can be modelled only on the basis of incoming freshwater runoffs. Originally only linear type of models can well reveal involving causal relationships, and we were able to point out that, simply, the more we have runoffs, the more we get nutrients into the sea. However, more accurate estimates of this coupling could be achieved with some non-linear modelling method, e.g. the loading intensity related to amount of incoming water masses is hardly just linear combination but increased runoff often means still more leaching of nutrients. For these purposes our intention during the final phase of the project is, with non-linear modelling, more detailed partition the runoff events during the study period to find out years, or periods, when runoffs have been in high vs. low levels effecting the loading intensity in opposite ways. Nutrient accumulation processes - In general, our models for nutrient accumulation processes in seawater turned out to be very weak for both substances. We were able to find a model only for organic form phosphorus, and it was evident only in two uppermost water columns. One possibility for weak results could be the high rate of processes, i.e. chemical and biological reactions and processes happening in the sea are temporally short-termed, and therefore with series resolution used in our models we just were not able to detect the real course of events. On the other hand, some additional processes can act simultaneously which can blur the general view. One example could be the atmospheric nitrogen uptake by blue-green algae which can be very intensive in summer and can influence largely on the amount of nitrogen stored in seawater. (b) Analysis of environmental samples (atmospheric particles) Laboratory of Analytical Chemistry (University of Helsinki) has together with Data Rangers developed and applied data analysis software for the measurement data of environmental samples (atmospheric particles). Software includes basic mathematical tools for statistics and some new tools based on the latest research results of the information technology (for example support vector machine). Software is easy to use without the need for deep expertise on statistics. Software has been tested for the particle (and gas phase) measurement data collected from the SMEAR II measurement station (Hyytiälä) during the QUEST campaign in 2003. Our goal was to find different correlations between many physical and chemical parameters measured for the elucidation of the formation and growth of aerosol particles. So far some correlations between oxidation products (pinonaldehyde) of terpene compounds in particle and in gas phase have been observed as well as those between oxidation products and particle size distribution (for 2003 collected data). In addition, some other terpene compounds seemed to behave similarly. Software has given useful information of the quality of the analytical data studied. Results have clearly shown that both the quality and quantity of the data is not sufficient to draw any wider or more accurate conclusions. According to these results the new particle collection and measuring campaign was designed for April-August in 2007 to better access different correlations and to fully exploit the software. During spring and summer 2007 aerosol samples with one hour time resolution were collected by using new sampling technique (particle-intoliquid sampler, PILS). Collection system (PILS) was slightly modified and optimized to better suit to organic compounds in aerosol particles. The chemical analysis of particle samples collected was completed at the end of 2007. For the data analysis with developed software, the physical and chemical data was reorganized to have a close time fit of data points due to their different sampling rate. Additionally, some smoothing was done for physical data. Interesting correlations were obtained for some oxidation products of terpene compounds. Pinonaldehyde and verbenone, which are oxidation products of α-pinene, were found to behave differently. Pinonaldehyde was mostly found in association with very small (< 25 nm) and acidic particles, while verbenone concentration was correlating with large particles (200-800 nm).
Also α-pinene had negative correlation with pinonaldehyde concentration and positive with verbenone concentration. Other more straightforward correlations were clearly seen with the software, like the particle size correlation with visibility and direct global radiation. This correlation was seen also with the concentrations of the former two oxidation products due to their different particle size association. Pinic acid concentration was also found to correlate with very large particles. Other analysed acids behaved in a similar way as pinic acid. The results achieved so far need to be confirmed and other correlations sorted out. It would be interesting to more carefully study the possible different mechanisms of particle association of these compounds as well as many others. Also the time and place when the oxidation of the compounds occur (gas/particle phase, oxidant like ozone amounts, etc.) and the dissolution conditions on the particle surface would be valuable for further conclusions. The amount of chemical concentration data from the collected aerosol samples is still growing and will be added to the data analysis matrix, which will improve the number and quality of the obtained correlations as well as the conclusions drawn. Special emphasis will be put especially on the addition of chemical concentration data of the oxidation products of sesquiterpenes into the model. The applicability of the developed software to aerosol particles will be assembled into an article in the near future.
4 Impact of the results The developed research results have immediate applications in a number of industrial fields. We will discuss benefits starting from immediate benefits and proceeding towards strategic benefits. The studied concepts relate to Neste Oil, Danisco, UPM and Envidata, Data Rangers and analytical technology in general. Data Rangers is the medium for transferring the findings to the industry. This spin off of the TKK is specialized in the implementation and commercialization of research results. Part of the algorithms developed by the Helsinki and Lappeenranta Universities of Technology will be implemented already during the project. Neste Oil and Danisco have access to the main Matlab® algorithms during the project. Additional confidential agreements and the agreements of utilization rights of algorithms have been done. The companies will also have access to the algorithms through Data Rangers and Envidata commercial software platform. The Archipelago research centre and Muovipoli disseminate the new analysis practices to their affiliated companies. Neste Oil may implement the algorithms to improve their unit operations. The new methods have been applied in developing more reliable and robust models for research and process control/optimization purposes. One main application field will be the simultaneous utilization of process and spectral variables in modeling of quality variables, which are utilized in controlling of the oil refinery. LUT/CG has been in close international cooperation with the development of algorithms conserning methods such as multi-block modelling, priority regression, CovProc and non-linear modelling. As a result of this lucrative activity, several international scientific publications have been published. Algorithms are available to the partners who have participated in the development process. The main focus in UPM-Kymmene Research Center has been the interpretation and monitoring of historical databases. Practical tools for monitoring the quality of the data are required, e.g., before the data is accepted to the data base, and when a data set is collected from the database. For example, a general tool providing information about the reliability of results of an automated analyzer is needed. The equipment may repeat the measurement routine tens of time, and the data need to be checked for outliers. The inner correlation structure of variables and the historical data can be utilized. There can be tens of variables measured from a sample, and the results are stored in database. The reliability of data should be evaluated when a data set from database is collected for different purposes. These methods will be based on traditional Multivariate Statistical Process Control methods. The methods also allow a robust practical tool for visualization of data, as well as, for (semi)automated monitoring of outliers, extreme samples or clustering.
As a co-operative action with our industrial partners (Envidata co, Datarangers ltd) we have applied our approach to produce a web-based service for environmental time series visualising. Basically the service is based on environmental monitoring information in coastal sea-areas of Finland consisting an environmental report-tool program and a general discussion platform for observation comments. A service demo from the vicinity of Porvoo town sea-areas is already available at demo.datarangers.fi. Practically the service enables users e.g. to plot GIS based environmental data sets (chlorophyll, transparency, N and P concentrations etc.) at any specific monitoring station together with Google Maps ™. In final form the service will be based on Finnish authorities’ HERTTA database established for network of environmental monitoring stations in coastal sea areas, which data will be become public in 2008 by EU legislation, and then the software could serve as a valuable toolbox for consultants, authorities or decision makers. To summarize, the CHESS consortium has a well defined structure for creating chemometric innovations, encoding these as software, testing the findings within the partnership, and disseminating the refined concepts to the partner companies and to industry in general.
5 International cooperation The Time Series Prediction Group at TKK organized the first European Symposium on Time Series Prediction – ESTSP in February 2007 in Otaniemi. The research group at LUT organized the SSC10 10th Scandinavian Symposium on Chemometrics in Lappeenranta in June 2007. The Baltic Sea research is strongly international. One example of the results partially achived during the CHESS project is the publication: Assessment of Climate Change for the Baltic Sea Basin. The BACC Author Team, Springer 2008, Hardcover, 474 pp. ISBN: 978-3-540-72785-9.
6 Future plans The original CHESS project schedule was 18 months which was continued with four months until the end of April 2008. The results during the final stage of the project have been good and from industrial points of view also successful. In spring 2007, it was not possible to continue the project due to some withdrawal of some companies. Cooperation with some partners is, however, still continuing after the project. Especially, the environmental problems are interesting and data analysis methods can be successfully applied in the research field.
7 List of publications and reports Scientific papers Satu-Pia, Reinikainen, Agnar, Höskuldsson, Multivariate statistical analysis of multi-step industrial processes, Analytica Chimica Acta; 595, Issues 1-2: (2007) 248-256. Francesco Corona, Amaury Lendasse, Satu-Pia Reinikainen, Annikki Perkiö, Kari Aaljoki, Olli Simula, Wavelength selection using the Measure of Topological Relevance on the Self-Organizing Map, Journal of Chemometrics (submitted 2007). Jarno Kohonen, Satu-Pia Reinikainen, Kari Aaljoki, Annikki Perkiö, Taito Väänänen, Agnar Höskuldsson, Multi-block methods in multivariate process control, Journal of Chemometrics (In Press 2008). Jarno Kohonen, Satu-Pia Reinikainen, Agnar Höskuldsson, Non-linear PLS approach in score surface, Procedings of CAC 2008, 4pp, (accepted). Jarno Kohonen, Satu-Pia Reinikainen, Agnar Höskuldsson, Block-based approach to modelling of granulated fertilizers’ quality, Chemometrics and Intelligent Laboratory Systems, (submitted 2008).
T. Kärnä, F. Corona, A. Lendasse. Compressing spectral data using optimized Gaussian basis. Journal of Chemometrics, (submitted 2007). A. Lendasse and F. Corona. Optimal linear projection based on noise variance estimation: Application to spectroscopic modeling. Journal of Chemometrics, (submitted 2007). A. Lendasse, F. Corona, S.-P. Reinikainen and P. Minkkinen. Functional variable selection using noise variance estimation. In Proc. Chimiometrie 2008, 39-42. E. Liitianinen, A. Lendasse and F. Corona. On non-parametric noise variance estimation. Neural Processing Letters, (submitted 2007). Hänninen, J., Toivonen, R., Vahteri, P., Vuorinen, I & Helminen, H. 2007. Environmental Factors Shaping the Littoral Biodiversity in the Finnish Archipelago, Northern Baltic, and the Value of Low Biodiversity. SEILI Archipelago Research Institute Publications 4. TURKU 2007. ISBN 978-951-293490-4. Hänninen, J. & Vuorinen, I. 2008. Transfer-function Modelling between Climate, Runoff and Nutrient Enrichment processes in the Baltic Sea. Manuscript. Rönkä, M., Saari, L., Hario, M., Hänninen, J. & Lehikoinen, E. 2008. Breeding success and population trends of waterfowl in northern Baltic Sea – Implications for monitoring. Manuscript.
Conference papers Satu-Pia Reinikainen, Kari Aaljoki, Annikki Perkiö, Agnar Höskuldsson, Software reliability engineering in chemometric methods, 10th International Conference on Chemometrics in Analytical Chemistry, CAC-2006, Brasilia, September, 10–15, 2006. Satu-Pia Reinikainen, Kari Aaljoki, Annikki Perkiö, Taito Väänänen, Agnar Höskuldsson, Analysis of instrumental data (X), on-line monitoring data (Z) and quality data (Y), 10th International Conference on Chemometrics in Analytical Chemistry, CAC-2006, Brasilia, September, 10–15, 2006. Jarno Kohonen, Satu-Pia Reinikainen, Kari Aaljoki, Annikki Perkiö, Taito Väänänen, Agnar Höskuldsson, Multi-block Methods in Multivariate Process Control, SSC10 - 10th Scandinavian Symposium on Chemometrics, Book of Abstracts, LUT, Faculty of Technology, Department of Chemical Technology. Report 170 (2007) 86. Jarno Kohonen, Satu-Pia Reinikainen, Agnar Höskuldsson, Analysis of score vectors in multivariate process control, LUT, Faculty of Technology, Department of Chemical Technology. Report 170 (2007) 87. Jarno Kohonen, Matti Ristolainen, Miia Asikainen, Satu-Pia Reinikainen, Automated Classification of Paper Grades with Lorentzen & Wettre Autoline Analyzer, LUT, Faculty of Technology, Department of Chemical Technology. Report 170 (2007) 88. Jarno Kohonen, Satu-Pia Reinikainen, Agnar Höskuldsson, Block-based approach to modelling of granulated fertilizers’ quality, Procedings of WSC6, Kazan, Russia, 18.2.2008-22.2.2008F. Corona, E. Liitiainen and A. Lendasse and R. Baratti. Using functional representations in spectrophotoscopic variable selection and regression. In Book of abstracts Scandinavian Symposium on Chemometrics SSC10 2007, 29. E. Eirola, E. Liitiainen, A. Lendasse, F. Corona and M. Verleysen. Using the Delta test for variable selection. In Proc. European Symposium on Artificial Neural Networks ESANN 2008, to appear. E. Liitiainen, F. Corona and A. Lendasse. Non-parametric noise variance estimation in supervised learning. Lecture Notes in Computer Science, 4572, 63-71, 2007.
A. Sorjamaa, P. Merlin, B. Maillet and A. Lendasse. SOM+EOF for finding missing values. In Proc. European Symposium on Artificial Neural Networks ESANN 2007, 115-120. Working Reports / LUT: • Jarno Kohonen, Satu-Pia Reinikainen, Neste Oil Case, 2007 • Jarno Kohonen, Satu-Pia Reinikainen, UPM-Kymmene Case, 2007 • Jarno Kohonen, Satu-Pia Reinikainen, Manual for the tool of UPM-Kymmene Case, 2008 • Satu-Pia Reinikainen, Jarno Kohonen, Agnar Höskuldsson, Integration of economics into MSPC models, 2008 • Satu-Pia Reinikainen, Jarno Kohonen, Agnar Höskuldsson, Introduction to non-linear models, 2008 • Jarno Kohonen, Satu-Pia Reinikainen, Agnar Höskuldsson, Analysis of score vectors in multivariate process control, 2007
8 Bibliography Höskuldsson, A., The Heisenberg modelling procedure and application to nonlinear modelling, Chemometrics and Intelligent Laboratory Systems, 44, Issues 1-2 , 1998, p. 15-30. Massart, D.L.;Vandeginste, B.G.M.;Buydens, L.M.C.;De Jong, S.;Lewi, P.J.;Smeyers-Verbeke, J., Handbook of Chemometrics and Qualimetrics, Parts A and B, Elsevier, 1998.
Dippner, J.W. & Vuorinen, I (Eds.) 2008. Chapter 5. Climate-related Marine Ecosystem change. In: Assessment of Climate Change for the Baltic Sea Basin. The BACC Author Team, Springer 2008, Hardcover. pp. 309-377. Vuorinen. I. & Flinkman, J. 2008. 5.6. Zooplankton. Chapter 5. Climate-related Marine Ecosystem change. In: Assessment of Climate Change for the Baltic Sea Basin. The BACC Author Team, Springer 2008, Hardcover. pp. 325-329. Vuorinen, I., & Hänninen, J. & Kornilovs, G. 2004. Transfer function modelling between environmental variation and mesozooplankton in the Baltic Sea. Progress in Oceanography 59: 339- 356. Hänninen, J., Vuorinen, I. & Kornilovs, G. 2003. Atlantic climatic factors control decadal dynamics of a Baltic Sea copepod, Temora longicornis. Ecography 26: 672-678. Hänninen, J., Vuorinen, I. & Hjelt, P. 2000. Climatic factors in the Atlantic control the oceanographic and ecological changes in the Baltic Sea. Limnology and Oceanography 45(3):703-710. Hänninen, J., Vuorinen, I., Helminen, H., Kirkkala, T. & Lehtilä, K. 2000. Trends and gradients in nutrient concentrations and loading in the Archipelago Sea, Northern Baltic, in 1970-1997. Estuarine, Coastal and Shelf Science. 50:153-171. Kohonen T. The self-organizing map, 3rd edition. Springer 2001. A. Lendasse, F. Corona, J. Hao, N. Reyhani, and M. Verleysen. Determination of the Mahalanobis Matrix using Nonparametric Noise Estimations, ESANN’2006, Bruges (Belgium), 26-28 April 2006, pp. 227–237. A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer, New York, 2007.