Supporting process and

Supporting process and

quality engineers by automatic diagnosis of cause-and-effect relationships between process variables and quality deficiencies using data mining technologies (AUTODIAG)

Research and Innovation

EUR 26179 EN

EUROPEAN COMMISSION Directorate-General for Research and Innovation Directorate G — Industrial Technologies Unit G.5 — Research Fund for Coal and Steel E-mail: [email protected] [email protected] Contact: RFCS Publications European Commission B-1049 Brussels

European Commission

Research Fund for Coal and Steel Supporting process and quality engineers by automatic diagnosis of cause-and-effect relationships between process variables and quality deficiencies using data mining technologies (AUTODIAG) Norbert Holzknecht VDEh-Betriebsforschungsinstitut (BFI) Sohnstraße 65, 40237 Düsseldorf, GERMANY

Cesar Fraga † ArcelorMittal Espana SA (ARCELOR) Residencia la Granda, 33418 Gozon, Asturias, SPAIN

Floriano Ferro ILVA S.p.A. (ILVA) Novi Ligure works, 15067 Novi Ligure, ITALY

Thomas Heckenthaler ThyssenKrupp Nirosta GmbH (THYSSEN) Oberschlesienstraße 16, 47794 Krefeld, GERMANY

Gianluca Nastasi Scuola Superiore Sant’Anna (SSSA) Piazza Martiri Della Liberta 33, 56127 Pisa, ITALY

Grant Agreement RFSR-CT-2008-00042 1 July 2008 to 30 June 2011

Final report Directorate-General for Research and Innovation

2013

EUR 26179 EN

LEGAL NOTICE Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of the following information. The views expressed in this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission.

Europe Direct is a service to help you find answers to your questions about the European Union Freephone number (*):

00 800 6 7 8 9 10 11 (*) Certain mobile telephone operators do not allow access to 00 800 numbers or these calls may be billed.

More information on the European Union is available on the Internet (http://europa.eu). Cataloguing data can be found at the end of this publication. Luxembourg: Publications Office of the European Union, 2013 ISBN 978-92-79-33237-1 doi:10.2777/4329 © European Union, 2013 Reproduction is authorised provided the source is acknowledged. Printed in Luxembourg Printed on white chlorine-free paper

Table of contents Page 1.

FINAL SUMMARY ............................................................................................................................. 5

2.

SCIENTIFIC AND TECHNICAL DESCRIPTION OF RESULTS ................................................................. 15 2.1 2.2 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 2.3.8 2.3.9 2.3.10 2.3.11 2.3.12 2.3.13 2.3.14 2.3.15 2.3.16 2.3.17 2.3.18 2.3.19 2.3.20 2.3.21 2.3.22 2.4 2.5

3.

OBJECTIVES OF THE PROJECT ........................................................................................................................15 COMPARISON OF INITIALLY PLANNED ACTIVITIES AND WORK ACCOMPLISHED. ........................................................16 DESCRIPTION OF ACTIVITIES AND DISCUSSION ..................................................................................................16 Task 1.1 Summary of data analysis methods oriented to quality problems .........................................16 Task 1.2 Categorisation of quality problems regarding the data analysis modalities ..........................18 Task 1.3 Definition and selection of a common framework .................................................................18 Task 1.4 Analysis of solution strategies of existing tools ......................................................................20 Task 1.5 Analysis of software & hardware requirements .....................................................................22 Task 2.1 Enlargement of databases & data acquisition systems..........................................................26 Task 2.2 Start of data acquisition .........................................................................................................31 Task 3.1 Development of the framework scheduler .............................................................................31 Task 3.2 Development of methods using 'brute force' approach .........................................................34 Task 3.3 Development of methods using 'individual adapted' approach .............................................37 Task 3.4 Development of 'smart' components .....................................................................................50 Task 3.5 Integration of the developed modules into the common framework.....................................52 Task 3.6 First laboratory tests of the developed methods with 'real' data, assessment of the results56 Task 4.1 Implementation of the automatic tools .................................................................................62 Task 4.2 Integration into the industrial environment ...........................................................................69 Task 4.3 Briefing of target users and launch of the developed system ................................................73 Task 5.1 Application to analysis of mechanical & technological properties .........................................74 Task 5.2 Application to analysis of strip geometry and/or strip flatness .............................................76 Task 5.3 Application to analysis of surface defects ..............................................................................77 Task 5.4 Evaluation of usability and tuning of the system ...................................................................82 Task 5.5 Comparison of the different approaches ................................................................................83 Task 5.6 Determination of the transferability ......................................................................................97 CONCLUSIONS...........................................................................................................................................98 EXPLOITATION AND IMPACT OF THE RESEARCH RESULTS .....................................................................................99 APPENDICES ................................................................................................................................ 103

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8

ANALYSIS OF PREVIOUS PROJECTS ...............................................................................................................103 POLL FROM KDNUGGETS ..........................................................................................................................116 COMPARISON OF WEKA AND R .................................................................................................................117 ADDITIONAL FIGURES ...............................................................................................................................120 LIST OF FIGURES ......................................................................................................................................124 LIST OF TABLES .......................................................................................................................................126 LIST OF REFERENCES.................................................................................................................................127 LIST OF ABBREVIATIONS ............................................................................................................................128

3

1.

Final Summary

Task 1.1

Summary of data analysis methods oriented to quality problems

In the past in many ECSC/RFCS projects the data mining was used to analyse relationships between product quality and data measured in the steel production chain. In this task a summary of some of the projects are listed and abstracted. The projects were investigated regarding 

the investigated quality problem,



the used data mining methods,



and the reached results.

The results of this task are used to define the quality problem categories and for the individual approach it is the starting point for the selection of methods to be further investigated. Task 1.2

Categorisation of quality problems regarding the data analysis modalities

The project partners have shown their knowledge regarding the analysis of data mining problems in many RFCS projects as shown in the previous section. For the categorisation of the quality problems the following classes were defined: 

Development over time. “Are we going to be better or not?”



Comparison of two data collections. “Are there some changes in the production process that affect the quality of our product?”



Classification of good and bad products.



Automatic search of influencing variables / features.

These categories were later used during the application of analysis examples during work package 5. Task 1.3

Definition and selection of a common framework

The common framework to be developed in this project shall be installed at the different plants of the industrial partners of the project. Here, different given environments have to be taken into consideration. To avoid duplicate work it was decided to use an existing tool as key component for the common framework. This generic framework has an analysis task (or kernel) which has be integrated into the different environments by means of individual interfaces. The consortium decided to use RapidMiner as the main part of the common framework to be able to easily exchange found analysis solutions between the partners. Furthermore an open source version from RapidMiner is available which reduces the costs. For the later day-by-day usage there is also a commercial licence available which includes professional software maintenance. The last reason for this decision is the availability of RapidMiner as stand-alone application and as library. This puts the consortium into the position to show the transferability of the developed solution by means of the realisation with different software architectures. Task 1.4

Analysis of solution strategies of existing tools

During this task several commercial and open source data mining tools available of the market were analysed. The comparison of the main features is related to: 

Data import,



Data exploration and visualisation,



Data preparation,



Modelling,



Evaluation and deployment.

5

The consortium focused on free available tools due to their availability for all partners. The final analysis guides the consortium to select RapidMiner as the ideal tool that fulfils the requirements of the Autodiag project. Task 1.5

Analysis of software & hardware requirements

ArcelorMittal wants to link RapidMiner with an actual database viewer that it is been used in AM Asturias to show the data from all the factories that are stored in a central database. This standalone application, called Mytica, can be installed in any computer and is used to show the data, actually is only a viewer and it‟s been modified to include/integrate RapidMiner on it to get data mining capacities and to include plug-ins that were development during AUTODIAG. As requirements at ILVA are concerned, the software developed by SSSA (ILVAMiner) requires a server that runs the MySql Database Management System (DBMS) as well as the automatic data importer tool. ILVAMiner can run on any workstation running MS Windows and connected to the intranet. For ThyssenKrupp Nirosta an existing tool for statistical analysis is used to integrate the AutoDiag functionality. Because ThyssenKrupp Nirosta together with BFI had to investigate the „brute force‟ approach a powerful dedicated high performance server system is necessary on which the common components of the AutoDiag solution shall running. Here also a database containing the piece related data as well as the common framework were installed. This ensures the maximum performance which is a main topic to reach the acceptance of the new system by the target users. Task 2.1

Enlargement of databases & data acquisition systems

Although Mytica database is being working for years AutoDiag project allows ArcelorMittal Asturias to improve the tracking and traceability functions of the software. It also allows to enlarge the databases and to integrate the IBA data acquisition [1]. SSSA developed at ILVA a database where different data sources are concentrated in a unique database (AUTODIAG_DB). For this aims, SSSA also developed an automatic data importer tool that is responsible for fetching and transforming data from the AS400 ILVA mainframe to the AUTODIAG_DB data mart. At ThyssenKrupp Nirosta a technical data warehouse (TDW) exists on which the process and product data from all production sites are stored. For the fast and easy access on the one hand and the aggregation of length-related data to piece-related data a new database was designed (star schema) and realised (with HSQLDB as database management system) for the project. A data acquisition system based on a cyclic executed transfer task aggregates and stores the data into the new database. Furthermore some additional features like failure rates or statistical values are calculated. Task 2.2

Start of data acquisition

ArcelorMittal has selected some quality problems that shall be investigated using data mining technologies. Even if Mytica database is being working for years, the selected quality problems are recent. ArcelorMittal has three cases that are candidates to be studied under the categories of quality problems defined in task 1.2. The necessary data for the investigation of these cases were collected and stored into the Mytica database. According to the work plan at ThyssenKrupp Nirosta the data acquisition was started with the beginning of the business year 2007/2008 and is still running. At the time of the preparation of the presented report around 3.500 features are available in the AutoDiag database. Task 3.1

Development of the framework scheduler

SSSA exploited Java Native Interface (JNI) in the development of ILVAMiner in order to call RapidMiner Java routines (i.e. the common framework chosen by the consortium) from within ILVAMiner software. This tool is written in C++ programming language.

6

For ThyssenKrupp Nirosta a framework scheduler was realised which is responsible for the transfer of data, the calculation of some usage statistics and for database maintenance and consistency checks. These jobs are running once a day, once a week and once a month. Task 3.2

Development of methods using 'brute force' approach

ThyssenKrupp Nirosta and BFI have investigated the „brute force‟ approach. Here the user has no preliminary information regarding the problem to be investigated. The user selects the data for the investigation e.g. only by the selection of process steps or the corresponding plant. All available data coming from these selections are used for the investigation. During the development it has to be noted that the application of more sophisticated methods of the data mining leads to strong increasing calculation times for a higher number of variables. During the discussion with the potential target users it was decided that this is not acceptable to them due to their charge with the day-by-day business. For the first prototype only methods for the uni-variate and linear analysis are used. The selected method calculates a weight describing the influence of a variable to the target. The selected methods are 

Weight by Information Gain Ratio



Weight by Deviation



Weight by Correlation



Weight by Uncertainty

The methods are applied parallel and their results are combined to an overall influence index which is used to generate an ordered table of variables influencing the target. This leads to an acceptable response times. The “brute force” approach was developed as RapidMiner process and integrated into the AutoDiag solution finally implemented and integrated at ThyssenKrupp Nirosta. Task 3.3

Development of methods using 'individual adapted' approach

For the categories defined in Task 1.1 a more sophisticated approach is here used. Based in the type of problem that want to be solved a semi-automatic process starts. Some wizards guide the user through some very easy questions to run the experiment. The algorithms are selected depending of the type of quality problem. A more detailed explication of these wizards is shown in Task 4.1 “Implementation of the automatic tools”. During this task previous algorithms developed at the first part of the project, coded in several languages, were translated to RapidMiner operators. These operators were used later on to elaborate the wizards that defines this „individual adapted‟ approach. In this task other tools and languages were still used to define the experiments and algorithms that later were integrated in the common framework and into the industrial environment. Migration to RapidMiner, selected as common framework by the consortium, was a continuous job that start in this task but remaining active in task 3.5 (Integration of the developed modules into the common framework), 4.1(Implementation of the automatic tools) and 4.2 (Integration into the industrial environment). In section 2.3.10 (Task 3.3 Development of methods using 'individual adapted' approach) there is a detailed explanation of the algorithms used. In the table below a summary is shown: Categories of quality problems

Algorithm

Comparison of two data collections

Residual from Self-Organizing Maps

Classification of good and bad products

Support vector machine (SVM)

Automatic search of influencing variables / feaSupport vector machine (SVM) and Multivariate tures Adaptive Regression Splines Two main requirements were defined to build the final operators. First is that the operator should be auto tuned, that is, the user does not need to know neither anything about the algorithm that it‟s behind them nor how to select any parameter related with the algorithm. The parameters necessary for the operator are calculated automatically.

7

Second is that operators should use all the computation power of the computer by using all the cores of the computers main processor. That means that the code of the operators should be able to run on a multiprocessor-architecture. Data mining is an intensive computational task and the actual evolution in the processors development is not to increase the frequency but to increase the number of cores inside them. Task 3.4

Development of 'smart' components

SSSA in collaboration with ILVA developed different ways to visualise data mining results in an easy and comprehensible way also for people not expert in such methods. Moreover SSSA developed also a wizard that can guide users in the database querying process, hiding all the complexity. Task 3.5

Integration of the developed modules into the common framework

As RapidMiner has been selected by the consortium as the common framework, during this period ArcelorMittal was doing the migration of different algorithms developed in several languages to this common architecture. The operators developed by ArcelorMittal can be used as a normal RapidMiner operator and therefore easy shared with the rest of the consortium. They can be composed with other operators to create the experiment that solves the problem. These compositions are built using the design interface capabilities of RapidMiner. Once the data mining expert has been created the solution for a defined problem, anyone should be able to use them with their own dataset. SSSA accomplished several tests in order to assess the integration of the common framework data mining engine (RapidMiner) with the developed software (ILVAMiner). ThyssenKrupp Nirosta and BFI have realised the brute force approach as RapidMiner process. The RapidMiner is used as library integrated into a servlet. This servlet is running inside a Tomcat application server together with the AutoDiag database on a dedicated server. The user interface is integrated into the existing statistic tool of ThyssenKrupp Nirosta called NiCo. Task 3.6

First laboratory tests of the developed methods with 'real' data, assessment of the results

ArcelorMittal, in order to confirm the usefulness and the universality of the developed operators, has selected, besides some internal data, two well-known populations in the data mining scene. These populations are public domain and are used to check the performance of different data mining algorithms and have the advantage that they were analysed previous by data mining experts of recognition experience. The results obtained after the application of the developed operators on the test populations shows a performance very similar to the actual state of art of algorithms that use these populations. This allows being confident in the ability of the developed operators, to extract knowledge from data and that the results obtained are useful when they are applied on internal data. One more test was done to verify the goodness of the developed method. A real problem of ArcelorMittal production was chosen. It was a two-class classification problem of good and bad quality. On this task several data mining experts probe several algorithms trying to achieve the best classification performance and to detect relevant features. The automatic operators score in the top of all the algorithms tested. SSSA developed a RapidMiner module with the help of ILVA expertise which addresses the issue of understanding how some key process parameters (e.g. furnace temperatures and elongations) are influenced by setting certain production targets in terms of mechanical characteristics (e.g. yield strength Rp02 and tensile strength Rm). ThyssenKrupp and BFI have compared the results from the AutoDiag „brute force‟ approach to results gathered by experts by means of a data mining tool. For that a data set was extracted from the AutoDiag database from ThyssenKrupp Nirosta and transferred to BFI. Here the RapidMiner process was used on the one hand and the BFI-tool DataDiagnose on the other hand. The results are comparable but not the same which was expected due to the different nature of the methods (uni-variate linear against multivariate non-linear). But it can be stated that the results reached by the automatic AutoDiag RapidMiner process are of usable quality.

8

Task 4.1

Implementation of the automatic tools

ArcelorMittal has developed several operators to meet three of the four categorises of the quality problems defined in Task 1.1. These operators are integrated in a customize version of RapidMiner developed by ArcelorMittal. This version starts showing a template that is a Wizard to guide the user to select the kind of problem he wants to solve (based mainly in categorises defined in Task 1.1) and asking for the parameters to configure the experiment. These parameters are very simple and they are not related with the algorithms behind the solution, because the operators were created taking in mind that they must to auto-tuned. The resulted implemented tool is a wizard that guides a non-expert data mining user to solve a problem with a few very easy questions, most of them are related to configure the load of data, answering the name of the file or the name of the objective variable. The design capabilities are removed from this version, we think that the profile of the final user of this system is a person that has not knowledge to edit and improve the performance of the experiment that was defined by an expert. Anyway it is not a closed box, if a new design must be done to cover some specific problem or a new wizard is created they are very easy included in the minimalistic version of RapidMiner by means of an XML file where the new experiment/wizard is defined. In this task, SSSA focused its activities on the development of ILVAMiner, software that allows users performing queries and data mining elaborations in an easy way. The software hides details about the particular Database Management System (DBMS) employed, or about which data mining engine is being used. Furthermore, it shows data mining results by means of 'smart' modules, which are tailored to each specific data mining elaboration and which help to interpret the results in the right way. ThyssenKrupp and BFI have implemented automatic tools for the following two groups: automatic data transfer, preparation and database maintenance on the one hand and automatic servlet based data mining process on the other hand. The tools for both groups were implemented, tested and they are running successfully. Task 4.2

Integration into the industrial environment

Mytica is an industrial data viewer and it is the official tool to analyse the databases of the facilities in ArcelorMittal Asturias. Mytica is the software that production people are using to consult the data stored from the production line and quality systems. If they want to do some studies they can export data to excel and import them into their favourite data mining tool. The successful use of the operators developed inside the AutoDiag consortium depends on a good integration of them inside the Mytica software that people are used to use. The first design of this integration was depth integration, that is, the Mytica core calls the core of RapidMiner as a library and shows the result inside Mytica interface. This first approach required a lot of work and difficulties to the IT department and as the project has been progressing and the knowledge of the internal structure of RapidMiner and its capabilities were discovered, a new way of integration was chosen. This new integration was simpler for the IT department because it only required few changes from their side and keeps the result presentation power of the RapidMiner system. The IT department has created a new installation package of Mytica where it is also includes an special installation of RapidMiner version create by ArcelorMittal. When the user of Mytica exports data it asks if he wants to do some data mining studies with them. If the answer is yes the minimalistic version of RapidMiner modified starts showing the wizards. ILVA and SSSA defined the mandatory steps for the setup of the ILVAMiner infrastructure and then successfully performed them at ILVA plant in Novi. ThyssenKrupp and BFI have integrated the AutoDiag functionality into the existing software tool called NiCo. The time-consuming methods for the „brute force‟ approach are handled by a servlet inside an application server on a dedicated high-performance server. This approach was selected to avoid additional network load on the one hand and to reduce the demands to the client computers on which the NiCo tool is running on the other hand. Here also the AutoDiag database is located. Due to the direct link of the database and the servlet (which contains the common framework realised by means of the usage of RapidMiner as a library) the expected calculation speed could be nearly reached. Only in cases

9

in which the user selects to many variables (>1000) and / or to many examples (> one year of data) the system runs into larger calculation times (e.g. > 90 seconds) which was decided to be acceptable at most. Task 4.3

Briefing of target users and launch of the developed system

ArcelorMittal has selected users that have and haven‟t experience in the usage of Mytica to be briefed and trained in the usage of the RapidMiner minimalistic version. The feedbacks from these target users were used to improve and to adjust the developed wizards. The people were selected from different positions from different facilities. They range from technician from the plant to R&D researchers to IT people. There were people with knowledge in the field of data analysis but also people with few technical skills in data mining. As the people involved in the training was heterogeneous we had various types of feedback, ranging from a very technical data mining comments to points related with the interface and the usability of the software. All of these comments were of course very useful to improve the developed tools. SSSA produced a PowerPoint presentation useful to train potential users of ILVAMiner. It illustrates the main features of the software and its usage. After the development of the AutoDiag functionality and their integration into the NiCo tool at ThyssenKrupp Nirosta the new functions were presented to a smaller group of users, the so-called “power users”. During the briefing the several functionalities of the AutoDiag system were presented and discussed. By means of a showcase the users have made their first experiences during the course. After the course the AutoDiag modules were released to these users by means of the NiCo user management. Task 5.1

Application to analysis of mechanical & technological properties

SSSA and ILVA developed a data mining elaboration useful to investigate the causes that may lead to sub-optimal process potential and capability indexes (Cp and Cpk). It exploits the Pareto dominance concept and a decision-tree model. Task 5.2

Application to analysis of strip geometry and/or strip flatness

Trying to solve the different quality problems categorised in the first tasks of the project it was realized that at least one of the operators should have regression capabilities. Strip geometry is a good example of a problem that used to need regression behaviour in the solution. ArcelorMittal has chosen a real problem of regression in one of our facility, a problem that it was solved prior of the start of the Autodiag project. This also shows the advantage of use the software developed instead of other more generalist software. The tools used to solve this were the use of analysis of variance (ANOVA) to select the influencing variables and the creation of a linear regression to adjust the final output. The ANOVA was done with commercial software managed by a statistical expert and a detailed technical study. Now with the template developed (Search of influencing variables based in MARS) any user can reach better result and the same information only filling the name of the excel file where the data are and the name of the variable under study. Task 5.3

Application to analysis of surface defects

ThyssenKrupp and BFI have investigated a typical surface defect („open and closed shells‟) that is detected by an automatic surface inspection system (ASIS). Based on a data sample consisting of 761 variables and initially 16.081 data sets the investigation was done by means of the system developed during the presented project. The defect information coming from the ASIS were aggregated to several values per coil. For the investigation the number of defects per coil is used. The application of the “brute force” approach leads to a list of influencing variables ordered by their importance. The topmost mentioned variables are graphically investigated more in detail and the results are discussed with the process experts. The influencing variables that were found are corresponding with the experience of the process experts. Due to the capabilities of the used data mining methods only uni-variate and linear dependencies can be detected. The use of more sophisticated data mining methods is not possible here due to system response times that were not acceptable by the target users.

10

Task 5.4

Evaluation of usability and tuning of the system

Based in the feedback from the ArcelorMittal users that have received the training courses described in section 2.3.16, some modifications in the developed wizards are done. There is a more detailed explication in section 2.3.20 but here is a summary. 

Enable the option to use only one population to run the experiment. This data set is split in two subsets: one to train the model and the other to test it.



Include a pre-processor module to do basic spurious and null data removing from population.



To reduce the time to run the experiment a non-exhaustive optimum search can be selected.



Create a new wizard to load a previous model and test new data on it.

SSSA took into account some observations and suggestions made by the ILVA personnel who tested the software in order to enhance and improve the software usability. ThyssenKrupp and BFI have made a training course to a first group of users, the so called “power users”. These users are very familiar with the NiCo tool and they have more experience with the data analysis. They are also the first users for which the AutoDiag functionality inside of NiCo was released. The experience of them points on one hand to typical little problems when new software was introduced (e.g. un-handled exceptions, curious combinations of parameters, problems with the user interface). On the other hand they gave important hints, how to improve the usage (e.g. a better variable selection) or to add additional functionality to make the software more valuable. The feedback from these users was and is still used to improve the software. Task 5.5

Comparison of the different approaches

During the presented project two general strategies were investigated, the “brute force” and the “individual adapted” approach. The first one tries to investigate the dependencies of the target quality information to all variables of the relevant production chain. Here the philosophy is that very often the unexpected influences are the most important ones. Because of the number of variables this approach needs a lot of computing power and out of this reason more simple and fast data mining techniques were selected for the realisation. For the “individual adapted” approach the experiences of the proposers of the last years were investigated, documented and used as the basis for the selection of proper data mining methods. Here the user has the possibility to make a specific selection of the variables from the processes which shall be investigated. Also he has the chance to influence the data mining process. During this task the consortium has discussed the both strategies and the results reached with them. Furthermore the advantages and disadvantages were balanced. The result of the discussion is shown as follows. 

Different data environment

The effort to provide the necessary product quality and process data is very different. For the „brute force‟ approach a large amount of data / variables have to be accessible by the system. Here the provision of the data during the implementation of the system is time consuming, especially if there is no common data source (e.g. a technical data warehouse) so that the data have to be gathered from different sources and have to be connected to the product. In contrast to that for the „individual adapted‟ approach only problem specific data have to be prepared. When focussing on a specific problem usually data from only few production stages are necessary which reduces the effort distinctly. 

Different users knowledge necessary

The demands to the target users regarding specific knowledge are also very different. For the „brute force‟ approach no detailed knowledge is necessary. The user selects the target and the input variables and starts the system. The result presentation is as easy as possible. For the „individual adapted‟ approach the user has at least to assign the problem to a given solution. Here a more skilled user is necessary.

11



Different methods can be applied

The „brute force‟ approach tries to incorporate as much process and product quality variables as possible. This leads to a large amount of data that have to be processed. To calculate the results in a reasonable time only „simpler‟ methods can be used. In opposite of that for the „individual adapted‟ approach more complex and data mining methods special tailored to the data mining problem were used which lead to more detailed results (see next topic). 

Different result quality

As described above for the „brute force‟ approach only „simpler‟ methods can be used. This leads to the fact that the results are less exact then they can be when using high sophisticated multi-variate and nonlinear data mining methods. At this point only more general hints can be expected. Being focussed on a specific problem the individual solution can reach more reliable results. Specialised data mining methods that reach the best results for a specific problem are used for the „individual adapted‟ approach. This benefit has the disadvantage that for each group of data mining problem an individual approach has to be developed and implemented. Task 5.6

Determination of the transferability

One important demand to RFCS projects is the focus to generate results that can be used in whole European steel industry, where they are applicable, respectively. For the presented project this was successful realised. For the determination of the transferability the following can be stated: 

Open framework that can be realised in any kind of steel industry (flat products, long products etc.).

The framework developed during the project is completely independent from the type of the steel producer. There are no methods or solutions that depend on the type of the steel product. The only necessity is the availability of data describing the product which should be solved nowadays for the European steel industry. 

Individual interfaces to data supply as well as result visualisation are always necessary.

As for every software solution there are individual adaptations necessary when transferring the software to another steel production facility. For the software developed in the presented project individual interfaces to the data supply as well as to the user interface were necessary. For both of them every steel producer has its own environment, which is usually very inhomogeneous due to the different age of the several plants. Also exists no standard e.g. for the access of process or product quality data. So an installation „out of the box‟ will never be realisable. 

Easy exchanges of methods / operators possible due to a company independent core of the framework build with RapidMiner.

One major aim of the project was to hide the underlying Data mining methods to the target users. This puts them into the position to use these methods without deeper knowledge regarding e.g. the necessary prerequisites for their application. The result is that these powerful methods can be distributed to a wider range of target users. Nevertheless, the implemented methods are defined using a standard tool and are stored in a common file format. So an exchange of these methods is very easy, independent of the individual implementation of the interfaces. 

Availability of the sources opens a wider range for individual solutions.

The used data mining tool RapidMiner is free available including the source code. This puts the consortium into the position to adapt the software to the demands of the steel industry. There were own modules developed like the shown template based wizard or individual learners. These developments could be started based on available modules so that duplicate work could be avoided. 

Different used software techniques have shown a wide range of realisation approaches that can be found at the European steel industry.

As described above is the IT environment of each steel producer very individual. So a common solution of the whole system for every steel producer is not possible. During the project different implementa-

12

tions based on the different IT environment of the industrial partners of the project have shown the transferability of the developed system. It was shown that a client / server architecture is possible as well as the integration into an existing tool or a standalone application. 

Software is open source what minimises the costs for the implementation and testing, but a commercial maintenance is also available.

The main component of the developed common framework based on a well-known and widely distributed open source data mining software. This puts an interested steel producer into the position, to test the developed solution e.g. on a pilot plant with low financial effort. If the test leads to the expected results, the system can be rolled out to the whole site, covering all plants. For that, also a commercial licence is available which leads to a professional support in the case of the appearance of software faults.

13

2.

Scientific and technical description of results

2.1

Objectives of the project

The objectives of the proposed project can be summarised by the following points: Objective 1:

Solutions for robust and problem adapted data mining methods were developed, which can be applied very fast by the potential users and which don‟t need special data mining knowledge. As many data mining steps as possible have to run automatically,

Objective 2:

Therefore all relevant quality problems of steel production were divided in different categories and for each category adapted solutions will be realised, in which the experience of the partners with many data mining investigations will be integrated,

Objective 3:

The developed solutions were integrated in a general, open framework which can easily be expanded and in which a possibility to store experiences and results is integrated,

Objective 4:

The developed system was implemented at the factories of the steel producers of the consortium,

Objective 5:

Typical product quality problems of flat steel production was investigated exemplarily: surface defects, insufficient material properties and strip geometry deviations,

Objective 6:

To guarantee the transferability, these application examples were taken from three different production lines: stainless steel, tin plate, automotive steel,

Objective 7:

The automatically generated results were compared with results reached manually by data mining experts,

Objective 8:

The feedback from the target users at the plants were used for system validation and system tuning.

The overall aim of the project is a robust, easy to use and automatic data mining solution which supports the process and quality engineers at the task of diagnosis of cause-and-effect relationships.

15

2.2

Comparison of initially planned activities and work accomplished.

The main objectives of the project have been achieved. A detailed explanation is shown in the following Table 1. Objective

Comments

Objective 1

All the algorithms, operators, software, etc. developed by the members of the consortium were designed to hide as much as possible the data mining techniques and parameters used and help the use of them to users which don't need special data mining knowledge. The application of the software at the sites of the industrial partners and the response given by the target users has successfully shown the usability and the acceptance of the developed solutions.

Objective 2

The quality problems have been categorized in four main classes as is described in 5.3.2 Task 1.2 Categorisation of quality problems regarding the data analysis modalities. Solutions for the different categories were implemented and installed at the industrial sites.

Objective 3

RapidMiner has been choose to be the core of the framework and has shown a great flexibility to be integrated in different architectures: client /server, integration into existing tool and standalone application.

Objective 4

The developed system is implemented and running in the pilot plants / existing software environment defined by the members of the consortium. After a teaching lesson to the final users the developed systems were used by them during the project.

Objective 5

The ranges of data sets selected and studied cover the scope of this objective. The different quality problems were investigated by different partners using different approaches.

Objective 6

The transferability of the project is guaranteed not only because the pilot plants are from different production lines but also because of the successes achieved in applied the common framework in different architectures. The RapidMiner process can be easily exchanged, independent from the different IT environment.

Objective 7

The results generated by the tools developed has been compared with other data mining tools used by data mining experts, with the process experts where data set comes from and/or with previous studies done with the same data that prove that the automatic results are comparable to a manually process addressed by an expert.

Objective 8

As is shown in 5.3.20 the feedback from the users has been used to improve and to validate the system.

Table 1: Comments to the archived objects Due to the economic crisis during the project some of the plants of the industrial partners, from which the data are used for the development of the methods and the test of the system, were not always available because of operational downtimes in 2009. This leads to changes in the order of the work packages against the initial work plan and to some delay. By means of increased man power of the involved partners the delay could be caught up and the project was finished in time. For the test of the developed system the application of several data mining examples were done. Due to the availability of a surface inspection measuring device (SIS) the investigation target of BFI and TKLNR (Application to analysis of mechanical and technological properties) was exchanged with SSSA and ILVA (Application to analysis of surface defects). 2.3

Description of activities and discussion

2.3.1

Task 1.1 Summary of data analysis methods oriented to quality problems

In the past in many ECSC/RFCS projects the data mining was used to analyse relation-ships between product quality and data measured in the steel production chain. The members of the consortium of the presented project were involved in several projects with aspects of data mining from which they gathered a lot of experience in the application of these methods for the steel industry. During this task finished and running projects were analysed regarding the investigated quality problems and the applied

16

FACTMON

ECSC 7210-PR/171 RFCS RFS-CR-03041

Implementation of an assessment and analysing system for the utilization of a factory wide product quality data-base Factory-wide and quality related production monitoring by data-warehouse exploitation

X

X

Improvement of quality management in cold rolling ECSC X and finishing area by combination of failure mode 72120-PR/342 and effect analysis with data-base approaches Intelligent soft-sensor technology and automatic RFCS SOFDETECT model-based diagnosis for improved quality, control RFS-CT-04017 and maintenance of mill production lines ECSC On-line prediction of the mechanical properties of OLPREM 7210-PR/292 hot rolled strips Investigation, modelling and control of the influence RFCS of the process route on steel strip technological IMGALVA X RFS-CR-04023 parameters and coating appearance after hot dip galvanising RFCS Optimised Productivity and Quality by On-line HIGHPICK RFS-CT-2005X Control of Pickled Surface 00021 Width-adaptable optimized controlled-cooling RFCS systems (WACOOLs) for the production of innovative WACOOL RFSR-CT-2005Advanced High Strength Steel grades and the study 00017 of strip shape changes while cooling ECSC Optimisation of the influence of Boron on the BORON 7210-PR/355 properties of steel DAFME

X

X

X

X

SSSA

X

THYSSEN

X

ILVA

ARCELOR

QDB

BFI

Data mining methods. The investigated projects and the involved partners are shown in the following Table 2.

X

X

X

X

X

X

X

X

X

Table 2: Analysed previous projects with Data mining aspects

A detailed summary of the analysis is shown in the annex 3.1 on page 103. The used data mining methods for one dimensional and two dimensional quality problems are summarised in the following Table 3. Self-Organising Map (SOM) 1D 2D surface defects

SOFDETECT, DAFME, HIGHPICK

Decision trees IMGALVA, DAFME, FACTMON, HIGHPICK IMGALVA, QDB

Neural Networks IMGALVA, OLPREM, FACTMON, WACOOL

Genetic Algorithms

IMGALVA

QDB

Correlation

Lazy LBK

DAFME, FACTMON

IMGALVA

IMGALVA

Table 3: Data mining methods applied in previous projects It can be ascertained that decision trees in comparison to neural networks seem to be more suitable. The high transparency of a decision tree is very advantageous. The evaluation of the tree result is a very comprehensive procedure. The most significant attributes are easily identifiable what is important especially with respect to poor quality of data.

17

2.3.2

Task 1.2 Categorisation of quality problems regarding the data analysis modalities

The project partners have shown their knowledge regarding the analysis of data mining problems in many RFCS projects as shown in the previous section. For the categorisation of the quality problems the following classes were defined (Table 4): Development over time



Automatic search of influencing variables / features

Here a typical question is: “Are we going to be better or not?” The aim of a steel producer is to increase the quality of the product. So the several processes have to be adapted to increase the product quality. By means of the development of quality features over the time the result of these adaptations can be verified. Another scenario is the early detection of trends in the product quality. When some relevant quality features are analysed over the time and there is a trend detected, the necessary actions can be taken before some of the products have to be discharged. Here the typical problem is the following: This month there is a high appearance of a quality problem which was not detected last month. So the question is: are there some changes in the production process? This can be investigated by means of the comparison of two data collections. At different stages of the production often the decision has to be made if the product is of sufficient quality to reach the customers claims. Here a grading of the product has to be done by means of the analysis of several quality criterions. Later on these results can be used to support the plant personnel in making the decision to apply reworks or to degrade the product. A prerequisite for the classification is reliable quality information of the product. Here a single value e.g. the result of a manual inspection is not sufficient. Other information like the product type, the target quality or the existence of another failure has to be recognised because it influences the detection of failures (e.g. if one critical failure was detected the product will not be further inspected so that the absence of a failure is not equal to the fact that the product did not have this specific failure). Due to the increased needs of the customers regarding the product quality and for the reduction of the production costs it is necessary to find out, what the reasons of the appearance of quality variations are. With the detection of the influencing variables the steel producer is in the position to adapt the production process to avoid the appearance of quality deviations. This will lead to increased product quality and less scrap.

Table 4: Categories related to the quality problem In the work package 5 some typical quality problems of the flat steel production were investigated by means of the tools developed during this project. The selection of the investigated quality problems were based on the categories shown above. 2.3.3

Task 1.3 Definition and selection of a common framework

The common framework to be developed in this project was installed at the different plants of the industrial partners of the project. Here, different environments have to be taken into consideration. The existing data sources are of different structure as well as the software environment used for user interaction, parameterisation and result presentation. So the approach is to define a generic framework with an analysis task (or kernel) which can be integrated into the different environments by means of individual interfaces. To avoid costs which will appear when using commercial tools (in the case of SPSS Clementine in the range of EUR 100.000 for an industrial licence) and to be able to adapt and integrate the data mining functions to the special need of the flat steel production, it was decided to use open source soft-

18

ware. These tools are under maintenance by a great community and new algorithms and techniques are added every day. Summarised the main structure of the framework and its integration into the plants are shown in Figure 1.

Figure 1: Integration of the common framework into the industrial environment With that approach the generic common framework can be developed during this project using different software tools and existing libraries. The individual interfaces were developed by the industrial partners. So the confidential part of the data storages can be hidden from the common investigations. As discussed during several project meetings the consortium decided to use RapidMiner as the main part of the common framework (see also chapter 2.3.3 on page 18). RapidMiner can be used by the process and data mining expert as an interactive tool which enables one to investigate several problems by means of specialised implemented methods. The results of the experts work are solutions, which are stored in RapidMiner project files (XML files). These files are containing the special knowledge of the experts and the necessary pre-processing and data mining steps, they can be easily exchanged between the partners of the consortium, independent of the different implementations. The project files were transferred to the part of the AutoDiag system that is used by the target users. They only have to select the type problem and to define a data set. Then the RapidMiner library is called which loads the data on the one hand and the project file on the other hand. After the calculation the results are presented to the user. The following Figure 2 shows the structure of the common framework.

Figure 2: Structure of the common framework This concept has the following advantages: 

The interactive tool for the development of the different approaches („brute force‟ and individual adapted) already exists and is well supported.



The developed solutions are stored in a common format (XML) that can be easily exchanged.



The resulting AutoDiag system can easily be extended to new types of investigations or to new available data.

19



The necessary knowledge of the experts regarding the data mining is completely hidden to the target user.

For the realisation of the common framework at ArcelorMittal Espana the integration into an existing data visualisation tool was chosen. This tool, previously only used for visualisation of data, was extended to be able to call RapidMiner and to use all the plug-ins that were developed during the project. At ThyssenKrupp Nirosta the AutoDiag system shall be integrated into an existing data exploration tool. This tool is used by a lot of people in the company. Therefore it was decided to build a clientserver structure that makes the new functionality available in the whole ThyssenKrupp Nirosta via the intranet. The server uses the well-known TOMCAT application server and the RapidMiner library is called by a servlet. This server is located on the same machine on where the database is running. So the maximum speed for the data access is granted. In agreement with other partners SSSA and ILVA adopted RapidMiner project file format as the common framework for this project. RapidMiner is a complete and well documented data mining tool that provides both a comprehensive Java library and a graphical user interface. Its project files are formatted in XML: each file represents a chain of operators (pre-processing, data mining models, validation, etc.) that is applied in sequence to input data. These files can be edited "by hand" in a text editor or they can be built by means of the graphical user interface. Nevertheless, RapidMiner requires certain knowledge of data mining principles and techniques and it generally can be used only by experts. On the other side, the aim of this project is to automate the data mining process in steel making industries by making it transparent to the final user. SSSA developed software (ILVAMiner) whose aim is to hide the complexity of RapidMiner by implementing a very user-friendly interface and by automating as much as possible common and frequently used data analysis procedures. In such a way data mining experts can prepare RapidMiner processes by means of its GUI that can be subsequently and transparently used by means of ILVAMiner. 2.3.4

Task 1.4 Analysis of solution strategies of existing tools

The next Table 5 shows a poll from KDnuggetsTM [8] where the people answered what data mining tools have used for a real project (not just evaluation) in the past 6 months? As it can be seen the Free/Open source tools are very popular and they are near commercial tools. Rapid Miner is the leader in the group of Open Source tools.

20

Data mining tools used for real project (May 2008 source http://www.kdnuggets.com/) Tool selected as one among several

Tool selected alone

80 70

Votes

60 50 40 30 20

Orange

Other free

C4.5/C5.0

KNIME

R

Weka

Your own

Rapid Miner

Bayesia

FairIsaac

Megaputer

Tiberius

Miner3D

Viscovery

Statsoft

Oracle DM

Other

Angoss

MATLAB

SQL Server

SAS

SAS

KXEN

Excel

SPSS

Salford

SPSS

0

Insightful

10

Free/Open Source Data mining Software

Commercial Software Figure 3: Poll, Data mining tools used for real projects.

The principal open source tools used are Rapid Miner, R and Weka. Rapid Miner can also call Weka algorithms. Comparing Weka and R [9] Weka and R are two prominent open-source software systems for analytics. Both originate from academia, but have different goals and focus. While R comes from the statistics community and is a general-purpose environment for statistical analysis, Weka‟s origin is in computer science, and, as such, was designed specifically for machine learning and data mining. In Table 39 (see annex 3.3 on page 117) a comparison between Weka and R is shown. The comparison is based in main features related to: 

Data import,



Data exploration and visualisation,



Data preparation,



Modelling,



Evaluation and deployment.

The main conclusion is that Weka has better support in machine learning and data mining, but algorithms of Weka can be called form R as [10] suggested using an R package RWEKA [11] in order to make the different sets of tools from both environments in a single unified system. RWEKA can be also used to get all the functionality of R in Weka [12], [13]. The commercial tools usually need an expert in data mining as the user. These tools offer a lot of possibilities in methods of different categories (neural networks, decision trees etc.). But the methods usually have to be parameterised which can only be done if the user knows the methodology in detail. A wrong parameter (e.g. a small data set is used to train a large neural network) can lead to results that can be misinterpreted. This is against the approach used during the project, to hide as many of the methods used to the user. So the decision is to use free available libraries, which can be seamless integrated into the common framework and which parameters can be pre-defined due to the category of quality problem to be investigated.

21

SSSA reviewed some data mining tools and software libraries in order to understand if it is possible to use them in the software that was developed. In order to lessen the number of possible choices, SSSA focused mainly on libraries embeddable in a C++ project. Hereafter a list of possible candidates is presented: 

Orange: a component-based data mining software library. It includes a range of pre-processing, modelling and data exploration techniques. It is based on C++ components, that are accessed either directly, through Python scripts, or through GUI objects called Orange Widgets. It is freely available under GPL licence. (http://orange.biolab.si/)



Data Mining Template Library (DMTL): is an open-source, high-performance, generic data mining toolkit written in C++. It provides a collection of generic algorithms and data structures for mining increasingly complex and informative patterns types. (http://dmtl.sourceforge.net/)



Weka: it is a freely available (GPL licence) collection of machine learning algorithms for data mining tasks. It's a well-known tool that provides both a graphical user interface and a library written in Java. It is possible to call Java routines from C++ software by means of Java Native Interface (JNI). (http://www.cs.waikato.ac.nz/ml/weka/)



RapidMiner: it is a freely available and open source tool for data mining. It provides both an intuitive Graphical User Interface and a library. It's written in Java and integrates Weka routines as well as R (The R Project for Statistical Computing - http://www.r-project.org/) environment. These features make RapidMiner one of the most complete and comprehensive freely available data mining tools.

Discussion During this task three main tools for the data mining were analysed: R, WEKA and RapidMiner. Other tools like ORANGE, TANAGRA or KNIME are very similar to RapidMiner, but not with the same distribution (see Figure 65 on page 116). Weka and R are two prominent open-source software systems for analytics. Both originate from academia, but have different goals and focus. While R comes from the statistics community and is a general-purpose environment for statistical analysis, Weka‟s origin is in computer science, and, as such, was designed specifically for machine learning and data mining. RapidMiner is also an open-source system and was designed for knowledge discovery and data mining. While R is more a language (like MATLAB) for statistical computing and graphics the functionality for data mining has to be developed or added as package. Weka is somewhat like RapidMiner. It has quite a large number of components and it is relatively simple to use. However, it is not able to perform all the functions that are available in RapidMiner. With RapidMiner the user is able to store the whole analysis process (chain of operators) into one XML file. Furthermore this tool can be used (like WEKA) as library for the development of own applications. Due to the fact that RapidMiner has incorporated most of the algorithms of WEKA and WEKA is able to call R, RapidMiner seems to be the best choice for the project and a fast realisation of a running AutoDiag system can be expected. 2.3.5

Task 1.5 Analysis of software & hardware requirements

ArcelorMittal wants to link RapidMiner with actual database viewer that is used in AM Asturias to show the data from all the factories that are stored in a central database. This standalone application, called Mytica, can be installed in any computer and is used to show the data, actually is only a viewer and it‟s been modified to include/integrate RapidMiner on it to get data mining capacities and to include plug-ins that were development during AUTODIAG. Figure 4 shows the schema of Mytica.

22

Figure 4: Schema of actual database viewer Mytica in ArcelorMittal Asturias As can be seen in Figure 4, Mytica consists of data from all facilities of the Asturias site in a central database. This software can be installed in every computer and it only requires that the user has an account in Mytica database. It has several modules. First the user use the selection signal module to define the range of products and the list signals that he wants to see. This action generates an SQL query that recovers those data from database. Data are shown by means of different modules and depends of the nature of them, there are: 

a graph module,



a grid data view module,



a traceability module and



a Surface Inspection System module.

There is also an export data module that can export data to different formats. During this task ArcelorMittal has defined the requirements, the necessary research and work which had to be done to insert/integrate the RapidMiner plug-ins inside Mytica. Two different approaches have been studied. One is based in a deep integration of RapidMiner inside the C# code of Mytica software. This requires the use of a wrapper of java (RapidMiner is coded in Java) to be called from C#, Figure 5 show the schema of this integration using JNI [7]. Mytica directly calls the operators designed in RapidMiner to process the data and visualise the solution. The second approach uses the export module of Mytica to send the data to an external version of RapidMiner that do the entire data mining processing task and shows the result. Both strategies were tested and evaluated during the project and second approached were selected to be integrated in the industrial environment, see chapter 2.3.15 on page 69 for more details.

23

Figure 5: Using JNI for Java-to-C# interface, taken from [7]

At ThyssenKrupp Nirosta an own developed tool for the statistical analysis of data already exists. This tool is called NiCo (Nirosta Cockpit). This tool shall be used to integrate the functionality for the automatic data mining by means of the „brute force‟ approach. The advantage is that the potential target users are already familiar with this tool and no additional software application needs to be introduced. So for the Task 4.3 (see page 73), the briefing can be shortened to the new data mining functionality. NiCo is developed in JAVA, so the requirement for the development of the user interface and for the data mining functionality is to use JAVA. One of the aims of this project is to hide the underlying data mining techniques to the target users. They shall focus on the problem to be investigated and not on the used methods and the necessary parameters to be set to use these methods. The configuration of the data mining procedures is realised by only few experts who are familiar with the available data as well as the proper application of statistical computations. Nevertheless it is necessary to check some requirements or to set some parameters. Here the user shall be supported by means of a wizard [14]. This is a software technique that guides a user by means of several sequential dialogues to realise ergonomic data input. Usually the user is asked something by means of a naturally formulated question and several answers are predefined (like a multiple choice). Depending on the selected answer a parameter is set or a selection is made. Furthermore the following steps can depend on the given answers. For the AutoDiag project an open source wizard is available which is completely realised in JAVA (see [15]). The analysis steps are dependent on the selections made by the user and for the shown example as follows: 1. Selection of the analysis scope 2. Selection of the features (or variables) 3. Definition of a data filter 4. Visualisation The analysis methods (data visualisation, development of a variable over the time, comparison of two data samples and investigation of variables influencing the product quality) are described more in detail in chapter 2.3.9 on page 34. One important criterion for the acceptance of the new AutoDiag functionalities by the target user is the response time. If the system is not able to present the results in an acceptable time the user will dislike the system. So it was decided to install a dedicated high performance server system for the AutoDiag project. The technical details of the server are 

4 processors



8 gigabyte main memory



300 gigabyte hard disk

24

The server has also the possibility to enhance the hardware (more processors and more memory). To make the server also scalable on the software side, a virtualisation of the server was used. By means of a VMWARE server [16], the operating system (OpenSuse, see [17]) runs as a virtual machine. This has the advantage that in case of insufficient resources of the application or the database, another virtual machine can be easily installed on the same hardware by duplication of the existing one. This can be repeated until the hardware resources are exhausted. The existing software tool NiCo (see above) is a JAVA Rich Client application [18]. This means, that the software is downloaded to the clients computer and executed there. For the AutoDiag project this would have the consequence, that all data have also be downloaded (high network load) and the client computer is blocked until the data mining task is finished. Here it was decided to use the Servlet Technology for the implementation of the data mining functionality by means of an Apache Tomcat Application Server [19]. Here the data are only transferred between the data mart (HSQLDB) and the servlet. Because they are both on the same machine the transfer speed is maximised. Following the decision of the project consortium, the open source software RapidMiner [20] is used to realise the „brute force‟ approach. RapidMiner itself is used as a library that is engaged by the servlet. The different data mining approaches are stored in RapidMiner project files (XML-files). These files were generated by the data mining experts from ThyssenKrupp Nirosta together with BFI. This way, the deeper knowledge of data mining is distributed to the process experts at the plant. The resulting hardware and software structure is shown in the following Figure 6.

Client (NiCo)

TDW Intranet

AutoDiag High Performance Server Virtual Machine Server (VMWare)

ETL (inhouse)

Data mart (HSQLDB)

Application Server (Tomcat)

Common framework (RapidMiner)

Virtual Machine (OpenSuse Linux) Figure 6: Hardware and software structure realised at ThyssenKrupp Nirosta In order to have a continuously updated database available, at ILVA a dedicated acquisition system has been implemented. The previous IT environment has been improved to fulfil the requirements for a daily based data acquisition. The data acquisition is structured as follows (Figure 7): once a day, an automatic routine checks the system for the coils which have been set ready for expedition the day before; this procedure ensures that the considered coils have finished their process route, so that no further processing will be done.

25

Such routine is performed from a dedicated workstation, and its operation is checked manually every day. These data are transferred from the workstation to the AUTODIAG_DB (counting one record for one coil) located on the server, equipped with MySQL to ease the data management. Both ILVA and SSSA have access to these data from the net, via secure access to ILVA server. The hardware requirements for the server are the following: 

good and stable network connection;



adequate hard disk capacity to host process data for a minimum of one year of production;



good processing power to perform complex queries.

Figure 7: Updated data acquisition scheme, totally based on ILVA Novi Ligure IT environment

2.3.6

Task 2.1 Enlargement of databases & data acquisition systems

Although Mytica database is being working for years, during the AutoDiag project ArcelorMittal Asturias had to improve the tracking and traceability functions of the software. Also the databases had to be enlarged and the IBA data acquisition [1] was integrated. It has its own file storage system and it needs a database module called ibaAnalyzer-DB-Extraktor to interact through SQL sentences and to integrate these data to Mytica database. At ThyssenKrupp Nirosta the work done together with BFI can be described separately for the database and the data acquisition system. This is done in the following chapters. 2.3.6.1 Database At ThyssenKrupp Nirosta several years ago a Technical Data Warehouse (TDW) was established. This TDW was realised by means of a large ORACLE database which contains data collected from the production facilities located in Dillenburg, Krefeld, Bochum and Benrath. Into the TDW, data from all production steps (melt shop, casting, hot and cold rolling, (bright-) annealing, temper rolling and finishing line) are stored for a history of more than 5 years. Furthermore data from a Production Planning System (PPS) are available. The material tracking is done by a history table in which all productions steps of a particular piece is stored. The given data environment at ThyssenKrupp Nirosta is shown in the following Figure 8.

26

Figure 8: Scheme of the data environment at ThyssenKrupp Nirosta The data sources shown in the figure above are containing all necessary data for the project. The structure of these data is not ideal for the project to realise data mining tasks. The indices of the different tables not set in an ideal way for the project purposes. First tests have shown that there are performance problems when accessing the TDW directly. Furthermore there are limitations of the ORACLE database for the selection of a large number of variables. Here the limit of ORACLE is 512 variables in a select statement. In task 3.2 ThyssenKrupp Nirosta together with BFI investigated the „brute force‟ approach for the data mining. Here the idea is to select all available variables for the first step of the investigations. Due to the amount of production steps and the number of available variables the limitation of ORACLE makes it necessary to build a dedicated data mart for the AutoDiag project. As the database management system for the AutoDiag system the database software HSQLDB was selected [21]. This Relational Database Management System (RDBMS) has the following advantages which are important for AutoDiag: 

Large tables possible (only limited by the available server main memory)



Unlimited number of variables in result sets



Possibility of memory resident tables (for a faster access)



Completely realised in JAVA

The resulting structure of the data acquisition system for the AutoDiag project is shown in the following Figure 9. The transfer module is described in the next chapter.

Figure 9: Scheme of the data acquisition system For the data model of the data mart the star schema is used [22]. Here the preference is not the normalisation of the data but the efficiency for read operations. The main application of the data model is the

27

data warehouse or OLAP applications. The model consists of one (or more) facts table(s) and several dimension tables. The facts table consists of foreign keys to the dimension tables which allow a fast concatenation of the data. For the AutoDiag project the facts table (here: Master-Key-Table) consists, beneath the foreign keys, of all data which will be used for the data selection during the brute force approach (e.g. dimension, steel grade, production date, customer, production path). Each dimension table contains the data from the several production steps which is shown in the following Figure 10.

Figure 10: Data model used for the data mart (star schema) The Master-Key-Table is stored completely in the main memory of the computer system used for AutoDiag. So a very good response time for the provision of the data for the brute force approach was reached. 2.3.6.2 Data acquisition system The data acquisition system for the data mart was realised as one transfer task, which is cyclically started by a so called cron job (the scheduler of the underlying UNIX (LINUX) operating system) as shown in Figure 9. The work which is done by the transfer task is the following: 

Detection of new data



Determination of the parent piece and assignment of the keys



Update of the data of the master key table



Aggregation of the data (e.g. time or length based data to piece related data)



Calculation of additional features (e.g. error rate per meter for surface defects)



Addition of the feature data to the dimension tables

The transfer task is completely parameterisable by several tables also stored in the HSQLDB data mart. This way, changes in the source TDW can be easily adjusted in the loading operations. These tables are containing the following information: 

Full path to the source table



The name of the target table (dimension table)



Names of the key fields (to the Master-Key-Table and the dimension table and also the key to the parent piece)



Mapping of the particularly cryptic table field names to „human readable‟ variable names



The aggregation method (e.g. mean-, max-, min-, standard-deviation-value)



The calculation rules for the additional features

28

This information is provided for each production step as well as for each source table. At the end of the project, the table for the job definition consists of 65 entries. This means that the transfer task of the data acquisition system of ThyssenKrupp Nirosta is divided in 65 sub tasks which are done sequentially along the production chain. The transfer task is started every two hours. In the first stage of the project, while the AUTODIAG_DB of ILVA was being enlarged with the described acquisition procedure, an offline database which reflects the same structure of the final database was created to allow the beginning of software development. For such database, the production period from January to April 2009 was considered. To facilitate the data acquisition process, few variables from the hot side of the steelmaking process were considered (hot rolled coil ID and heat chemical analysis). In such manner, all the data flow could be managed from inside ILVA Novi Ligure plant, facilitating the implementation of the data acquisition system itself, and also its tuning and maintenance. With reference to the cold side of the process, many variables have been selected for the main possible process routes, whose flow is shown below (Table 5). Where: CAPL = Continuous Annealing Process Line BA/SPL = Batch Annealing / Skin Pass Line HDGL = Hot Dip Galvanizing Line EGL = Electro Galvanizing Line

Pickling/Cold rolling

Uncoated product Coated product

CAPL

Lab.

BA/SPL

Lab.

HDGL

Lab.

EGL

Lab.

Finishing

Shipment

Table 5: Schematic material flow at ILVA Novi Ligure

Each record of a coil contains many entries which can be divided in 6 classes: 

ID (coil ID, to ensure the material traceability, steel grade, planned thermal cycle, surface aspect and quality, etc.)



Common processes variables (pickling time, cold rolling main parameters)



Specific process route variables (e.g. for HDGL: cleaning section data, furnaces and strip temperatures, line speed, air blade, skin pass and tension leveller operating parameters, measured coating, etc.)



Laboratory testing (tensile test, roughness, hardness, etc.)



Quality Control data (verified defects, decisions about coil, such as deviation, suspension or scrapping)



Data from an Automatic Surface Inspection System (ASIS, from Parsytec) very simplified, only the amount of certain defects is considered

Some things have to be said about Parsytec data: since up to now the Surface Inspection System net is not totally connected with the IT management system of the plant, a more sophisticated data treatment is not available as immediate action. Advanced data treatment is available up to now only by means of periodic batch activity. Nevertheless, though it is known that the simple variable “number of defects” is not enough by itself to describe the strip surface quality, it is already at disposal and it has been decided to consider it to check if and how it could be exploited for data mining investigations.

29

In Figure 7 the Entity Relationship (E-R) diagram of the AUTODIAG_DB developed at ILVA is shown. It shows which entities are currently imported in the AUTODIAG_DB and their relationship. The main entities are: 

Cast: the information about casts are limited to chemical analyses



Hot rolled coil: contains information about traceability



"Parent" coil: it is a semi-finished product and it could have a different ID from the finished product. It has several attributes about product features.



"Child" coil: it represents the finished product



Parsytec: contains information about Parsytec analyses for each coil. There are 2 Parsytec, one at the end of the cold rolling process and one at the end of the zinc coating process

On the other side, the involved relationships are: 

A certain number of hot rolled coils are produced from a single cast



The pickling and cold rolling processes produce a number of coils called "fathers", this relationship associates an hot rolled coil, a Parsytec analyses and a "father" coil



The finishing process produces from a father coil one or more children.

In Figure 11 the detailed schema of AUTODIAG_DB keys is shown.

Figure 11: Entity-Relationship diagram of the ILVA AutodiagDB

Process data are stored mainly in an AS400 mainframe system: at the moment at ILVA query can be performed on such a system and results are returned in text format. On the other hand the Hot Dip Galvanizing Line (HDGL) is equipped with a local server which stores data regarding that process. SSSA developed a data importer tool which can be easily configured by means of simple INI files (text files composed by key=value statements). By means of this configuration files it is possible to specify where to collect a certain information (position in a text file or cell in an excel sheet) and which is the destination table and field in the MySQL database. Such software can be periodically executed in order to

30

transfer data from the format returned by AS400 and the local HDG line server to the AUTODIAG_DB, managed by MySQL. 2.3.7

Task 2.2 Start of data acquisition

ArcelorMittal has selected some quality problems that shall be investigated using data mining technologies. Even if Mytica database is being working for years, the quality problems selected are recent. ArcelorMittal has three cases that are candidates to be studied under the categories of quality problems defined in task 1.2, these are: 

Comparison of two data collections,



Classification of good and bad products,



Automatic search of influencing variables / features.

These cases are being used as a model to test different algorithms and data mining techniques inside the „individual adapted‟ approach as we can see in task 3.3. At ThyssenKrupp Nirosta the data acquisition was started according to the work plan. The start date and time used for the acquisition was repositioned to the beginning of the business year 2007/2008 which is the 1st of October 2007. An additional month was used to ensure the completeness of the data for each product piece (e.g. heat, slab and coil). This ensures that the data of all pieces are complete, even if some of the process steps were done before the starting date. Finally, data starting from September 2007 are available for the further development of the „brute force‟ approach. The data are updated by means of a two-hour cycle. During the remaining semesters the data mart was continuously enlarged. Data from more production steps (e.g. continuous annealing, batch annealing, cold rolling, temper rolling) as well as from the plant from the different branches were integrated. And the end of the project nearly 3.500 features are available, which are not in every case filled with data (e.g. depending on the production path or the material). ILVA data acquisition daily activity has been planned not to interfere with the existing planned IT activities. Depending on the amount of coils set ready for expedition the day before, it collects data from many of the machines and interfaces of the plant, causing an unusual load on the IT system. The scheduled time is set to prevent the IT system from overloads, as it runs in a usually very light time-window. The utilization of the data acquired requires data to be reliable. To ensure this, all the considered variables have been checked by means of accuracy and reliability, to prevent the later implemented soft-ware from considering bad data due to mal-functioning. For each variable, an acceptability range has been set. Though bad data are discarded, significant outliers have to be considered and pointed out; therefore the tolerability range has been tuned up to prevent outliers from being discarded as bad data. 2.3.8

Task 3.1 Development of the framework scheduler

For the system developed by BFI and ThyssenKrupp Nirosta the framework schedules uses the jobscheduler of the underlying Linux operating system. Here the different tasks can be scheduled based on the following definitions: 

day of week (0 - 7)



month (1 - 12)



day of month (1 - 31)



hour (0 - 23)



min (0 - 59)

So the configuration of the different tasks is very flexible and can be adjusted to the different functions. For the AutoDiag system realised by ThyssenKrupp Nirosta the framework scheduler consists of the following functions:

31



data transfer from different source databases and tables to the AutoDiag data mart (from every 2 hours to once a day), see chapter 2.3.6.2 on page 28.



calculation of some usage statistics like the amount of the transferred data and the time used for that (once a week)



database maintenance like consistency checks and backup (once a month)

SSSA designed and developed a software (ILVAminer) whose aims are to simplify and automate as much as possible data selection from the AUTODIAG_DB, data analysis and results interpretation.

ILVAMiner

Native Libraries

RapidMiner

JNI

JAVA routines

PLATFORM OS

JAVA VIRTUAL MACHINE

NATIVE CODE Figure 12: Java Native Interface

This software is implemented in C++ in order to gain the best performance, the greatest flexibility and to exploit the expertise of SSSA at best. In order to make possible the interaction of C++ (ILVAminer) and the RapidMiner Java library, SSSA has exploited Java Native Interface (JNI), a programming framework that allows creating a Java Virtual Machine within a C++ environment, which in turn allows creating and executing methods of any desired Java object. By means of this technique, it is possible to silently launch RapidMiner from within ILVAminer, to pass the selected data to it and to get results back (Figure 12 and Figure 13).

32

Figure 13: ILVAMiner structure and JNI

This structure thus allows launching of RapidMiner project file within ILVAMiner in a complete transparent way for the final user. RapidMiner then schedules the appropriate operators that have to be applied to the provided data in the sequence specified in the project file and then it provides a result set. In Figure 14 a diagram of the data flow within ILVAMiner is shown. Workstations can connect to AUTODIAG_DB by means of ILVAMiner. The software allows data validation, filtering and quantization before data are passed to RapidMiner: thresholds and validity ranges can be set in order to discard faulty measurements that would alter data mining results. Data can be passed to RapidMiner by means of JNI as previously described, as well as results can be retrieved. The results are shown in ILVAMiner.

33

Figure 14: ILVAMiner data flow

2.3.9

Task 3.2 Development of methods using 'brute force' approach

ThyssenKrupp and BFI have investigated several methods identified in task 1.1 for their usability for the „brute force‟ approach. The aim of the application inside this approach is to find the variables with the most important influence to the selected target. So the outcome shall be a list of variable ordered descending by their influence to the target variable. Due to the successful usage in other projects the following two methods were investigated more detailed. 

Self-Organising Map (SOM): The SOM transforms a high dimensional input vector to a 2dimensional map. Hereby, similar input vectors are located on the same area on the map. This is done by weights, which are connected from each input to each node on the map. So for each point on the map the weights to each input can be visualised. This is called the component plane (see the following Figure 15). If the target is also used as an input, the weight map of the target variable can be (numerical) compared to the other input variables. The calculated value can be interpreted as a nonlinear and multi-variate correlation coefficient to the target. This is used as an indicator for the influence of the input variable to the target variable and used as order criterion for the resulting priority list.

34

Figure 15: Component plane of a SOM



Decision tree: here the method separates the data set by means of the best possible class partition. The most important variable is used at first. The resulting data set is separated by the next important variable. The result is a tree-like presentation (see the following Figure 16). The order of the variables selected by the methods is here interpreted as the order of influence of the variable to the target.

Figure 16: Decision tree

During the investigations it has to be noticed, that these methods need a larger calculation time. Especially for a larger number of input variables, which is the common case for the “brute force” approach, the calculation time increases dramatically. In a discussion with the target users during the training course is has to be ascertained that this large calculation time in the range of several 10 minutes was not acceptable. So it was decided not to implement these methods for now. For the further development of methods for the „brute force‟ approach it was decided to use a combination of methods which are faster to calculate. The fact, that these are usually simpler methods which

35

leads to sub-optimal results (from the data mining point of view), was reduced by the application of several methods parallel and a combination of the results. The finally developed RapidMiner process used for the “brute force” approach uses a combination of several methods which are calculating a weight for each input variable to the target variable. The used methods are [23] [24]: 

Weight by Information Gain Ratio: This operator calculates the relevance of a feature by computing the information gain ratio for the class distribution (if the example set would have been splitted according to each of the given features).



Weight by Deviation: Creates weights from the standard deviations of all attributes. The values are normalized by maximum of the attribute.



Weight by Correlation: This method provides a weighting scheme based upon correlation. It calculates the correlation of each attribute with the target attribute and returns the absolute or squared value as its weight.



Weight by Uncertainty: This method calculates the relevance of an attribute by measuring the symmetrical uncertainty with respect to the class. The formulation for this is:

All these methods are using only uni-variate dependence, but they are very fast in calculation time. The result is a priority list of variables, which is combined by mean value. The scheme of the core „brute force‟ approach is shown in the following Figure 17.

Figure 17: Scheme of the core „brute force‟ approach

For the practical application inside the AutoDiag solution for ThyssenKrupp Nirosta the RapidMiner process was completed by several pre-processing steps. These are: 

Generation of an unified ident



Define data specific parameter: number of variable, number of data sets



Remove unwanted variables by means of a regular expression (e.g. id‟s, date and time variables)

36



Remove constant variables



Remove highly correlated variables (linear correlation coefficient > 0.9)



Remove variables with more than 40% missing data



Remove data sets with missing data

The overview of the „brute force‟ RapidMiner process is shown in the following Figure 18. The detailed explanation is shown in the annex in the Figure 67 to Figure 71 on pages 120 to 122.

Figure 18: Overview of the „brute force‟ RapidMiner process

2.3.10

Task 3.3 Development of methods using 'individual adapted' approach

According to chapter 2.3.2 (“Task 1.2 Categorisation of quality problems regarding the data analysis modalities” on page 18), ArcelorMittal worked in three of the four categorises. Development over time category was not included. During the first part of the project some own build tools were written in Matlab, Python and RapidMiner. This code seems that work for the different categorises defined in Task 1.2. Using those tools as model they were translate to RapidMiner operators during this task. At the beginning of the project there was only the version 4 of RapidMiner and the code was done for this, but a new version, with new functionalities and a friendlier interface appear, version 5. These new functionalities and the fact that only RapidMiner version 5 is going to be supported and improved leads us to recode the operators to the newer version. The programming of plug-ins or operator from one version to another are quite different and it was not very easy the migration from one to other. Now all the code is written for the latest version of RapidMiner. For each of these categorises algorithms are used based in (Table 6): Categories of quality problems

Algorithm


Residual from Self-Organizing Maps




Support vector machine (SVM) and Multivariate Adaptive Regression Splines

Table 6: Quality problem categories and selected algorithms During the next subsections a more detailed exposition of the algorithms developed and the theory behind them are shown. 2.3.10.1 Self-organizing maps (SOM): Kohonen's self-organizing maps (SOM) are important neural network models for dimension reduction and data clustering. SOM can learn from complex, multidimensional data and transform them into a topological map of much fewer dimensions typically one or two dimensions. These low dimension plots provide much improved visualisation capabilities to help data miners visualise the clusters or similarities between patterns.

37

SOM networks represent another neural network type that is markedly different from the feedforward multilayer networks. Unlike training in the feedforward MLP (Multi-Layer Perceptron), the SOM training or learning is often called the unsupervised because there are no known targets associated with each input pattern in SOM and during the training process, the SOM processes the input patterns and learns to cluster or segment the data through adjustment of weights. A two-dimensional map is typically created in such a way that the orders of the interrelationships among inputs are preserved. The number and composition of clusters can be visually determined based on the output distribution generated by the training process. With only input variables in the training sample, SOM aims to learn or discover the underlying structure of the data.

Figure 19: Input feature space (D) and visualisation space (V) showing direct and inverse mappings, with the image M of V in D and, the projection of a feature vector x and its residual using the surface M as a model, The Self-Organizing Map has some special characteristics when it is used as a dimension reduction technique for this aim. The visualisation space is a bounded rectangular discrete grid or discrete array of points (SOM units in visualisation space). Each one of these points has a corresponding point in the input feature space, so there is no 2D surface in the input space like the labelled M in Figure 19, but a distribution of points (SOM units in input space). The direct/inverse mapping is just a one-on-one correspondence between SOM units in both spaces. The projection of an input vector is the SOM unit that is closer to it, also called the best matching unit, using generally the Euclidean distance in the input space. Moreover, this distance of an input vector to its best matching unit is what was called “residual”. Residual (how much the test data are far from the model) from Self-Organizing Maps it‟s showing a promising algorithm to compare two data collections and find which of the signals are causing the problem. On Figure 20 it‟s shown with colour the magnitude of the residual from a model create by a SelfOrganizing Map. During the training time the residual are zero because that data are used to create the model but when the model is applied on new data the residual appears and it can be seen that at the time when the problem occurs there are a clear increase at the residual on the signals that are causing the problem. This can be used to detect the origin of the problem. When the problem is solved it can be seen that the residual return to a normal value.

38

Figure 20: Residual from Self-Organizing Maps to compare two data collections and to detect differences between them, Figure 20 is from a Matlab script written during previous work packages of Autodiag, this tool and others as it is previous cited have been migrated to Rapid Miner version 5. RapidMiner has some support to Self-Organizing Map but it is not enough to the task of calculation of residual and there is not suitable graph to visualise them. Therefore: A new version of the operator SOMDimensionalityReduction is written, it is called SOMDimensionalityReductionAndResidual. The first one only performs a dimensionality reduction based on a SOM; the new one also gives the Residuals of all the attributes. To help in the visualisation a new plotter is also written to be similar to the one showed in Figure 20. To test the behaviour of the new operator an experiment is created with a random example set. The layout is showed Figure 21 where the executions order of the operator is also indicated. In the next Table 7 there is an explanation of each operator.

39

Operator order

Name

Description

1

ExampleSetGenerator

Generates a random example set for testing purposes. The target function is Gaussian mixture clusters with 1000 examples with 5 attributes. This gives us a population of random data of 5 attributes grouped in clusters. Figure 22 show the output of this operator when only 2 attributes is selected, this help in the visualisation and the understanding of this operator.

2

SplitData

Divides a data set into the defined partitions. In this case 70% of the generated data is going to be used to train the model and the rest are used to test the model.

3

AddNoise

Adds noise to one of the existing attributes, in this case attrib2. This noise is added to the testing population, so the model has to manage new population that was not previous trained but also one of the attribute has noise. The operator and the new plotter designed should help us in the detection of witch of the attributes has the disturbance. Figure 23 shows the added noises when it is added to the population generate by operator 1 with only 2 attributes to help in the visualisation.

4

Multipliply

This operator is an auxiliary operator; it copies the test population to later add to the train population in order to visualise them as a unique item.

5

SOMDimensionalityReductionAndResidual

6

AppendTrainData

7

8

This is the new operator created for AUTODIAG, it is base in RapidMiner operator SOMDimensionalityReduction but new functionalities are added to get information of the residual of the model. This operator performs a dimensionality reduction based on a SOM (Self Organizing Map), generates a model that can be applied to new example set and also outputs the residual of each attributes. The model is going to be applied to the data test that has noise in one attribute; this operator also applies the model to the train data that was used to create the model. The parameters and the meaning of them are the same that the ones of the original SOMDimensionalityReduction operator of RapidMiner. This operator merges the example set used to train with the one used to test, that has passed through the Addnoise operator. This operator is used only as auxiliary to help in the visualisation of the whole data used in the experiment, the test and train data in one table. This table is connected to the output of the experiment.

Apply Model

This operator applies the model create by SOMDimensionalityReductionAndResidual to the train data, the one that noise was added to attrib2. The output of this is a table which columns are the coordinates of the dimensionality reduction and the residual of each attributes.

Append_Result

This operator merges the example sets that are outputs of running the model thorough the train a test data. This table is connect to the output of the experiment and is going to used to plot the both population in one graph.

Table 7: Description of layout of experiment of SOMDimensionalityReductionAndResidual

40

Figure 21: Layout of the experiment to test the performance of the new SOMDimensionalityReductionAndResidual operator. The number of each box is the order of execution and in Table 7 there is a description of each element.

Figure 22: Example of the random data generate by ExampleSetGenerator when 2 attributes is selected.

Figure 23: Example of the random data generate by ExampleSetGenerator when 2 attributes is selected and noise is added to attrib2.

After running the experiment described in Table 7 and Figure 21 the output must be analysed to find what the attribute/attributes are those was/were perturbed and is causing the problem in the process. Here it was tried to resolve the problem of “Comparison of two data collections”, that is a common problem at the industry: What has been changed that is causing me a problem? There are data from the time that the process was under control and now, there is a problem with it. There are cents or may be thousands of signals that have to be analysed to find what the wrong one that is causing the fault is. There is near 33 different types of plotter in RapidMiner, but for this kind of problem may be only can be used to see a time sequence.

41

Figure 24: Plot using the plotter Series of RapidMiner of the residual of attribute 1 and 2. The residual (the error of the model) for the example set used for the training (the first 700 items) is small because the model uses that data to build itself. The next 300 samples are new for the model and therefore the error (residual) is bigger. If we look carefully it can be seem that the attribute 2 has more error than 1, because the noise was added only to attribute2. Figure 24 shows a plot from the Series Plotter of RapidMiner where the response of the SOM model to the sequence of all the data used in the experiment can be seen, first the train data and later the test data. Now the focus is laid on the residual/error of that response. Both words residual and/or error were used to check how much the test data are far from the model. The residual for the example set used for the training (the first 700 items) is small because the model uses that data to build itself. The next 300 samples are new for the model and therefore the error (residual) is bigger. It can be seem that the attribute 2 has more error than 1, because the noise was added only to attribute2. Using the Series Plotter (Figure 24) to plot a few number of attributes to find the relevant ones, but it is not useful when a lot of attributes have to be inspected. In that case an image built with the array of signals, where each row is a signal/attribute, each column is a time in the sequence and the colour is based in the value of the residual could be useful to have a quick look of the response and the state of the system. Figure 20 is an example of the type of graph that shall be described. This plot is not native from the RapidMiner and we decide to write and operator that support this graph. This operator is not linked only with the SOMDimensionalityReductionAndResidual operator, it can be used like the other 33 different type of graphs that RapidMiner has, and can be selected as the others to plot any data set. In the Plot View of RapidMiner a new entry in the combo box Plotter is created where the type Residual Plot can be selected. On Figure 25 there is a screen capture of the Residual Plot applied to the output of our experiment. One can select the signals to graph, selected the colour-map using the Style combo Box or active the interpolation only in axis X to get another way of visualisation.

42

Figure 25: Image created with the Residual Plot developed for RapidMiner to show the Residual of experiment described at Figure 21 The green tones means that the signal plotted is near zero while red and blue tones mean high or low values of the signal. In this case it is easy to see that the attribute 2 has behaviour quite different to the rest, pointing out the signal that it is causing the problem. In this case maybe 1000 points is too much to see and while the zoom functionality of this graph is finishing a resample could help to improve the visualisation. Using the Stratified sampling operator of RapidMiner, which performs a random sampling of a given fraction we can get Figure 26 and if we active the Interpolate only X option we get Figure 27.

Figure 26: To improve the visualisation fewer points are showed using a resample operator.

43

Figure 27: The Interpolate only X option gives a new option of visualisation that can be useful in some cases. When the user wants to use this operator he only needs to insert the two populations to compare. These populations do not need to know target outputs, only booth populations have to have the same attributes. This operator was selected and built taking into account that no parameters are needed in the beginner mode, if you switch to expert mode you access to the configuration of the SOM model. The next table shows the parameters available at the expert mode. Any way the default parameters are good enough most of the cases (Table 8): Parameter number of dimensions net size training rounds learning rate start learning rate end

adaption radius start

adaption radius end

Description Defines the number of dimensions, the data shall be reduced. Default value: 2 Defines the size of the SOM net, by setting the length of every edge of the net. Default value: 30 Defines the number of training rounds Default value: 30 Defines the strength of an adaptation in the first round. The strength will decrease every round until it reaches the learning_rate_end in the last round. Default value: 0.8 Defines the strength of an adaption in the last round. The strength will decrease to this value in last round, beginning with learning_rate_start in the first round. Default value: 0.01 Defines the radius of the sphere around a stimulus, within an adaptation occoures. This radius decreases every round, starting by adaption_radius_start in first round, to adaption_radius_end in last round. Default value: 10.0 Defines the radius of the sphere around a stimulus, within an adaptation occoures. This radius decreases every round, starting by adaption_radius_start in first round, to adaption_radius_end in last round. Default value: 1.0

Table 8: Description of the parameters of the SOMDimensionalityReductionAndResidual operator.

44

2.3.10.2 Support Vector Machines: Support Vector Machines (SVM‟s, Boser et al., 1992, Cortes and Vapnik, 1995) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalised linear classifiers. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin and hence they are also known as maximum margin classifiers. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The hyperplane is the one that maximizes the distance between the two parallel hyper planes (see Figure 28). An assumption is made that the larger the margin or distance between these parallel hyperplanes, the better the generalisation error of the classifier. The SVM builds a model from the training samples which is later used on the test data. This model is built using the training samples that are most difficult to classify (Support Vectors). The SVM is capable of classifying both linearly separable and non-linearly separable data. The nonlinearly separable data can be handled by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed. SVMs can exhibit good accuracy and speed even with very less training.

Figure 28: SVM maximize the margin around the separating hyperplane. Support Vector Machine (SVM) is an effective classification method, but it does not directly obtain the feature importance. It can be also used to Automatic search of influencing variables / features as it‟s shown in [2]. In this article Yi-Wei Chen and Chih-Jen Lin investigate the performance of combining support vector machines (SVM) and various feature selection strategies. One of these strategies is use F-Score as a filter to select features.

45

F-score is a simple technique which measures the discrimination of two sets of real numbers. Given training vectors xk , k  1,, m , if the number of positive and negative instances is n  and n  , respectively, then the F-score of the ith feature is defined as:

F (i )  Where xi

( xi(  )  xi ) 2  ( xi(  )  xi ) 2 1 n (  ) 1 n (  ) () 2 ( x k ,i  x i )  ( xk ,i  xi(  ) ) 2   n  1 k 1 n  1 k 1

xi( ) xi( ) are the averages of the ith feature of the whole, positive, and negative data sets, ()

()

respectively; x k ,i is the ith feature of the kth positive instance, and x k ,i is the ith feature of the kth negative instance. The numerator indicates the discrimination between the positive and negative sets, and the denominator indicates the one within each of the two sets. The larger the F-score is, the more likely this feature is more discriminative. Therefore, this score is used as a feature selection criterion. It‟s going to be used the RBF kernel in all the experiments:

2

k ( x, x)  exp(  x  x ) With the RBF kernel, there are two parameters to be determined in the SVM model: C and  .To get good generalization ability; a validation process to decide parameters is followed. This procedure is summarized below (Table 9):

46

1.

Calculate F-Score and order the features according this parameter.

2.

Select a collection of number of features.

3.

For each of the members of the collection (n features): a.

Select the first n features with higher F-Score.

b.

Randomly split the training data into Xtrain and Xvalid

c.

Let Xtrain be the new training data. i. Consider

a

grid

space

of

(C,

)

with

log 2   {15,13,,3} ii. For each hyperparameter pair (C, tion on the training set. iii. Choose the parameter (C,

log 2 C  {5,3,,15} and

 ) in the search space, conduct 5-fold cross valida-

 ) that leads to the lowest CV balanced error rate.

iv. Use the best parameter to create a model as the predictor. d.

Use the predictor to predict Xvalid.

e.

Repeat the steps above five times, and then calculate the average validation error.

4.

Choose the number of features with the lowest average validation error.

5.

With the previous number of features select that number of features with the higher F-Score and follow steps i to iv to calculate the pair (C,  ) and create the model.

6.

Calculate F-Score and order the features according this parameter.

7.

Select a collection of number of features.

8.

For each of the members of the collection (n features): a.

Select the first n features with higher F-Score.

b.

Randomly split the training data into Xtrain and Xvalid

c.

Let Xtrain be the new training data. i. Consider

a

grid

space

of

(C,

)

with

log 2   {15,13,,3} ii. For each hyperparameter pair (C, tion on the training set. iii. Choose the parameter (C,

log 2 C  {5,3,,15} and

 ) in the search space, conduct 5-fold cross valida-

 ) that leads to the lowest CV balanced error rate.

iv. Use the best parameter to create a model as the predictor.

9.

d.

Use the predictor to predict Xvalid.

e.

Repeat the steps above five times, and then calculate the average validation error.

Choose the number of features with the lowest average validation error.

10. With the previous number of features select that number of features with the higher F-Score and follow steps i to iv to calculate the pair (C,  ) and create the model.

Table 9: Algorithm to determine the optimums features to be used and the parameters of a Support Vector Machine Learner that uses a RBF kernel

RapidMiner has support for SVM but only the core of the calculation, but as can be seen in the previous procedure, there are some parameters that have to be tuned and it is not implemented an automatic way to search them. The aim is to get the best features (attributes) that describe the process. Therefore a new

47

operator called F-Score_SVMLearner was developed that uses internally another two sub process (Multi-BestLibSVMLearner and F-Score). The main characterises of this process are: 1. Calculated the optimums parameter of the RBF kernel of the Support Vector Machine. 2. Output a ranking of the influencing variables / features of the studied process. 3. Output a model than can be used to classify a new population. Due the heavy task that the selection of the parameters of the SVM and the searching of influencing features is, the operator was written to use the multi core processors of the computer, dividing the main task in as many parts as cores have the computer. This reduces the computing time in a factor more or less proportional to the number of cores of the processor. To test the behaviour of the new operator an experiment is created that it is going to use the well-known population in data mining: Wisconsin Breast Cancer Database, This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.[5]. The mains features of the operator aims to create two new sub process, one that do the search of a RBF kernel best parameter and the other to do the ranking of the variables using the F-Score metric. These operators are using to calculate recursively, as we see in the previous algorithm (see Table 9), the optimums model, where a number of features with the higher F-Score are selected and it is calculate the pair (C,  ) to create the model. The layout is showed Figure 29 where the executions order of the operator is also indicated. In the next Table 10 there is an explanation of each operator. Operator order

Name

1

Read CSV

2

Ser Role -ID

3

Set Role Label

4

Numerical to Binominal

5

Multipliply

6

Split Data

7

FScore

8

MultiBestLibSVMLearner

9

Apply Model

10

Performance

Description This operator can read CSV files, where all values of an example are written into one line and separated by an constant separator. The data of Wisconsin Breast Cancer Database is loaded from a CSV file. These data are from what it is called Group 1: 367 instances (January 1989) This operator is used to change the attribute role since the CSV is a plain file and all the attributes are considerate regular. This defines the first column as id number. This operator is used to change the attribute role since the CSV is a plain file and all the attributes are considerate regular. This defines the second column as label. This operator converts the label columns that is 2 for benign and 4 for malignant into a binomial values. Converts all numerical attributes to binary ones. If the value of an attribute is between the specified minimal and maximal value, it becomes false, otherwise true. If the value is missing, the new value is missing. This operator is an auxiliary operator; it copies the population to later run the F-Score operator and Multi-BestLibSVMLearner on it. Divides a data set into the defined partitions. In this case 70% of the generated data is going to be used to train the model and the rest are used to test the model. This is the new operator created for AutoDiag. It ranks the features bases in the F-Score criterion This is the new operator created for AutoDiag. Calculate the best parameters of the RBF kernel of the Support Vector Machine. This operator applies the model create by MultiBestLibSVMLearner to the population. This operator delivers a list of performance values that calculates the error of the model developed.

Table 10: Description of layout of experiment of Multi-BestLibSVMLearner

48

Figure 29: Layout of the experiment to test the performance of the new Multi-BestLibSVMLearner operator. The number of each box is the order of execution and in Table 10 there is a description of each element. The Multi-BestLibSVMLearner sub process does not need any parameter to run. Its input is a data set of binomial type. It outputs the model, the data set used to create it and a debug table with the C and gamma parameter tested to find the best ones. This info can be used to plot graphics like the one shown at Figure 30 where the performance obtained of each pair (C,  ) is painted. The model created uses the best parameter found.

Figure 30: Searching of the best parameters C and gamma of the RBF kernel of the Support Vector Machine. For the Wisconsin Breast Cancer Database the best (C,  ) is (22, 2-7). The model is trained with 257 of the 367 samples, the rest are used to test and it gives an accuracy of 95.45%. At the literature [6] applied 4 instance-based learning algorithms, collected classification results averaged over 10 trials and

49

the best accuracy resulted with 1-nearest neighbour: 93.7% trained on 200 instances, tested on the other 169. The F-Score operator gives us the ranking of the importance of the features. Table 11 and Figure 31 show this classification using the Wisconsin Breast Cancer Database us example of input.

Table 11: Importance of the variables based in the F-score metric.

Figure 31: Importance of the variables based in the F-score metric. The performance of SVM is most of the time better than other classification algorithm and wins in some international competences to detect influencing variables [2]. The tests done during this period (see chapter 2.3.13 on page 56 for more details) to confirm this hypothesis, SVM can classify products in good or bad and detect the influence of variables with better performance than other general data mining techniques like neural networks, decision trees, etc. 2.3.11

Task 3.4 Development of 'smart' components

SSSA developed within ILVAMiner software different ways to simplify data mining analyses and results interpretation by means of wizards and easy-to-read charts and plots. Figure 32 shows the layout of the main screen of ILVAMiner. The application toolbar (marked with “Main commands”) contains main commands that allow creating, opening or saving a task or accessing database configuration dialog window. The left part (marked with “Data overview”) contains query results of the data selection in numerical form. These are the input values to the RapidMiner process.

50

The bottom panel (marked with “Numerical results”) contains numerical results coming from RapidMiner such as regression coefficients, cluster centroids, etc. The central panel (marked with “Chart area”) contains visualisation tools such as charts, scatter plots, etc. The output of these last two panels depends on the type of RapidMiner model that has been employed in the task (regression, classification, clustering) and are automatically selected by ILVAMiner. The graph shown in Figure 32 is the output of a dummy linear regression model which tries to model the yield strength RP02 with the tensile strength RM as input. The resulting coefficients (here only one for RM and the intercept) are shown below the graph. The input variable can be selected with by the pull-down menu on the upper left corner of the graph. Here all the variables can be selected which are visible in the data overview. The „smart components‟ are developed in a modular way, such that new visualisation tools can be easily added in order to meet future needs. The whole design of the user interface was made as simple as possible, to make it usable for the foreseen target users who do not have to have to be experienced in the underlying data mining methods.

Figure 32: The main screen of ILVAMiner

ILVAMiner is organized in tasks, where each task is composed by: 

a SQL query string which defines the data collection to load from AUTODIAG_DB;



optional validity ranges;



optional quantization rules;



the path to a RapidMiner process that performs the analysis;



optional selection of "label" and "id" columns.

ILVAMiner tasks are defined by means of an XML file which stores the information listed above. Moreover, SQL query string may contain custom keywords such as {LAST_YEAR}, {LAST_MONTH}, {LAST_WEEK}, etc. which are interpreted by ILVAMiner and can be used to automatically extract data with respect to the current date. In this way periodical monitoring of a process can be executed very easily and quickly and it is not necessary to create a new task each time the

51

user wants to analyse data from last year, month or week. In order to simplify the access to the database and the creation of tasks a graphical wizard was developed, shown in Figure 33. It allows: 

selecting dates range by means of radio buttons: the end-user may specify a fixed dates range as well as one of the predefined period mentioned before;



selecting a certain steel grade;



selecting a subset of the available fields within AUTODIAG_DB by means of a tree control;



specifying different selection criteria (equality, greater than, less than, between);



specifying quantization rules for different fields;



specifying aggregation criteria (group by, mean, sum, count);



selecting a RapidMiner process to be executed on the queried data.

ILVAMiner translates such instruction in an SQL statement and executes automatically required database tables JOIN operations. Queried data are then previewed in the second dialog window of the wizard, where the user can specify, if required, some attributes of the data, such as which column is the "id" or which column is the "label" as these information may be required to perform certain RapidMiner processes.

Figure 33: ILVAMiner task wizard

2.3.12

Task 3.5 Integration of the developed modules into the common framework

ArcelorMittal has been working in the translation and the integration of the tools developed in laboratory to the RapidMiner environment as plug-ins. Those tools were developed in several programming languages and those are listed and described in previous paragraphs (see chapter 2.3.10), should solve the three main categories of quality problems (see chapter 2.3.2 on page 18) that ArcelorMittal is examining: Comparison of two data collections, classification of good and bad products and automatic search of influencing variables / features.

52

ArcelorMittal has developed some operators for RapidMiner, selected as common framework for the consortium. For each of these categorises the algorithms applied to solve the problem are listed in the Table 12 below. In this table the operators developed by ArcelorMittal are briefly described, for more detailed explain see at WP3 task 3.3 (see section 2.3.10 on page 37): Categories of quality problems


Classification of good and bad products (binary classification), MultiClassification and regression problem.


Algorithm

Residual from SelfOrganizing Maps


Support vector machine (SVM) and Multivariate Adaptive Regression Splines

Operator developed

-

SOMDimensionalityReductionAndResidual This operator performs a dimensionality reduction based on a SOM (Self Organizing Map), generates a model that can be applied to new example set and also outputs the residual of each attributes.

-

Residual Plot: A new plotter that it is built with an array of signals, where each row is a signal/attribute, each column is a time in the sequence and the colour is based in the value of the signal, residual in our use of it

-

Multi-BestLibSVMLearner: It calculates the best parameters (C,  ) of the RBF kernel of the Support Vector Machine in a problem of binary classification.

-

F-Score_SVMLearner: It uses Multi-BestLibSVMLearner and F-Score to create the best SVM model with the most influencing variables.

-

Multi-BestLibSVMLearnerRegression. It calculates the best parameters (C,  ) of the RBF kernel of the Support Vector Machine in a problem of regression.

-

Multi-BestLibSVMLearnerMultiClasification. It calculates the best parameters (C,  ) of the RBF kernel of the Support Vector Machine in a problem of multi classification

-

F-Score: It ranks the features bases in the F-Score criterion for binary classification.

-

Create an operator based on the R package earth to build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS" and use its rank of influencing variables.

Table 12: Quality problem categories and selected algorithms The main operators developed by ArcelorMittal are shown at the table above, but there are some that are not listed and are used as auxiliary tool to create the experiments. The two more relevant are: the one used to filter the variables that are used by the model (ModelFilter) and the one to convert a dataset to a Residual Plot (ExampleSetToResidualPlot). The F-Score_SVMLearner outputs a model that also ranks the variables and use for the building of the model only the most influencing variables. Applying this model to another dataset that has all the variables not only the selected as influencing, the performance of the operator is dramatically decreased, this behaviour is not due to the operator developed, it is because the way RapidMiner applies all the models. It is mandatory that the example set has the same variables with which the model was created. To help in the creation of the automated tools it was needed to create the operator ModelFilter that filters the dataset to select the ones that needs a model. RapidMiner has an enormous amount of different plots, but for our needs we have to create a new one: the Residual Plot. To point the final user to the most suitable plot for the studied problem a new opera-

53

tor called ExampleSetToResidualPlot was created. This operator forces a dataset to be plotted with Residual Plot. It is clear for the consortium that RapidMiner is the best and unquestionable world-leading open source system for data mining. The only drawback of the system is that some algorithms are missing. But now, it was formally announced last September, RapidMiner has recently integrated their product with R. This initiative brings the power of the R system and the huge number of algorithms to the intuitive graphic interface provided by RapidMiner. The RapidMiner/R combination promises to be a disruptive technology. ArcelorMittal has taken advantage of the new capacities that the integration of R inside RapidMiner brings. Multivariate Adaptive Regression Splines (MARS) are one of the algorithms that we miss inside RapidMiner but that was available at R. MARS has been applied in some projects in ArcelorMittal with good performance. MARS is a form of regression analysis introduced by Jerome Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models non-linearity and interactions. The term "MARS" is trademarked and licensed to Salford Systems. MARS gives you a model to solve regression problems but also gives a rank of influencing variables; this is used in the automatic search of influencing variables. The operators developed by ArcelorMittal can be used as a normal RapidMiner operator and therefore easily shared with the rest of the consortium. They can be composed with other to create the experiment that solves the problem. These compositions are built using the design interface capabilities of RapidMiner. Once the data mining expert has been created the solution for a defined problem, anyone should be able to use them with their own dataset. This final user does not need to know something about data mining technology or the details of the algorithm that are behind the experiment. They are only focused in get a solution or an approximation of the solution of his problem. SSSA successfully accomplished several tests in order to assess the integration of the common framework data mining engine (RapidMiner) with the developed software software (ILVAMiner). The only constraints are that RapidMiner processes should take inputs directly from the process input port and provide outputs by means of the process output port, because data are automatically provided and retrieved by ILVAMiner. For ThyssenKrupp Nirosta the integration into their industrial environment was realised by means of an existing tool called NiCo (Nirosta Cockpit). This tool can be accessed by the personnel of all branches of Nirosta. There is a large user group (>100) from different departments (production, quality department, material testing department, technical customer support) that is using this tool for data visualisation and data exploration. The NiCo tool was enhanced by the functionality for AutoDiag. The applied concept has the advantage that target users do not have to be trained how to use this tool in general. The typical functionality like opening of a function, selection of data, start of the analysis as well as the functions for the graph manipulation like zoom and pan are common for all functions and are already available in NiCo. Summarised, this approach has the following important advantages: 

No new tool has to be introduced to the personnel.



The target users are familiar with the general use of the tool.



The AutoDiag functionality can be realised step-by-step.



Early feedback from some „power users‟, which are allowed to use the AutoDiag functionality already in beta state, regarding the usability and the reached results. Their first impressions regarding the usage of the AutoDiag functionality were used to improve the tool (see chapter “Task 5.4 Evaluation of usability and tuning of the system” on page 82)

For the necessary input of some parameters for the AutoDiag functionality the user is guided by a kind of wizard. Here the next step can only be selected if all parameters are given and correct. This avoids the possibility to enter wrong parameters which will lead to wrong results. In Figure 34 a hardcopy of the user interface is shown. Here the selection of the variable to be investigated is presented.

54

Figure 34: AutoDiag GUI: Definition of the data sample

In the upper left corner (marked with “wizard”) the user is informed about the current step he is working on. In the shown figure it is the step “definition of the data sample” (selected dot with red coloured text). The other steps (“Visualisation” and “Data mining”) are „greyed out‟, that means they are disabled until the current step is successful finished. So the user has to provide the necessary parameters for the next step. So it is ensured that the underlying data mining methods are provided with correct parameters. In the lower left box the user can define filters for the data compilation. The data can be restricted, for example, to a certain time range, a material or a material group, strip thickness and width but also to a specific process route. This is a very important step to ensure that the data sample used for the data mining does not contain data from different process conditions that cannot be directly compared and that lead to less usable results. The foreseen target users are usually process experts who know very well under which conditions (from process and product as well) a certain product quality problem appears. The right box is used to select the process and product data to be used for the data mining process. They are grouped on a first level according to the process steps (here: steel works, hot rolling, hot strip annealing, cold rolling) and on a second level according to aggregates (e.g. for steel works: material data, chemical analysis, converter, casting, slab). In the third level all the variables are listed, together with a brief description. In this tree view the categories can be fold in and out by clicking on it. To support the brute force approach the user can select a single variable, or also a category from the second or the first level. This will select / deselect all variables inside this category. Here the typical strategy of the process experts is taken into account. They usually know that are certain quality problem is originated „somewhere inside the annealing‟, or, for example, internal defects are „coming from casting‟. So they can select the variables from the whole casting process if they don‟t know which of them are important. The following data mining process will produce an importance ranking of the selected variables. All parameters chosen by the user need to be documented in the various result presentations. This ensures that later on the results can be interpreted based on the user given parameters.

55

On the server side of the AutoDiag implementation of ThyssenKrupp Nirosta the servlet consists of the main functionality. Here the data for each session (=for each connected user) is prepared. The communication between the user application NiCo and the servlet is realised by means of a so called „transfer object‟, in which the information is encapsulated. The servlet is also in charge for the communication between the user, the data mart and the RapidMiner framework. The data selected by the user is prepared and converted into the RapidMiner data structure. After the setting of the parameter in the XML project files the RapidMiner is started and the results are converted again. Finally the results are transferred to the user application NiCo. 2.3.13

Task 3.6 First laboratory tests of the developed methods with 'real' data, assessment of the results

ArcelorMittal has been recording some population test to be used in the development and the investigation of the algorithms. Here we are going to briefly summary the tested done that has been described in detail in previous technical reports. The first laboratory tests were select to verify the viability and the goodness of the algorithms that were later converted in operators of RapidMiner. To test the performance of the algorithms some academic populations were selected, these populations have the advantage that they were analysed by data mining experts of recognition experience. To test the robustness of the procedure based on F-Score and SVM we choose a prepare population of data, Arcene, from the NIPS2003 (Seventeenth Annual Conference on Neural Information Processing Systems - http://nips.cc/Conferences/2003/) .The task of Arcene is to distinguish cancer versus normal patterns from mass-spectrometric data. This is a two-class classification problem with continuous input variables. This population is create by the Organizing Committee join data from several database and after some manipulations, the data of Arcene had 10000 features, including 7000 real features and 3000 random probes. This algorithm selected 2480 features of the 10000 and 95.93% of the selected belong to the 7000 real features. To test the behaviour of the operator F-Score_SVMLearner an experiment was created that used the well-known population in data mining: Wisconsin Breast Cancer Database, This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. On this population this operator gives even more accuracy than the original algorithm. See 2.3.10.2 Support Vector Machines: for a detailed explanation of the test done with this population. One more test was done to verify the goodness of the F-Score+SVM method. A real problem of one of ArcelorMittal installation was chosen. It was a two-class classification problem of good and bad quality ("Good and Bad quality" dataset in Table 13). On this task several people probe several algorithms trying to achieve the best classification performance and to detect relevant features. The algorithms tested were SVM, some decision tree algorithms, a couple of neuronal nets, some Bayesian nets algorithms, etc. From all the tested algorithm SVM was the second in a Cross-validation attempt and second in a try out with a validation population, but score first taking account both experiments. Not all the algorithms tested can do feature classification but with those that gives this information directly or at least can be easy obtained, the 73% of features selected SVM was also chose by the rest of the algorithms. Figure 35 shows the output of F-Score + SVM on a real two-class classification.

56

f2 9

f2 7

f2 5

f2 3

f2 1

f1 9

f1 7

f1 5

f1 3

f1 1

f9

f7

f5

f3

20 18 16 14 12 10 8 6 4 2 0 f1

F-Score

F-Score + SVM feature selection

Features

Figure 35: Feature classification and selection using F-Score and SVM on a real two-class classification. The operators developed based in F-Score and SVM feature selection need a lot of calculation time, even using multithread inside the code of the operators. If a depth optimum search is active in the wizard this time could reach days if the number of samples and features are big. To have a reference of the time needed a dataset of more than 35k samples and 60 features are created from the data ware house of ArcelorMittal in Asturias. This dataset comes from a quality problem and is a two-class classification problem. On this dataset we made a depth search to find the best model based in SVM using the wizard "Analysis of good and bad product with search of influencing variables" developed in the project. During the running of this wizard the system has to auto-tune the parameters of the SVM algorithm, and it has to check different combination of group of features to select the best parameters and the best features. This means a lot of iterations and it results in two and a half days of working time in an I7 CPU 3.07 GHz with 8 processors and 8 GB ram. In the same computer, running the same experiment with the "Good and bad quality" dataset it takes only 40 seconds. The next table summarises the characteristics of the main datasets used during the development of the operators. The "geometrical problem" dataset is used in 5.3.18 Task 5.2 Application to analysis of strip geometry and/or strip flatness.

57

Name of dataset

Size of dataset

Arcene

100 samples, 10000 features (7000 real, 3000 fake).

Wisconsin Breast Cancer Database

367 samples, 9 features.

Good and bad quality

Origin of data Seventeenth Annual Conference on Neural Information Processing Systems (NIPS2003)

Problem tackle Distinguish cancer versus normal patterns.

University Of Wisconsin Hospitals

Cancer diagnosis.

1015 samples, 121 features

ArcelorMittal

Two-class classification problem of good and bad quality.

Geometrical problem

14700 samples, 36 features

ArcelorMittal

Geometrical problem.

Big dataset

>35k samples, 60 features

ArcelorMittal

Two-class classification problem of good and bad quality.

Comments Used to check the performance of FScore and SVM to rank the influence variables. Used to check the performance of FScore and SVM using a well-known population and a reference in data mining. Used to compared automatic results with reached manually by data mining experts. Used to check the performance in a regression problem. To check the processing time in a heavy task problem

Table 13: Summary of characteristics of datasets used to check the performance of the operators developed by ArcelorMittal.

Self-Organizing Maps is showing a promising algorithm to compare two data collections and find which of the signals are causing the problem. On Figure 20 and on section 2.3.10.1 it‟s shown an example of this situation. SSSA developed a RapidMiner module with the help of ILVA expertise which addresses the issue of understanding how some key process parameters (e.g. furnace temperatures and elongations) are influenced by setting certain production targets in terms of mechanical characteristics (e.g. yield strength Rp02 and tensile strength Rm). Despite the huge amount of available data, it is not rare that some of these key process parameters could be set taking into account only the operators experience; this could be enough until some changes are made to the incoming material. A deeper understanding of the effects of the main process parameters on the final material properties, usually non-linear and different between one steel grade and another, becomes fundamental to properly manage the cold processing whenever changes are applied to some part of the process. This method may help decision makers to choose between two or more steel grades the one which, according to the available production history, allows optimising process variables such as furnace temperatures or skin pass elongation. As told before, the relationship between process variables, chemical compositions, etc. are often non-linear, thus the analysis should go beyond the employment of linear models such as linear regression. In Figure 36 the RapidMiner process setup is shown: it is composed by some pre-processing steps whose aim is to select proper variables from the database, to filter out invalid values (out of predefined ranges) and to normalise input data. This last step is particularly important when neural networks are used because it helps their convergence in terms of both speed and precision. The last step is composed by a cross-validation block containing the neural network model and the performance evaluation. The system is then trained and the model could be saved and used for later use bypassing the training phase. Exemplarily the following scenario shall be investigated: how should the soaking furnace (SF) temperature in a hot dip galvanizing process be set in order to obtain certain mechanical characteristics for a certain steel grade. From the AUTODIAG_DB, data concerning chemical analysis, involved process

58

parameters, coil physical parameters (width, thickness, etc...) and mechanical characteristics regarding the analysed steel grade are extrapolated and aggregated by means of ILVAMiner. The SF temperature is then set as target value, while the other variables are set as inputs. Data are then passed to RapidMiner and the neural network is trained. The resulting model is so able to predict the furnace temperature in function of the steel chemical composition, coil physical parameters and the desired mechanical characteristics by means of knowledge extraction from the plant database.

Figure 36: Rapid miner process developed by SSSA

By exploiting the same model structure, it is also possible to compare two or more different steel grades: in fact let suppose that a steel strip characterized by certain mechanical features is needed and the issue is to know which steel grade, among a predefined set, allows lower soaking furnace temperatures. To cope with this problem, different models - one for each steel grade - can be trained on the historical data. Once the models are trained it is possible to predict the different soaking temperatures against the desired mechanical features: by simply comparing the predicted temperatures it is thus possible to choose the steel grade, by means of chemical composition, which allows optimizing furnace temperature. The developed module has been tested on 'real' data extracted from the AUTODIAG_DB. The selected variables are: 

steel chemical composition: Al, B, C, Mn, Nb, P, S, Si, Ti;



coil physical parameters



mechanical characteristics: Rp02, Rm.

The model has been trained for two selected steel grades (later on called A and B). Figure 37 and Figure 38 show a plot of the projection of the input/output space on the SF1_FURN_TEMP - Rp02 plane for both steel grades.

59

Figure 37: Measured and predicted values of SF1_FURN_TEMP vs. Rp02 for steel grade B

Figure 38: Measured and predicted values of SF1_FURN_TEMP vs. Rp02 for steel grade A

RMSE NRMSE

A

B

32.51 0.092

15.94 0.099

Table 14: SF1_FURN_TEMP prediction errors

Table 14 shows SF1_FURN_TEMP prediction errors in terms of Root Mean Square Error (RMSE) and Normalized Root Mean Square Error: being the NRMSE of both steel qualities below the commonly accepted threshold of 10%, both prediction results are acceptable. Table 15 shows an example of comparison between two predictions.

60

Steel quality

B A

Physical pr.

Chemical composition C

Si

Mn

P

S

Nb

Al

Ti

Obfuscated data Obfuscated data

Target

Prediction

B

WIDTH

RP02

RM

SF_FURN_T

0 0

1525 1018

340 340

400 400

828.6 736.2

Table 15: Example of comparison between two prediction of SF1_FURN_TEMP

For both steel strips has been set a target of Rp02 = 340 and Rm = 400. As shown, the A strip needs a far lower temperature to obtain the same results in terms of mechanical characteristics (almost 100 °C less), which may lead to less energy consumption and better temperature control. Nevertheless, other problems could affect the steel grade B production route, so it is demanded to the metallurgists to decide which should be the preferred steel grade. The data provided by ThyssenKrupp Nirosta were analysed by BFI. Here, the flexible concept of the common framework became important. The RapidMiner project that is used in the servlet with the RapidMiner library can also be used with the RapidMiner GUI. So the data mining process can be optimised at BFI and the new project file can be stored on the AutoDiag server of ThyssenKrupp Nirosta. For the „brute force‟ approach a comparison of the results gained with the RapidMiner process (see task 3.2) against the results reach by experts was applied. For that a data sample from the AutoDiag database at ThyssenKrupp Nirosta was extracted. One quality failure was selected together with 88 process variables, from which 63 variables remain after the pre-processing. The data sample contains 7656 data sets. The data were analysed by means of the „brute force‟ process and the resulting weights were noticed. For a common result in a first approach a mean value of the several weights were calculated. The result is shown in the Figure 39 (left table) for the first 20 positions, ordered by the mean value. For the evaluation of the results the same data were analysed by experts using the BFI tool DataDiagnose. This tool was developed by BFI and is used for data analysis investigations. It is realised by a MATLAB kernel and a DELPHI GUI. It contains several methods for data pre-processing, visualisation, data mining and modelling. In Figure 66 (on page 120 in the annex) a hardcopy of this tool is shown. The BFI tool DataDiagnose was used to pre-process the data and to create a priority list of influencing variables. For that, a decision tree (OC1), a neural network (SOM) and statistical methods (discriminatory analysis, categorised histograms) were used. The results of each method were also combined by a mean value and are shown in the following Figure 39 (right table).

Figure 39: Comparison of the results from „brute force‟ approach to DataDiagnose results

61

As one can see are the results very similar but not the same. The most important variables were found from both investigations, but in a slightly different order. This can happen due to the following reasons: 

The neural network (SOM) from DataDiagnose uses a random initialisation of the weights which leads to different results for different runs.



The creation of a balanced sample (necessary e.g. for the application of the discriminatory analysis) uses only a sub-set of the test data which is selected randomly.



The mean value is not the best solution for combining the results from each applied method.

For the interpretation of this result a solution can be only to mention the influencing variables without an order of importance. 2.3.14

Task 4.1 Implementation of the automatic tools

According to Task 1.2 (see chapter 2.3.2) where a categorisation of quality problems were done, ArcelorMittal has been working in three of the four categorises. Development over time category is not included in the studies. During previous sections a description of the operators developed to solve the quality problems are described (see chapters 2.3.10 and 2.3.12). The creation of the auxiliary tool operator like the ModelFilter and ExampleSetToResidualPlot (see chapter 5.3.12) and of course the creation of all the main operators provided an enormous experience on the internal structure of the source code of RapidMiner and the way to modify it. These allowed us to create new interfaces of RapidMiner more suitable for ours requirements. To approach the RapidMiner power to users that are not interested in the technology of the data mining, ArcelorMittal created a minimalistic interface for RapidMiner as shown in Figure 40. I shall help them to use these techniques to solve a problem. The idea is to eliminate all that can confuse or complicate the use of it to this kind of user while, of course, maintaining all the power of calculation and the presentation and show capabilities of final results of RapidMiner. As a consequence of this way of thinking, the interface of RapidMiner is reduced to the minimum: it starts showing a template that is a Wizard to guide the user to select the kind of problem the user wants to solve and asking for the parameters to configure the experiment. These parameters are very simple and they are not related with the algorithms behind the solution, because the operators were created taking in mind that they must be auto-tuned. The most of the parameters are related to configure the load of data, asking the name of the file or the name of the objective variable. The most evident elimination of the normal interface of RapidMiner is that the design capabilities are removed; we think that the profile of the final user of this system is a person that has not the necessary knowledge to edit and improve the performance of the experiment that was defined by an expert. Anyway, as described later (see chapter 5.3.15 on page 69), if someone wants to edit the experiment he can do it.

62

Figure 40: Minimalistic RapidMiner User Interface developed by ArcelorMittal In this configuration of interface, that is called “Wizard Interface”, the users can select one template that will guide the user to solve a specify problem. Several templates were created that cover the categories of quality problem that are under study: 

Analysis of good and bad products with search of influencing variables.



Multi-classification problem.



Regression problem.



Comparison of two data collections.



Search of influencing variables based in MARS.

In the following paragraphs these templates are discussed more in detail. 2.3.14.1 Analysis of good and bad product with search of influencing variables: Here the Support Vector Machine (SVM) was applied to find the influencing variables and to create a model that it is trained with the train dataset and is applied over the test dataset. Figure 41 shows the last step in the configuration of this experiment, it asks to the user the name of the excel files that contains the train and the test dataset. It also asks for the name of two special attributes: the name of identification column and the name of the variable under study. After this the user has to press button Finish and RapidMiner begins to run the calculations.

63

Figure 41: First and last step of the wizard of the analysis of good-bad products. The operators are programmed in a way that they can use the multithread capabilities of the multiprocessors of the actual computers. The power evolution of the computers is now based in the inclusion of more number of processors instead on the increase of the computer frequency of it. This philosophy of programming enables us to take advantage of this evolution and reduce the time of calculation of the experiment when more threads can be run in more processors. When the simulation ends result overview taps are shown to the user. Figure 42 show this result overview. For this template there are four tabs: 

FLearner Debug (Multi-FLearner): In this tab it is shown the list of the variables that the algorithm considers as influencing variables and they are used to create the model.



ExampleSet (multi-FLearner): In this tab the prediction of the model when the test dataset is applied is shown. There is a column where the prediction of the model can be seen.



Performance Vector (Performance Test): Here it is shown the performance statistics of the model when it is applied to the test population.



Performance Vector (Performance Train): Here it is shown the performance statistics of the model when it is applied to the train population.

64

Figure 42: Presentation of the results in the wizard of the analysis of good-bad products.

2.3.14.2 Multi-classification problem: Based on the binary classification of the last template we build a new template that covers the case where we have to classify in more than good and bad category. The difference from the binary classification is that in this case there is not a searching of influencing variables and all the features are going to be used in the model obtained. As the binary problem the algorithm is auto-tuned, the question asked to the users to configure the experiment are also the same as the binary and the result taps are the same except that there is not the tap of influencing variables. 2.3.14.3 Regression problem: If the target is a regression problem also the Support Vector Machine algorithm can be used to solve the problem. In this case there is not either a searching of influencing variables, the questions to the user are similar to previous template and the algorithm is auto-tuned. 2.3.14.4 Comparison of two data collections: This template uses the algorithm Self-Organizing Maps to do the comparison of two data collections. It asks to the user the filename of the excel files where the test and train data are. As the Self-Organizing Maps is a non-supervised algorithm it is not needed to define the label variable. The Wizard only asks for the name of the identification column of the excel files. After the execution of the algorithm there are only two overview tabs. Once with the dimensional reduction model create by the SOM using the train data and the result of the application of this model to the join of both population, first the train and consecutive the test data. The second tab shows the residual plot of the joined population, it is automated selected this plot thanks to the auxiliary operator developed (ExampleSetToResidualPlot). Figure 43 shows this overview tab where thanks to the residual plot it can be easily seen that the attribute 2 (“Error_att2”) is different than the others and is mainly the guilt of the difference of the two populations. The first homogeneous green zone is the train data where obviously has less error because the model has been trained with it, next the test data is joined where the error is greater but in general less that the variable that is farthest from the train population that in this case is attribute 2.

65

Figure 43: One of the overview result tap of the template of comparison of two data collections. Attribute 2 has a different behaviour than the others variables, this points the users that this is the feature that is different in the two data collection.

2.3.14.5 Search of influencing variables based in MARS: To use the new possibilities that the integration of R inside RapidMiner give, a new template is created that uses the EARTH package of R that implements the Multivariate Adaptive Regression Splines algorithm. The user is asked in the wizard for the name of the excel files where the test and data population are. The user has also to define the two special columns that are the identification and the variable under study. The overview results tabs are in this case three: once with the rank of influencing variables, second the prediction of the model when the test dataset is applied and third the performance statistics of the model when it is applied to the train population. There is a more detailed presentation in section 2.3.18 “Task 5.2 Application to analysis of strip geometry and/or strip flatness” on page 76. In this task, SSSA focused its activities on the development of ILVAMiner, software that allows users to access the AUTODIAG_DB and to perform queries and data mining elaborations in an easy way. In fact, the software hides details about the particular Database Management System (DBMS) employed, or about which data mining engine is being used. Furthermore, it shows data mining results by means of 'smart' modules, which are tailored to each specific data mining elaboration and which help to interpret the results in the right way. The software architecture is depicted in Figure 7. It is developed in C++ and, as shown, it consists of four main modules: 

MySQL DBMS, which stores the AUTODIAG_DB;



A data importer tool developed by SSSA in the previous semesters, whose aim is to periodically update AUTODIAG_DB with new data



RapidMiner, the data mining engine chosen as common framework for this project



ILVAMiner, which acts as a bridge between the DBMS and RapidMiner

ILVAMiner is in turn subdivided in three parts: 

A database driver



An interface towards RapidMiner, based on Java Native Interface (JNI)

66



The main Graphical User Interface of ILVAMiner

ILVAMiner plays an important role in hiding the complexities of database querying and data mining elaborations to the final users. In order to fulfil this objective, ILVAMiner is based on a series of XML configuration files (task files) that describe queries and elaborations performed on queried data. These files can be created and/or modified by means of dedicated wizards within ILVAMiner, depicted in chapter 2.3.11 on page 50. Each file represent an elaboration flow, called task, i.e. describes which data should be collected, which RapidMiner project should be called and other necessary parameters such as eventual aggregation or quantization to be performed on certain column. Figure 44 shows an example of a configuration file.

Figure 44: ILVAMiner configuration file example

The RapidMiner node specifies which RapidMiner process should be called and, eventually, which is the target variable (called label) or the ID variable, which are necessary information for RapidMiner. The SQL node contains the query string and eventual parameters to pass to the DBMS. Finally the quantization node specifies the column on which perform the discretization, which is its base and step. By means of ILVAMiner GUI it's possible to execute these tasks and to view elaboration results in tabular form or by means of charts. For simple RapidMiner processes, ILVAMiner is able to automatically understand how to represent results by inspecting the RapidMiner process itself as depicted in chapter 2.3.11 on page 50. In Figure 45 and Figure 46 two different visualisation strategies are shown. In this specific case study, attention has been focussed on the correlation between steel chemistry and mechanical properties of the final product, taking into account some typical steel shop requirements. This is especially true when, as for the ILVA case, the same steel shop has to supply many different customers, and many steel grades have to be produced, preserving high productivity. In Figure 45 the results of clustering by means of a K-means operator are presented by means of a scatter plot. By means of two combo boxes it's possible to assign to each axis a variable taken from those that have been extracted from the database. The ratio between C and P content is shown, and the centroids are highlighted in blue. For the figure the data from various steel grades are used. The aim is to assist the steel shop in melts scheduling activity: even if several steel grades are included, only three centroids could be considered, simplifying scheduling activities. Figure 46 shows a similar plot, but this time the results of a linear regression operation are depicted (blue dots). In the last case the Y-axis is fixed to the labelled variable (i.e. the target variable of the linear regression, here: Rp02). This case study is more oriented to products: the well-known correlation between Mn content and mechanical properties (yield strength Rp02) is shown (red dots). The aim of this approach is to define the effect of this element on different steel grades, to see if the usual adopted ranges could be optimized, finding the best balance between requested product features and steel shop scheduling flexibility. The mechanical properties acceptability is finally assessed by optimizing process to maximize Cp and Cpk indices, as depicted in chapter 2.3.17 on page 74.

67

Figure 45: Clustering results representation

Figure 46: Linear regression results representation

Moreover, the exploitation of Object-Oriented Programming (OOP) paradigms and the modularity of ILVAMiner allow its extension by new functionalities and result visualisation strategies in an easy manner. The full interoperability of the solutions developed by the different partners is based on the adoption of RapidMiner as the common data mining framework.

68

The automatic tools developed by ThyssenKrupp Nirosta and BFI consist of the following two areas: Automatic data transfer, preparation and database maintenance: At ThyssenKrupp Nirosta a large central database exists in which the data from the three branches and from the whole production chain is stored. To reduce the network traffic and to have the data and the calculated features (e.g. time / length related data aggregated to piece data, calculation of new features, collection of data from different sources into one table) directly available, a dedicated AutoDiag database is used for the project. By means of automatic tools the transfer of these data from the central database to the AutoDiag database as well as the necessary calculations is cyclic done, depending on the data from every 2 hours to once a day. Furthermore some processes for the maintenance task were realised. They are for the calculation of some statistics (running one a week) and for some consistency checks (running once a month). Automatic servlet based data mining process: The user interface for the AutoDiag components is integrated into the existing tool NiCo of ThyssenKrupp Nirosta. Here the user selects the variables and the type of data mining process. Furthermore the results are visualised within this GUI. The other steps of the data mining process are done automatically by the AutoDiag servlet: 

Collection of the data sample



Data pre-processing



Start of the data mining methods



Collection and evaluation of the results



Transfer of the results to the user application

Because the database and the servlet are running on the same machine the maximum performance can be expected. Nevertheless due to the scalability of the selected solution (e.g. moving of the database to a dedicated server located in the same rack, enhancement of the main memory and/or the number of processors) increasing requirements regarding the calculation performance can be handled so that the realised AutoDiag system is assured of a good future. 2.3.15

Task 4.2 Integration into the industrial environment

For ArcelorMittal the integration of the developed tools into Mytica, the industrial data viewer in Asturias, is the main objective of this task. Mytica (see Figure 47 for a screen capture example) is the software that production people are using currently to consult the data stored from the production line and quality systems. If they want to do some studies they can export data to excel and import them into their favourite data mining tool. In the shown example one can see an individually adapted visualisation scheme, here for the display of surface defect data. In the left box (marked with “General info”) some information about the material and the selected product is shown. The product selection is done in the right box (marked with “Product selection”).

69

Figure 47: Mytica interface, showing graph module and surface inspection system module. The central area of the tool is divided into two sections. In the lower section (marked with “Surface defects map”) the 2-dimensional defect map of a strip is shown, basing on data from an automatic surface inspection system. In the upper section (marked with “Graph zone”) some interesting process signals are shown, assigned to the length of the strip. The advantage for the target users is to get all important information for a specific, individual problem in one customised view. The first design of this integration was depth integration. That means, the Mytica core calls the core of RapidMiner as a library and show the result inside Mytica interface. As the project has been progressing and the knowledge of the internal structure of RapidMiner and its capabilities were discovered a new way of integration has been chosen. This new integration is simpler for the IT department because it only requires few changes from their side and keeps the result presentation power of the RapidMiner system. The IT department has created a new installation package of Mytica that also includes an especial installation of RapidMiner version created by ArcelorMittal. When the user of Mytica exports data he can optionally directly start the minimalistic version of RapidMiner. The modified version starts showing the wizards described in section 2.3.14 and shown in Figure 40 (page 63). The new installation besides the installation of Mytica software has: 

An installation of the last version of RapidMiner 5.01 compiled by ArcelorMittal to include the Residual Plot, that cannot be deployed as a plug-in, and all the operators developed in the form of plug-ins. There are a lot advantages in developing operators in the way of plug-ins, they can be distributed easily (they only need to be copied in a specific folder to be recognize by RapidMiner), they can be upgraded easily. This version of RapidMiner has the complete interface and the design capability, it also has the same templates than the minimalistic version and therefore it can be used to edit these templates if the user has the knowledge and if particular needs exist. Only advanced users use this version.



The minimalistic version of RapidMiner that was described in previous sections (see “Task 4.1 Implementation of the automatic tools” on page 62).



The software and the configuration of the environment variables of Windows operating system to allow the integration of R in RapidMiner. This step had to be developed because the actual integration of R needs several configuration steps that could be too difficult for a non-expert user.

70

Once all the software is installed the user could interact with Mytica to select the populations that he wants to be under study. He has to use this tool and it makes no sense to try to design a new way to extract data from the facilities data ware house. As soon as he exports data to EXCEL, it is prompted to ask if he wants to do some data mining study. If so, the minimalistic version of RapidMiner starts. This way of working has some advantages: 

The infrastructure of the data mining technology is spread over the different department using a tool that they are used to. This technology is RapidMiner and this help to define it as a standard.



It can be easy upgraded. Using the Mytica installation software or directly by overwriting the operators.



When a normal user needs to solve a specific problem not covered by the actual templates an expert in Data mining could create a new template. This template can be exported as a XML file that has only to be saved in a specific folder of the minimalistic RapidMiner version and when the user opens the wizard this new template is available.



Easy integration of new operators: in the case that a new operator has to be integrated, it will be created as a plug-in and it has only to be saved in the plug-ins folder to be accessible by the users.



In the case of transferring this system to other ArcelorMittal facilities where no Mytica is available or there is a similar application, the integration will be easier than the previous idea of doing a depth integration.

The common framework as well as the ILVAMiner software has been installed at ILVA plant in Novi. The setup of the entire system requires the following steps: 

Installation and configuration of MySQL 5: the DBMS can be installed on a server on the plant intranet (shared installation) as well as on the same PC that hosts ILVAMiner (private installation). The AUTODIAG_DB schemas can be easily created by means of a SQL script;



Installation and configuration of the automatic data importer tool: this tool takes as input the text files produced by a routine that queries the AS400 mainframe server during low traffic periods. These text file are then translated and data stored in the AUTODIAG_DB. This tool can be configured by means of simple INI files which specify which variables should be extracted from text files and where they should be stored;



Installation of Java Runtime Environment (JRE): required by RapidMiner and necessary for the employment of JNI;



Installation of RapidMiner: this software has been chosen by the consortium as the common framework for the AUTODIAG project;



Installation of ILVAMiner: it must be installed on each workstation that should access AUTODIAG_DB to perform data mining elaborations.

At ThyssenKrupp Nirosta the AutoDiag functionality is integrated in the existing software called NiCo, which is a self-developed tool for the visualisation and statistical analysis of the product and process data. The data for the analysis is stored in a central database which contains data from the three branches of ThyssenKrupp Nirosta: Dillenburg, Krefeld and Düsseldorf. NiCo is heavily used by >100 users from different departments: production and quality department as well as material development and customer claim handling. This is an optimal prerequisite for the distribution of the AutoDiag functionality to a wide range of applications and a broad spectrum of users with different education in the field of statistics, data handling and data analysis. Furthermore some necessary modules like user account management, visualisation, data interface and automatic software distribution is already available which avoids unwanted standard software development efforts in the AutoDiag project. The technique for the implementation of the AutoDiag was fixed due to the given software tool NiCo. It is realised as a JAVA web start application which is available at the whole intranet of ThyssenKrupp Nirosta. The user interface of the AutoDiag system is realised by means of a wizard-like structure. The

71

user is guided through the data mining process. The next step can only be selected if all prerequisites are fulfilled which avoids wrong results based on incorrect parameters, for example. Due to the approach investigated by ThyssenKrupp Nirosta and BFI, to take a large amount of data into account („brute-force‟ approach), the expected time consuming data handling as well as the application of the data mining methods was implemented as a backend on a central, powerful server. The selected technology is a servlet implementation. This servlet contains the application of the data mining methods, based on the RapidMiner library. The third important component of the ThyssenKrupp Nirosta AutoDiag solution is a dedicated database located also on the AutoDiag server. Here the data of the several tables of the central database is stored after a data pre-processing. The source data are aggregated to piece data and additional features are calculated. This is done by the automatic tools. Even if this concept leads (in some cases) to duplicated data storage, the advantage of pre-calculated features and direct access to the data (servlet and database are on the same server) leads to a significant faster data access together with a reduced load of the central database server and a reduced network traffic. The following Figure 48 shows an overview of the ThyssenKrupp Nirosta solution of the AutoDiag system.

Figure 48: ThyssenKrupp Nirosta AutoDiag solution

The AutoDiag component of NiCo consists of the following modules / functionality: 

Data selection: All available product and process data are listed in a tree view. The grouping to different branches is done by production steps (e.g. steel works, casting, hot rolling, cold rolling and finishing). Furthermore some special groups like chemical analysis or inspection results are realised.



Visualisation: The first step in a data mining process is to visualise the data. The realised visualisations are y versus time, scatter plot, histogram and a table view to the data.



Target definition: Here the target of the investigation has to be defined. This can be an inspection result (product failure was detected or not) as well as a continuous variable (e.g. strip thickness deviation, flatness). The range for the two cases “good” and “bad” can be defined. For example the inspection results are stored with a failure weight: no failure, light failure and strong failure. The user has to decide, which cases have to belong to which class. The same can be done for continuous variables, for which the borders for the good and the bad case have to be provided.

72



Data mart preparation: After the definition of a material filter by means of a time range, material classes, production routes, production procedures etc. the data mart for following data mining process is created.



Data mining: •

Categorised histograms: This kind of visualisation shows a histogram which takes the classification into account. This uni-variate, linear view to the data gives first hints to possible influencing variables.

•

Hypothesis tests: This feature was a request from the target users during the training course. Because of the activities of ThyssenKrupp Nirosta in the field of six-sigma the student t-test was integrated into AutoDiag.

•

Influencing variables: Here the data mining process is addressed. By means of a complex RapidMiner process (RapidMiner is integrated as a library into the AutoDiag servlet) a priority list of variables which are influencing the selected target is calculated.

Several hardcopy‟s of the user interface are shown in the Figure 72 to Figure 75 on page 122 in the annex. 2.3.16

Task 4.3 Briefing of target users and launch of the developed system

ArcelorMittal selected some users that have experience in the usage of Mytica to be briefed and trained in the usage of the RapidMiner minimalistic version. Some people that are familiar with Mytica were also invited to these sessions. The feedbacks from these target users were used to improve and to adjust the wizards developed. The people were selected from different position from different facilities. They range from technician from the plant to R&D researchers to IT people. There were two kinds of profiles in the people selected: one was people with knowledge in the field of data analysis and the other that had few technical skills in data mining. As the people involved in the training were heterogeneous we had various types of feedback, ranging from a very technical data mining comments to points related with the interface and the usability of the software. All of these comments were of course very useful to improve the developed tools. SSSA produced a presentation that describes the usage of ILVAMiner which may be used to brief target users. In the presentation all the main features of ILVAMiner are quickly shown and it describes some sample elaborations that can be performed by means of the software. The main aim of ILVA was to get a powerful tool whose features could provide results close to the day by day steelmaking system requirements; three were the major field of interest which were covered by specific algorithms and visualisations: 

Quality – Day by day check of key parameters



R&D – Process development



Customer – Process capacity calculation

As far as is concerned with the last field of ILVAMiner usage, some indices were selected, taking into account the typical requirements of the automotive market, with reference to the VDA process auditing procedure, the standard for German automotive industry. The relation between steelmakers and car producers is in the last years more and more oriented to partnership rather than to supplying strategy; in the frame of this cooperation, steelmakers have to demonstrate that they can monitor and control the whole production chain, also by means of statistical tools and advanced data management, as this Project aims to achieve. Among the available parameters, for Gaussian distributed process variables, Cp and Cpk have been selected, since these indices are able to point out the process goodness and they are described in the following section. After the development of the AutoDiag functionality and their integration into the NiCo tool at ThyssenKrupp Nirosta the new functions were presented to a smaller group of users. These “power users” were selected by the following requests:

73



detailed knowledge regarding the usage of NiCo,



knowledge in the field of data analysis,



statistical knowledge (e.g. six-sigma trained personnel),



a kind of robustness regarding the appearance of software bugs.

During the briefing the several functionalities of the AutoDiag system were presented and discussed. By means of a showcase the users have made their first experiences during the course. After the course the AutoDiag modules were released to these users by means of the NiCo user management. 2.3.17

Task 5.1 Application to analysis of mechanical & technological properties

SSSA and ILVA developed and applied data mining elaborations to the evaluation, assessment and prediction of mechanical characteristics. The main focus of these analyses were the Cp and Cpk parameters (Figure 49), which synthetically describe if a certain product characteristic statistically lays within certain limits, called Lower and Upper Specification Limits (LSL and USL).

Figure 49: Cp and Cpk indexes, where x is the mean and s is the standard deviation

The Cp index, or process potential index, measures the distribution width against certain specifications, lower and upper (LSL and USL), while the Cpk index, or process capability index, measures the distribution centring against certain specifications, lower and upper (LSL and USL). The most important feature of such indices is their usability in all the three regions of interest, i.e. Quality, R&D and Customer satisfaction: they can give useful information to metallurgists, researchers and customers, being simple and significant at the same time. SSSA and ILVA developed a model which allows predicting if a certain process will optimize Cp and Cpk values for multiple mechanical characteristics at once. A simplified block diagram that describes the training phase is depicted in Figure 50. Data regarding process parameters (such as furnaces temperatures, skinpass elongation, etc.), mechanical characteristics analysis (Rp02, Rm, anisotropy, strain hardening, etc.) and target mechanical characteristic ranges (LSL and USL for each characteristic) for a specified steel grade and for a specified time period are queried from AUTODIAG_DB by means of ILVAMiner. It filters anomalous observations and null values, eventually it could discretize them and then it passes pre-processed data to RapidMiner by means of the JNI interface. The RapidMiner process is composed by several modules: 1.

An aggregation block calculates means and standard deviations of mechanical characteristics on a per week basis;

2.

A custom script block calculates Cp-Cpk for each mechanical characteristic and each week;

3.

A custom script block performs a Pareto ranking on all Cp-Cpk values, assigning to each coil a label which describes if its Cp-Cpk values are Pareto-dominant or not;

4.

A composite block performs feature selection and decision tree model training and validation iteratively in order to optimize model performances by identifying most influencing variables;

5.

The optimal decision tree model parameters are saved for future use and returned to ILVAMiner.

74

Figure 50: Optimal Cp-Cpk training process

More in details, the Pareto ranking exploits the so-called Pareto-dominance concept, which is widely employed in Multi-objective Optimization Problems (MOPs). It states that a vector a dominates a vector b if and only if each element of a is greater or equal than the corresponding element of b and if this comparison is strict for at least one element:

a

b  ai  bi i  k : ak  bk

In a set of vectors, those that are non-dominated compose the so-called Pareto-front, i.e. the set of optimal vectors. After step 2, Cp and Cpk values for each examined chemical characteristic and for each week are known: let us call the set of Cp-Cpk values Cp-Cpk vector. In order to establish which week has the best Cp-Cpk, a Pareto rank is calculated by assigning to each Cp-Cpk vector a value equal to the count of vectors that dominates it. Cp-Cpk vectors that have a Pareto Rank equal to zero lie on the Pareto-front, thus they are optimal. Cp-Cpk vectors are then divided into two classes on the basis of their rank: dominated and non-dominated. Subsequently a decision tree model is trained using process parameters as inputs and dominated / nondominated classes as target. The training is performed several times with a different subset of inputs at each step in order to select those process parameters that optimize decision-tree performances i.e. that most influence the optimality of Cp-Cpk vectors. The selection of the inputs can be performed by means of "brute force", i.e. by trying all combinations, if the number of parameters is restricted or by means of more sophisticated methods such as Genetic Algorithms (GAs) if the number of parameters is high.

75

The decision tree model has two main interesting employments: on the one hand it can be used to forecast Cp-Cpk optimality given a set of process parameters and, on the other hand, it allows the extraction of knowledge from historical data. In fact decision-trees models produce a set of comprehensible "ifthan-else" rules which may help steelmakers understanding how to optimize steelmaking processes. As a case study let us consider an IF steel grade coils produced by ILVA available in AUTODIAG_DB (~800 coils). The following information is queried from the database: 

Coils identity and week of production;



Process parameters such as furnace temperatures at different steps and skinpass and tension leveller elongations;



Mechanical characteristics analyses for Rp02, Rm, r, n;



Mechanical characteristics target ranges for selected steel grade (LSL and USL for each mechanical characteristic).

For each week the Cp-Cpk vector is evaluated, the Pareto rank is assigned and the dataset is subdivided in non-dominated class (~110 coils) and dominated class (the remaining coils). During the features selection two input parameters have been discarded (namely the last two steps furnace temperatures). Table 16 shows the confusion matrix of the decision tree model. Dominant

Non-dominant

Predicted

107

3

dominant

(true dominant)

(false dominant)

Predicted

6

761

non-dominant

(false non-dominant)

(true non-dominant)

Class recall

94.69%

99.61%

Class precision 97.27%

99.22%

Table 16: Confusion matrix of the decision tree model

The overall accuracy of the model was 98.99% while precision was 99.25%, where these performance indexes are defined as:

accuracy =

true dominant + true non-dominant total coils

precision =

true dominant true dominant + false dominant

Such high performances means that chances to use such a model to understand eventual production anomalies are good. Moreover such training procedure could be easily applied on a wider range of steel grades by means of ILVAMiner. The comparison of the smart components to the other approaches was summarised in chapter 2.3.21 on page 83. 2.3.18

Task 5.2 Application to analysis of strip geometry and/or strip flatness

As it is written in chapter 2.3.13, ArcelorMittal tested their operators on some academic populations but also in a real two-class classification problem of good and bad quality. In order to show the capabilities of the new integration of R inside RapidMiner and the template developed to see the goodness of the MARS algorithm (see 2.3.14.5) a regression problem was presented. It was a real problem of regression

76

in one ArcelorMittal facility, some geometry problem that was solved prior of the start of the AUTODIAG project. The tools used to solve the problem were the usage of analysis of variance (ANOVA) to select the influencing variables and the creation of a linear regression to adjust the final output. The ANOVA was done with commercial software managed by a statistical expert and a detailed technical study. Now with the template developed (Search of influencing variables based in MARS) any user can reach better result and the same information only filling the name of the excel file where the data are and the name of the variable under study. In Figure 51 the result of the execution of template Search of influencing variables based in MARS is shown. The two overview tabs are the rank of influencing variables and the equation that solve the regression. MARS gives a linearization by intervals and this second taps also gives the intervals where it must be applied. This linear approximation by intervals is very easy to integrate in the computer process without the need of complex mathematical libraries.

Figure 51: Influencing of variables using MARS in a regression problem and the result liner regression equation. MARS define linear regression by intervals. The comparison of the smart components to the other approaches was summarised in chapter 2.3.21 on page 83. 2.3.19

Task 5.3 Application to analysis of surface defects

ThyssenKrupp and BFI did their investigation in the field of surface defects. This task was switched with ILVA / SSSA due to their availability of surface inspection data. At ThyssenKrupp Nirosta the stainless steel coils are inspected several times along the production chain. There are manual inspections done by Specialists as well as automatic surface inspection done by Automatic Surface Inspection Systems (ASIS). For this task the defect „open and closed shells‟ was selected as identified by an ASIS. The data coming from this system are the following information, available for each detected defect: 

Position on the strip and size of the defect



Side of the strip



Type of defect



Severity of the defect

77

The data are aggregated per coil during the transfer from the TDW to the AutoDiag database to the following figures: 

Total number of defects per coil



Defect density (summarised defect area per strip area) for each side



Total defect density

For the application of the “brute force” approach the following variables were selected: 



Steel works: 237 variables, consisting of o

Chemical analyses: 79 variables

o

Converter: 32 variables

o

Secondary metallurgy: 48 variables

o

Casting: 33 variables

o

Slab data: 45 variable

Hot rolling: 495 variables, consisting of o

Oven: 88 variables

o

Hot rolling: 407 variables

For the data filter the following criteria were selected: 

Two material classes with similar features



Data from one year

Finally the data sample used for the application to analysis of surface defects consists of 761 variables and 16.081 data sets. The sample was reduced by 2994 data sets for which the target information was missing. The investigation of the defect was completely done with the software developed during the AutoDiag project. The applied steps for that were: 

Selection of the variables



Definition of the data filter



Visualisation of the data



Definition of the target classes



Compilation of the data sample



Application of the data mining methods: o

Hypothesis tests

o

RapidMiner process

These steps can be very easily used by the personnel of ThyssenKrupp Nirosta, which are not educated in the field of data mining. The complete session is documented by means of hardcopy‟s in the annex with the Figure 72 to Figure 75 on pages 122 to 124 at the end of this report. In the following paragraphs some exemplarily steps are described more in detail. Definition of the target classes For the following investigations the total number of defects is used. To generate a proper target classification and to distinguish the separation between „good‟ and „bad‟ coils the continuous information was divided into two classes by means of the following definition: 

< 18 occurrences of the defect per coil: “good”



> 120 occurrences of the defect per coil: “bad”

78

This is shown in the following Figure 52.

Figure 52: Definition of target classes

Because the “grey area” between the border for “good” and “bad” products is left out, the differences between these two classes usually are more obvious. This is a common practice for the data mining process. The user is supported is this point by means of the automatic calculation of these borders with the option to adapt them manually. This operation reduced the number of examples in the data sample. The next Figure 53 shows the resulting class distribution.

79

Figure 53: Resulting class distribution in the selected data sample

Result presentation After that calculation is done, user gets a list of variables ordered by their importance. The importance is given by the following figures: 



Hypothesis tests: o

p-value (probability value) and F-value from a one-way ANOVA test for continuous variables

o

Chi-squared value for nominal variables

RapidMiner process: o

Overall influence index

An exemplarily result table is shown in the following Figure 54.

Figure 54: List of influencing variables

Clicking on one line of this table shows the following Figure 55. Here the selected variable is shown as a histogram together with the following additional information. The number of examples in the group is

80

written on top of the bar and additionally coded with colour (the darker the bar, the more data sets are in this group). The frequency (in %) of the occurrence of the “bad” products inside the group is coded by the height of the bar. In the shown case the frequency decreases with the increase of the value of the selected variable (obfuscated in the figure). So the AutoDiag software tool has found a variable that has an obvious influence to the investigated target defect.

Figure 55: Detailed presentation of a variable selected as important

Results and conclusions The generated list of influencing variables was discussed with the process experts. Some of the found relations are well known and obvious (alloy additions, some chemical elements). Other relations are not as clear as expected. The usability of the software was accepted by the target users. As final result of the application example of the “brute force” approach in a real problem it can be stated that, due to the used data mining methods which are taking only uni-variate and linear correlations into account, the possibilities of the data mining cannot be applied in their full range. Using many variables and a lot of data sets prevents from the usage of high-sophisticated data mining methods due to unacceptable calculation time (in the practical use of the foreseen target users) or to high demands to the computational resources. So the priority list of influencing variables can only be used as starting point for further discussions with the process experts. One option for the improvement is the restriction of the data sample to be analysed. If the number of variables is limited, methods like the investigated SOM or the decision trees can be used with acceptable calculation times. But this solution is not fully conforming to the initially followed “brute force” approach. The investigations have to be done in several iterations with changing sets of variables which lead also to increasing time requirement for the whole investigation. The comparison of the smart components to the other approaches was summarised in chapter 2.3.21 on page 83.

81

2.3.20

Task 5.4 Evaluation of usability and tuning of the system

As a result of the different training courses at ArcelorMittal (described in the section 2.3.16) the feedback from the users was collected and the main points are summarised below in Table 17. Feature requested

Description

Solution

Create a subsample population from the train data to be used as test data.

Originally all the templates developed ask for two excel files, one with the train data and the other for the test data. Some users wants to use, at least at the beginning of the investigation, only one file and automatically the software selects a sub set from the test data to be used to test the model.

Add an option in the wizard to ask to the users if only one file is used to train and test.

Do some preprocess operations before start the experiment.

In the beginning no pre-process tasks were done before starting the data mining operations. The users had the responsibility to remove spurious points or null data.

Decrease the time needed to run a fast investigation.

Initially the experiments were built to achieve the best solution possible. With some algorithms and even with the multiprocessing capabilities this could take long time because deep searches were done. In these cases the run could take long time.

The wizard asks if a deep search is needed.

Load a previous model to test with new data.

The first templates had not the possibility to load a previous model. Of course in the non-minimalistic version of RapidMiner you could modify the experiment created by the wizard to save and load models, but there were not a wizard that can directly load a model and run it.

Create a new wizard that can load a model and run on it new test data.

The wizard asks if some basic filter and preprocessed jobs are done to automatically remove spurious and null data.

Table 17: New features included because of feedback from the users. SSSA took into account some observations and suggestions made by the ILVA personnel who tested the software in order to enhance and improve the software usability. In particular, the main efforts has been focused on the simplification of ILVAMiner usage by means of guided wizards, which help users in the creation of complex data mining tasks without a deep knowledge of database functioning or data mining models. Moreover some tunings of the data-importer tool were made in order to adapt it to new data formats and to make easier its configuration by means of the introduction of configuration files as depicted in chapter 2.3.6. ThyssenKrupp and BFI presented the AutoDiag functionality inside NiCo during a training course to the target users and then released it to them. Due to the integration into the existing NiCo tool and the experience of the target user in using this system, the application of the new AutoDiag functionality did not lead to serious problems. The feedback of the user was mainly addressed to minor software implementation deficiencies or un-intercepted invalid combinations of parameters / data collocations. Nevertheless, two major deficiencies / extension demands occurred: 1.

Confusingly variable selection:

The main idea of the investigated approach is to use as many input variables as possible to ensure that the variables containing the important information regarding a target problem are taken into consideration. Often some data are omitted by experts due to their process knowledge which might sometimes be incorrect. The „brute-force‟ approach shall avoid such kind of routine blindness. The consequence of that is that there are many variables, which can be selected for the further analysis. For example flat stainless steel is cold rolled with several passes in a reversing mill. For each pass data are measured and stored. At ThyssenKrupp Nirosta the data from the first pass (where the largest reduc-

82

tion is done) and the last pass (definition of the final surface) is available for AutoDiag. This has the effect that the number of variables containing roll pass data is doubled. This happens for several plants. Finally the number of available variables is > 3000. The variables are grouped on the first level for the different process steps (e.g. steel work, hot rolling, annealing, cold rolling). In a second level the variables are grouped by means of single plants or aggregates. For the steelworks the sub groups are chemical analysis, converter, caster and slab data. For hot rolling the sub groups are referring to oven and hot rolling mill. But even if the data are grouped by two levels, in a single remaining list of variables 100 or 200 variables are listed. Here it is often very uncomfortable to find and select several variables. Solution: To make the variable selection easier in a first step, the knowledge from experienced process experts is used to build a special group of variables, which have higher importance than others and which shall always be take into consideration. The group called „global VIP‟s‟ contains variables like hot/cold strip width and thickness, material, some chemical elements and some major process route information. These variables are presented in a separate group on top of the list. Furthermore these variables are selected by default. 2.

No standard methods

Several users which were briefed initially are well educated if the field of six-sigma. So they have missed a result which based on standard statistical values. The priority list of the data mining process is sorted by means of an abstract value, calculated from the results of the different applied methods. This cannot be easy interpreted by the target users. They prefer a standard statistical value like the probability value (“p-value”) which is a typical coefficient as a result of statistical tests. Solution: A standard hypothesis test (student t-test) was implemented into the AutoDiag module of NiCo. As results, users get a list of the selected variables, sorted by means of the increasing p-value. This example shows the high flexibility of the selected AutoDiag solution of ThyssenKrupp Nirosta. Due to the given prerequisites (data collection, feature calculation, existing user interface) the system could be fast and easily extended by standard analysis functionality. It is expected that, with increasing experience of the target users with the AutoDiag system, the requests to the system by the users will also increase. By the fast implementation of the hypothesis test it was shown that the modular system can grow together with the demands of the users. 2.3.21

Task 5.5 Comparison of the different approaches

During the presented project two general strategies were investigated, the “brute force” and the “individual adapted” approach. The first one is used to investigate the dependencies of the target quality information to all variables of the relevant production chain. Here the philosophy is that very often the unexpected influences are the most important ones. Because of the number of variables this approach needs a lot of computing power and out of this reason more simple and fast data mining techniques were selected for the realisation. For the “individual adapted” approach the experiences of the proposers of the last years were investigated, documented and used as the basis for the selection of proper data mining methods. Here the user has the possibility to make a specific selection of the variables from the processes which shall be investigated. Also he has the chance to influence the data mining process. For the comparison of the approaches they were applied to identical data samples coming from the industrial partners of the consortium covering different investigation targets. In the following paragraphs the data samples are described.

83

Data sample 1: 

Origin: AME



target: strip flatness (binary information: positive/negative)



distribution (training and validation): o

target = positive: 575

o

target = negative: 438



121 process variables



731 data sets for training



282 data sets for validation

Data sample 2: 

Origin: TKL-NR



target: surface defect detected (0= not detected; 1=low severity; 2=high severity)



distribution:





o

target = 0: 3089

o

target = 1: 431

o

target = 2: 453

o

target = null: 705

393 process variables from o

secondary metallurgy

o

slab caster

o

reheating oven

o

hot rolling

o

cold rolling

4687 data sets

Data sample 3: 

Origin: RIVA



target: mechanical characteristic (binary information: good / bad)



distribution: o

target = good: 612

o

target = bad: 216



124 process variables



828 data sets

Results from the „individual adapted‟ approach: Here the data sample 1 was used to select the most important variables as well as for the classification. The result for the variable selection is shown in the following Figure 56.

84

FScore

Figure 56: Importance of variables selected by the „individual adapted‟ approach for data sample 1

The figure shows the result of the „FScore‟ operator (see page 48) of the variables in descending order. The higher the value of the variable the higher is the importance. The 15 most important variables ordered by name are: 

V3, V16, V19, V20, V28, V37, V43, V49, V51, V54, V56, V61, V62, V94, V97

To check the result a classification model was trained with the selected variables. The classification is done by SVM method and it is described in chapter 2.3.13 (page 56) in the paragraph over Figure 35 and was carried out with the wizard "Analysis of good and bad product with search of influencing variables” (see chapter 2.3.14.1 on page 63). The results for the training (with the training data sample) and the validation (with the validation data sample) are: true positive

true negative

class precision

pred. positive

360

39

90.23%

pred. negative

18

314

94.58%

class recall

95.24%

88.95%

Table: 18: Training results for the selected variables true positive

true negative

class precision

pred. positive

99

11

90.00%

pred. negative

98

74

43.02%

class recall

50.25%

87.06%

Table: 19: Validation results for the selected variables The data sample 2 under study is pre-processed using the wizard developed as a request of users during Task 5.4 Evaluation and tuning of the system, see Table 17 (page 82): “New features included because of feedback from the users”. This pre-processed wizard removes features that has more than 20% (this is a parameter that you can change) of rows with nulls. With the remaining data the rows with nulls are also removed. From the original 393 features the wizard removes 331. It also removes 2446 rows from 4686.

85

The data pre-processed are used in the wizard/template called Multi-classification problem (see chapter 2.3.14.2 on page 65) that uses SVM as algorithm. The output is a model that gives an accuracy of 76.96% true 0

true 2

true 1

class precision

pred. 0

1724

252

264

76.96%

pred. 2

0

0

0

0.00%

pred. 1

0

0

0

0.00%

class recall

100.00%

0.00%

0.00%

Table 20: Result of the „individual adapted‟ approach to data sample 2 Due to the big different between the number of samples of type 0 against 1 and 2 the algorithm is not able to find a better solution than the simplest one that is to classify all the data to type 0. From the individual adapted approach it is better put more attention in the data selection than a large amount of data that it is more convenience for the brute force approach. If type 1 and 2 are merge in one new type the algorithm starts to do real classification. When the number of samples of each type is similar and comes from a more similar environment it is fairer for the algorithm to extract knowledge from the data. The algorithm was designed not taken in count that this situation could appear. There are also other alternatives to compensate this issue but not was in the scope of the project. It also gives a rank of the influence of the variables. This is a list of them by relative importance: CR_Thick

HS_Temp3

Chem5

SM_Proc45

HS_Temp4

HS_Temp2

ProcPar_15

SM_Proc39

ProcPar_2

ProcPar_28

SM_Proc9

Date

ProcPar_41

ProcPar_5

Chem29

ProcPar_6

HS_Thick

HS_Temp1

Chem25

SM_Proc48

Chem8

SM_Proc24

ProcPar_4

Chem1

ProcPar_34

Chem4

Chem30

Chem17

Chem3

The following Figure 57 shows the selected variables ordered by their importance.

86

FScore

Figure 57: Selected variables, ordered by their importance, for data sample 2

Using the "Analysis good-bad product with search of influencing variables" wizard (see chapter 2.3.14.1 on page 63) on the data sample 3 the „individual adapted‟ approach gets the following results: true GOOD

true BAD

class precision

pred. GOOD

592

115

83.73%

pred. BAD

20

101

83.47%

class recall

96.73%

46.76%

Table 21: Classification results of the „individual adapted‟ approach on data sample 3 The 31 most significant variables (ordered by name) are: V8

V37

V62

V88

V102

V22

V49

V65

V89

V106

V27

V52

V70

V90

V117

V28

V54

V73

V96

V118

V29

V55

V80

V98

V119

V30

V57

V87

V99

V121

The relative importance based on the FScore is shown in the following Figure 58.

87

V122

FScore

Figure 58: Relative importance of the variables from data sample 3

Results from the „brute force‟ approach: The above described data samples were applied to the „brute force‟ approach (see Figure 67 to Figure 71 in the annex). Here, the flexibility of the developed solution could be used: the RapidMiner process file, currently used in the application of ThyssenKrupp Nirosta, was applied at BFI. The following result figures shows the so called „Overall Index‟ OI which is a linear combination of the results of the several used methods by a descending OI. Data sample 1:

Figure 59: Selected variables by the „brute force‟ approach

88

The following list shows the 15 most important variables side ordered by name. The variables found also by the „individual adapted‟ approach are coded by blue colour and italic underlined font. 

V1, V3, V9, V16, V20, V25, V28, V37, V49, V51, V54, V56, V61, V94, V97

For the evaluation of the result the selected variables were user for a classification. The data sample was divided into training (80%) and test (20%) data sets by random selection. For the classification a neural network (Multi Layer Perceptron ML) was used. The results are shown in the following tables. true positive

true negative

class precision

pred. positive

238

152

61,03%

pred. negative

63

357

85,00%

class recall

73,46%

57,42%

Table: 22: Training results for the selected variables for data sample 1 true positive

true negative

class precision

pred. positive

35

13

72,92%

pred. negative

38

117

75,48%

class recall

47,94%

88,88%

Table: 23: Test results for the selected variables for data sample 1 Data sample 2: The most important variables selected by the „brute force‟ approach are as follows: ProcPar_43

Temp_23

ProcPar_44

Temp_24

ProcPar_51

ProcPar_50

ProcPar_66

Temp_1

ProcPar_45

ProcPar_6

Temp_26

Temp_27

ProcPar_56

SM_Proc46

Temp_28

ProcPar_4

ProcPar_49

ProcPar_57

ProcPar_48

Temp_22

ProcPar_3

Temp_25

ProcPar_55

ProcPar_52

Temp_8

Temp_21

Temp_10

ProcPar_62

ProcPar_67

ProcPar_54

Temp_19

There is no accordance to the results of the „individual adapted‟ approach. The relative importance is shown in the following Figure 60.

89

Figure 60: Result of the „brute force‟ approach from the data sample 2

The classification of the data sample 2 using the first 15 selected variables lead to the following results: true positive

true negative

class precision

pred. positive

506

82

86,05%

pred. negative

0

0

0,00%

class recall

100,00%

0,00%


true negative

class precision

pred. positive

126

21

85,71%

pred. negative

0

0

0,00%

class recall

100,00%

0,00%

Table: 25: Test results for the selected variables for data sample 2

90

Data sample 3:

Figure 61: Result of the „brute force‟ approach from the data sample 3

The 31 most important variables are: V122

V29

V115

V70

V119

V57

V106

V71

V112

V28

V90

V16

V62

V37

V118

V73

V78

V55

V99

V82

V22

V14

V27

V102

V96

V30

V98

V87

V10

V65

V52

There are 22 from 31 variables selected by both approaches (~71%) The classification of the data sample 3 using the first 18 selected variables lead to the following results: true good

true bad

class precision

pred. good

468

100

82,39%

pred. bad

21

73

77,66%

class recall

95,70%

42,20%

Table: 26: Training results for the selected variables for data sample 3 true good

true bad

class precision

pred. good

113

31

78,47%

pred. bad

10

11

47,62%

class recall

91,87%

26,19%


91

Results from the „smart components‟ For the comparison the data samples were also applied to the „smart components‟. SSSA and ILVA developed a method of variable selection based on multiple variable weighting algorithms whose outputs are aggregated and ordered by means of the concepts of Pareto dominance and Pareto Rank as explained below. The employed algorithms are: 

Weight by correlation: it performs a variable weighting based upon the residuals of an unweighted local polynomial regression.



Weight by SVM: this operator uses the coefficients of a hyperplane calculated by a Support Vector Machine (SVM) as feature weights.



Weight by Information Gain Ratio: this operator calculates the relevance of a feature by computing the information gain ratio for the class distribution.

Each of these algorithms produces as output a vector that associate a weight (importance) to each variable of the input set. Different algorithms typically assign weights in different ways, so that is difficult to compare and rank them. To overcome this problem a method to rank the most influencing variable based on Pareto dominance is proposed. A vector k Pareto-dominates a vector q if and only if for each component of k is less or equal to the corresponding component of q and at least for one component this comparison is strict:

k

q  ki  qi i   j : k j  q j

The Pareto Rank associated to a vector is the number of vectors that it dominates. So in our method, the vector(s) with the highest Pareto Rank correspond to the variable(s) those results to be most influencing for all the weighting methods and so on for decreasing ranks. In the following results the Pareto Rank has been normalized. Data sample 1:

Figure 62: Normalised pareto ranking for data sample 1

The first 15 variables are as follows. The accordance to the „individual adapted‟ approach is marked with blue colour, underline and italic font; the accordance to the „brute force‟ approach is marked with a leading star.

92



*V97, *V94, *V3, *V56, *V54, *V37, *V61, V34, V101, V82, V15, V2, V10, V47, *V20

There are 8 of 15 variables (~53%) selected which are also appearing in the „individual adapted‟ and in the „brute force‟ approach. The classification of the data sample 1 using these selected variables lead to the following results: true positive

true negative

class precision

pred. positive

337

143

70,21%

pred. negative

83

247

74,85%

class recall

80,24%

63,34%


true negative

class precision

pred. positive

98

20

83,05%

pred. negative

57

28

32,94%

class recall

63,23%

58,34%

Table: 29: Test results for the selected variables for data sample 1 Data sample 2:


93

The selected variables are: Chem30

SM_Proc35

SM_Proc1

HR_Proc169

Chem8

SM_Proc27

ProcPar_53

HS_Temp4

HS_Temp2

SM_Proc17

ProcPar_61

SM_Proc11

Chem19

SM_Proc43

Chem26

SM_Proc48

HR_Proc170

ProcPar_41

ProcPar_11

Chem3

Date

HS_Temp1

SM_Proc47

ProcPar_34

Chem27

ProcPar_64

SM_Proc5

HR_Proc173

ProcPar_14

CR_Thick

Chem1

There are 11 of 31 variables (~35%) selected which are also appear in the „individual adapted‟ and no conformity with the results of the „brute force‟ approach. The classification of the data sample 2 using the first 16 selected variables lead to the following results: true positive

true negative

class precision

pred. positive

493

75

86,79%

pred. negative

13

7

35,00%

class recall

97,43%

8,53%

Table: 30: Training results for the selected variables for data sample 2

true positive

true negative

class precision

pred. positive

124

20

86,11%

pred. negative

2

1

50,00%

class recall

98,41%

5,00%


94

Data sample 3:


The selected variables are: *V122

*V16

*V115

*V90

V96

*V71

*V106

*V62

*V29

V86

*V73

V58

V20

V8

V113

*V57

*V119

*V99

*V118

V72

V101

*V102

V51

*V14

*V22

*V87

V69

V15

*V98

*V65

*V52

There are 18 of 31 variables (~58%) selected which are also appear in the „individual adapted‟ and 20 of 31 variables (~65%) identical with the results of the „brute force‟ approach. The classification of the data sample 2 using the first 21 selected variables lead to the following results: true good

true bad

class precision

pred. good

462

91

83,54%

pred. bad

27

82

75,23%

class recall

94,48%

47,40%

Table: 32: Training results for the selected variables for data sample 3 true good

true bad

class precision

pred. good

116

8

93,55%

pred. bad

7

35

83,33%

class recall

91,31%

81,40%


95

Discussion of the comparison It can be seen that the three tested approaches led to very similar results in two of the three cases. For the selection of the most important variables of the „brute force‟ to the „individual adapted‟ it can be summarised as follows: 

data sample 1: 12 of 15 variables selected are identical (80%)






data sample 2: 22 of 31 variables selected are identical (~71%)

For the selection of the most important variables of the „smart components‟ to the „individual adapted‟ it can be summarised as follows: 







data sample 2: 20 of 31 variables selected are identical (~65%)

It can be stated that for two of the exemplary cases all approaches are useful. The third one (data sample 2) leads due to the very inhomogeneous composition of the data to unreliable results. The preprocessing eliminates many variables and a lot of examples. This leads to a very uneven distribution of the two cases of the target variable. The result is that the „brute force‟ approach fails in this case due to the less sophisticated methods. Further comparison of the approaches During this task the consortium discussed all strategies and the results reached with them in the practical usage at the industrial sites as well as during the comparison. Furthermore the advantages and disadvantages were balanced. The result of the discussion is shown in the following paragraphs: 

Different data environment

The effort to provide the necessary product quality and process data is very different. For the „brute force‟ approach a large amount of data / variables has to be accessible by the system. Here the provision of the data during the implementation of the system is time consuming, especially if there is no common data source (e.g. a technical data warehouse) so that the data have to be gathered from different sources and have to be connected to the product. In contrast to that for the „individual adapted‟ approach only problem specific data have to be prepared. When focussing on a specific problem usually data from only few production stages are necessary which reduces the effort distinctly. 

Different users knowledge necessary

The demands to the target users regarding specific knowledge are also very different. For the „brute force‟ approach no detailed knowledge is necessary. The user selects the target and the input variables and starts the system. The result presentation is as easy as possible. For the „individual adapted‟ approach the user has at least to assign the problem to a given solution. Here a more skilled user is necessary. 

Different methods can be applied

The „brute force‟ approach tries to incorporate as much process and product quality variables as possible. This leads to a large amount of data that have to be processed. To calculate the results in a reasonable time only „simpler‟ methods can be used. In opposite to that for the „individual adapted‟ approach more sophisticated methods were used, which are adapted to the data mining problem for which they were selected. 

Different result quality

As described above for the „brute force‟ approach only „simpler‟ methods can be used. This leads to the fact that the results are less exact then they can be when using high sophisticated data mining methods. At this point only more general hints can be expected. Being focussed on a specific problem the indi-

96

vidual solution can reach more detailed results. Specialised data mining methods that reach the best results for a specific problem are used for the „individual adapted‟ approach. This benefit has the disadvantage that for each group of data mining problem an individual approach has to be developed and implemented. Final validation of the solutions Summarised it can be stated that all approaches are useful for real industrial tasks. From the practical exercises with the target users and experience of the test phase it can be concluded that all approaches can help the personnel in their daily work when searching for process parameters influencing the product quality. They can find the reasons for quality deficiencies faster than before, so they can eliminate such causes earlier which lead to less defective products. The advantage is the relief of the personnel from this additional task, so they can focus to their main jobs. Comparing the classification results of the three approaches for the three different data samples some similarities can be detected. Data sample 2 couldn‟t be handled by any of the approaches. Here the necessary information for the detection of the influencing variables was not found in the available data sample. For data sample 1 the “individual adapted” approach seems to be better than the other two approaches. This is not surprising because the approach was developed to solve this problem, it was individual adapted. The results of the other two approached are comparable. Finally, for data sample 3 all three approaches lead to comparable results, even if the “individual adapted” approach was not tuned to this problem and the “brute force” approach uses only very common methods. The advantages of the different approaches are as follows: 

The “brute force” approach can be used by “non data mining experts” for a first view to a quality problem, especially currently occurring ones. A wider circle of employees can use these techniques which lead, together with their expert knowledge regarding the steel production processes, to a faster reaction to production problems.



The “individual adapted” approach can be used for product and process problems that seldom occur. Developing an individually adapted solution for that kind of problems can help to develop monitoring functions which detects such problems in an early state. So delivery of products and related customer rejections can be avoided. The only disadvantage is the larger effort in the development of an individually adapted approach for each problem that has to be investigated.



The “smart components” can be easily used to hide the complexity of the data mining technique to the target users. With the developed components the user is supported when selecting data or getting the results visualised. Hiding the complexity of the data selection by means of a complex SQL statement and the data mining methods to the target user increases the acceptance of the system by the process experts.

2.3.22

Task 5.6 Determination of the transferability

One important demand to RFCS projects is the focus to generate results that can be used in whole European steel industry, where they are applicable, respectively. For the presented project this was successfully realised. For the determination of the transferability the following can be stated: 

Open framework that can be realised in any kind of steel industry (flat products, long products etc.).

The framework developed during the project is completely independent from the type of the steel producer. There are no methods or solutions that depend on the type of the steel product. The only necessity is the availability of data describing the product which should be solved nowadays for the European steel industry. 

Individual interfaces to data supply as well as result visualisation are always necessary.

As for every software solution there are individual adaptations necessary when transferring the software to another steel production facility. For the software developed in the presented project individual interfaces to the data supply as well as to the user interface are necessary. For both of them every steel pro-

97

ducer has its own environment, which is usually very inhomogeneous due to the different age of the several plants. Also, it does not exist any standard e.g. for the access of process or product quality data. So an installation „out of the box‟ will never be realisable. 

Easy exchanges of methods / operators possible due to a company independent core of the framework build with RapidMiner.

One major aim of the project was to hide the underlying data mining methods to the target users. This puts them into the position to use these methods without deeper knowledge regarding e.g. the necessary prerequisites for their application. The result is that these powerful methods can be distributed to a wider range of target users. Nevertheless, the implemented methods are defined using a standard tool and are stored in a common file format. So an exchange of these methods is very easy, independent of the individual implementation of the interfaces. 

Availability of the sources opens a wider range for individual solutions.

The used data mining tool RapidMiner is free available including the source code. This puts the consortium into the position to adapt the software to the demands of the steel industry. There were own modules developed like the shown template based wizard or individual learners. These developments could be started based on available modules so that duplicate work could be avoided. 

Different used software techniques have shown a wide range of realisation approaches that can be found at the European steel industry.

As described above the IT environment of each steel producer is very individual. So a common solution of the whole system for each steel producer is not possible. During the project different implementations based on the different IT environment of the industrial partners of the project showed the transferability of the developed system. It was shown that a client / server architecture is possible as well as the integration into an existing tool or a standalone application. 

Software is open source what minimises the costs for the implementation and testing, but a commercial maintenance is also available.

The main component of the developed common framework based on a well-known and widely distributed open source data mining software. This puts an interested steel producer into the position to test the developed solution, e.g. on a pilot plant, with low financial effort. If the test leads to the expected results, the system can be rolled out to the whole site, covering all plants. For that, also a commercial licence is available which leads to a professional support in case of appearance of software faults. 2.4

Conclusions

The activities planned for this project were mainly reached and the achieved results were satisfactory. The conclusions gathered during the execution of this project can be summarized as follows: 

The application of data mining methods in an industrial environment for a wider range of potential users is possible.



The necessary knowledge for the application of data mining methods can be “stored” inside the developed solutions.



The usage of data mining methods can be hidden to the target users so that they do not need detailed knowledge regarding the underlying methods.



The implementation of the common framework can be done using different software techniques, ranging from client/server architecture to integration into an existing tool or a standalone application.



One common framework can be used for different approaches for the application of data mining methods, based on open source software.



The developed common framework can be integrated into different given IT environments of the steel producers.

98



The “brute force” approach is usable for the first investigation of a data mining problem using simpler methods. The results are of less quality than the individual approach but very useful to get a first impression of the problem and to define further, more detailed investigations.



The “individual adapted” approach enables to be more specific in the type of algorithm to apply in each case. The developed wizards guide the user to solve defined quality problems using the most suitable algorithm. This approach get better accuracy than the brute force approach but it is a heavy time consumption task, even using the multithread design of the developed operators. This time consumption is more evident when a depth optimum search is done (not always needed). The increasing number of processors in actual computer will reduce this bottleneck and the multithread design decision has shown that it is a good choice in spite of a more difficult internal coding.



The “smart components” developed during the presented project have successfully hidden the underlying methodology of the used data mining methods. By means of a sophisticated data selection, an optimisation and an intelligent data visualisation it was shown that real problems of the flat steel production can be successful analysed.



The approaches developed during the presented project were successfully applied to different problems that appear during the steel production.



The developed system can be transferred to any kind of steel production. The only requirement is the availability of the necessary process and product quality data.

2.5

Exploitation and impact of the research results

During the project a common framework for the application of data mining methods in an industrial environment was developed. The framework based on the open source software RapidMiner, which covers all requirements defined during the initial phase of the project (see section 2.3.5 on page 22). The advantage is the avoidance of duplicate work for the development of such a framework, the availability of a lot of well tested data mining functions and the availability of the source code. This enables customisation of methods and the development of new functions to meet the special needs of the steel producers. This framework was integrated into the IT environment of the steel producers of the consortium, ThyssenKrupp Nirosta, ArcelorMittal Espana and ILVA. Here, different conditions regarding the IT environment had to be considered. This was done by using different software techniques, showing the transferability of the developed framework. The different approaches were developed inside the common framework by means of available and new developed RapidMiner functions. With that, very complex and non-trivial tools were realised which could be easily exchanged between the partner. This exchange was done during the project duration to compare the different approaches with data coming from the industrial partners (see section 2.3.21 on page 83). Even if the implementations are of prototypic character they are still in use and it is foreseen that they will be continuously expanded to fulfil the increasing demands of the target users. The advantage gathered with the new system can be summarised as follows: 

The experience of the first application of the AutoDiag system under industrial conditions (see section 2.3.20 on page 82) shows that it is not necessary that the people using such a system are data mining experts. So a larger number of potential users can investigate more production problems or product quality deficiencies taking into account their expertise in flat steel production. This will lead to a better production and good product quality which then can increase the yield and reduce the customer claims.



The number of employees with a more detailed knowledge regarding the correlations between the production behaviour and the product quality will increase. Hiding the complexity of the data mining methods to the target user reduces the fear to use the system. The first successful investigations have increased the acceptance because the employees have usually a financial benefit when yield of a plant can be increased.



Slight optimization of the prediction/control of a process using data mining could get enormous impact in the savings or benefits of one factory in the steel production. Multiply this by the number of different processes that can be optimized and by the tons of produced flat steel a potential saving of millions of Euros could be achieved if a key parameter can be slightly improved.

99



Bringing data mining techniques to production people could increase the possibility to improve the process combining the power of the data mining methods and the detailed process knowledge of the plant personnel.



Making the power of the data mining available to the plant personnel increases automatically the number of „experts‟ able to solve quality problems immediately. The work load of the few real data mining experts available in the steel industry can be dramatically reduced. So they can focus to improve and further develop the AutoDiag system so that the data mining knowledge can be distributed to the whole company.

The common framework developed during the project runtime was and is still used by the employees of the industrial partners. They use the data mining functionality for different applications that are described in the following paragraphs. ArcelorMittal Espana: The tool has been integrated in the existing operative working framework called Mytica. Personnel responsible for the product quality were trained to use Mytica. During the development of the project also information from the ASIS was filtered and more accurate introduced in Mytica, so currently they are applying the algorithms developed for RapidMiner to find relations between process variables and product variables from casting. Some rules to prevent quality defects in hot coils coming from the slabs have been already defined and were validated by process teams. At ThyssenKrupp Nirosta the implemented NiCo-Miner tool was well accepted, because the number of people investigating the relations between process parameters and product features increased from three experts to more than 50 colleagues. They belong to different departments (mainly quality management, material development and production) and are regularly using the NiCo tool. This lead to optimised processes with an estimated benefit of more than 150 k€ in the first year and a much faster identification of root causes for quality deficiencies. Considering the effort for the initial training of the users which takes about 3 hours per session and the effort for maintenance of the system which consumes about 1 day per month a great value can be generated. After the project end it is planned to further extend the functionality and train more possible users. The first application example from ThyssenKrupp Nirosta to be mentioned is the optimisation of surface quality of strip that was produced on the bright annealed process route. Here the target value is non-metallic inclusions that were observed by an operator on the cold strip. For a time range of one year the relevant data were selected by the NiCo-Miner tool. Here the data were also filtered to the specific process route, the same strip geometry and material class. By means of the RapidMiner process, that runs hidden inside the NiCo-Miner tool, these variables from the steel melt shop and caster were selected, that have the most important influence to the target quality problem. On the basis of analysing the root causes for the building of non-metallic inclusions, the process could be significantly improved yielding less non-metallic inclusions. Another investigation belongs to the fracture strain A80. Especially for ferritic strips the influencing variables of the annealing process were investigated. Through the Nico-Miner tool functionality optimal process routes and parameters could be identified to ensure customer demands for sufficient values for A80. In general, by the application of the NiCo-Miner tool it could be seen that these kinds of investigations can be done by process experts efficiently in a few hours. Before AutoDiag was realised, such investigations had to be done by the data mining experts who have to collect and prepare the data for each investigation separately. This procedure was fault-prone and takes about several days up to weeks. Finally it can be concluded that for ThyssenKrupp Nirosta the usage of the AutoDiag functionality inside the NiCo-Miner tool leads to a discharge of the data mining experts, a larger number of employees who are able to investigate product quality problems and find optimised process parameters and also to a larger number of quality problems can be investigated in a shorter time. At ILVA the smart components are installed by means of the ILVAMiner and they are used for the optimisation of the schedule at the steel shop. Here especially the scheduling optimization preserving and, if possible, improving quality of the final product was considered. The aim was to define the effect of some elements on different steel grades, to see if the usual adopted ranges and grouping in the sched-

100

uling phase could be optimized, finding the best balance between requested product features and steel shop scheduling flexibility. The economic potential of the developed system that is expected by the industrial partners of the consortium cannot be directly denoted. If for a product quality deficiency the causes can be identified and eliminated very fast, the amount of scrap and / or the number customer claims will be reduced and therefor the yield will increase. Also the increased know-how of the personnel in the field of process understanding and cause and effect relationship cannot be described from an economical point of view. For the industrial partners the economic benefit and the amortisation was summarised in the following Table 34 to Table 36. Until the completion of the presented report there were no publications or conference presentations of the project results. No possible patent filing.

Table 34: Economic benefit and amortisation for ILVA

101

Table 35: Economic benefit and amortisation for ArcelorMittal Espana

Table 36: Economic benefit and amortisation for ThyssenKrupp Nirosta

102

3.

Appendices

3.1

Analysis of previous projects

QDB - ECSC- Project 7210-PR/171: “Implementation of an assessment and analysing system for the utilization of a factory wide product quality data-base”, 01.07.1999 – 30.06.2002 Summary: The central goal of the project was the detection of dependencies between process/plant parameters and product quality over the whole process chain of steel strip and based on length segments of the product. This had to be realized using data-based methods operating on a factory wide product quality database. The summary of the objectives of the project were the following: 

To implement a data-warehouse for a factory wide process / plant and product quality database specialised to integrate steel manufactories and related to technological problems.



To implement different toolboxes for the detection of relationships between product properties and process data, data analysis, model construction and combination of all information which is stored in the database. Process and quality engineers are the typical users of this tool.

To select the hardware and commercial software for the global system and introduce it in the field and to demonstrate the on-line capability of the system. Partners: 



BFI



Aceralia



Centro Sviluppo Materiali (CSM)



Forschungs- und Qualitätszentrum Brandenburg GmbH



EKO Stahl GmbH

 IRSID Data pre-processing: Pre-processing of data is supported by DataTools: 

Filtering



Removing outliers



Plausibility check



Handling of NULL-entries



Ensuring of variance



Coding of text and categorical variables



Quantization

 Computation of derived variables Quality problems: 

Analyse scale defects.



Analyse shuttering problems.



Tale end clog.



Scale.



Slivers.

103

Analysis method and Reached results: As demonstration example sliver defect is studied. Slivers are defined as material overlays of different shapes and sizes occurring irregularly on the surface of the rolling material and only partly connected with the base material. Three classes (good, medium and bad) were used to categorize the strips but this leads to no direct results. In order to receive sharper results on functional connections the limit set to 2 classes. Categorised Histogram, Decision trees and Genetic algorithms were tested over the data. Categorised histograms shows influence in Mould level, casting performance, slab width and Argon pressure. Decision trees and genetic algorithms identified different parameters influence this leads to the assumption that not all significant factor of influence could be identified yet. Anyway, the following significant parameters can be determined as a result of the of the influence analysis: 

Mold level deviations.



Argon pressure and quantity.



Immersion depth of the nozzle during the sequence.



Tundish weight.



Sheet thickness in the finishing mill.



Modification of the casting width.



Edging draft.

Less strong influence was shown by: 

Vanadium.



Shift number.

 Tundish number DataTools offers the possibility of model identification. Using neuronal net as model structure a tendency of defect formation with casting width could be detected. The reduction of Argon quantity and pressure, the reduction of immersion depth of the nozzle during the sequence and change of the casting width lead to reduce the number of defects. FACTMON - RFCS Project RFS-CR-03041: “Factory-wide and quality related production monitoring by data-warehouse exploitation” 01.07.2003 – 21.12.2006 Summary: The central and main idea of this project is the intelligent combination of data-warehouse technology with integrated plant, process and quality monitoring approaches based on a medium- and long-term time scale. The project pursued the following objectives: 

To develop concepts for the integrated monitoring of plants, processes and products based on the information stored in a factory-wide database and focussed to the intermediate and final product quality,



to extend existing data-warehouses by additional information, especially regarding operating practices and plant information, which are necessary for the developed concepts,



to develop new automatic and through-process monitoring methods and/or adapt existing methods to the factory-wide nature of the data-warehouse,



to test the developed concepts and methods in the field and to document results and experiences,

 to verify the transferability of the new approaches. The overall aim of the project was to improve the product quality and to increase the yield of the production.

104

Partners: 

BFI



Aceralia






Forschungs- und Qualitätszentrum Brandenburg GmbH

 EKO Stahl GmbH The monitoring methods proposed by BFI and ARCELOR EISENHÜTTENSTADT depend on SPC techniques using raw or transformed data. A large variety of models and aggregation methods are made available by the newly developed software system DATAMON which accesses the data on-line from the ARCELOR EISENHÜTTENSTADT data warehouse system ZQDB. The monitoring strategy proposed by Arcelor España, based on the combination of SPC and OLAP techniques, has been developed and implemented as a software tool accessible to all the staff of the line. The SPC technology has been adapted to monitor globally the line behaviour. A previous selection of the most representative variables has been done by means of graphical and multivariate data analysis techniques (clustering, projections and neural networks). The control of operative practices has been developed and implemented by CSM as supplementary tool for control and supervision of material quality assurance. The existent data warehouse, containing quality and process data, has been completed with new features that describe the correct application of setpoint parameters during the process, in order to achieve the best quality. Data pre-processing: 

aggregation operations like computation of mean values, standard deviations, etc.



filter operations to handle for instance outliers and NULL-values,

 computation of derived features as combination of several variables. Quality problems: In Arcelor España, the project has been applied to monitor a tinplate line. The objective is to provide different views of the line state and its influence on product quality, according to the needs of the users, production, quality and maintenance staffs. A software application has been developed that implements the techniques selected in the project. It includes tools for analyzing the line from two points of view: 

State of the line (at a whole, or specific sections), related to the capacity of producing coils according to the client requirements.

 State of a particular element, related to the probability of failure. The next table shows the quality measures on tinplate line. Tinplate line

 Dimensions  Coating o Thickness o Uniformity o …  Defects o Surface  Anode marks  Black edge  … o Other defects  Coiling  Cross-bow  …

Table 37: Quality measures on tinplate line

105

Analysis method and Reached results: Control charts are used to routinely monitor quality or process variables. They show the value of the feature along with the control limits. These limits define the normal behaviour of the process. To increase the sensitivity to trends or drifts and reduce the false alarms special control charts were used: Cumulative sum, Moving average and Exponentially-weighted moving average charts. To manage multivariable in control chart a Hotelling‟s T2 statistic can be used to transform the multidimensional variable into one dimension. Adaptation of control limits. Knowing when to recalculate control limits is a key point in the use of control charts. Process states must be known in order to change control limits. When we are on a multivariable case projection technique (Sammon projection) can be applied and the different process states can be represented in a bi-dimensional scatter plot. The library of monitoring methods (like SPC, multi-variate models etc.) is realised in the DATAMON environment offering: 

Different SPC control charts (X-bar, EWMA-, S-, C-, U-, N-, NP-Charts)



Different aggregation functions for feature generation (mean, standard deviation, median, range, etc.)



A great amount of arithmetic functions, applicable to (multiple) features.



Data based modelling by DataDiagnose



Flexible scheduling of monitoring tasks

 Storing the models and tasks in the data-base. Besides the basic aggregation functions DATAMON offers a special grading function for monitoring surface quality of flat products providing different kinds of defect assessments: 

Number of defects per coil (absolute and specific)



Defected length per coil (absolute and specific)

 Defected area per coil (absolute and specific) DataDiagnose provides model types for classification and regression. Available Methods for classification are: 

LVQ (learning vector quantisation)



Bayes classifier



Neural networks (MLP and RBF)



Decision trees (C4.5 and OC1)

 NNK (nearest neighbour classifier) Available methods for regression type modelling are: 

Linear regression



PLS (partial least square)



Neural networks (MLP, RBF, and Filternet)

For the systems using SPC techniques it can be concluded: 

SPC-like control charts are a suitable way of displaying monitoring results, easy to understand to all possible users.



The premises for the application of SPC techniques must be obeyed, in order to get reliable results.

In certain cases, quality information can be added to SPC, monitoring not only variability in the data but also the influence on quality. ASSPC (Advanced SPC system): 

106



Multivariate SPC has been applied to monitor sections of a tinplate line, including different elements and equipment. Different levels of aggregation have been defined, from single variables to a global index for the whole line.



The large number of variables in a whole facility makes necessary the use of OLAP techniques to retrieve information.



When monitoring a whole facility, the capacity of the user to create its own defined aggregations should be limited. Otherwise, given the large number of variables and potential users, the complexity of the system would grow to a point that the hardware requirements would increase.

DAFME - ECSC-Project 72120-PR/342: “Improvement of quality management in cold rolling and finishing area by combination of failure mode and effect analysis with data-base approaches”, 01.07.2004 – 30.06.2007 Summary: The main topic of this project is the combination of rule-based methods like Failure Mode and Effect Analysis (FMEA) or the Anticipatory Failure Determination (AFD) with data-based methods. The overall aim is to improve the quality management in the cold rolling and finishing area and by this to improve the productivity and to reduce the production costs Partners: 

BFI



Aceralia






LABEIN

 ThyssenKrupp Nirosta Data pre-processing: 

Null entry handling



Boundary value check



Outlier elimination



Visual plausibility check

Coding. Data which are only available in text form, have to be replaced (codec) by a numerical representation. Quality problems: 



Causes of failures or defects (11 types) in the tinplate and galvanizing lines

 “Dull surface” in stainless steel. Analysis method and Reached results: 

Statistical analysis (histogram distributions, linear correlation)



DataTools -

Categorised histograms

-

Discriminatory analysis

-

Decision tree C45

-

Multiple trees OC1

-

Self Organising Map SOM

-

Non-linear correlation using SOM

107

Stainless flat products: One of the parameter of the product that was detected to have an influence to the investigated quality failure 'grey surface' was the yield strength. This value is measured by means of a test sample. Tinplate: Based on these selected failures, an AFD analysis has been performed upon the named "White Stains on Strip" failure that causes secondary qualities of the foiled strip and has several potential causes that should be pointed out. The results obtained from this study corroborate the identified causes for this failure. The material of the White Stains was oil rests. The target of the analysis was to detect, helped by the Failure Analysis process, the causes that produce the White Stains at different parts of the line. The Inverted Problem was analysed, trying to detect the ways to re-produce the failure situation. Once it was identified that the most critical and harmful cause for the White Stains on the Strip is that the Electrolytic Cleaning operation is running under low efficiency conditions, we approached the case of improving its performance. Galvanizing line: The selected failure situation arises when grains are detected during visual inspections of the resulting coating. Once the System was defined, the Inverted Problem was analysed in order to identify necessary components and revealing the corresponding resources. The Grains are originated by reaction between not alloyed Zn on the strip surface with the steel substratum of the strip creating Zn-Fe alloys, in a 95% Zn, 5% Fe rate. Furnace conditions were analysed to reduce this defect. The conclusions can be summarised to the following bullet points: 

The combination of traditional methods (FMEA and AFD) and data-based methods leads to improved quality management by -

the analysis of causes for quality defects,

-

the anticipatory diagnosis to avoid defects.



The deficiencies of each method can be reduced by the combination of both.



Improvement of the knowledge about the process given by FMEA or AFD by means of data analysis.



Summarisation of the whole knowledge about a specific quality problem to give optimised hints to the plant operator.



The developed tools make the gathered knowledge available to the whole responsible people in an easy accessible way.



The realised combination of different methods improves the global view of the production.



The reached results are directly depended form the availability of reliable plant / process / quality data.



The shown data-based approach can be combined with every kind of knowledge extracting method, not only FMEA / AFD.

SOFDETECT - RFCS-Project RFS-CT-04017: “Intelligent soft-sensor technology and automatic model-based diagnosis for improved quality, control and maintenance of mill production lines”, 01.07.2005 – 30.06.2007 Summary: The system developed in this project allows to create data-based models from any kind of facility of system where it be installed. In this project, the final developed system have been installed in the Tandem 2 tin plate cold rolling mill located in the ArcelorMittal facilities sited in Avilés (Spain). It was an intensive cooperation with technicians from the facility in the phase of hardware and software specification. In terms of hardware the whole system was installed in a cabinet containing a dedicated PC and three acquisition cards. Also, it was design a dedicated card containing filters for avoiding aliasing possible difficulties. The wiring with field and process computer ethernet communication was defined with the supporting of the facility technical services. Several tools have been developed in order to allow the applications of data mining techniques directly from the system without the need of exporting data for further Knowledge Data Discovery (KDD) techniques.

108

During the elaboration of models it was found several conditions in the mill that confirmed the good working of the tool. It is clear that such a tool must be used as dynamic tool in the mill because a wrong condition is not a usual condition of the mill (fortunately) so many time should be needed for having a complete collection of bad conditions and its related rules for solved them. Finally, just to say that the developed system is being demonstrated as a very nice and universal tool for advanced maintenance tool. It is very easy to transfer to another facility or system to be monitored. Partners: 

Aceralia






LABEIN



TKS



IMS



RASSELSTEIN

 UNIOVI Data pre-processing: 

Eliminate variables and captures



Composition of variables



Selection and remove based on threshold or selection



Remove head and tale of coils

 … Quality problems: The on-line application acquires data from 29 field signals and 89 signals from process computer for mill status and thickness quality soft-sensing and monitoring. Analysis method and Reached results: It was considered that time plot, scatter plot, pdf scatter plot, table lens, PCA (Principal Component Analysis) and SOM (Self-Organizing Maps) will be the most helpful as visualisation tools for this project. The design was also deliberately open so that other kinds of data mining techniques could be added in the future if they were deemed necessary. The model manager application is aimed at model generation and management using the results of feature extractions as input data. It combines two kinds of models: dimension reduction (SOM) models and rule based models. These capabilities were originally planned to be included in the data analysis application, but later it was decided to implement them in a separate application. 



Features extraction: -

Mean

-

STD

-

Root mean squared

-

Root mean squared at constants and variable frequencies

-

Composition variables

Visualisation: -

Temporal Plot

-

Table lens

-

Scatter

-

PDF scatter

109

-

PCA

-

SOM maps

-

Spectrograms



SOM



SOM correlation maps

 Residual from SOM model The usual procedure is to label each state in the map with an indication of the condition of the mill for producing with high quality (good/bad). During monitoring, if the residuals are zero, the current condition corresponds to a recorded one (good or bad), which is indicated by the state pointer in the maps. These recorded or known bad conditions must be characterized beforehand analysing the causes for their low quality and this knowledge must be used to obtain a rule base which provides with hints to get transitions of the rolling mill to known good conditions. If some residuals are non-zero, the values of the residuals indicate the differences between the current state and the signalled one. Particularly, the features related to quality may show if the current quality is higher or lower than the signalled in the state map, and if it is lower the rest of the residuals may provide a clue about the causes. OLPREM - ECSC-Project 7210-PR/292: “On-line prediction of the mechanical properties of hot rolled strips”, 01.07.2001 - 30.06.2004 Summary: The research aims to validate in different hot strip mills a numerical model for the prediction of the final mechanical properties of hot rolled steel strips immediately after the production and to implement this model on-line. This model was validated on the selected hot strip mills, improved in order to reduce the observed deviations and extended in order to satisfy new conditions. The metallurgical modelling approach was compared with statistical models and neural networks. Partners: 

Aceralia



CRM



TKS



CORUS

 IRSID Data pre-processing: The first task has been to combine the process and quality databases into a new one which contains only the necessary variables. Then, the initial dataset has been built. It includes all the variables that can have influence in the mechanical properties, even although some of them highly correlated. Before the modelling phase can be started, the number of variables has to be reduced. The first step before analysing the data is to reject from the dataset all the samples that can lead to errors during the training phase. Two criteria were used to select the samples. In the first place, the values have to be included into the fixed limits, to avoid using wrong data due to errors in the measures or the storage. The other criterion is based on the data distribution. The outliers in the initial data set have been removed. These data, even if they are correct, do not represent the typical operating conditions in the mill. Trying to build models using them could produce less accurate models for the normal cases. The application of these two sets of limits defines the validity range of the models. PCA (Principal Components Analysis) has been used to project the multidimensional data on a plane. From the whole set of available variables, the first step is to select the relevant ones by using the knowledge about the process. Based on it, about 100 variables have been initially selected. This number is still very large, so other methods based on the data were applied to refine it using six techniques:

110







Graphic methods: Their main advantage is that they do not require any assumption about the nature of data or their relationships. On the other hand, they do not provide any numerical index of the relevance of a particular variable, only a graphic that has to be interpreted and they can only analyse pairs of variables. These methods help to identify strong relationships between variables, but when they are not clear, they need to be confirmed by other techniques. -

Scatter plot: This is the simplest method: one variable is represented against other. If they are correlated, the points will follow a curve.

-

Table lens: It consists on sorting the whole dataset according to the values of one variable and plotting them. Those variables related with the first one will also show some kind of order.

Linear methods: These methods are also quite simple and provide a numerical index of the strength of the relation between two variables. Their main drawback is that they are only able to identify properly linear relationships. -

Linear correlation: The correlation coefficient is an index of the linear dependency between two variables.

-

Linear regression: This is itself a modelling technique, but as it only can deal with linear relationships it cannot be used in this project. However, they results are easy to interpret, and when all the input variables are conveniently scaled, the equation that this method generates shows the relative importance of each of them to the output variable.

Advanced methods: They are more powerful than the previous ones, since they can identify linear or non-linear relationships. They can order the variables according to their relevance for the output variable and, additionally, the output of these methods can be visualised, which gives a more deep knowledge of the relationships they discover, although they have not been designated as correlation hunting methods. However, they can be misled by the presence of highly correlated variables, so it is necessary a previous filtering process. -

MARS (Multivariate Adaptive Regression Splines): This is also a modelling technique. It builds flexible models by fitting piecewise linear regressions; that is, the nonlinearity of a model is approximated through the use of separate regression slopes in distinct intervals of the predictor variable space. MARS also calculates the importance of the variables: it refits the model after dropping all the terms involving a variable and calculating the reduction in goodness of fit.

SOM (Self-Organizing Maps): This is a neural network type. It quantizes the data space formed by the training data and simultaneously performs a topology-preserving projection of the data onto a regular low-dimensional grid. The basic technique is the visualisation of component planes. Each component plane can be thought as a slice of the map: it consists on the values of a single vector component in all map units. By comparing component planes with each other, correlations are revealed as similar patterns in identical positions of the component planes: whenever the values of one variable change, the other variable changes too. Quality problems: Numerical model for the prediction of the final mechanical properties of hot rolled steel strips. Analysis method and Reached results: The technique used for the modelling phase is artificial neural networks (ANN). A different network has been trained for each output variable. The selected architecture is a feedforward network. Feedforward networks often have one or more hidden layers of sigmoid neurons followed by an output layer of linear neurons. Multiple layers of neurons with nonlinear transfer functions allow the network to learn nonlinear and linear relationships between input and output vectors. The linear output layer lets the network produce values outside the range -1 to +1. They have been trained using backpropagation with momentum as the training algorithm. This is the learning rule commonly applied to feedforward multilayer networks. Momemtum is used to make it less likely for a backpropagation network to get caught in a shallow minimum. -

111

The main conclusion of this project is that the data mining methodology used can be successfully applied to solve this type of problems. The main constraint is the need of large amounts of data, which also are responsible of the validity range of the models. Building the initial dataset is the most important step in the process. It helps to reduce the training time and up to a certain extent determines the success of the project. Neural models are only able to predict accurately the mechanical properties if they have been properly trained with an adequate dataset. Several trials have to be made to find the set of parameters for the neural networks that produce the best results. The best results have been obtained when a different model has been used for each of the predicted properties. The final models have been extensively validated, showing a prediction error usually inside acceptable tolerance ranges. Simple transferability of the models to other facilities is not possible. Results are unpredictable unless both mills have the same characteristics, including those related to the process and to the coils produced. However, they have been applied to data provided by TKS and, when possible, results were also satisfactory. The application developed acquires data from two sources and merges them in a local database, making the models independent of possible future changes in the data acquisition system. The modules themselves have been implemented in a way that they can be easily updated or replaced. IMGALVA - RFCS-Project RFS-CR-04023: “Investigation, modelling and control of the influence of the process route on steel strip technological parameters and coating appearance after hot dip galvanising”, 01.07.2004 – 21.12.2007 Summary: A system for prediction of quality-relevant product properties for hot dip galvanized material was developed. It gives information about the product properties-to-expect while the hot dip galvanising process and preceding steps are carried out. This prediction based on operational variables of the process. The system is used for open-loop quality control. That means that the deviation between predicted and required properties is used to adjust process parameters towards optimum performance. The system contains both, the prediction of technological parameters and that of coating appearance. It was founded on data-based and physical models. As far as possible, operators‟ knowledge was utilised. Partners: 

ACERALIA (ARCELOR Group), Spain,



University of La Rioja, Spain,



Korrosions- och Metallforskningsinstitutet AB (KIMAB),Sweden,



Salzgitter Mannesmann Forschung (SZMF), Germany,

 Betriebsforschungsinstitut GmbH (BFI), Germany Data pre-processing:  Outlier (beside three sigma) violations have been removed Quality problems: 

A model for the prediction of regular yield strength after galvanising and tempering was developed.



Prediction of roughness values.

 Irregularities in the adherence of the zinc layer. Analysis method and Reached results: 

Neuronal Network



Decision tree C4.5



Lazy LBK classifier

112

Classifier NN (multilayer Perceptron) Decision Tree Lazy LBK

Classification results for regular yield strength Deep Drawing grade 85% 85% >90%

elevated strength >90% >90% >90%

Table 38: Classification results for regular yield strength HIGHPICK - RFCS Project RFS-CT-2005-00021: “Optimised Productivity and Quality by Online Control of Pickled Surface” 01.07.2005 – 21.12.2008 Summary: To attain a complete elimination of the oxide from the steel surface is an obvious quality request of pickling. However the overpickling i.e. the attack of iron by the acid solution is also detrimental in terms of productivity and quality. So reaching the exact “just pickled” state is an important issue. That is precisely the objective of this project. It has a multidisciplinary approach: 

Development / adaptation of a number of sensors: to monitor the strip surface, cess and assess the aggressiveness of the acid solution



On-line tests in several industrial lines

follow the pro-

 Statistical analysis of overall pickling phenomena. All the data were used to define the conditions assuring optimised operation i.e. full pickling and no or minimal overpickling. Finally the benefits were analysed and the transfer conditions were prepared. Partners: 

CRM



SSSA



ILVA



ArcelorMittal Research



BFI



TKS-RA (subcontactor of BFI)

 Università di Pisa (subcontactor of ILVA) Quality problems: The aim of the project was to optimize pickling process results by identifying under and over pickled coils by means of process parameters. In the dataset ILVA provided to us, composed by historical data, the classes of well-pickled coils and defected coils were very unbalanced. Different data mining techniques were tested as reported below, included recBFN, a peculiar neuro-fuzzy paradigm suitable for the classification of very unbalanced sets. Data pre-processing: 

NULL values elimination

 Range validation Analysis method and Reached results: 

SOM



Fuzzy system



RecBFN



Decision Tree C45

113

After various data mining techniques were tested with poor performances, SSSA developed a C++ tool composed by a RecBF network and a Decision Tree C45. The tool is able to predict with a good precision both classes. WACOOL - RFSR-CT-2005-00017: “Width-adaptable optimized controlled-cooling systems (WACOOLs) for the production of innovative Advanced High Strength Steel grades and the study of strip shape changes while cooling” (July 2005 – June 2008) Summary: Efficient production of Advanced High Strength Steels (AHSS) requires important and parallel improvement of the rolling process and especially of the cooling systems. The microstructures and the resultant mechanical properties of steel depend on the cooling patterns and the steel chemistry. The prediction of TTT-diagrams and other curves obtained from continuous cooling required for planning the industrial production of AHSS grades were performed. Symmetrically and asymmetrically controllable and edge-masking types of width-adaptable and water-efficient laminar-flow cooling systems were developed and implemented at 3 rolling mills. To complement the strip-cooling & transformation phenomena, an investigation of the strip shape changes, by means of both mathematical modeling and AI techniques, all along the cooling path between finishing mill and down coiler were undertaken with a view to reducing rejects. Partners: 

CETTO



SSSA



ILVA



TKS



ARCELOR

 CORUS Quality problems: SIAD developed a system for predicting of the whole TTT-diagrams from the chemical composition by exploiting both physical and neural network based modelling. To this aim, many data coming from several production sites were collected. A C++ tool, composed by a model and a user-friendly graphical user interface which is able to display TTT-diagrams, was developed. Data pre-processing: 


 Range validation Analysis method and Reached results: 

parametric characterization



neural networks



linear regression

BORON - ECSC STEEL RDT PROJECT Nr 7210-PR/355: “Optimisation of the influence of Boron on the properties of steel" 01.07.2000 - 30.06.2003 Summary: Objective of the project: 

Investigating the influence of B additions on austenitic and ferritic grain size.



Investigating the effect of various micro-alloying elements and impurities on the B hardenability factor.



Developing a model for prediction of the Jominy profile in B steels.



Optimising the amount of B addition to tie up the N in the steel as stoichiometric BN.

114

Investigating the effect of stoichiometric and super stoichiometric unprotected B addition on the work hardening rate and static strain ageing in low carbon steel during cold working operations. Partners: 



CORUS



SSSA



ILVA



SIDENOR

 LABEIN Quality problems: SSSA worked on the prediction of the Jominy test (steel hardenability) from the chemical composition of the steel. A model based on a neural network chain was developed. The optimization of some parameters was done by means of a genetic algorithm. A C++ dll which implements the model was developed and integrated in the common framework. Data pre-processing: 




Range validation



PCA

 ICA Analysis method and Reached results: 

Neural Networks (MLP, RBF, WRBF)



Genetic Algorithms

115

3.2

Poll from KDNuggets

Figure 65: Poll from KDnuggets (taken from [8])

116

3.3

Comparison of WEKA and R

Table 39: Comparison between WEKA and R

Feature Text file – delimited ARFF C4.5 format Database SAS file SPSS file Minitab

Data Import Weka Yes Yes Yes Yes

R Yes Yes Yes Yes Yes Yes

Data Exploration / Visualisation Feature Weka Descriptive statistics Yes Frequency table Yes Scatter plot Yes Scatter plot matrices Yes Histograms Yes Tree/Graph visualisation Yes Boxplots ROC curve Yes Precision/recall curve Yes Lift chart Yes Cost curve Yes Feature

Data Preparation Weka

Sampling Oversampling/balancing Random Stratified Discretization (binning) Equal width Equal frequency Supervised Reorder fields Identifier fields Normalization/standardization Binarization Derived fields Outlier detection Principal components Random projections Attribute selection Arbitrary kernels

Bayesian Naïve Bayes

Feature

Modelling

R

Yes Yes Yes

Yes Yes Yes

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Yes Yes

Weka Yes

117

R Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes Yes Yes

R

Naïve Bayes multinomial Complement naïve Bayes Averaged one-dependence estimators Weighted averaged one-dependence estimators Bayes nets Naïve Bayes trees Bayesian additive regression trees Lazy Bayesian rules Functions Linear regression Logistic regression Isotonic regression Least median squares regression Pace regression Support vector machines

Yes Yes Yes Yes Yes Yes

Multilayer perceptron (neural net)

Yes

Radial basis function network Gaussian processes Voted perceptron Lazy K-nearest neighbours Locally weighted learning Trees ID3 C4.5 CART Decision stumps Random forests Best first tree Logistic model trees M5 model tree Alternating decision trees Interactive tree construction KNN trees Rules Decision table RIPPER Conjunctive rule M5 Rules PART Ripple down rules (Ridor) NNge OneR Ensmeble learning AdaBoost LogitBoost Additive regression Bagging Stacking Dagging Grading MultiBoost

Yes Yes Yes

Yes Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Yes Yes Yes Yes Yes (via interface to third party app) (single hidden layer NN) Yes Yes

Yes Yes Yes

Yes

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

118

Yes Yes Yes Yes

Voted classifier MetaCost Ensembles of nested dichotomies Multi instance learning methods

Yes Yes Yes Yes

Clustering EM KMeans XMeans COBWEB (hierarchical) OPTICS Farthest first clustering Hierarchical clustering Agglomerative nesting Fuzzy C-means clustering Bagged clustering Cluster ensembles Convex clustering Association rules Apriori Predictive Apriori Tertius Generalized sequential patterns Eclat

Feature Prediction accuracy Confusion matrix AUC Information-retrieval stats Information-theoretic stats ROC / lift charts Experiment facility Feature Serialized java object Java source code (limited) PMML (limited)


Yes Yes

Yes

(via interface to third pary app)

Yes Yes Yes

Evaluation

Deployment

119


(via interface to third party app)

Weka Yes Yes Yes Yes Yes Yes Yes

R Yes Yes Yes Yes

Weka Yes Yes

R

Yes

Yes

3.4

Additional figures

Figure 66: Hardcopy of DataDiagnose

Figure 67: Overview of the „brute force‟ RapidMiner process

120

Figure 68: Sub-process „Remove_vars‟ for removing empty variables

Figure 69: Sub-process „AD_Pre_Process‟: variable filter, remove correlated input variables

Figure 70: The core „brute force‟ RapidMiner process

121

Figure 71: Sub-process „Deviation‟; the other paralell sub-processes are similar

Figure 72: Selection of the variables to be used for the „brute force‟ approach

122

Figure 73: Visualisation and definition of the data filter

Figure 74: Definition of the target classification

123

Figure 75: Result presentation, here after the application of the hypothesis test

3.5

List of Figures

Figure 1: Integration of the common framework into the industrial environment .................................. 19 Figure 2: Structure of the common framework ....................................................................................... 19 Figure 3: Poll, Data mining tools used for real projects. ......................................................................... 21 Figure 4: Schema of actual database viewer Mytica in ArcelorMittal Asturias ...................................... 23 Figure 5: Using JNI for Java-to-C# interface, taken from [7] ................................................................. 24 Figure 6: Hardware and software structure realised at ThyssenKrupp Nirosta ....................................... 25 Figure 7: Updated data acquisition scheme, totally based on ILVA Novi Ligure IT environment......... 26 Figure 8: Scheme of the data environment at ThyssenKrupp Nirosta ..................................................... 27 Figure 9: Scheme of the data acquisition system .................................................................................... 27 Figure 10: Data model used for the data mart (star schema) ................................................................... 28 Figure 11: Entity-Relationship diagram of the ILVA AutodiagDB ........................................................ 30 Figure 12: Java Native Interface.............................................................................................................. 32 Figure 13: ILVAMiner structure and JNI ................................................................................................ 33 Figure 14: ILVAMiner data flow ............................................................................................................ 34 Figure 15: Component plane of a SOM .................................................................................................. 35 Figure 16: Decision tree .......................................................................................................................... 35 Figure 17: Scheme of the core „brute force‟ approach ............................................................................ 36 Figure 18: Overview of the „brute force‟ RapidMiner process ............................................................... 37 Figure 19: Input feature space (D) and visualisation space (V) showing direct and inverse mappings, with the image M of V in D and, the projection of a feature vector x and its residual using the surface M as a model, ............................................................................................................................................... 38 Figure 20: Residual from Self-Organizing Maps to compare two data collections and to detect differences between them, ....................................................................................................................... 39 Figure 21: Layout of the experiment to test the performance of the new SOMDimensionalityReductionAndResidual operator. The number of each box is the order of execution and in Table 7 there is a description of each element. ............................................................................. 41

124

Figure 22: Example of the random data generate by ExampleSetGenerator when 2 attributes is selected. ................................................................................................................................................................. 41 Figure 23: Example of the random data generate by ExampleSetGenerator when 2 attributes is selected and noise is added to attrib2. ................................................................................................................... 41 Figure 24: Plot using the plotter Series of RapidMiner of the residual of attribute 1 and 2. The residual (the error of the model) for the example set used for the training (the first 700 items) is small because the model uses that data to build itself. The next 300 samples are new for the model and therefore the error (residual) is bigger. If we look carefully it can be seem that the attribute 2 has more error than 1, because the noise was added only to attribute2. ...................................................................................... 42 Figure 25: Image created with the Residual Plot developed for RapidMiner to show the Residual of experiment described at Figure 21 .......................................................................................................... 43 Figure 26: To improve the visualisation fewer points are showed using a resample operator. .............. 43 Figure 27: The Interpolate only X option gives a new option of visualisation that can be useful in some cases. ....................................................................................................................................................... 44 Figure 28: SVM maximize the margin around the separating hyperplane. ............................................. 45 Figure 29: Layout of the experiment to test the performance of the new Multi-BestLibSVMLearner operator. The number of each box is the order of execution and in Table 10 there is a description of each element. ........................................................................................................................................... 49 Figure 30: Searching of the best parameters C and gamma of the RBF kernel of the Support Vector Machine. .................................................................................................................................................. 49 Figure 31: Importance of the variables based in the F-score metric........................................................ 50 Figure 32: The main screen of ILVAMiner ............................................................................................ 51 Figure 33: ILVAMiner task wizard ......................................................................................................... 52 Figure 34: AutoDiag GUI: Definition of the data sample ....................................................................... 55 Figure 35: Feature classification and selection using F-Score and SVM on a real two-class classification. ........................................................................................................................................... 57 Figure 36: Rapid miner process developed by SSSA .............................................................................. 59 Figure 37: Measured and predicted values of SF1_FURN_TEMP vs. Rp02 for steel grade B .............. 60 Figure 38: Measured and predicted values of SF1_FURN_TEMP vs. Rp02 for steel grade A .............. 60 Figure 39: Comparison of the results from „brute force‟ approach to DataDiagnose results .................. 61 Figure 40: Minimalistic RapidMiner User Interface developed by ArcelorMittal .................................. 63 Figure 41: First and last step of the wizard of the analysis of good-bad products. ................................. 64 Figure 42: Presentation of the results in the wizard of the analysis of good-bad products. .................... 65 Figure 43: One of the overview result tap of the template of comparison of two data collections. Attribute 2 has a different behaviour than the others variables, this points the users that this is the feature that is different in the two data collection. .................................................................................. 66 Figure 44: ILVAMiner configuration file example ................................................................................. 67 Figure 45: Clustering results representation ............................................................................................ 68 Figure 46: Linear regression results representation ................................................................................. 68 Figure 47: Mytica interface, showing graph module and surface inspection system module. ................ 70 Figure 48: ThyssenKrupp Nirosta AutoDiag solution............................................................................. 72 Figure 49: Cp and Cpk indexes, where x is the mean and s is the standard deviation .............................. 74 Figure 50: Optimal Cp-Cpk training process .......................................................................................... 75 Figure 51: Influencing of variables using MARS in a regression problem and the result liner regression equation. MARS define linear regression by intervals. ........................................................................... 77 Figure 52: Definition of target classes..................................................................................................... 79 Figure 53: Resulting class distribution in the selected data sample ........................................................ 80 Figure 54: List of influencing variables .................................................................................................. 80 Figure 55: Detailed presentation of a variable selected as important ...................................................... 81 Figure 56: Importance of variables selected by the „individual adapted‟ approach for data sample 1.... 85 Figure 57: Selected variables, ordered by their importance, for data sample 2....................................... 87 Figure 58: Relative importance of the variables from data sample 3 ...................................................... 88 Figure 59: Selected variables by the „brute force‟ approach ................................................................... 88 Figure 60: Result of the „brute force‟ approach from the data sample 2 ................................................. 90 Figure 61: Result of the „brute force‟ approach from the data sample 3 ................................................. 91 Figure 62: Normalised pareto ranking for data sample 1 ........................................................................ 92

125

Figure 63: Normalised pareto ranking for data sample 2 ........................................................................ 93 Figure 64: Normalised pareto ranking for data sample 3 ........................................................................ 95 Figure 65: Poll from KDnuggets (taken from [8]) ................................................................................ 116 Figure 66: Hardcopy of DataDiagnose .................................................................................................. 120 Figure 67: Overview of the „brute force‟ RapidMiner process ............................................................. 120 Figure 68: Sub-process „Remove_vars‟ for removing empty variables ................................................ 121 Figure 69: Sub-process „AD_Pre_Process‟: variable filter, remove correlated input variables ............ 121 Figure 70: The core „brute force‟ RapidMiner process ......................................................................... 121 Figure 71: Sub-process „Deviation‟; the other paralell sub-processes are similar ................................ 122 Figure 72: Selection of the variables to be used for the „brute force‟ approach .................................... 122 Figure 73: Visualisation and definition of the data filter ...................................................................... 123 Figure 74: Definition of the target classification ................................................................................... 123 Figure 75: Result presentation, here after the application of the hypothesis test .................................. 124

3.6

List of Tables

Table 1: Comments to the archived objects ............................................................................................ 16 Table 2: Analysed previous projects with Data mining aspects .............................................................. 17 Table 3: Data mining methods applied in previous projects ................................................................... 17 Table 4: Categories related to the quality problem.................................................................................. 18 Table 5: Schematic material flow at ILVA Novi Ligure ......................................................................... 29 Table 6: Quality problem categories and selected algorithms ................................................................. 37 Table 7: Description of layout of experiment of SOMDimensionalityReductionAndResidual .............. 40 Table 8: Description of the parameters of the SOMDimensionalityReductionAndResidual operator.... 44 Table 9: Algorithm to determine the optimums features to be used and the parameters of a Support Vector Machine Learner that uses a RBF kernel ..................................................................................... 47 Table 10: Description of layout of experiment of Multi-BestLibSVMLearner ..................................... 48 Table 11: Importance of the variables based in the F-score metric. ........................................................ 50 Table 12: Quality problem categories and selected algorithms ............................................................... 53 Table 13: Summary of characteristics of datasets used to check the performance of the operators developed by ArcelorMittal..................................................................................................................... 58 Table 14: SF1_FURN_TEMP prediction errors...................................................................................... 60 Table 15: Example of comparison between two prediction of SF1_FURN_TEMP ............................... 61 Table 16: Confusion matrix of the decision tree model .......................................................................... 76 Table 17: New features included because of feedback from the users. ................................................... 82 Table: 18: Training results for the selected variables .............................................................................. 85 Table: 19: Validation results for the selected variables ........................................................................... 85 Table 20: Result of the „individual adapted‟ approach to data sample 2................................................. 86 Table 21: Classification results of the „individual adapted‟ approach on data sample 3......................... 87 Table: 22: Training results for the selected variables for data sample 1 ................................................. 89 Table: 23: Test results for the selected variables for data sample 1 ........................................................ 89 Table: 24: Training results for the selected variables for data sample 2 ................................................. 90 Table: 25: Test results for the selected variables for data sample 2 ........................................................ 90 Table: 26: Training results for the selected variables for data sample 3 ................................................. 91 Table: 27: Test results for the selected variables for data sample 3 ........................................................ 91 Table: 28: Training results for the selected variables for data sample 1 ................................................. 93 Table: 29: Test results for the selected variables for data sample 1 ........................................................ 93 Table: 30: Training results for the selected variables for data sample 2 ................................................. 94 Table: 31: Test results for the selected variables for data sample 2 ........................................................ 94 Table: 32: Training results for the selected variables for data sample 3 ................................................. 95 Table: 33: Test results for the selected variables for data sample 3 ........................................................ 95 Table 34: Economic benefit and amortisation for ILVA ....................................................................... 101 Table 35: Economic benefit and amortisation for ArcelorMittal Espana .............................................. 102 Table 36: Economic benefit and amortisation for ThyssenKrupp Nirosta ............................................ 102 Table 37: Quality measures on tinplate line .......................................................................................... 105

126

Table 38: Classification results for regular yield strength ..................................................................... 113 Table 39: Comparison between WEKA and R...................................................................................... 117

3.7

List of References

[1]

iba AG, Measurement and Automation Systems, http://www.iba-ag.com

[2]

Y.-W. Chen and C.-J. Lin, Combining SVMs with various feature selection strategies. In the book "Feature extraction, foundations and applications." Series Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2006.

[3]

B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152, 1992.

[4]

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

[5]

O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

[6]

Jianping Zhang Proceedings of the ninth international workshop on Machine learning.Aberdeen, Scotland, United Kingdom Pages: 470 – 479. Year of Publication: 1992 ISBN:15586-247-X.

[7]

Experience in integrating Java with C# and .NET. Judith Bishop, R. Nigel Horspool and Basil Worrall. CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2003; 00:1-18. http://www.cs.uvic.ca/~nigelh/Publications/ccpe03.pdf

[8]

Poll from KDnuggetsTM Data mining Community‟s. What data mining tools have you used for a real project (not just for evaluation) in the past 6 months? http://www.kdnuggets.com/

[9]

Comparing Weka and R from Pentaho TM http://www.pentaho.com

[10]

Open-source machine learning: R meets Weka. Physica Verlag, an Imprint of Springer-Verlag GmbH. Kurt Hornik, Christian Buchta and Achim Zeileis. http://statmath.wuwien.ac.at/~zeileis/papers/Hornik+Buchta+Zeileis-2008.pdf

[11]

RWeka: An R interface to Weka. http://cran.r-project.org/web/packages/RWeka/index.html

[12]

Calling R from Java. Duncam Temple Lang http://www.omegahat.org/RSJava/RFromJava.pdf

[13]

FAQ for the R-Java Interface http://www.omegahat.org/RSJava/FAQ.html

[14]

Wizard software technology: http://en.wikipedia.org/wiki/Wizard_(software)

[15]

API and UI to develop wizards in Swing easily https://wizard.dev.java.net/

[16]

VMWare Server: http://www.vmware.com/de/

[17]

OpenSuse Linux: http://www.opensuse.org/

[18]

Rich Client Technology: http://de.wikipedia.org/wiki/Fat_client

[19]

The Apache Software Foundation: Apache Tomcat: http://tomcat.apache.org/

[20]

Mierswa, Ingo and Wurst, Michael and Klinkenberg, Ralf and Scholz, Martin and Euler, Timm. YALE: Rapid Prototyping for Complex Data mining Tasks. In Proceedings of the 12th ACM

127

SIGKDD International Conference on Knowledge Discovery and Data mining (KDD 2006), ACM Press, 2006. [21]

HSQLDB Java Database: http://www.hsqldb.org/

[22]

Star-schema documentation: http://de.wikipedia.org/wiki/Star-Schema (in German) or http://en.wikipedia.org/wiki/Star_schema (in English)

[23]

Bai-Ning Jiang, Xiang-Qian Ding, Lin-Tao Ma, Ying He, Tao Wang, Wei-Wei Xie: A Hybrid Feature Selection Algorithm: Combination of Symmetrical Uncertainty and Genetic Algorithms; The Second International Symposium on Optimization and Systems Biology (OSB‟08), Lijiang, China, October 31– November 3, 2008

[24]

Mark A. Hall, Lloyd A. Smith: Feature Selection for Machine Learning Comparing a Correlation based Filter Approach to the Wrapper, Department of Computer Science, University of Waikato, Hamilton, New Zealand http://home.eng.iastate.edu/~julied/classes/ee547/Handouts/Flairs.pdf

3.8

List of Abbreviations

(A)SIS

(Automatic) Surface Inspection System

ANOVA

Analysis Of Variance

ARFF

Attribute-relation file format

C#

C sharp (Programming language for the .NET runtime environment)

CSV

Comma separated value

DB

Data base

DM

Data mining

GUI

Graphical user interface

HSQLDB

HyperSQL Database: relational database engine written in JAVA

JNI

JAVA Native Interface

MARS

Multivariate Adaptive Regression-Splines

NiCo

Nirosta Cocpit

OLAP

Online Analytical Processing

PPS

Production Planning System

RapidMiner

An open source data mining tool

RBF

Radial Basis Function

SOM

Self-Organising Map

SQL

Structured Query Language

SVM

Support Vector Machine

TDW

Technical Data Warehouse

VDA

German Association of the Automotive Industry

XLS

File type for Microsoft EXCEL

XML

Extended Mark-up Language

128

European Commission EUR 26179 — Supporting process and quality engineers by automatic diagnosis of cause-and-effect relationships between process variables and quality deficiencies using data mining technologies (AUTODIAG) Luxembourg: Publications Office of the European Union 2013 — 128 pp. — 21 × 29.7 cm ISBN 978-92-79-33237-1 doi:10.2777/4329

129

HOW TO OBTAIN EU PUBLICATIONS Free publications: • one copy: via EU Bookshop (http://bookshop.europa.eu); • more than one copy or posters/maps: from the European Union’s representations (http://ec.europa.eu/represent_en.htm); from the delegations in non-EU countries (http://eeas.europa.eu/delegations/index_en.htm); by contacting the Europe Direct service (http://europa.eu/europedirect/index_en.htm) or calling 00 800 6 7 8 9 10 11 (freephone number from anywhere in the EU) (*). (*) The information given is free, as are most calls (though some operators, phone boxes or hotels may charge you).

Priced publications: • via EU Bookshop (http://bookshop.europa.eu). Priced subscriptions: • via one of the sales agents of the Publications Office of the European Union (http://publications.europa.eu/others/agents/index_en.htm).

KI-NA-26179-EN-N

The through-process detection of cause-and-effect relationships by investigation of process and quality data with data mining techniques has been proved to be a powerful possibility to decrease quality deficiencies. Nevertheless this method is not used area-wide in the companies because of its complexity, the necessary specific knowledge which only few people in the company have and the missing adaptation of the tools to the specific problems of the steel production. These are the reasons to develop, implement and test robust, practicable and easy-to-use solutions which are specialised to steel quality problems. A generic common framework based on a well-known software tool (RapidMiner) was developed and implemented. Each industrial partner has developed individual interfaces to databases at the one hand and to the user interface on the other hand. For the data mining solutions different approaches were investigated: • brute force • individual adapted • smart components For each approach an investigation of an actual problem was performed to be able to show the usability of the developed solution and to analyse the requirements for the transferability. After a training of the personnel the systems were rolled out for daily usage. The experience of the target users was analysed and used for improvements. During the project it could be shown that the developed system can be used in an industrial environment. The increasing number of users, the number of investigations made by them as well as the request to integrate additional data sources show that the system is fully accepted by the users.

Studies and reports

doi:10.2777/4329

Supporting process and

Supporting process and

Suggest Documents

Supporting process and product knowledge in ...

WIRE PATH RAPID TOOLING PROCESS AND SUPPORTING ...

SUPPORTING PROCESS GUIDANCE FOR ... - CiteSeerX

supporting chemical process design under

Supporting Process Reuse in PROMENADE

SUPPORTING PROCESS GUIDANCE FOR ... - CiteSeerX

Process-Oriented Requirement Analysis Supporting the Data ...

generating services supporting variability from configurable process ...

Supporting Learning Process in Higher Education ...

Supporting Process Control in Business Collaborations

Supporting the Learning Process with ... - Stanford University

Supporting Execution-Level Business Process ... - Semantic Scholar

A participatory process supporting design of future

Supporting Stakeholders in the MDA Process

Supporting aspect orientation in business process ...

Process-Oriented Requirement Analysis Supporting the Data ...

Taba Workstation: Supporting Software Process Improvement ...

Supporting the Answering Process - Semantic Scholar

Supporting the Product Derivation Process with a

Supporting the Retrieval Process in Multimedia ... - CiteSeerX

Process Activities Supporting Security Principles - DistriNet

Towards Supporting the Architecture Design Process Through ...

Supporting Business Process Fragmentation While Maintaining ...

Supporting the Requirements Prioritization Process Using Social ...