Data modeling versus simulation modeling in the big data era: case ...

Simulation Special Issue Article

Data modeling versus simulation modeling in the big data era: case study of a greenhouse control system

Simulation: Transactions of the Society for Modeling and Simulation International 2017, Vol. 93(7) 579–594 Ó The Author(s) 2017 DOI: 10.1177/0037549717692866 journals.sagepub.com/home/sim

Byeong Soo Kim, Bong Gu Kang, Seon Han Choi and Tag Gon Kim

Abstract Recently, big data has received greater attention in diverse research fields, including medicine, science, engineering, management, defense, politics, and others. Such research uses big data to predict target systems, thereby constructing a model of the system in two ways: data modeling and simulation modeling. Data modeling is a method in which a model represents correlation relationships between one set of data and the other set of data. On the other hand, physics-based simulation modeling (or simply simulation modeling) is a more classical, but more powerful, method in which a model represents causal relationships between a set of controlled inputs and corresponding outputs. This paper (i) clarifies the difference between the two modeling approaches, (ii) explains their advantages and limitations and compares each characteristic, and (iii) presents a complementary cooperation modeling approach. Then, we apply the proposed modeling to develop a greenhouse control system in the real world. Finally, we expect that this modeling approach will be an alternative modeling approach in the big data era.

Keywords Complementary cooperation, big data, data modeling, simulation modeling, greenhouse control system Publisher’s note: This article was originally part of the special issue on modelling and simulation in the era of big data and cloud computing: theory, framework and tools in volume 93, issue 4 of SIMULATION: Transactions of The Society for Modeling and Simulation International. Please refer to this issue (http://journals.sagepub.com/toc/simb/93/4) and the editorial of that issue (DOI: 10.1177/0037549717696446).

1. Introduction As a means of forecasting for a diversified society, the phrase ‘‘big data’’ has become widespread. The word means data sets with sizes beyond the ability of common software tools to capture, curate, manage, and process within a tolerable elapsed time.1 Since such big data provides more insight than existing limited data, it has received greater attention in diverse research fields, including science, engineering, defense, management, medicine, and politics. Hence, many researchers in modeling and simulation (M&S) have paid much attention to it. These researchers have usually focused on using it to validate a developed model with real big data and predict system behavior with the model. However, before such uses are advanced, more fundamentally, researchers must construct a model using big data. For this reason, modeling with big data is becoming a necessary and important issue in the era of big data. This approach, that is, modeling with big data, has been referred to as data modeling, which has focused on

representing correlations of data. In detail, such an approach has been classified and studied in each research field in two ways: data mining and machine learning. Figure 1 illustrates how data modeling can be achieved in both ways. Data mining can be a means of data modeling, as shown in Figure 1(a). By using data-mining techniques, users who want to predict the future are able to not only analyze a pattern or property of data in one dimension, but also identify a distribution function of the data from the pattern. Then, after validating the distribution function with data in the real world, using such methods as the goodness of fit (GOF) test, users can acquire a ‘‘random number

Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea Corresponding author: Byeong Soo Kim, Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Republic of Korea. Email: [email protected]

580

Simulation: Transactions of the Society for Modeling and Simulation International 93(7)

Figure 1. Overview of the data-modeling approach.

generation’’ model. Ultimately, the acquired model can be utilized in the process of prediction of future data patterns. On the other hand, machine learning can be another means of data modeling, as shown in Figure 1(b). With a machine-learning algorithm, such as an artificial neural network (ANN) and the genetic algorithm (GE), users can map associations between one set of data (d1) and another (dn). Similar to the process of data mining, after passing through a validation process of this map with data in the real world, using general performance indices such as the root-mean-square error (RMSE), users can acquire a data model. By extension, it enables the prediction of a future value of data set ‘‘dn’’ using given data set ‘‘d1.’’ Data modeling thus consists of a series of processes: acquisition, modeling, validation, and prediction. It has seen wide use in various fields, including science, engineering, economic, industries, and others, in order to predict the future behavior of the target system.2 Moreover, some researchers have claimed that, given enough information, correlation is enough to make robust and informative predictions.3 However, contrary to that expectation, data modeling is not always a powerful modeling approach. It has some limitations. One of the representative limitations is that it is able to just describe correlations between data, not represent causal relationships between controlled inputs

and corresponding outputs. It cannot cope with anomalies and changing circumstances of the system.4 It can also be influenced by how much data we have of the target system. To overcome such limitations of data modeling, it is necessary to use another approach through simulation modeling. Simulation modeling is a theory-based modeling approach generally used in the simulation field. To build a model, it uses physical or operational laws. Because the theory means a statement of what causes what and why,5 it is possible to represent clearly the causality between a set of controlled inputs and the corresponding outputs of the system contrary to the data-modeling approach. Nevertheless, simulation modeling cannot alone be a perfect solution for big data. For example, when it is difficult to acquire the knowledge on the system, it is not possible to completely build a model that satisfies the objectives of M&S.6 The simulation-modeling approach is based on the prior knowledge of the target system, and its completion depends on how much we can understand about the system. To sum up, each modeling approach has its advantages and disadvantages. To mitigate the disadvantages, a new modeling method is required. In this respect, this study presents the new modeling approach that employs advantages of both data modeling and simulation modeling, as depicted in Figure 2, ‘‘complementary cooperation modeling.’’

Kim et al.

581

Figure 2. Complementary cooperation between the two modeling approaches.

Accordingly, this study focuses on identifying the limitations of data modeling and simulation modeling, a comparison between the modeling approaches, and their complementary cooperation. Then, to demonstrate how to apply this approach, this paper applies it to a greenhouse control system in the real world. We expect our study to provide a new possibility for M&S in the era of big data. This paper is organized as follows: Section 2 presents some of the basic knowledge about the two modeling approaches, and Section 3 introduces previous works similar to our study and compares them. Section 4 discusses the limitations of both approaches and compares their characteristics. Section 5 describes the complementary cooperation between them, using an example of the greenhouse control system. Finally, Section 6 concludes the discussion.

2. Preliminaries Before moving to the central part of our work, it is imperative to clarify some issues related to modeling or analysis to understand the overall context of the study. Therefore, this section describes background knowledge for data modeling and simulation modeling, such as correlation and causality, and model usage in system analysis.

2.1. Correlation versus causality Knowing the difference between correlation and causality is essential in order to understanding system modeling. Correlation means whether and how strongly pair variables are related.7 Causality means a relationship in which one

action or event is the direct consequence of another.8 Correlation and causality may seem similar, but they have an entirely different meaning. The sentence ‘‘Correlation does not imply causation’’ is frequently used in statistics, and it means that correlation between two variables does not necessarily imply that one causes the other.9 While causation implies correlation, the reverse is not necessarily true. Even though much scientific evidence is based on the correlations of variables, they do not always imply causalities. Figure 3(a) shows the relation between quantities of imported oil and chicken consumption.10 The graph, made from observational data (data model), shows the information that quantities of imported oil increase as the consumption of chicken increases. However, this linear function does not assure causality between these two variables. Actually, rising quantities of imported oil do not cause an increase of consumption: an invigorated economy does. This shows the fact that the relationship between these two variables describes a correlation only, not a causality. On the other hand, the relation between car speed and distance moved, as shown in Figure 3(b), contrasts with the previous example. Since the graph in the figure is made from a physical law (simulation model), the simulation model can assure causality between two variables. This simple example shows that it is important to identify that the relationship means whether correlation or causality is present in the system analysis.

2.2. Levels of system analysis As Figure 4 illustrates, a user who wants to analyze certain systems can generally acquire a data set regardless of

582


Figure 3. Examples of correlation and causality.

having knowledge of the target system. The data set can comprise a data model that represents correlations between variables through the data-modeling approach, such as data-mining or machine-learning techniques. In addition, in this situation, if the user can acquire physical or operational laws of the target system, he or she can construct a simulation model representing causalities between variables by adding causal inference to the data model. The above data model and simulation model can be used to analyze the target system. They do not have the same analysis level but different levels of system analysis, as shown in Figure 4. The level of system analysis is divided into three parts: descriptive analysis, predictive analysis, and prescriptive analysis.11 Firstly, the descriptive analysis explains what has happened in the system. Secondly, the prescriptive analysis, an added pattern of data into the descriptive analysis, gives the prediction of what will happen in the future. Lastly, the prescriptive analysis, which is the highest level of system analysis, is possible with some additional information of the system structure. It gives the discernment of how we can make things happen. The data set and data model can be used in the descriptive and predictive analysis, respectively, but cannot be used for the prescriptive analysis due to lack of causality. In this respect, the two analyses are mostly used in the diagnosis of the system. On the other hand, the simulation model can be used for prescriptive analysis as well as for predictive analysis, and meeting the needs of various applications, including treatment of disease, control of policy, management of organization, and optimization.

To put it plainly, the user can model any system he or she wants to analyze as a different model according to the given information. As a result, it enables analysis of different levels, depending on the modeling approach.

3. Related works In recent years, some researches have been conducted on data modeling and simulation modeling with big data in various domains. These researches can be classified into two groups according to whether or not big data is used to build a model. Regarding the former, there is one research study on the dynamic data-driven application system (DDDAS).12 It provides better analysis and prediction of a system through inputting big data into the simulation model in real time. In other words, real-time data continually influences simulation. The DDDAS sees use in many fields, especially research on wildfire spread simulation.13 The DDDAS has improved the estimation of fire growth. It reflects the dynamic and stochastic nature of real data and informs real-time decision-making.14 However, the DDDAS is not complete enough to represent a complex system because it utilizes big data only to simulate the model in real time. It does not utilize big data for data modeling to overcome the limitations of the simulation model. In addition, simulation of traffic systems is a major field that makes good use of big data. In traffic simulations, big data from traffic sensors is used to calibrate the existing traffic model or help decision-making regarding

Kim et al.

583

Figure 4. Levels of system analysis.

traffic policy.15,16 On the other hand, some researches only use data modeling to represent and analyze the traffic system.17 However, not all of them provide the cooperation modeling approach between two modeling approaches. Researches about building energy modeling similarly apply big data to improve the simulation result.18 However, it is important to manage and process the big data of buildings.19 They focus on technical challenges to improve big data processing, but they do not approach the use of big data as a modeling means. In the latter case, relatively little research has been carried out. One representative study by Todorovski and Sasˇo Dzˇeroski20 combines modeling approaches. This study suggested a framework for modeling a dynamic system by integrating knowledge-driven and data-driven approaches. The framework is an automated modeling of dynamic systems based on equation discovery, inducing the unspecified parts of the system model from data. Even though this study tried to combine modeling approaches, it has limited relevance to the general M&S domain because it is focused on a restricted and dedicated application. Similarly, researches about combining rule-based and data-driven techniques have the same limitations.21–23 Unfortunately, to the best of our knowledge, no study explicitly uses cooperative modeling with both data and knowledge from identifying characteristics, limitations, and roles. In this respect, we suggest a new cooperation approach, which allows us to develop an enhanced model for complex systems or to improve the performance of simulation by taking advantage of two modeling approaches: data modeling and simulation modeling. For

this, prior to describing our proposed approach, the following sections deal with the limitations of each approach and the comparison of their characteristics.

4. Data modeling versus simulation modeling In this section, we explain the limitations of each existing modeling approach through brief examples, and then compare their characteristics.

4.1. Limitations of data modeling This section largely describes two weaknesses of the datamodeling approach through predictions of a simple queuing system and a traffic simulation. The former example shows its inability to predict under changed conditions, and the latter shows its inability to cope with unexpected events. 4.1.1. Prediction under the changed condition. As depicted in Figure 5, assume there is a queuing system in the real world, and we want to predict a turnaround time, that is, how long a customer stays in the system. Data sets are composed of three kinds of data: inter-arrival time, service time, and the turnaround time of past customers. From a data-modeling perspective, the most intuitive method is to construct a static map of input (inter-arrival time, service time) to output (turnaround time), using machine-learning techniques, such as the neural network and nonlinear autoregressive exogenous model (NARX),

584


Figure 5. Prediction from the data model and the simulation model: queuing system.

without knowledge of the queuing system.24 Then, when given a new customer, one can substitute the inter-arrival time and service time from the static map, as shown in the left-hand part of Figure 5. On the other hand, from a simulation modeling view, a modeler would make an abstracted single-queue single-server (SQSS) model with a first-in, first-out (FIFO) policybased knowledge of the queuing system. It would take the shape of a dynamic map of input (inter-arrival time, service time, state: status of queue and server) to output (turnaround time), using the modeling formalism of discrete event system specification (DEVS), shown in the righthand part of Figure 5.25 The overall procedure in Figure 5 shows how well the data model can predict the inter-arrival time of the queuing system. For this, we designed the input data set with the following model parameters: inter-arrival time follows exponential distribution ðl1 = 5Þ and service time follows

normal distribution ðm = 3; s = 0:3Þ. At first, with the input data set, from execution of the simulation model designed as DEVS in DEVSim++, we acquire output data set inter-arrival, and then construct a data model as NARX with the total data set (inter-arrival time, service time, inter-arrival time).26 After that, we compare the interarrival time of the new customer from the data model and the simulation model with extra testing of the data input set with the same probability distribution in training. This graph shows that if data for prediction have the same statistical property when training, then prediction from the data model is plausible within the margin of error. If conditions for prediction change after training, is the prediction with the data model reasonable? To answer this, we consider two situations: changes to the parameters of the model and to its policy. In case of the former, we change the model parameter for the input data set from the existing normal distribution (m = 3, s = 0.3) to a new

Kim et al.

585

Figure 6. Limitation of the data model under a changed structure. FIFO: first-in, first-out; LIFO: last-in, first-out.

normal distribution (m = 4, s = 0.4). For the latter, we change the policy of the queue from FIFO to last-in, firstout (LIFO). In both, we forecast the inter-arrival times of new customers with the data model constructed in the previous example. The experiment results from the graphs in Figure 6 show that prediction through the data model is no longer valid. That is, the data model cannot predict, if the model parameter or policy changes after training. 4.1.2. Unexpected events. Another limitation of the data model is that it cannot cope with unexpected events. In a real system, unexpected events, such as a rare event with a probability of 108 or less, may occur due to its high complexity and uncertainty. Usually, a data set does not include these events. When the event happens, the data model subordinated to the data set cannot predict the result of the unexpected events precisely. On the contrary, the simulation model, not dependent on the data set, may make a reasonable result in these events. To show the limitation, the vehicle allocation system (VAS) is described by the data model and the simulation model, as shown in Figure 7. One objective of the VAS is to find the average waiting time of passenger (AWT) in Daejeon, Korea, in given parameters: the number of passengers per hour (NP) and the number of vehicles in the city (NV). The data model is constructed with an ANN using the data set of the VAS, as shown in Figure 7. The simulation model of the VAS is an agent-based model using DEVS. The passengers and vehicles are described as agents who observe an environment

and decide their behaviors based on the predefined rule set. The environment in the simulation model is organized as a graph that reflects the real traffic environment of Daejeon with 223 vertexes and 390 edges.27 The simulation model is validated with the data set of the VAS. As shown in Figure 7, both the data model and the simulation model accurately predict the results of usual events. However, both models give different results when unexpected events happen, as shown in Figure 8. For example, some roads in the city may jam up due to a major traffic accident. In this case, the simulation model can predict reasonable results of the event, reflecting this traffic on the environment. For the VAS simulation model, the reflection is achieved by modification of some parameters at a lower cost. However, the data model, which did not include unexpected events, just gives a similar prediction result to the given data set. In addition, when the traffic policy of some roads, such as a speed limit, is changed, the data model requires a new data set of this environment to predict the result accurately. Acquiring the new data set entails heavy costs. Even if the data set is regarded as a rare event, it is impossible to get the data in the real system. Meanwhile, the simulation model can present a reasonable result by changing some parameters without much cost.

4.2. Limitations of simulation modeling To judge from the above example, are we sure that prediction by the simulation model is always better than that by

586


Figure 7. Prediction from the data model and the simulation model: vehicle allocation system.

the data model? If we have knowledge of the system for prediction, this goes without saying; however, it is not always possible to have such knowledge. In other words, there might exist situations in which we are able to know little or nothing about the system. Simulation requires extensive physical and operational knowledge of a target system in order to be accurate. In this condition, for prediction, we should adopt the approach of data modeling explained in the previous section. To put it another way, the simulation model has the following property: if knowledge about the system can be obtained, it should be applied to the prediction. In addition, the simulation model requires idealistic assumptions and constraints about the system, while the data model does not. For example, when we perform a modeling of the controller in the automotive engine, the controller may be valid in a narrow range of operation because the models include some ideal behavior assumptions and the model parameters are generally calibrated

using steady-state experimental data.6 This can be one of limitations of simulation modeling.

4.3. Comparison of data modeling and simulation modeling On the basis of the features mentioned above, Table 1 illustrates generalized characteristics of two modeling approaches in various perspectives. From the point of view of the modeling structure, the data-modeling approach merely associates the input variable with the output variable in static map form; for that reason, a modeler can make the data model without any knowledge. Conversely, the simulation-modeling approach is to make a dynamic map of the input and state to the output variable based on cause–effect; thereby, this approach requires knowledge of the target system. The former approach has been studied in research areas of computer science, such as data mining and machine learning, and the latter has been dealt with in

Kim et al.

587

Figure 8. Limitation of the data model under an unexpected event.

Table 1. Comparison of the data model and the simulation model.

Model representation & system knowledge Dynamics of the model

Modeling means Analysis levels Condition for valid prediction Simulation (decision) time Anomaly/non-existing system

Data model

Simulation model

Associational relationship between variables No knowledge about system required Static map of input variable to output variable

Cause–effect relationship between variables

(No state within model) Data mining Machine learning Descriptive ! Predictive (No decision on control policy) System structure remains unchanged before and after training Very short ! Real time Not applicable

research focusing on clear causality, such as physical and operational laws. The primary difference of the modeling approaches is the level of analysis. Because the data-modeling approach is built on learned information, it enables the analysis of a low level, such as to describe the past and to predict the future under the same condition when training. As this approach represents the system as a simple static model, execution of the model is fast enough to be feasible in real

System knowledge required Dynamic map of (input, state) to output

(State Q within model) Physical and/or operational laws Predictive ! Prescriptive ! Cognitive (Control policy) (Planning) Model validation Relatively long ! Hardly real time Applicable (as in rare event or new design)

time. On the other hand, since the simulation-modeling approach is based on knowledge, a higher level of analysis is possible in comparison with the data-modeling approach. For instance, it enables not only the prediction of the future under a different condition when training, but also the controlling and planning of the system. Such a difference allows the simulation-modeling approach to conduct an analysis of a system with an abnormal or nonexisting system, for example, a rare event or new design,

588


which is impossible in the data-modeling approach. However, due to the high complexity of the model, simulation or decision time through this approach is relatively longer than in the former approach. This causes difficulty in real-time execution. In summary, the data-modeling approach has an advantage in that analysis is possible regardless of knowledge of the system, but it has a weakness regarding the depth of analysis and can therefore only be used in superficial analysis. Conversely, while the simulation-modeling approach requires knowledge, it enhances the depth of analysis; in other words, profound analysis can be conducted. In these circumstances, we have a question. Can we complement the defects of both modeling approaches? The following section will the focus on this question.

5. Application of complementary cooperation between two modeling approaches: greenhouse control system From the limitations and comparison of the data- and simulation-modeling approaches in the previous section, we confirmed that it is difficult to describe a complex system using only one of the two approaches. Therefore, it is necessary to suggest a new type of modeling approach that can combine the advantages of each and yield enhanced results. It is also important to identify how the simulation and data models are constructed from prior knowledge and experimental data, respectively, and furthermore how these two models complement each other. This section applies this approach to a greenhouse control system. It presents descriptions of the system, greenhouse modeling, experimental design, and experimental results.

5.1. Greenhouse control system: overview A key challenge in greenhouse control systems is to set up a favorable environment for the growth of crops at a minimum price.28 For this, it is necessary to optimize the control parameters by maintaining a set point suitable for the development of plants: temperature, humidity, and so on. To find the optimal control parameters, it is difficult to experiment in a real greenhouse because variations of the internal environment of a greenhouse according to the specification of greenhouses is considerable. As a result, it requires a M&S approach. As the greenhouse has high complexity and various factors influencing it, there is a limitation to construct a greenhouse model for accurate simulation with only existing theory. For this reason, to acquire high accuracy, there is a complementary modeling between simulation modeling and data modeling from the real greenhouse. Some research studies have been conducted on the modeling of

Figure 9. Control of the greenhouse system based on modeling and simulation.

the greenhouse control system,29–31 but these approaches were almost all conducted with one modeling approach. Bennis et al.28 also performed a perceptive research on greenhouse modeling and robust control. However, their research is a kind of simulation modeling for both the controller and plant from the M&S standpoint. Accordingly, we arrive at cooperation modeling to amplify the synergy. The greenhouse control system is composed of a plant, a sensor, a controller, an actuator, and disturbance of the environment. Its execution process is shown in Figure 9. When the control parameters are set up in the controller, the result of operation occurs by the controller and actuator. Then, after the result and extra disturbance from the environment have influenced the plant model of the greenhouse, the plant generates output: internal temperature, internal humidity, and so on, and consequently, the output, measured through the sensors, reaches the controller as feedback. Such a process is recursively executed in every clock and, ultimately, the plant outputs can be adjusted close to the set point that we want to acquire. In brief, the objective of a greenhouse control system is to find optimized environments for crop growth through M&S.

5.2. Greenhouse control system modeling To model a greenhouse control system, one should first acquire observational data from real greenhouses. Table 2 depicts the concrete specifications of the greenhouse used in this application. After that, one must build a conceptual model that expresses the abstraction level, structure, and system elements according to the objectives of M&S. Greenhouse models broadly consist of a controller model, a plant model, an actuator model, a sensor model, and an outdoor environment, as mentioned earlier. The input of the system is the control parameter settings, and the output is the inner temperature or humidity inside the greenhouse.

Kim et al.

589

Then, the conceptual model should be partitioned into two models according to the objective of M&S and the acquisition of data/knowledge. In the greenhouse control system, the plant and actuator can be represented as simulation models because knowledge about them can be obtained sufficiently. On the other hand, the controller model can be represented as a data model with the acquisition of environmental data. This partitioning (simulation modeling: plant and actuator; data modeling: controller) is only valid in this paper. It can vary depending on the domains to apply. This partitioned conceptual model can be modeled using each modeling approach as follows. 5.2.1. Simulation modeling: plant. For simulation modeling, acquiring prior knowledge, including physical operations, laws, algorithms, structures, and mathematical equations, is essential. In the greenhouse control system, the overall structure of the system, plant, and actuator is modeled with the simulation-modeling approach. Above all, a plant demands much knowledge due to its complexity. The plant is a model depicting the internal environment of the greenhouse, calculating internal temperature and humidity based on physical laws or control factors, or external environment parameters. To carry out a simulation modeling of the plant, one must acquire knowledge or theories about how to control factors; the internal and external environments have an impact on the plant, so one must obtain environment parameter data and inner temperature (Ti)/humidity (Hi) data in the real greenhouse system. Table 3 describes the parameters of the greenhouse plant model, which consists of control input, control output, and disturbance. For simulation modeling, we use physical laws containing these parameters. Equations (1) and (2) show the physical laws derived from the energy balance for the temperature equation and the water mass balance for the hygrometry equation, respectively.32 In this case study, we only consider the temperature of the greenhouse in order to simplify the model: dTi ðtÞ = ðc1 + c2 Pw ðtÞÞðTo ðtÞ Ti ðtÞÞ + c3 Sw ðtÞ dt + c4 Ql ðtÞ

ð1Þ

dHi ðtÞ = ðd1 + d2 Pw ðtÞÞðHo ðtÞ Hi ðtÞÞ dt + ðd3 + d4 QlðtÞÞDHi ðtÞ

ð2Þ

Ti ðk + 1Þ = a1 Ti ðk Þ + a2 To ðk Þ + a3 Pw ðk Þ + a4 Ql ðk Þ + a5 Sw ðk Þ

ð3Þ

The temperature model can be converted and arranged as a discrete model, as shown in Equation (3). At that

Table 2. Specification of the greenhouse. Type

Specification

Location Dimensions Date/sampling time Kind of crop Sensors

Jinju, South Korea 30 m × 90 m × 7.5 m April–June 2015/1 min Tomato Temperature, humidity, CO2, rainfall, wind speed, light density, etc. Window, heater, curtain, etc.

Actuators

Table 3. Parameters used in the greenhouse plant model. Type

Parameter

Description

Control input Control output Disturbance

Pw (%) Ti (°C) To (°C) Ql (W/m2) Sw (cm/s)

Window (roof) position Indoor temperature Outdoor temperature Light quantity Wind speed

time, we must estimate the coefficients of the physical plant model. We use the off-line least squares method33 to estimate them using the observational greenhouse data in Table 2. Equation (4) shows the cost function of the plant model: 8 NP 1 2 > > > Ti ðk + 1Þ uT ðk Þû ðN : Num: of samplesÞ > > :

ð4Þ "

u =

N 1 X k=1

#1 T

uðk Þu ðk Þ

3

N 1 X

uðk Þ Ti ðk + 1Þ

ð5Þ

k=1

The optimal values of the coefficients can be determined to minimize the cost function J by zeroing the derivative of J, as represented in Equation (5). To obtain the optimal coefficients, we use the acquired greenhouse data displayed in Table 2. The off-line least squares method is applied to the plant model by using greenhouse data from five successive days. Table 4 shows the estimated results that complete the simulation modeling of the plant. After deriving the coefficients, a validation process should be performed to check the validity of the simulation model. We use the RMSE as a means of validation. The same greenhouse data is used to validate the model. Figure 10 shows a graph of the validation results and the RMSE value. From the validation results, we consider that the plant model can provide somewhat fine predictions of the indoor greenhouse temperature. Therefore, we use this

590


Table 4. Estimated coefficient values of the plant model.

Table 5. Description of the input parameters influencing the Pband.

Coefficient

Value

Name

Description

a1 a2 a3 a4 a5

9:992 × 101 9:201 × 103 1:563 × 104 1:884 × 105 2:813 × 104

Max.outdoor.temp.

Maximum threshold of outdoor temperature Minimum threshold of outdoor temperature Maximum threshold of wind speed Minimum threshold of wind speed Maximum threshold of P-band Minimum threshold of P-band Degree of window opening by one execution

Figure 10. Validation result of the plant model: graph and root-mean-square error (RMSE).

Figure 11. Graph of the P-band value according to outdoor temperature and wind speed.

plant model for the simulation of the greenhouse control system. 5.2.2. Data modeling: controller. One important factor in the greenhouse control system is the controller. Even though there are many factors to be controlled, this control system deals with two: the presence of a window and a heater. The window and heater play a significant role in decreasing and increasing the inner temperature, respectively. In

Min.outdoor.temp. Max.wind.speed. Min.wind.speed. Max.pband. Min.pband. Win.open.deg.

our case, we only consider the window controller. In the greenhouse system, a proportional (P) control algorithm generally controls the window. We apply the concept of P-band for the P control.34 The P-band is a kind of proportional parameter that determines the opening angle of the window according to an excess of desired temperature: the difference between the set point and the measured point is the reciprocal of the proportional gain constant (KP), which is the general proportional parameter. This P control can be operated in two ways: a fixed or variable P-band (or KP) value. The general controller, which we can usually see, uses the fixed P-band value. On the other hand, the outdoor temperature and wind speed from the real greenhouse data dynamically influence the variable P-band. It decreases with high temperature and conversely increases with high wind speed. Figure 11 shows a graph that determines the variable Pband value, depending on the outdoor temperature and wind speed. The graph is based on prior knowledge of plant growth. The shape of the variable P-band value graph is determined by the seven input parameters shown in Table 5. When the shape of the graph changes due to the input parameters, the inner temperature and humidity can be affected. As a result, if there is a model drawing optimal values regardless of parameter change, it will bring an improved result. Therefore, we need to build a data model that can fluently cope with the circumstance regardless of changes of the input parameters. To build a data model, we have to acquire experimental data through repeated simulation using the controller (shown in Figure 11). The design of the experiment for the repeated simulation is accomplished through the combination of the parameters shown in Table 5. We can obtain a data model using the simulation result. The inputs of the model are the seven parameters shown in Table 5, and its output is the P-band value. An ANN can be used to extract this data model. We use the acquired model in the following section.

Kim et al.

591

Figure 12. Experiment with the greenhouse model.

5.3. Experimental design In this section, we design experiments of the developed greenhouse model to show the effect of the cooperation of the two modeling approaches. We can show the contribution by comparing two experiments; one uses a model from the complementary cooperation approach, and the other uses a model from simulation modeling. For this comparison, we use the model shown in Figure 12 in two ways. In this case study, MATLAB/Simulink is used to build both the simulation model and data model. The plant model based on the physical laws is implemented using MATLAB. The controller model based on the observational data is learned using an ANN toolbox that MATLAB provides. We can easily integrate both models into one model and simulate using MATLAB. The enhanced result can be shown with two performance indices: the RMSE and moving (or rolling) standard deviation. Firstly, the RMSE, used in Section 5.2.1 to validate the model, is evaluated from the difference between the set point and the predicted greenhouse temperature. (The set point is determined by domain knowledge of plant growth.) That is, it shows that the smaller RMSE value means a better control performance. Next is the aspect of the moving standard deviation. Although the lower RMSE is a significant element, Kamp and Timmerman34 mentioned that it is also important to maintain a constant temperature for optimal plant growth. Therefore, in

the greenhouse system, evaluating the temperature fluctuation can be a significant performance index, apart from performance indices generally used in the control system. To evaluate it, we use the moving standard deviation frequently used for analysis of a time series model.35 The moving average and standard deviation are represented as follows: 8 > > > > > < > > > > > :

nP 1 yti ubt ðnÞ = 1n i=0 ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nP 1 1 sbt ðnÞ = n1 ðyti ubt ðnÞÞ2

ð6Þ

i=0

ðyt : time series; n : width of windowÞ

We can see the stability of the time-varying greenhouse temperature by deriving the mean value of the moving standard deviation. Its small value implies the high stability of the temperature.35 That is, it provides a stable environment for plant growth. In our case study, we use the observational data of the Jinju greenhouse as depicted in Table 2. From this data set, we use the 6000 sample data for four days to simulate the model.

5.4. Experimental results This section will discuss the experimental results. Figure 13 shows the simulation results of the greenhouse control

592


Table 6. Simulation results: performance indices. Performance Index

Simulation modeling + data modeling

Simulation modeling

RMSE Mean of moving standard deviation

2.507 0.161

2.857 0.183

RMSE: root-mean-square error.

emission varies according to not only the opening angle, but also disturbances (outdoor temperature, wind speed, etc.). This can lead to a higher temperature fluctuation. Therefore, the greenhouse model with only the simulation model makes it difficult to maintain the quantity of heat because it does not reflect the disturbances. On the other hand, a data model that reflects the change of disturbances makes it possible to keep the quantity of heat emission constant, as shown in Figure 13. To sum up, because it is difficult to completely represent the real system using physical laws inside the greenhouse, due to its complexity, the data model of the controller can assist this defect through the complementary cooperation. Defects of the simulation model can be mitigated with the data model and, from this, the cooperation of two kinds of models ensures a certain improvement of performance and more stable plant growth.

6. Conclusion Figure 13. Simulation result: greenhouse temperature.

system designed in the previous section. Table 6 also shows the numerical results in order to analyze the results specifically. It represents comparative performances between cooperation M&S modeling. Because we do not consider the heater in this case study, we only use data controlled by the window when we calculate the performance indices. The first performance index is the RMSE. Its result shows that the cooperation of the two models has enhanced control performance compared to the simulation model. To put it concretely, this cooperation brings a 14.0% performance improvement. This results from supplementing the P-band value with the data model and deducting the optimized P-band value, regardless of control parameter changes. In this regard, because the RMSE is relative to the change of the set point, its absolute value is more important than the rate of improvement. The important fact is that it reduces the error from using two models complementarily. The second performance index is the mean of the moving standard deviation. The results in Table 6 also show the performance improvement using two modeling approaches in concert. This technique brings a 14.4% stability improvement over the simulation model. It means that the data model of the controller helps the simulation models maintain temperature. The reason is as follows. In this greenhouse simulation, the roof window is the only control element that we use to reduce the interior temperature. When controlling the temperature through regulating the opening angle of the window, the quantity of heat

In modern society, a large amount of information can be stored and processed due to the development of information technology, and many researchers have thus used it to predict target systems in various research fields, such as medicine, science, engineering, management, defense, and politics. Before such a prediction, a model for forecast should be constructed, and therefore two types of models have been studied: data modeling and simulation modeling. Even though both approaches have advantages and limitations, existing studies have constituted models with one modeling perspective between two. For this reason, it is difficult to get the benefits of each model concurrently. To overcome these weaknesses, in this study, we showed a complementary cooperation modeling that is able to combine the merits of both with two issues: (i) indepth consideration of existing modeling; and (ii) empirical analysis with our proposed approach. For the intensive study of previous modeling, we first clarified the differences between the existing modeling approaches. Then, we explained their advantages and limitations in detail. From a practical queuing system and a VAS, we showed the limitations of data modeling compared with simulation modeling with a machine-learningbased NARX and a DEVS formalism-based simulation model, respectively, and addressed the opposite situations. On the basis of the in-depth considerations, we suggested a complementary cooperation modeling method, and applied it to a real-world greenhouse control system. In the process of modeling the system, we designed a plant model and a controller model using simulation and data modeling with physical equations and real data from the system, respectively. For the empirical analysis, after constructing one simulation environment including the two kinds of models, we conducted an experiment and showed

Kim et al. relatively enhanced performance on control of the temperature in comparison with the existing method. As a main contribution, we showed a possibility of a new modeling approach that differs from the existing modeling approach. We then showed how the proposed modeling approach can be applied in the real system, and how it can create synergy through the cooperation between two models. In future works, we will establish a complementary cooperation modeling methodology between the datamodeling and simulation-modeling approaches, applying the concepts of interoperability36 and composability.37 Also, although we applied only one dedicated agriculture system in this paper, we will apply the proposed works to various other research fields that are influenced by big data.

Declaration of conflicting interest

593

13.

14.

15.

16.

17.

The authors declare that there is no conflict of interest. 18.

Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

19.

7. References 1. Hashem IAT, Yaqoob I, Anuar NB, et al. The rise of ‘‘big data’’ on cloud computing: review and open research issues. Inform Syst 2015; 47: 98–115. 2. Bock HG, Carraro T, Ja¨ger W, et al. Model based parameter estimation: Theory and applications. New York: Springer, 2013. 3. Anderson C. The end of theory: The data deluge makes the scientific method obsolete, http://www.wired.com/2008/06/ pb-theory (accessed 1 January 2016). 4. Wanamaker B and Bean D. Big data: the end of theory in healthcare? http://www.christenseninstitute.org/big-data-theend-of-theory-in-healthcare (accessed 1 January 2016). 5. Christensen CM. The innovator’s dilemma: when new technologies cause great firms to fail. Brighton, MS: Harvard Business School Press, 1997. 6. Janakiraman VM. Machine learning for identification and optimal control of advanced automotive engines. Doctoral Dissertation, The University of Michigan, 2013. 7. Hayter A. Probability and statistics for engineers and scientists. 4th ed. Boston, MS: Cengage Learning, 2012. 8. Beebee H, Hitchcock C and Menzies P. The Oxford handbook of causation. Oxford: Oxford University Press, 2009. 9. ldrich J. Correlations genuine and spurious in Pearson and Yule. Stat Sci 1995; 10: 364–376. 10. Spurious Media LLC. Spurious correlations, http://www.tylervigen.com/spurious-correlations (accessed 1 January 2016). 11. Evans JR. Business analytics. 2nd ed. London: Pearson, 2015. 12. Darema F. Dynamic data driven applications systems: A new paradigm for application simulations and measurements. In:

20.

21.

22.

23.

24. 25.

26.

27.

International conference on computational science, Krako´w, Poland, 6-9 June 2004, pp.662–669. New York: Springer. Ntaimo L, Hu X and Sun Y. DEVS-FIRE: towards an integrated simulation environment for surface wildfire spread and containment. Simulation 2008; 84: 137–155. Yan X, Gu F, Hu X, et al. A dynamic data driven application system for wildfire spread simulation. In: Proceedings of the 2009 winter simulation conference, Austin, TX, USA, 13–16 December 2009, pp.3121–3128. Piscataway, NJ: IEEE. Mudigonda S and Ozbay K. Using big data and efficient methods to capture stochasticity for calibration of macroscopic traffic simulation models. In: Symposium celebrating 50 years of traffic flow theory, Portland, Oregon, 11–13 August 2014. Washington, DC: Transportation Research Board. Wilkie D, Sewall J and Lin MC. Transforming GIS data into functional road models for large-scale traffic simulation. IEEE Trans Visual Comput Graph 2012; 18: 890–901. Lu H, Sun Z and Qu W. Big data-driven based real-time traffic flow state identification and prediction. Discrete Dynam Nat Soc 2015; 2015: 1–11. Sanyal J and New J. Simulation and big data challenges in tuning building energy models. In: IEEE workshop on modeling and simulation of cyber-physical energy system, 20 May 2013. Piscataway, NJ: IEEE. Sanyal J and New J. Building simulation modelers - are we big data ready? In: Proceedings of the ASHRAE/IBPSA-USA building simulation conference, Atlanta, GA, 10–12 September 2014, pp.449–456. Todorovski L and Sasˇo Dzˇeroski S. Integrating knowledgedriven and data-driven approaches to modeling. Ecol Model 2006; 194: 3–13. Sun J, Hu J, Luo D, et al. Combining knowledge and data driven insights for identifying risk factors using electronic health records. In: AMIA annual symposium proceedings, Chicago, IL, USA, November 3–7 2012, pp.901–901. Bethesda, MD: American Medical Informatics Association. Bangalore S and Johnston M. Balancing data-driven and rule-based approaches in the context of a multimodal conversational system. IEEE workshop on automatic speech recognition and understanding, 30 Nov - 4 Dec 2003, pp.221–226. Sagae K and Lavie A. Combining rule-based and data-driven techniques for grammatical relation extraction in spoken language. In: The 8th international workshop in parsing technologies, Nancy, France, 23–25 April 2003. Nelles O. Nonlinear system identification. New York: Springer, 2000. Zeigler BP, Praehofer H and Kim TG. Theory of modeling and simulation. 2nd ed. Cambridge, MA: Academic Press, 2001. Kim TG, Sung CH, Hong SY, et al. DEVSim++ toolset for defense modeling and simulation and interoperation. J Defense Model Simulat 2011; 8: 129–142. Choi SH, Lee SJ and Kim TG. Multi-fidelity modeling & simulation methodology for simulation speed up. In: Proceedings of the 2nd ACM SIGSIM/PADS conference on principles of advanced discrete simulation, Denver, CO, USA, May 18–21, 2014, pp.139–150. New York: ACM.

594


28. Bennis N, Duplaix J, Enea G, et al. Greenhouse climate modelling and robust control. Comput Electron Agric 2008; 61: 96–107. 29. Cunha JB. Greenhouse climate models: an overview. In: The 3rd conference of the European Federation for Information Technology in Agriculture, 5–9 July 2003. 30. Luan X, Shi P and Liu F. Robust adaptive control for greenhouse climate using neural networks. Int J Robust Nonlin Contr 2011; 21: 815–826. 31. Cunha JB, Couto C and Ruano AE. Real-time parameter estimation of dynamic temperature models for greenhouse environmental control. Contr Eng Pract 1997; 5: 1473–1481. 32. Bot GPA. Physical modelling of greenhouse climate. In: Proceedings of the IFAC/ISHS workshop, Matsuyama, Japan, 30 September-3 October 1991, pp.7–12. Oxford: Pergamon Press. 33. Ljung L. System identification - theory for the user. Upper Saddle River, NJ: PTR Prentice-Hall, 1987. 34. Kamp PGH and Timmerman GJ. Computerized environmental control in greenhouses. IPC-Plant, 1996. 35. Zibot E and Wang J. Modeling financial time series with SPLUS. Berlin: Springer Science & Business Media, 2007. 36. IEEE Standard 1516:2010. IEEE standard for modeling and simulation: high level architecture – HLA framework and rules. 37. Sarjoughian HS. Model composability. In: Rossetti M D, Hill R R, Johansson B, Dunkin B A and Ingalls R G (eds.) Proceedings of the 2009 winter simulation conference, 2009, pp.149–158.

Author biographies Byeong Soo Kim received his BS and MS in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST) in 2010 and 2012, respectively. Currently, he is a PhD candidate at the Department of Electrical Engineering at KAIST. His research interests include M&S of discrete event systems, big data modeling, and system identification.

Bong Gu Kang received his BS in electronics communication and computer engineering from Han Yang University and his MS in electrical engineering from KAIST in 2011 and 2013, respectively. He is currently a PhD candidate at the Department of Electrical Engineering at KAIST. His research interests include discrete event systems, defense M&S, and interoperation. Seon Han Choi received his BS and MS in electrical engineering from KAIST in 2012 and 2014, respectively. Currently, he is a PhD candidate at the Department of Electrical Engineering at KAIST. His research interests include M&S of discrete event systems, reverse simulation, and simulation-based optimization. Tag Gon Kim received his PhD in computer engineering with specialization in systems M&S from the University of Arizona, in Tucson, Arizona, in 1988. He was an assistant professor of electrical and computer engineering at the University of Kansas, in Lawrence, Kansas, from 1989 to 1991. He joined the Electrical Engineering Department at KAIST, in Daejeon, Korea, in the fall of 1991, and has been a full professor in the EECS department there since the fall of 1998. He was the President of the Korea Society for Simulation (KSS). He was the editor-in-chief for Simulation: Transactions for Society for Computer Modeling and Simulation International (SCS). He is a coauthor of the textbook Theory of modeling and simulation, Academic Press, 2000. He has published about 200 papers in M&S theory and practice in international journals and conference proceedings. He is very active in defense M&S in Korea. He is a consultant for defense M&S technology at various Korean government organizations, including the Ministry of Defense, Defense Agency for Technology and Quality (DTAQ), Korea Institute for Defense Analysis (KIDA), and Agency for Defense Development (ADD). He is a fellow of SCS and a senior member of IEEE.

Data modeling versus simulation modeling in the big data era: case ...

Data modeling versus simulation modeling in the big data era: case ...

Suggest Documents

Modeling big smart data - SoSyM

statistics in the big data era

Industrial Engineering in the Big Data Era

Publishing in the Era of Big Data

Economic Modeling in the Post Great Recession Era; Incomplete Data ...

Modeling Data Heterogeneity using Big DataSpace ...

Mega-Modeling for Big Data Analytics

Data-driven medicinal chemistry in the era of big data

Astronomy in the Big Data Era - Data Science Journal - codata

Astronomy in the Big Data Era - Data Science Journal - CoData

Small data in the era of big data - Springer Link

Astronomy in the Big Data Era - Data Science Journal - codata

Modeling and Querying Scientific Simulation Mesh Data

Big Data, NoSQL Databases and Data Modeling for a Document ...

Small Sample Learning in Big Data Era

ADVANCES IN DATA MODELING RESEARCH

ADVANCES IN DATA MODELING RESEARCH

Modeling of System of Systems via Data Analytics â Case for âBig Data ...

Modeling of System of Systems via Data Analytics â Case for âBig Data ...

MODELING SKEWNESS IN SPATIAL DATA

Predictive Modeling in a Big Data Distributed Setting ...

Big Data in Building Information Modeling Research - MedCrave

SAP BW Data Modeling

Data Modeling - Liberty University

Data modeling versus simulation modeling in the big data era: case ...