J Intell Manuf (2011) 22:289–299 DOI 10.1007/s10845-009-0291-9
DeviceNet network health monitoring using physical layer parameters Yong Lei · Dragan Djurdjanovic · Leandro Barajas · Gary Workman · Stephan Biller · Jun Ni
Received: 13 March 2008 / Accepted: 8 July 2009 / Published online: 9 August 2009 © Springer Science+Business Media, LLC 2009
Abstract Since the 1980s, the manufacturing environment has seen the introduction of numerous sensor/actuator bus and network protocols for automation systems, which led to increased manufacturing productivity, improved interchangeability of devices from different vendors, facilitated flexibility and reconfigurability for various applications and improved reliability, while reducing installation and maintenance costs. However, such heightened manufacturing integration facilitated by industrial networks also leads to dramatic consequences of improper or degraded network operation. This paper presents a novel Network Health Management system that provides diagnostic and prognostic information for DeviceNet at the device and network level. It extracts features from analog waveform communication signals, as well as logic and timing features from digital packet logs. These features are used to evaluate the network system performance degradation by applying multidimensional clustering techniques. In addition, this work proposes a hybrid prognostics structure using combined physical and logic layer features to provide fault location information that cannot easily be realized with analog or digital data Y. Lei (B) · J. Ni Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48109, USA e-mail:
[email protected] D. Djurdjanovic Department of Mechanical Engineering, University of Texas, Austin, TX 78705, USA L. Barajas · S. Biller Manufacturing Systems Research Lab, General Motors Research and Development Center, Warren, MI 48090, USA G. Workman Controls, Conveyors, Robotics and Welding (CCRW), General Motors Technical Center, Warren, MI 48090, USA
independently. Furthermore, an intermittent connection diagnostic algorithm which analyzes patterns of interrupted and error packets on the network was developed. This tool can be used as a packet source identification method which uses joint analog features and digital information inferred from analog waveforms. A test-bed was constructed and the experiments of network impedance mismatch, cable degradation, and intermittent connections were conducted in laboratory environment. Experimental results show that the proposed system can detect degradations of the network and identify the location of the intermittent connection successfully. Field tests performed in an industrial environment were also conducted and their results are discussed. Keywords Industrial networks · Field buses · DeviceNet · Fault diagnosis · Network automation systems · Network instrumentation
Introduction Since the development of the Fieldbus during the 1980s, the manufacturing environment has seen the introduction of numerous sensor/actuator bus and network protocols for automation systems (Cena et al. 1995; Timoney 2004). The wide-scale implementation of networked automation systems has contributed to the enhancement and the streamlining of manufacturing processes, improved the interchangeability of devices from different vendors, provided flexibility and reconfigurability for various applications and improved reliability, while reducing installation and maintenance costs. The characteristics of the Fieldbus architecture motivate the change from point-to-point architecture to a fully distributed networked architecture. However, introduction of industrial networks also introduces different forms of uncertainties
123
290
between controllers, sensors and actuators. Among these uncertainties, timing is one of the crucial elements in industrial automation systems. It could cause system performance degradation and even total system failure. The timing uncertainty comes from the shared communication media as well as communication processing time of the devices connected to the fieldbus. Most of the research on distributed automation systems has been focused on three areas: communication protocol (Pleinevaux and Decotignie 1988; Lian et al. 2001), controller design (Nilsson et al. 1998; Lian et al. 2000) and specification analysis and failure diagnosis in terms of logic behavior (Sampath et al. 1996; Pranevicius 1998; Koutsoukos et al. 2001). A well designed communication protocol can guarantee the performance of the network, while controller design can achieve the optimal control performance within the constraints of network delay and uncertainty. In addition, fault diagnosis could be correctly performed under the assumption that the network is functioning normally as a transparent layer. However, it is also important to consider the performance of the network itself to ensure the accuracy of locating the root cause of the failure. Moreover, as the level of sophistication of the networked automation systems increases, the negative cost-effects of downtimes caused by degraded network performance greatly increase. Therefore, the effect of these failures and other limitations on the performance of the industrial networks must be well-understood and addressed. Furthermore, early detection of networked systems anomalies is critical not only for preventing failures that often cause major production losses in industrial automation facilities, but also for improving network performance and reliability. Most of the research on networked automation systems is focused on the effects of network delays on control systems. In networked manufacturing systems, the fundamental requirement of communications is that there must be a bounded and known message delivery latency, in the presence of factors such as overload or faults. The bounded transmission delay and worst-case deadline analysis on controller area network (CAN) based networks have been comprehensively addressed in Tindell et al. (1995), Navet et al. (2000), Rufino and Veríssimo (1995), Cenesiz and Esin (2004). The packet transmission delay analysis on other network structures is also studied in literature. For example, Lian et al. (2001) analyzed the performance of ControlNet and Ethernet; Georges et al. (2002) studied the packet transmission characteristics of switched Ethernets. Moon et al. (1993, 1998) studied the performance of IEEE 802.4 token ring bus in noisy environments. Research was also conducted on networked automation system dependability issues. Cauffriez et al. (2003) and Jumel et al. (2003) analyzed network dependability of distributed systems, especially at the controller level. Corno et al. (2004) analyzed dependability of the CAN based networked systems
123
J Intell Manuf (2011) 22:289–299
using simulation models. Rodriguez-Navas et al. (2003) developed a fault injection system to analyze the effect of network faults on control system performance. However, main failure modes of the DeviceNet and the corresponding signatures have not been fully studied or understood. In addition, even though industrial vendors provide some basic diagnostics functions available in their products, there is still the urgent need for a method that provides node performance analysis and a method for quantitative description of the degradation of the network performance. In order to more appropriately understand the network failure modes and further develop tools to detect and quantify the network performance degradation, we introduce methods for assessment of the performance of the network from physical layer performance indicators like voltage levels, timedelays, and voltage profile characteristics. This paper is organized as follows. In the next section, we introduces related DeviceNet background. We then define the problem addressed in this paper and summarize major failure modes of DeviceNet systems. Next, we introduce the structures of the network health monitoring (NHM) system. The experimental analysis of the case study we considered is reported followed by conclusions.
DeviceNet basics DeviceNet is a worldwide used fieldbus protocol. It is an application layer protocol based on standard CAN specifications as its physical and data link layer protocols (Open DeviceNet Venders Association 1997). CAN is a serial communication protocol based on Carrier Sense Multiple Access/ Arbitration on Message Priority (CSMA/AMP) media access method. The physical layer electrical connection, as defined in ISO (2003), can be seen in Fig. 1. This standard contemplates a bus with a 2 V differential electrical signals where the bit-stream of a transmission is synchronized at the physical layer. The logic states on the bus are defined as recessive (‘1’ logic) and dominant (‘0’ logic), where the terms recessive and dominant indicate that a dominant state will always
CAN_H Terminating Resistor
CAN_L VCC CAN Module
Tx CAN Module
CAN Module
VCC
Rx
Fig. 1 Physical layer connection recommended by ISO 11898
J Intell Manuf (2011) 22:289–299
291
Standard Frame Data Bus Idle
Control
Arbitration Field
11 bit Identifier
DLC r1r0
Data Field
CRC Field
Data 0-8 byte
15 Bits
ACK
EOF
Int Bus Idle
RTR
Fig. 2 DeviceNet data packet format (standard CAN data frame)
cancel a recessive state. Any node that wants to transmit a packet should wait until the bus is idle. The conflicts for accessing to the bus are solved by a wired-AND process at the bit level during the transmission of the arbitration field of the data frame. If one of the nodes receives a bit that is different from the one it sends out during arbitration, this node will lose its priority. With this arbitration method, the transmission from higher priority node will not be destroyed. In DeviceNet/CAN, the protocol is message oriented and each message contains a specific priority defined by the message identifier. The data packet format of DeviceNet is shown in Fig. 2. The total data frame includes Start of Frame (SOF), Arbitration Field (11 bit Identifier), Control Field, Data Field, Cyclic Redundancy Check (CRC) Field, Acknowledgment (ACK) Field, End of Frame (EOF) and Intermission (INT) Field. The size of the data field varies between 0 and 8 bytes. The arbitration field provides message prioritization as well as source and destination identification.
The problem definition As previously mentioned, the fundamental requirement of the network in a networked automation system is that the message delivery latency must be bounded in a noisy environment. Many noise factors, such as faults or interferences, are directly related to major failure modes of the network. These failure modes are the sources of delivery latency, asynchrony and inconsistency, which affect the network robustness and reliability. According to our industrial networks experts, majority of the failure modes commonly occurring on DeviceNet Networks are due to one or more of the following factors: – Media degradation. Network cables usually are considered as a durable component. Which it is generally true, unless the cable is used in a vibration intensive environment or mounted with stress. The vibration and improper mounting can change the electronic properties of the cable such as resistance, inductance and capacitance. In some scenarios, for example, the movements of robotic arms can break the insulation and cable threads. Although the degraded cable may not affect the network communication immediately, it will affect the quality of
communication signals and make the system vulnerable to external interferences. – Network impedance mismatch. It is suggested in ISO11898 that each end of the network needs to be terminated using a 120 resistor. However, it turns out that fewer or more than two terminating resistors might be present because of human factors or improper setting of internal terminating resistor inside nodes. Fewer terminating resistors will weaken the network robustness to noise, while multiple resistors also may cause the reflection phenomenon, and hence distort the signal (Lawrenz and Lawrenz 1997). – Intermittent connections. Intermittent connection is one of the most frequent and troublesome failure modes observed in industrial networks. Intermittent connection usually occurs at the connections between the field device and the network trunk/drop cable, which are not reliably and consistently connected due to external factors. This phenomenon results in intermittent breaks in communication between the field device and the network. Thus, intermittent connections commonly indicate themselves through the corresponding field device having transient difficulties and breaks in communication with the controller. – Grounding problems. Although CAN protocol is robust in the noisy environment, the external disturbances still affect the network performance especially when the system is not properly grounded. In applications such as automotive manufacturing plants, the systems are often subjected to high degree of electromagnetic interference (EMI) from the operational environment such as welding equipments, motors and relays. These interferences may trigger transmission errors on the CAN bus when the connection to ground is degraded. In this paper, we will focus on the first three problems. The grounding problems, although they do occur and cause problems, can be easily identified by measuring the voltage and current on the shield reference to the ground (Open DeviceNet Venders Association 2006). As mentioned in ‘Introduction’, all the currently available network monitoring and diagnostic tools can provide certain basic information about the network, such as voltage monitoring or error counters. Nevertheless, a tool that can quantify the performance of the network and identify the location of an intermittent connection is not currently available to the industry. Proposed methodology In order to quantify the performance of the network, we propose a NHM system, designed to passively record network performance metrics and evaluate them with existing
123
292
J Intell Manuf (2011) 22:289–299 DeviceNet
BEGIN
DeviceNet Interface card
DAQ
Read Data Segment
Determine the Beginning Location of Each Packet
Feature Extraction
Reference Data
Read a Packet & Extract Features
Performance Assessment
Fig. 3 Network health monitoring system (NHM) function block diagram
Append to Feature Vector
reference features. A general functional block diagram of the proposed NHM system is illustrated in Fig. 3.
Last Packet?
Data collection The DeviceNet protocol is a thin layer implementation based on the CAN protocol, which implements physical and data link layers of the 7 layer ISO/OSI network model (Lawrenz and Lawrenz 1997). Therefore signals in these two layers contain most of the network performance and failure information. As illustrated in Fig. 3, signals from the physical layer are collected through analog measurements, while data link layer features are collected using a network interface device (PEAK-Systems Technik GmbH 2006). The network features collected from different layers are gathered and then handled by the performance assessment module. Currently only physical layer information is utilized. Detailed information of each module is described in the following sections. Feature extraction from physical layer signals In order to monitor the network performance and detect the network anomalies using physical layer information, features of this layer need to be extracted from recorded analog waveforms. Figure 4 shows the procedure of how features are extracted and grouped into feature matrix according to the source addresses decoded from the waveform. All the features are collected from the header segment of the packet given that the content of the header for a specific node is predetermined. The physical features are collected for the following six different categories: – Signal-to-Noise Ratio (SNR) – Common mode signal features – Overshoot features
123
No
Save Features into Disk
No Last Segment?
END
Fig. 4 Flow chart of the NHM feature extraction process
– Rise- and fall-time features – Bit width features – Signal Root Mean Square (RMS) features The features in the aforementioned categories are further described as follows: – Signal-to-Noise Ratio. The SNR is calculated from the signals during the dominant state, where voltage difference between CAN_H and CAN_L is around 2V. An example of the waveform is shown in Fig. 5. As it can be seen, the SNR calculation uses the static segment of the dominant state data of the header packet. The SNR can be defined as SNR =
Average signal power , Average noise power
or SNRdB = 10 log10 SNR,
J Intell Manuf (2011) 22:289–299
293
(in case of consecutive bits, the average bit width is calculated.) – Signal RMS features. Signal RMS features include:
3
Voltage (V) CAN − CAN L H
2.5
– Signal RMS: RMS for the entire signal sequence. (DC RMS) – Signal RMS noise 1: RMS for the signal minus the ideal DC values of the recessive and dominant states (AC RMS). – Signal RMS noise 2: a variation of Signal RMS noise 1, measured after transients have passed. Recessive and dominant RMS values are calculated separately. – DC values of dominant and recessive states.
2 1.5 1
SNR Calculation Data Segment Second Bit
0.5 0 −0.5 400
450
500
550
600
650
700
Sampling Point (10Ms/S)
Fig. 5 Waveform segment during a typical dominant state
where the average power of a signal x is defined as: N −1 1 Ex |xn |2 . Px = N N n=0
– Common mode signal features. The common mode signal features include the mean and standard deviation values of the common mode signal of a packet. The features of the common mode signal would represent the network-wide problems, such as unbalanced differential signals, circulating current, imperfect grounding etc. The desired DeviceNet (CAN) signal is the differential or balance signal, and the undesired common mode or unbalanced signal is the main source of all problems on the network. The two-wire transmission signal can be split up over time t into a differential signal component USignal (t) = UCAN H (t)−UCAN L (t) and a common (t) = UCAN H (t) + mode signal component U Common UCAN L (t) /2. An ideal differential signal only exists, if UCommon (t) = constant (Lawrenz and Lawrenz 1997). – Overshoot features. Overshoots are calculated when the logic states switch from the recessive to the dominant state. The overshoot features include absolute and percentage overshoot values. The static values of the states (dominant/recessive) are the average values of the final states. – Rise/fall time features. As in standard automatic control literature, the rise time is considered as the time required for the response to rise from 10 to 90% of the final value (dominant state voltage). Similarly, the fall time can be defined for the transient time from dominant to recessive static voltage. – Bit-width features. The bit-width in this paper is considered as the time duration a bit maintains its state.
Table 1 summarizes the physical layer features used in this paper for network health monitoring and assessment. Performance assessment Performance of the network is assessed through comparison of the most recent features with the reference features. Reference features can be obtained from historical values, a network known to be operating well or can be defined using expert knowledge and theoretical values. The distributions of the feature vectors are approximated using Gaussian mixture models (Duda et al. 2000) and the performance measure is calculated based on a similarity measure between the test and reference feature distributions. Gaussian mixture model and CV calculation The probability density function of a given feature x can be modelled by a finite mixture of Gaussian distributions with N components: p(x) =
N
ωk pk (x, μk , σk ),
with
i=1
N
ωk = 1,
(1)
i=1
where
(x − μk )2 pk (x, μk , σk ) = exp 2σk2 2π σk2 1
.
(2)
The model parameters (μk , σk ), k = 1, . . . , N can be learned by using Expectation-Maximization (EM) algorithm, where the number of components N is predetermined by using unsupervised clustering method (k-means) and Silhouette values as rule of thumb (Duda et al. 2000). The similarity between two probability densities f (x) = Ng N f ˆ k qk can be calculated k=1 ωk pk and the g(x) = k=1 ω using a similarity measure (SM) defined in Eq. 3.
123
294
J Intell Manuf (2011) 22:289–299
Table 1 List of features used in NHM
Feature category
Features
SNR
SNR
Common mode signal features
Mean of the common mode signal segment SD of the common mode signal segment
Overshoot features
Absolute overshoot from recessive to dominant state Percentage overshoot from recessive to dominant state Absolute overshoot from dominant to recessive state Percentage overshoot from dominant to recessive state
Rise/fall time features
Mean rise time SD of rise time Mean fall time SD of fall time
Bit width features
Mean bit width
Signal RMS features
Signal RMS
SD of bit width Signal RMS noise 1 Signal RMS noise 2 of recessive state Signal RMS noise 2 of dominant state Mean DC value of dominant state SD of DC value of dominant state Mean DC value of recessive state SD of DC value of recessive states
N N f g
SM = ωi · OP( pi , q j ) · ωˆ j
this measure is referred to as Confidence Value (CV). An alarm can be raised when the CV value is lower than a predetermined threshold, which means the network system performance is severely degraded. The threshold can be determined by using either engineering knowledge or statistical process control methods with historical CV value observations.
i=1 j=1
⎛
Nf Nf
⎜
ωi · OP( pi , p j ) · ω j ⎝ i=1 i=1
⎞
N N g g
⎟ × ωˆ i · OP(qi , q j ) · ωˆ j ⎠
Network node performance assessment (3)
j=1 j=1
where N f , N g are the numbers of Gaussian components in probability densities f and g, respectively, and OP( pi , q j ) denotes the similarity measure between two Gaussian components pi N (μi , σi ) and q j N (μ j , σ j ). OP( pi , q j ) can be calculated as OP( pi , q j ) =
1 σi2 +σ 2j 2(σi ·σ j )
−
·e
(μi −μ j )2 2(σi2 +σ 2j )
.
(4)
Using the similarity measure defined in Eq. 3, we can quantitatively determine the similarity between the test distribution and the one representing the normal system behavior. Note that this measure is always in the range from 0 to 1, with similarity measure being 1 when the two distributions match perfectly. Following Djurdjanovic et al. (2003),
123
The node performance is compared based on performance similarity of each feature category. As described previously, features of each bit in the header data are extracted and stored sequentially. As the example illustrated in Fig. 6, the features of the common mode voltage are stored in a matrix. The matrix items in column i denote the features extracted from the ith dominant or recessive state, while elements in each row of the matrix denote the order of observations. All the performance assessment processes, such as feature distribution fitting and CV calculation, are executed column by column. After the CV structure is obtained by calculating all the features on every bit, the average of all the CV values is taken as the final performance measure for the node. Overall network performance assessment The overall performance at the network level is analyzed using aggregated node features. The analysis procedure is
J Intell Manuf (2011) 22:289–299
295
Logic State of Bits in the Header of Node A 1
1
0
1
1
0
1
1
Network
1 Reference Data Set of Node A
......
......
0.11 Feature 1 Feature 2 0.98
......
Packet 1 0.10 0.12 Packet 2 0.12 0.10 ....... Packet N 0.13 0.15
DNET Interface
DAQ Interface
0.12
Feature Extraction & Error Packet Pattern Recognition
Feature M Feature 1 Feature 2 Feature M
Similarity Measure Calculation
0.12 Feature 1 Feature 2 0.13 ......
......
......
Packet 1 0.11 0.13 Packet 2 0.13 0.15 ....... Packet N 0.12 0.13
Node waveform Feature Database
Fig. 7 Function block diagram of an intermittent connection detector
0.13 Feature M
Test Data Set of Node A
Fig. 6 CV calculation for node performance assessment
described as follows: Firstly, all the individual features of all dominant bits of all nodes are aggregated together to form a single feature vector. Secondly, CV values are calculated by comparing the overlap between reference and test aggregated feature distributions. Thirdly, the final CV value is obtained as the average of all CV values for each feature. Intermittent connection detection The occurrence of intermittent connections in a network can interrupt the normal communication and result in repeated and lost packets. In the network in which intermittent connections (IC) occur at a drop cable to a node, two network traffic patterns can be formed: (1) the problematic node will frequently interrupt the normal communication packets when it is supposed to listen to the network traffic only; (2) the packets from the node connected via an intermittent connection are often interrupted by a 6-bit long error packet, which indicates that errors in the packets from this node are noticed simultaneously by other nodes. Therefore, observation of these two patterns of error source and interrupted packet source can be used to infer the location of the IC problem. However, it is difficult to obtain the sources of error packets and the interrupted packets using data link layer information, since the interface hardware will discard the receiving packets upon errors. Therefore, analog waveforms must be used to recognized the packet source. If the analog waveform contains the address segment, the source address can obtained by decoding the waveform. Otherwise, the addresses are obtained by applying pattern classification methods on the physical layer features of the packet waveforms. Figure 7 shows the block diagram of the intermittent connection detector. Functions of each module are described as follows:
– Data acquisition module. Physical and data link layer information is recorded in this module. An error packet detector is developed to track errors on the network so that physical and data link layer information can be synchronized on the error events. – Feature extraction and error source detection module. Physical layer features are extracted from packets. Waveform decoding and classification are applied in this module to identify the source of the interrupted and error packets. In this study, a linear classifier (Duda et al. 2000) based on all the features described in ‘Proposed methodology’ is applied to identify the sources of the packets. – Node waveform feature database. Node waveform features on the network can be on-line or off-line obtained by extracting features from the successfully transmitted packets. The database provides classification information for the error source detection module. In order to describe the causal relationship between the sources of the interrupted and error packets, two Error Matrices are introduced to represent the patterns previously introduced. Definition 1 Each element E 1 (i, j) of Error Matrix E 1 ∈ R N ×N is defined as (5) E 1 (i, j) = N X i1j , where N(·) denotes the cardinal number of a set, X i1j denotes the set of events in which the source of the error packet (longer than 6-bit) was node i, and sources of the interrupted packet was node j. Definition 2 Each element E 2 (i, j) of Error Matrix E 2 ∈ R N ×N is defined as (6) E 2 (i, j) = N X i2j , where N(·) denotes the cardinal number of a set, X i2j denotes the set of events in which the source of the interrupted packet was node i, and sources of the error packet (6-bit long) was node j.
123
296
The causal relationship of the interrupted and error packets can be visualized using two graphs which use the error matrices as their adjacency matrices. The vertexes in the graphs are mapped one to one to the addresses of the network nodes. The out-degree of a vertex in the graph using matrix E 1 represents the number of errors sent by this node. The out-degree of a vertex in the graph using matrix E 2 represents the number of packets from this node that are interrupted simultaneously by other nodes. Vertexes with large out-degrees in both graphs can be inferred as the problematic nodes since two network traffic patterns are established. More specifically, one can firstly rank the out-degrees of the vertexes in both graphs. Secondly, in each graph, one may compare this vertex with the vertexes whose out-degree values are smaller then this vertex. If the out-degree value of this vertex is statistically larger then the vertexes with smaller out-degree values , we then list this vertex as a candidate vertex. Finally, if the candidate vertexes from both graphs represent the same network node, then we may conclude that this node has intermittent connection problems.
Experiments In order to show the effectiveness of the system to detect the degradation/anomalies of the network performance, laboratory experiments and plant network testings were conducted. Laboratory experiments Terminating resistor mismatch, cable degradation and intermittent connection problems were investigated in the lab environment. Experimental setup The setup of the experiment is shown in Fig. 8. The node N represents the NHM system that analyzes the signal passively. A pair of resistors is put into the drop cable connecting to node 1 to emulate the cable impedance degradation while cables to other nodes are untouched. By changing the resistance of the resistors, the cable impedance is changed so that the effects on physical layer features and the effectiveness of the degradation detection can be studied. For terminating resistor mismatch scenarios, resistors can be removed or added from/to the ends of the network drop cables. For cable degradation scenario, resistors are symmetrically added to CAN_H and CAN_L cables to emulate the impedance effect of mounting stress or cable wear out. For intermittent connection problem, a random on-off switch is placed on a drop cable to emulate the intermittent connection scenarios. The switch is controlled by a PC that generates a sequence of
123
J Intell Manuf (2011) 22:289–299 CAN_H
CAN_L
Terminating Resistor Node 1
Node 2
Node N
Fig. 8 Experimental setup: terminating resistors can be added or removed to change the impedance of the cable
on-off events, whose inter-event time is sampled from a predefined distribution. Terminating resistor mismatch The terminating resistors are used to reduce the effects of reflected signal wave and noise. When the network is not properly terminated, the whole network acts as an antenna that has properties similar to an open wire system. The resonant frequencies of one such systems are dependent on the network geometry. In this case, the network is sensitive to the undesired incoming disturbing signals or reflected wave since the common mode signal is not properly terminated. Normally, two terminating resistors are used to reduce interference of external factors. One terminating additional resistor and two terminating resistors are placed into the network to observe the impact on the analog features of the nodes. Figure 9 shows the changing of the CV of node 1 in the normal network setup and abnormal setting of terminating resistors. The procedure of calculating CV value is as follows: For each terminating resistor mismatch condition, firstly, features are extracted from the headers of each data packets sent by node 1. Secondly, Gaussian mixture models of the features under different mismatch conditions are fitted. Using Eq. 3, the similarity measures for each bit locations are calculated by comparing with features of normal terminating condition. The final CV value of node 1 is calculated using the procedure described in ‘Network node performance assessment’. In practice, features from the last several bit locations in the header are used to avoid the effects of bit arbitration. The CV values of all other nodes show similar increasing trend, that is, all the CV values of the nodes are reduced when there is an abnormal setting of terminating resistors on the network. Cable degradation Besides terminating resistor induced network impedance mismatch, the cable impedance can also be changed due to cable wear, strain or aging. A two-node test was conducted to validate the ability of the NHM system to detect the network cable impedance degradation over time. As it can be seen in Fig. 10, using the same procedure described previously, the physical layer features of signal
J Intell Manuf (2011) 22:289–299
297
PLC
Switch
CAN_H
CAN_L
Terminating Resistor Node10 1
Node 2 3
Node N
Intermittent Connection Detector PC
Fig. 9 CV values for node 1 using different network impedance
Voltage (v)
Fig. 11 Schematic layout of intermittent connection testbed. In the testbed the computer controlled switch is placed on the drop cable of a node 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 3000
4000
5000 6000 7000 8000 No. of Samples (100Ms/sec)
9000
10000
Fig. 12 Segment of analog waveform with errors caused by intermittent connection Table 2 Comparison of digital log with IC detector analysis result on the segment from Fig. 12
Fig. 10 Confidence values of node 1 & 2 with different impedance of drop cable connected to node 1
from node 1 are significantly affected by the cable impedance connected to it while the features from node 2 are not. The experiment results indicates that the NHM system is able to locate the source of the anomaly by observing the CV of each node on the network. In the test, the difference of CV values of node 1 and node 2 suggests that the impedance of the cable connected to node 1 has been changed, fact which is confirmed from the test-bed setup. Intermittent connection detection Figure 11 shows the test setup for the intermittent connection detection test, in which a computer controlled switch is placed on the drop cable to emulate random intermittent
Digital
Error
IC detector:
Packet_3
Error_10
Error
Packet_3
Packet_10 Error_3
Packet_3
events. In the test, we place the switch on the drop cable to node 3, as indicated in Fig. 11. Figure 12 shows one segment of the analog waveforms recorded from the testbed. The falling edges of the error trigger signal mark the positions of the error packets. Table 2 concurrently shows the interpretation result of the data segment in Fig. 12 by the IC detector and the log recorded using a commercial digital interface card, which per DeviceNet protocol, logs errors without the information on what packet was interrupted and which node initiates the error packet. Packet_3 denotes a packet sent by node 3, Error_9 denotes the error is initially sent by node 9. It can be seen that the IC detector can fully recover the sequence of the events, while the digital log can only indicate the existence of communication error(s). During one test, 43 errors were collected. Figure 13 shows relationship graphs of the interrupted packets and error packets G 1 and G 2 that represent matrices E 1 and E 2 respectively.
123
298
J Intell Manuf (2011) 22:289–299
Fig. 13 Relationship graphs of the interrupted packets and error packets
PLC
PLC
10
1
15 Node 3
Node 10
Node 3
Node 10
17 G1
Table 3 Network nodes out-degree Node 3
Graph using E 1 26
G2
Table 4 Node and overall CV values of network 1 Graph using E 2 17
Total
Node address
CV
Node address
CV
43
11
88.01
42
88.67
10
0
0
0
12
89.82
43
92.12
PLC
0
0
0
13
92.06
44
93.98
14
90.55
45
91.56
25
87.90
46
91.11
32
93.82
47
88.36
33
92.71
48
90.55
35
88.40
50
88.91
36
91.47
51
94.77
40
89.21
59
94.03
41
92.13
Overall
99.4
Table 3 shows the causal relationships of the errors. As can be seen, the total out-degree of node 3 is 43, which is the highest value among three nodes. Therefore, the established error patterns indicates that node 3 is the most likely cause of the intermittent connection problems. Plant network testing The NHM system was tested on several plant networks to verify its robustness and the effectiveness to detect anomalies in real industrial environments. In these tests, features extracted from packet waveform while the production line was idling were regarded as reference features, those extracted while the production line was running were regarded as test features. Two different networks were selected to compare their performances. Network 1 has 21 slave nodes in total. Network 2 has 5 slave nodes. The PLC nodes in both networks are not evaluated. For each network, the performance differences between the time when the production line was running and idling were observed. As can be seen in Tables 4 and 5, nodes in network 1 show lower CV values than network 2 as well as overall network CV values using aggregated features. This shows that similarities of test and reference features distributions of nodes in network 1 are less than those in network 2. Similarly, the overall performance CV of network 1 is also lower than the one of network 2. In our example, the higher CV values may be the contributions of factors such as better cable shielding, power quality and grounding, which can protect the network system from EMI or other interferences. We should also note that network 1 was experiencing faults during the time this testing was made.
123
Table 5 Node and overall CV values of network 2 Node address
CV
Node address
CV
33
97.29
48
97.15
14
95.66
21
98.42
34
97.93
Overall
99.98
Conclusion In this paper, we introduced a novel systematic approach that allows us to assess DeviceNet network performance and detect anomalies. We generalized four major network failure modes and proposed a NHM system which can detect the degradation and anomalies of the network. We implemented a prototype system in the laboratory environment, and conducted experiments that emulate media degradation, network impedance mismatch and intermittent connection problems on the network. Moreover, testing was conducted in the plant environment with results indicating that the NHM system can successfully detect the problems in the network by using measured and reference features. Finally, the physical layer features suggested in our paper are used to detect
J Intell Manuf (2011) 22:289–299
the sources of error packet, which is fundamental for detection of the source of an intermittent connection. The experimental results show the detector can successfully identify the intermittent connection problem on the drop cable. The developed methods in this work can be applied to other bus structured network systems with minimum modifications on protocol hardware level, for example network systems based on CAN-based protocols or Ethernet using bus structures. Future work includes Identifying intermittent connections at the trunk cable and isolating EMI induced errors from intermittent connection errors. Acknowledgments This research was supported in part by the NSF Industry/University Cooperative Research Center for Intelligent Maintenance Systems (IMS) at the University of Michigan and by General Motors Corp.
References Cauffriez, L., Conrard, B., Thiriet, J., & Bayart, M. (2003). Fieldbuses and their influence on dependability. In Proceedings of the 20th IEEE instrumentation and measurement technology conference (Vol. 1, pp. 83–88). Cena, G., Durante, L., & Valenzano, A. (1995). Standard field bus networks for industrial applications. Computer Standards & Interfaces, 17(2), 155–167. Cenesiz, N., & Esin, M. (2004). Controller area network (CAN) for computer integrated manufacturing systems. Journal of Intelligent Manufacturing, 15(4), 481–489. Corno, F., Perez, J., Ramasso, M., Reorda, M. S., & Violante, M. (2004). Validation of the dependability of CAN-based networked systems. In Proceedings of the 9th IEEE high-level design validation and test workshop (pp. 161–164). Djurdjanovic, D., Lee, J., & Ni, J. (2003). Watchdog agent—an infotronics-based prognostics approach for product performance degradation assessment and prediction. Advanced Engineering Informatics, 17(3–4), 109–125. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley. Georges, J.-P., Rondeau, E., & Divoux, T. (2002). Evaluation of switched ethernet in an industrial context by using the network calculus. In Proceedings of 4th IEEE international workshop on factory communication systems (pp. 19–26). ISO. (2003). ISO11898-1 Read Vehicles Controller Area Network (CAN) Part 1: Data link layer and physical signalling. Jumel, F., Thiriet, J.-M., Aubry, J.-F., & Malasse, O. (2003). Towards an information-based approach for the dependability evaluation of distributed control systems. In Proceedings of the 20th IEEE information and measurement technology conference (Vol. 1, pp. 270–275). Vail, CO.
299 Koutsoukos, X., Zhao, F., Haussecker, H., Reich, J., & Cheung, P. (2001). Fault modeling for monitoring and diagnosis of sensorrich hybrid systems. In Proceedings of 40th IEEE conference on decision and control (CDC) (Vol. 1, pp. 793–801). Orlando, FL. Lawrenz, W., & Lawrenz, W. (1997). CAN system engineering: From theory to practical applications. Secaucus, NJ: Springer. Lian, F.-L., Moyne, J., & Tilbury D. (2000). Control performance study of a networked machining cell. In Proceedings of the 2000 American control conference (Vol. 4, pp. 2337–2341). Lian, F.-L., Moyne, J., & Tilbury, D. (2001). Performance evaluation of control networks: Ethernet, controlnet, and devicenet. IEEE Control Systems Magazine, 21(1), 66–83. Moon, H.-j., Park, H.S., Ahn, S.C., & Kwon, W.H. (1998). Performance degradation of the ieee 802.4 token bus network in a noisy environment. Computer Communications, 21(6), 547–557. Moon, H.-j., Park, H.-s., Kim, D.-w., & Kwon W.-H. (1993). Analysis of the ieee 802.4 token-passing mechanism with noise. In Proceedings of the 19th annual international conference on industrial electronics, control, and instrumentation (Vol. 1, pp. 541–546). Maui, Hawaii. Navet, N., Song, Y.-Q., & Simonot, F. (2000). Worst-case deadline failure probability in real-time applications distributed over controller area network. Journal of Systems Architecture, 46(7), 607–617. Nilsson, J., Bernhardsson, B., & Wittenmark, B. (1998). Stochastic analysis and control of real-time systems with random time delays. Automatica, 34(1), 57–64. Open DeviceNet Venders Association. (1997). DeviceNet specifications (2nd ed.) Open DeviceNet Venders Association. (2006). Devicenet plant floor troubleshooting guide. PEAK-Systems Technik GmbH. (2006). PCAN trace. Pleinevaux, P., & Decotignie, J.-D. (1988). Time critical communication networks: Field buses. Network, IEEE, 2(3), 55–63. Pranevicius, H. (1998). Formal specification and analysis of distributed systems. Journal of Intelligent Manufacturing, 9(6), 559–569. Rodriguez-Navas, G., Jimenez, J., & Proenza, J. (2003). An architecture for physical injection of complex fault scenarios in can networks. In Proceedings of IEEE conference on emerging technologies and factory automation, 2003 (Vol. 2, pp. 125–128). Rufino, J. & Veríssimo P. (1995). A study on the inaccessibility characteristics of the Controller Area Network. In Proceedings of the 2nd international CAN conference (pp. 7.12–7.21). London, England: CiA. Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K., & Teneketzis, D. C. (1996). Failure diagnosis using discreteevent models. IEEE Transactions on Control Systems Technology, 4(2), 105–124. Timoney, R. (2004). Fieldbus: An enabling technology for today’s competitive environment. Hydrocarbon Processing, 83(3), 45–47. Tindell, K., Burns, A., & Wellings, A. (1995). Calculating controller area network (CAN) message response times. Control Engineering Practice, 3(8), 1163–1169.
123