paths exist for integrating service monitoring and server virtualization support: (i) ... The free version of the solution offers virtual machine (VM) live migration, ...
12th IFIP/IEEE
International Symposium on Integrated Network Management
2011
Incorporating Virtualization Awareness in Service Monitoring Systems Marcio Barbosa de Carvalho, Lisandro Zambenedetti Granville Institute of Informatics Federal University of Rio Grande do Sui Porto Alegre, Brazil
{mbcarvalho, granville}@inf.ufrgs.br
Abstract- Traditional service monitoring systems (e.g., Nagios
introduced, and the set of experiments carried out in order to
and Cacti) have been conceived to monitor services hosted in physical computers. With the recent popularization of server virtualization platforms (e.g., Xen and VMware), monitored services can migrate from a physical computer to another, invalidating the original monitoring logic. In this paper, we investigate which strategies should be used to modify traditional service monitoring systems so that they can still provide accurate status information even for monitored services that are constantly moving on top of a set of servers with virtualization support.
evaluate our proposed solution is described. In Section
5
we
present and discuss the results from our experiments. Finally, in Section
6,
we close this paper with conclusions and future
work. II.
RELATED WORK
Current monitoring solutions for server virtualization are usually devoted to a specific aspect of virtualization. Security is probably one of the main aspects, for example, when a compromised virtual machine (VM) affects the performance of
I.
other VMs hosted in a shared physical computer. Back in
INTRODUCTION
[1]
Modern server virtualization solutions like Xen
[2],
VMware
despite
enabling
network
administrators
2003,
and of
In complement, Payne
infrastructures, break service monitoring processes of popular
[3]
and Cacti
[4].
Because
a
failing
one)
(e.g.,
(e.g.,
to save processing
invalidates
the
this
paper,
we
investigate
the
of
We use an experimental machine
awareness
of
virtualization
It is important to observe that the current solutions,
is
especially
when
servers are
monitoring agent and the monitored system. Machida
highly
[9],
Independent from virtualization management, in another front,
we review related work on service monitoring,
architecture
we present the design of our solution to incorporate
4
is
service
usually
where
monitoring
realized
monitors
through
contact
in a
remote
networked very
simple
machines
to
retrieve the operational status of the services of interest
the network
running at those machines. Because they are based on such a
environment where critical services must be monitored is
978-1-4244-9221-31111$26.00 ©2011 IEEE
traditional
environments
especially considering the support for server virtualization. In virtualization awareness into Nagios. In Section
at.
according to the changes in the virtualized resources.
The remaining of this paper is organized as follows. In
3
et
for example, presented a VM monitoring solution where
monitoring server) was able to adapt its monitoring traffic
unbound to the service being observed.
Section
the
an external monitoring agent (in fact it was refereed as
virtualization, as of today, every time a service starts running in a different physical machine, the monitoring system will be
2
assume
are, however, other research works that physically separate the
demanded. If the monitoring system is unaware of server
Section
those based on VM introspection,
monitoring agent to be placed inside physical machines. There
environment, service monitoring is critical to identify resource in moments
introduced,
observing the performance of VMs, in that case, instantiated in
incorporated into Nagios in order to monitor a set of services
especially
[8]
at.
the Xen virtualization platform.
in a real production network environment. In such a real failures,
et
for example, the PMonitor, which is a lightweight monitor for
monitoring systems when server virtualization takes place in where
advanced the previous work by
Like security, the performance of VMs is also an aspect
modifications and adjustments are required in current service the managed IT infrastructures.
[6]
that requires monitoring solutions. Shao
which
approach
at.
to dispense with human interventions, was able to automated
traditional
question
et
fix just compromised VMs.
monitoring logic. In
proposed a set of requirements
proposing a more sophisticated monitoring agent that, in order
when a healthy server takes over
immediately
et at. [6]
proving the concept by introducing the XenAccess monitoring library. Fraser
monitoring the services of a networked environment, any power) or replacement
presented a technique
to guide the development of VM monitors, in addition of
such systems have a list of fixed servers to contact while virtual machine (VM) migration
[5]
employment of a VM monitor running inside the physical host.
creating more flexible and robust information technology (IT) monitoring systems such as Nagios
Garfinkel and Rosenblum
called VM introspection for preventing intrusions through the
297
simple architecture,
open source monitoring systems like
common monitoring values, such as the method used to check
Nagios [3] and Cacti [4] have gained popularity among IT
service status,
administrators.
administrator responsible for the group, and the window period
In
addition,
growing
commumtIes
of
independent developers and enthusiasts provide plugins to
utilization
monitoring
becomes
between
checks,
the IT
Nagios uses two types of checks: active and passive. Active checks are initiated at the Nagios server, and triggered and controlled by the main Nagios process. Passive checks, in tum,
can be settle. resource
interval
that the service must be observed.
these systems so that very sophisticated monitoring scenarios
The
the time
more
complex in virtualized environments. In these environments,
are initiated and controlled by external processes that notify
the administrator must monitor both the real and virtual
Nagios when the status of monitored services changes.
resources.
Tools like Nagios
monitor resource utilization
In order to perform an active check, Nagios issues the
consulting agents installed on servers. In virtual machines, these
agents
only
provide
information
about
the
execution of a command line script defined in a service object
virtual
of the Nagios configuration hierarchy. The results of that
resources. VM monitoring tools, like Xen Center, provide both
execution is collected by the Nagios server from the standard
information but are unable to monitor the services that the
output and stored in internal data structures. These command
virtual machine runs.
lines are in fact plugins of the system that can be coded by third-party developers to enrich a Nagios' plugin library. In
From a historical point of view, service monitoring systems have been conceived before the current popularization of
order
server virtualization platforms. Because of that, traditional
complementary offers to the IT administrator the Nagios
service
monitoring
systems
are
incapable
of
to
remotely
execute
Nagios
plugins,
the
systems
tracking
Remote Plugin Executor (NRPE), which is itself a plugin
monitored services when they migrate from one physical host
installed in the Nagios server and a remote daemon running at
to another. On the other side, the systems specifically designed
the monitored hosts.
for VM monitoring check the status of the VMs themselves,
Passive checks are executed by external processes
but they ignore the services running on those VMs. In addition, such VM monitoring systems are not currently as widely used
monitored hosts -- that inform the main Nagios process about
as traditional service monitoring systems. In this scenario, two
the changes in the status of monitored services. That happens
paths exist for integrating service monitoring and server
when monitoring processes request Nagios to execute a so
virtualization support: (i) modifying VM monitoring systems
called external command (since it comes from an external
to become aware of the running services, or (ii) modifying
process). That is done allowing external processes to write into
traditional service monitoring systems to become aware of the underlying
virtualization
infrastructure
the Nagios external command file, which is periodically read
that is hosting the
by Nagios main process to check whether there are new
critical services. Since we believe that the popularity of service
external commands to be executed. External processes running
monitoring system among IT administrator will not decrease in
in the Nagios server are able to directly write into the external
the near future, we understand that the option of adapting the current service
monitoring
systems
is
command file. External processes running on remote hosts,
more feasible and
however,
realistic. III.
through
indirectly write into the external command file the
intermediation
of the Nagios Service Check
Acceptor (NSCA), which is a daemon running in the Nagios
PROPOSED SOLUTION AND IMPLEMENTA TION
server that receives remotely issued command calls.
In this section we first review Nagios and Citrix XenServer
Events in Nagios are handled in the system's server by
key aspects for our proposal to then present strategies to
Nagios Event Broker (NEB) modules. When a NEB module is
incorporate virtualization awareness into Nagios.
A.
-
running either locally at the Nagios server or remotely on the
developed to deal with a particular event, that module must subscribe its interest into a Nagios pipe. When the event of
Nagios overview
interest happens, Nagios checks which NEB modules have
Nagios is an open source monitoring tool widely used by
been subscribed and, through callback functions, passes the
IT administrators. Nagios' configuration is defined through
control of the server to the registered modules. Since Nagios
hierarchically organized set of configuration objects that map
employs an infinite single loop, the calling NEB modules must
the actual structure for the managed network and the services
quickly process and return the control to the Nagios core in
of interest running on top of it. From top to bottom, Nagios
order not to affect the performance of whole system.
defines, in its configuration hierarchy, hostgroup objects that contain host objects, which, in tum, contain service objects.
B.
Hostgroups uses host templates to minimize the registration effort required when new hosts need to be included into the system.
Host
templates
describe
common
XenServer
is
based
on
the
open
source
Xen
hypervisor. The free version of the solution offers virtual
monitoring
machine (VM) live migration, while load balancing and high
configuration values for the hosts within the same hostgroup.
availability support is only available on paid versions of the
That includes, for example, the time interval between status
system. To support VM live migration, XenServer uses a
checks, the hosts' icon to be used in Nagios visual maps, and
centralized storage that is shared among all physical hosts of a
the deadline for issuing notifications of internal changes of a host.
Citrix XenServer Citrix
cluster of servers. All physical hosts of the cluster can run any
Service templates are also provided to ease the
VM saved in the shared storage. Locally stored VMs can also
registration of network services to be monitored and that share
298
be used,
but in that case with no support for VM live
Na ios server
migration, load balancing, and high availability. Citrix
XenServer
uses
a
master-slave
strategy
Nagios main process
to
orchestrate its cluster. Information about the whole cluster and Physical host
all associations between VMs and physical hosts (i.e., which VMs are being hosted by each physical host) can be entirely accessed through and from any physical host. In our investigation, we consider a Citrix XenServer cluster where the monitored services can move from one physical host to the other when the VMs that are running the monitored
Figure 2.
services migrate (e.g., to save power energy or optimize processing).
Passive check architecture
Since in our case study critical services are
It's important to highlight that notifications to the Nagios
monitored by Nagios and such services can move over a
server are sent only when the list of VMs of a physical host is
XenServer cluster, Nagios needs to be adapted to be aware of
changed, thus decreasing the network traffic generated to
the underlying virtualization support that is taking place.
check the list of running VMs, when compared to the active checks strategy.
C.
Active checks strategy and architecture
Since
In order to incorporate virtualization awareness using Nagios' active checks, the first strategy consists in obtaining
is too large. A variation of the passive checks strategy, called
plugin, called check_xen_virtual_machines, is remotely
aggressive passive checks, can take advantage of the fact that
executed in each physical host through the intermediation of
each server in the Citrix XenServer cluster is aware of the
NRPE. At the Nagios server side, a new service object is the
system's
configuration
is
perceived much later at the Nagios side if the checking interval
XenServer cluster is hosting. In this strategy, a new Nagios
in
plugin
configuration, an internal change on a physical host may be
the list of VMs that a physical host belonging to a Citrix
registered
check_xen_virtual_machines
the
executed in time intervals that depend on the system cron
status of the whole cluster. The aggressive passive checks
hierarchy,
strategy
corresponding to each physical host that hosts VMs that need
consists
in
modify
the
check_xen_virtual_
machines plugin to notify the Nagios server about the entire
to be monitored. Such service, called virtual_machines,
status of the cluster at once. In this case, the first physical host
lists the VMs currently running on each physical host. The
of the cluster that detects a change notifies that change to the
checking command line associated with this service uses the
Nagios server. All other physical hosts would eventually detect
NRPE local plugin to contact the NRPE daemon running at the
that change too, and then inform Nagios again. This strategy
remote physical hosts of interest. The architecture to support
increases the network usage because the same change is
the active checks strategy is presented in Fig. 1.
notified several times, but it decreases the delay to detect a change. The impacts of the aggressive passive checks over the communication
Nagios server
conventional
network,
passive
as
and
well
active
as
the
checks,
impacts are
of
analyzed
the in
Section IV.
call Nagios main process !---,c""e;;";c" """; e;;"" n_-; v;;; n'i't;cua;;--71 _ x;O;:: machines
E.
Physical host
VM and physical host association and visualization The fact that virtualization awareness is incorporated into a
service monitoring system - in our case, into Nagios - does not mean that the user of the service monitoring system will be also aware, through the system, of the underlying virtualization support; the final user may or may not be aware of it. We Figure 1.
believe, however, that IT administrators should be conscious
Active check architecture
not only about the status of services they need to monitor, but D.
also about the actual location of those services running over a
Passive checks strategy and architecture
set of physical servers.
The passive checks strategy consists of physical hosts
In our proposed solution, not only Nagios is aware of the
notifying the Nagios server about the VMs that are currently
virtualization support presented in the hosting servers, but it
running. In this strategy, a different implementation of the check_xen_virtual_machines plugin is used. hosts have their system cron
configured
to
also exposes such awareness via its graphical user interface
Physical
(GUI) to the IT administrator. In essence, the IT administrator
periodically
is able to visually observer the status of critical services as well
execute check_xen_virtual_machines, which in its turn
as where they are currently running. In order to achieve that
detects when the set of running VMs changes and notifies Nagios using the send_nsca
with Nagios, our approach consists in having, for every VM
command that contacts the
being
NSCA daemon at the Nagios server. The passive checks
monitored,
a
Nagios
logical
service
called
physical_load that presents the identification and operation
strategy is illustrated in Fig. 2.
status of the associated physical host. Figure 3. shows a system
299
concerned are scalability, average response time, and network
snapshot where, for example, a VM that is running on top of
rsxen02
the
traffic.
physical server that, in its turn, is operational
adequate.
check_physical,
The possible,
the new plugin that makes it
implementation first reads from a Nagios local
folder a set of files that inform, for each VM, the identification of the physical host that is currently hosting each VM. Such files
are
updated
detect_host_change
by
a
that
listens
NEB
module
for
called
updates
in
the
VM/physical host association detected by either the active or passive strategies. Afterwards,
check_physical
contacts,
using active checks, the remote physical host to update its status information stored at the Nagios server, which is finally presented to the IT administrator, as exemplified by Fig. 3.
Check
1 mount
DISK OK - free space:
Check
1 3178
1.16 (50%
inode=31%):
QQi!!! /boot mount
I �==;;; DISK OK - free space: 13178 1.16 (50%
QQi!!!
inode=31%):
Check CPU Load
OK - load average: 0.00, 0.00, 0.00
Check
Memory
Check Swap Free , physical load
rsrails2
rsvlIlteste
Up
Up
Up
Memory WARNING - 95.7% (146380 k6) used
WARNING
Used
rsproject
;;;;;;;
OK ::::==
SWAP OK-100% free (0 1.16
out of 0
Figure
M6)
4.
Physical machines and VMs hierarchy visualization
On rsxen02 - OK - load average: 0.06, 0.06,
OK
IL____--' 0.01
Figure
3.
Real and virtual resource information visualization
Na iDS server Nagios Web interface
Another important view is the one where the primary focus is on the physical host, to then present the internal VMs. Figure 4. shows a second snapshot where our modified setup
Active or passive checks
of Nagios presents a hierarchy of physical machines and VMs of a monitored Citrix XenServer cluster. The
detect_host_change
NEB
module updates the
Nagios data structures when needed, but the Nagios Web
Figure
5.
Updating Nagios view of VMs and physical hosts
interface does not read the monitoring information directly from data structures. The Web interface, in fact, reads a file called
object_cache_file
shown.
In
order
to
detect_host_change
A.
to retrieve the information to be ensure
visual
consistency,
on
emulated larger systems
behavior of the Nagios plugins and modules in large systems.
must additionally patch such a file so
We
that the Nagios interface can show the current associations of
have
emulated large
systems
due
to restrictions on
acquiring a large number of real physical hosts, but at the same
the Citrix XenServer cluster. Parsing the
Scalability
The goal of this evaluation experiment is to check the
time we used an actual installation of Nagios to treat the
object_cache_file
file - whose size can be
migration of VMs over four emulated physical infrastructure
larger than 500 Kbytes - can be very demanding for the
setups. Our Nagios installation has been settled to support the
Nagios main process.
following four different scenarios:
Consequently,
Nagios may become
temporarily unavailable. To avoid that, another thread had to be created by the NEB module to use internal data structures to
- 125 VMs running on top of 25 physical hosts
decide when the file needs to be patched. A patch in the
- 250 VMs running on top of 50 physical hosts
object_cache_file
- 375 VMs running on top of 75 physical hosts
file
is
carried
out
only
if
the
associations have changed. This is illustrated in Fig. 5.
- 500 VMs running on top of 100 physical hosts In all four scenarios, each physical host hosts five VMs.
IV.
We
EVALUATION
have
coded
a
script
that
informs
Nagios
about
all
information on associations between VMs and physical hosts
This section presents a set of evaluation experiments
of the emulated scenarios. Such script uses passive checks to
whose goal is to analyze some aspects of our proposed solution that shows the impact and cost it. The aspects that are
300
send to Nagios the list of VMs that each physical host is running.
The maximum response time that the Nagios interface takes to show the correct association information using active checks is:
Nagios' initially shows all monitored hosts side by side, since no association information is present. Afterwards, the script that sends to Nagios the lists of all VMs of each physical host is executed. The experiments ends when the Nagios interface shows all VMs associated with their respective physical hosts. This is simple to visually check because, when finished, no VM will be displayed without a physical host associated to it. The response time that is measured is the difference between the moment when our script is started (emulating the notification of a set of migrating VMs) and the moment when the Nagios interface shows all VMs associated with their physical hosts.
I
I
150 �
3;100 � & 50
Figure 6.
B.
P + Mfl
+
PI.
(1)
+
fl
+
P + E + Mfl + PI.
(2)
In (2), I is the interval check time, fl is the time of processing a check, P is the time to receive the output of a check, E is the time to Nagios reads the external command file, Mfl is the time of NEB module processing, and PI is the time to refresh the Nagios Web interface. Again, we assume that fl, P, and Mfl consume 1 second each; PI takes 90 seconds; and E take 15 seconds. Assuming I takes 1 and 10 minutes, we have again two scenarios. We have 168 seconds with I taking 1 minute and 708 seconds with I taking 10 minutes.
200
=
+
For the passive checks strategies, we can make the same analysis using the equation below:
250
Virtual Servers
fl
In (1), I is the interval check interval, I, is the time to send a check to a monitored physical host, fl is the delay to process a check at the physical host, P is time to receive the output of a check, Mfl is the time of NEB module processing, and finally PI is time refreshing the Nagios Web interface. In an oversized estimation, we assume that I" fl, P, and Mfl consume 1 second each. We assume PI takes 90 seconds and I assumes 1 and 10 minutes, creating two different situations. In total, with I consumes 1 minute the total maximum time is 154 seconds; for an I consuming 10 minutes the total time is 694 seconds.
This first experiment has been repeated 30 times and the results depicted in Fig. 6 shows a confidence interval of 95%. As can be seen, even when the size of a scenario is increased four times, the average response time increased slightly more than 2 times. This shows that Nagios scales with the increased size of the monitored environment. However, in the scenario with 750 VMs and 150 physical hosts, the Nagios Web interface showed an erratic behavior, sometimes even crashing. This revels that Nagios GUI is unstable when monitoring large system.
Physical Servers
+ I, +
Another interesting estimative is the expected average time. To calculate this estimative, we assume the time of intervals like interval check, refresh interface, and reading the external command file as half of the time of the original maximum times presented above. This is the expected value for the intervals; it can be from 1 second up to the maximum time allowed. Thus, the expected time is half the maximum allowed. Equation (3) and (4) show the estimative for the active and passive checks strategies, respectively.
125 25 =
Average response time for large emulated scenarios
112 + I, + fl + P + Mfl
Estimated response time
We have conducted two comparisons: first, we made a comparative between estimated and measured times in each strategy. We are interested in the maximum time and the mean time that the solution takes to show association information between VMs and physical hosts after a change. Second, we made a comparative between strategies on the mean time measured.
112 + fl
+
P + E/2
+
Mfl
+
+
PII2
(3)
PII2
(4)
The aggressive passive checks strategy has a slight modification that minimizes the mean time because the first physical host that detects an association change informs the Nagios process faster. We consider the mean time as the interval check over 2, as explained above, multiplied by the number of physical hosts in the cluster. Equation (5) shows this estimative.
A theoretical response time estimative can be calculated for the active and passive checks strategies. These theoretical values become important to verify that the implementation does not present serious problems for construction. If the measured values are appropriate to the estimated times can infer the quality of the solution.
11(2* N)
301
+
fl
+
P + EI2 + Mfl + PII2
(5)
TABLEL
In (5), N is the number of physical hosts in the Citrix XenServer Cluster.
MEASURED TIMES FOR ACTIVE CHECKS Interval Check
Active Checks
For the active checks strategy, the mean time estimative with I of 1 minute is 79 seconds and with I of 10 minutes is
Maximum time estimated
349 seconds. For the passive checks strategy, the mean time
Mean time estimated
estimative with I of 1 minute is 85.5 seconds and with I of 10
Maximum time measured
minutes is 355.5 seconds. For the aggressive passive checks
Mean time measured
1 minute
10 minutes
154 s
694 s
79s
349s
130s
659s
66.67s
349,97s
strategy, the mean time estimative with N of 5 and I of 1 minute is 61.5 seconds, and with I of 10 minutes is 115.5
Table 2 shows the estimated and measured times for 30
seconds.
executions of each scenario with each interval check for the passive checks strategy.
C.
Response time in an actual cluster setup
After the theoretical analysis that indicates the expected
TABLE II.
results of the experiments, we present the technical details of our second experimental evaluation environment. In this case,
MEASURED TIMES FOR PASSIVE CHECKS Interval Check
Passive Checks
instead of emulating larger infrastructures, we employ a real
1 minute
10 minutes
Citrix XenServer cluster composed of five physical hosts that
Maximum time estimated
168 s
708 s
use Dual QuadCore Intel Xeon E5430 CPU with L2 cache of
Mean time estimated
85.5s
355.5s
12MB and 16GB of main memory. The cluster's storage has 1 TB to store virtual machines. The Nagios Server uses an Intel Core2 Duo E8400 CPU with L2 cache of 6MB and 2GB of main memory and all network connections are based on
Maximum time measured
149s
574 s
Mean time measured
77.5s
377.54 s
Table 3 shows the estimated and measured times for 30
gigabit Ethernet. Figure 7. shows the network topology of our
executions of each scenario with each interval check for the
second experimental environment.
aggressive passive checks strategy.
TABLE IlL
Citrix XenServer Cluster
MEASURED TIMES IN 30 EXPERlMENTALS
Aggressive Passive Checks
Interval Check 1 minute
10 minutes
Maximum time estimated
168 s
708 s
Mean time estimated
61.5s
115.5s
Maximum time measured Mean time measured
147s
557s
78.37s
401.27s
A statistical analysis must be made to ensure that the mean times measured in the experiment have a correlation with the mean time expected for the solution. Figure 8. shows the confidence interval to scenarios with I of 1 minute and 10 Figure 7.
minutes to all strategies presented. The confidence interval is
Second experimental infrastructure
calculated considering a confidence level of 95%.
The experiment consists on migrating a VM and measuring
500 450 400 :e 350 � 300 � 250 5r 200 & 150 100 50
the delay that the Nagios interface takes to show such a change. This experiment has been carried out considering different times for the check interval: an experimental scenario used the check interval of 1 minute, which is the minor time accepted
by
Nagios
and
system
cron,
while
another
experiment scenario uses 10 minutes, which is the default check interval of Nagios. This experiment has been performed in a real environment, like the one described in Fig. 7, and observed the three check strategies, Le., active checks, passive
''""'' ''1L..., ---'' "..", --"" - - -'---""""'-'''-''-''-,-- -c- --' -= --'---'--O-A -' o -'----ctive Checks . I =1 minute
checks, and aggressive passive checks.
o Passive Checks, I =1 minute
•Aggressive Passive Checks, I =1 minute
Table 1 shows the estimated and measured times for 30
gActive Checks, I =10 minutes
@Passive Checks, I =10 minutes
executions of each scenario with each interval check for the
�Aggressive Passive Checks, I =10 minutes
active checks strategy. Figure 8.
302
Response time considering the real cluster deployment
Both the interval checks of 1 and 10 minutes experiences faster response employed.
time
The
when
passive
the
active checks
checks
strategy
depends
on
In our experiment, we capture TCP transmissions of each
is
strategy and computed the average size of these transmissions
an
for the active checks and passive checks strategies, which are
strategy
additional delay that Nagios imposes to read the external
1960 bytes and 1594 bytes, respectively. Table V shows the
command file.
traffic accumulated in one hour.
In the scenario with I of
1 minute, this
difference is clearly observed because such difference is in the
The passive checks strategy is more efficient in terms of
order of 15 seconds, which is the interval that the Nagios reads
network usage. This strategy takes advantage of the size of the
the external command file. If the administrator wants to
individual transmissions and the knowledge that no migration
achieve the better response time we recommend the use of
occurs since the last execution. The active checks strategy
active checks strategy.
performs is better than the aggressive passive checks strategy
The aggressive passive checks strategy does not reach its
except in the scenario with 1 minute of interval check and in
objective. This strategy takes advantage of no determinism of
environments that rarely has a change in VMs and physical
the time of the execution of the notifying script. The script is
hosts
called by the system cron. Since the local clock of each
restrictions
physical host in cluster is synchronized among all the other
overloaded) we recommend the use of passive checks with I
hosts, the script ends up being called almost at the same time.
1 minute. This strategy has a response time a little greater, but
This turns the time of the execution of the script deterministic,
uses less traffic.
associations about
information.
If
administrative
the
traffic
administrator (the
has
network
is =
ending up providing no advantage over the conventional passive checks strategy. This can be observed by comparing
TABLEV.
the response time of passive checks and aggressive passive checks in Fig. 8. This situation could be improved by forcing a employed. However, this extra delay must be in the same order of interval check, to distribute the checks along all intervals. Network This
Number of transmissions
Strategy
random sleep time when the aggressive passive checks are
D.
NETWORK TRAFFIC IN ONE HOUR
traffic
Maximum
Minimum
Active Check. 1= 1 minute
117.600 bytes
117.600 bytes
Active Check, I =10 minutes
11.760 bytes
11.760 bytes
Passive Check, I =1 minute
95.640 bytes
9.564 bytes
Passive Check, I =10 minutes
comparison
consists
in
capture
all
the
traffic
Aggressive Passive Check, 1=1 minute Aggressive Passive Check, I =10 minutes
generated by each strategy. In order to compare our strategies with different interval check values, we observed the traffic generated in one hour of the system operation. Both the
9.564 bytes
9.564 bytes
478.200 bytes
47.820 bytes
47.820 bytes
47.820 bytes
passive checks strategies can minimize the network usage if they detect that no change occurred in the VMs and physical
V.
hosts association information since the last execution of the
CONCLUSIONS AND FUTURE WORK
checking script. Otherwise, at every 10 minutes this script
With the modifications made in Nagios to incorporate
forces a check to Nagios. This is needed in the case of a
virtualization awareness, a, IT administrator can put together
Nagios reboot. Without this enforced check, the physical host
the service status information of each virtual machine (VM)
only sends its internal information if a VM migrates. With this
with the resource metrics collected from the physical machine
information
the
that hosts it. This is dynamically executed in order to allow
minimum transmissions of each strategy for each physical
Nagios to detect migrations of VMs, thus requiring no human
we
can
be
compute
the
maximum
and
host. In the aggressive passive checks strategy, we consider
intervention into the Nagios conventional configurations. The
five physical hosts in the cluster. In this strategy, each physical
Web interface of Nagios graphically shows the relationships
host sends the association information of all physical hosts in
between VMs and physical hosts in the system's map. In this
the cluster, which increase the number of transmissions. The
map, the physical hosts are presented as parents of each VMs
Table IV shows the traffic results.
of the monitored environment. Although our work has employed the Citrix XenServer
TABLE IV.
MESSAGES IN ONE HOUR
Strategy
virtualized environment to prove the concept, one can easily adapt
Number of transmissions
our
proposed
solution
to
other
virtualization
environments. The only element of solution that is platform
Maximum
Minimum
specific is the check_xen_virtual_machines plugin. The
Active Check, I =1 minute
60
60
NEB module, which proceeds with the most complex task in
Active Check, I =10 minutes
6
6
Passive Check, I =1 minute
60
6
Passive Check, I =10 minutes
6
6
Aggressive Passive Check, 1=1 minute
300
30
Aggressive Passive Check, 1=10 minutes
30
30
the solution, expects a list of virtual machines for each physical host in the form of "OK - vmOl,vm02,vm03". Anyone can develop a plugin that informs Nagios sich a list for another virtualized environment.
303
Additional improvements for monitoring Citrix XenServer
[4]
physical hosts include the development of further Nagios
[5]
plugins
that
XenServer
collect
resource
hypervisor.
The
status
plugins
information supplied
by
of
the
Cacti. [http://www.cacti.net/] T. Garfinkel, M. Rosenblum, "A Virtual Machine Introspection Based
Architecture for Intrusion Detection," Network and Distributed System
Security Symposium,pp. 191-206,2003.
Nagios
[6]
community work only with the open source version of Xen and must then be adapted to work with Citrix XenServer. Or,
B. D. Payne, M. D. P. de Carbone, W. Lee W, "Secure and Flexible Monitoring of Virtual Machines," 23'd Annual Computer Security
Applications Conference (ACSAC),pp. 385-397,2007.
the IT administrator can use the check_snmp plugin to collect
[7]
metrics, for example, using SNMP.
T. Fraser, M. R. Evenson, W. A. Arbaugh, "VICI Virtual Machine
Introspection for Cognitive bnmunity," 24th Annual Computer Security Applications Conference (ACSAC),pp. 87-96,2008.
[8]
REFERENCES [1]
Citrix Systems,Citrix XenServer. [http://www.citrix.com/]
[2]
VMware. [http://www.vmware.com/]
[3]
W. Barth, Nagios: System and Network Monitoring, 2nd ed. San
Z. Shao, H. Jin, X. Lu, "PMonitor: a Lighweight Performance Monitor
for
Virtual
Machines,"
1st International
Workshop
Technology and Computer Science,pp. 689-693,2009.
[9]
on
Education
F. Machida, M. Kawato, and Y. Maeno,
"Adaptive Monitoring for
Virtual
Enterprise
Machine
Based
Reconfigurable
Systems,"
3'd
International Conference on Autonomic and Autonomous Systems
Francisco: No Starch Press,2008. [http://www.nagios.orgl]
(lCAS),pp. 8-8,2007.
304