Performance Management of Java-based SIP Application Servers

5 downloads 62214 Views 2MB Size Report
specifications. The rationale of this choice is that JSLEE application servers are currently regarded as very promising candidates for deploying telecom services.
12th IFIP/IEEE 1M

2011: Mini Conference

Performance Management of Java-based SIP Application Servers Mauro Femminella, Emanuele Maccherani, Gianluca Reali Department of Electronic and Information Engineering, University of Perugia, Perugia, Italy {mauro.femminella,emanuele.maccherani,gian1uca.reali}@diei.unipg.it

standardized technologies and APIs, Java simplifies the service

Abstract- Within the activities of the Java APIs for Integrated Networks (JAIN), the Java community offers a set of standard

implementation.

frameworks

industry standard for implementing enterprise-class services. In

and

open

telecommunications

protocol

to

addition, two further Java application frameworks, namely Java

stack,

the

works

The Java Enterprise Edition (EE) is the

APIs for Integrated Networks (JAIN) Server Logic Execution

(SIP)

recent

advanced have

Protocol

However,

create

pointed out that Java-based implementations of the Session Initiation

services.

APIs

fundamental

signalling

Environment

protocol in the convergent telephone-IP world, perform poorly if

deploying

protocol semantics with the Java language features, which does

most widespread solution for implementing servers in data centres, this performance issue needs to be solved urgently.

consists of the joint usage of virtualization and parallelization extensive

creating,

services

multimedia services, have severe scaling problems in large

implemented by using Java-based SIP application servers. It an

allow

telecom

multi-core architectures. Since they currently represent the

improve the throughput and signalling latency of applications

performed

Servlets,

facto standard signalling protocol for voice over IP (VoIP) and

not allow fully exploiting the computing capabilities of multi-core architectures. To face this problem, we propose a solution to

have

SIP

advanced

However, a recent paper

data centres. The problem lies in the combination of the SIP

We

and

managing

[1]. [2] shows that Java implementations of Session Initiation Protocol (SIP [10]) stack, which is the de

they are executed by large multi-core servers, typically used in

techniques.

(SLEE)

and

The problems come from the interactions between SIP protocol

measurement

campaign by using an open source application server compliant

semantics

with the JSLEE (JAIN Service Logic Execution Environment)

unbalanced pipeline stages and lock contention. The same

specifications.

The

rationale

of

this

choice

is

that

with

Java

language

features,

which

performance problem has already been illustrated in

JSLEE

causes

[3], where

application servers are currently regarded as very promising

we have observed that the average CPU utilization never

candidates for deploying telecom services. Results show that it is

exceeds

possible improving performance in terms of throughput and

used. In addition, in

50% when a server equipped with 8 CPU cores is [2] authors show that no effective solutions

signalling latency by running more instances of the JSLEE server

to solve this issue through some best practice configurations

in parallel, each of them into separate virtual machines deployed

exist. In fact, although some improvements can be obtained

on the same server. This improvement can increase throughput

with

values of about 64% and, in the maximum throughput condition, the call set up latency can be nearly halved.

modularity

and

The contribution of this paper is the proposal and analysis of

reusability,

parallelization

lOO% of the

someway counterintuitive. Through an extensive experimental campaign we have shown that a very convenient solution consists of introducing a hypervisor and use virtual machines

nwnber of developers to speed up the service development. candidate

and

setup latency below an acceptable threshold. Our results are

networks and devices. In addition, it allows involving a higher

natural

virtualization

terms of successful supported SIP calls, while maintaining the

the great advantage of enabling services to support multiple

a

of

CPU utilization, but to increase the overall server throughput in

standard open interfaces, rather than proprietary ones, brings

is

usage

servers (ASs). Clearly, our goal is not to achieve

oriented architectures, new levels of abstraction, and well

technology

original

of servers hosting Java implementations of SIP application

the

defmed logical components. In addition, the introduction of

Java

an

techniques in order to better exploit the computing capabilities

development process can benefit from the usage of event­

The

procedure

are spare CPU resources to manage a larger workload.

protocols and interoperate with different software and hardware of

this

of the computing capabilities are unused in the average, there

INTRODUCTION

availability requirements. They may rely on different network terms

optimization,

hand, obviously, applications must exploit as much as possible

Advanced telecom services are intrinsically asynchronous

In

and

the hardware capabilities, and it is clear that if more than half

and should fulfill low latency, high throughput and high

platforms.

profiling

language, i.e. easy and rapid service development. On the other

Keywords- Java; SIP; JSLEE; virtualization; parallelization

I.

extensive

nullifies one of the most important features of the Java

(VMs) to host SIP ASs. In more detail, our proposal is to run mUltiple, identical VMs on the same physical server, each of

for

implementing telecom services. Thanks to its intrinsic features,

them running a single instance of the Java-based AS. Despite

such as platform and operating system (OS) independence,

the increased overhead due to virtualization, the peculiarity of

networking support, dynamic adaptability, and availability of

the Java over SIP operation allows increasing the overall SIP

978-1-4244-9221-31111$26.00 ©2011 IEEE

493

call throughput with such a virtualized setting. We use of an

capabilities,

open source JSLEE AS, Mobicents [4], running a SIP-based

management, deployment, and thread pooling.

such

as

service

and

JSLEE

configuration

VoIP service performing several database queries during call lifetime. Our choice of using a quite complex service test is

The Mobicents structure is characterized by an evident

showing improvements with a very simple service, such as the answering

service

used

in

[2],

is

not enough

complexity, as it integrates the Java Virtual Machine (JVM),

to assess

the JBoss AS, and the MSLEE. In order to achieve an efficient

improvements in real service scenarios. Instead, using a JSLEE

platform setup, some critical aspects must be considered.

AS with a realistic VoIP service is reasonable, since JSLEE

Below we briefly illustrate those we have faced in this work.

ASs are primary candidates for the deployment of application services in the new convergent telecom paradigms [6]. Thus,

1)

our tests represent a realistic benchmark to verify if the

the automatic Java Garbage Collector (GC) mechanism for memory cleaning [9]. The drawback is that the developer has

The paper is organized as follows. In Section II, we present

not the full control over the GC behavior, which may even

an overview of the JSLEE specifications, focusing on the

pause and delay the application being executed [1]. Such

critical performance aspects. Section III describes the proposed deployment

configurations

and

the

VoIP

pauses may be critical for real-time telecom applications. In

service

fact, the AS may freeze due to post-pause avalanche restarts

implemented to test them. Section IV presents the numerical

(since during a pause it can accumulate many unprocessed

results of the experimental campaign. Section V illustrates

messages [8]). The JVM v.6 includes different GCs, in order to

related works and, finally, Section VI draws our conclusions. II. A.

meet the requirements of different applications [1]. We have selected the Parallel GC (see also [3]), which is the default and

BACKGROUND

most efficient GC (i.e., it uses the lowest CPU time), even if it produces longer pauses in program execution. In particular, the

JSLEE specifications and available platforms

test results described in [3] shows that the Parallel GC exhibits better performance when used with the UDP transport protocol.

The JSLEE activity aims to specify a Java-based, event­ oriented container for the execution of carrier-grade telecom

Another issue to be considered when Java is used for real

services [7]. The service logic is implemented in software

time services is the amount of memory to be allocated to the

components called Service Building Blocks (SBBs). A JSLEE

Java heap. As suggested in [8], this is another tricky point,

AS creates a pool of SBB objects and manages them according

since having a large memory allocated to Java may reduce the

to a well defined lifecycle. SBBs operate asynchronously by

frequency of GC phases, but each collection may last a much

receiving, processing, and triggering events. They can be

longer time due to the larger amount of heap to be cleaned.

attached to data streams called Activity Contexts, by which they receive events from other entities. Also, SBBs may be

2)

linked together by parent-child relationships to implement the

Event

Router,

SBB.

which

External

delivers

network

64

bit

64 bit operating system (OS), it is preferable over a 32 bit one. In fact, 64 bit OSs allow overcoming the well known limitation

Events are internally managed by a functional element appropriate

Operating System type: 32 vs.

Typically, when the hardware resources are suited to host a

service logic in a modular fashion.

called

Java memory management

Being based on the Java technology, the MSLEE relies on

proposed solution can be effective in operation.

AS

Critical issues in system configuration

B.

motivated by our convincement that the frequent approach of

each

events,

event such

to

the

as

SIP

of 32 bits OSs that each application may use up to 3GB of RAM. Furthermore, it may also increase the CPU efficiency. Nevertheless, 64 bit systems may be still not mature enough for

messages, are translated into internal Java events by the so­

all

called Resource Adaptors (RAs). More generally, the set of

carrier-grade

generally

implemented RAs constitutes an abstract interface layer that

more

applications, stable.

We

whilst have

32

bit

evaluated

systems many

are

Linux

distributions, both 32 and 64 bit versions. The 32 vs. 64 bit

allows a JSLEE server to access external resources.

choice may impact on performance, but it does not affect

The Mobicents Communication Platform is an open source

service implementation, since using either a 32 or a 64 bit OS

project, currently owned by Red Hat [4]. It includes a JSLEE, a

basically means using a different kernel and a different JVM,

Media Server, a Presence Server, and a SIP Servlet Server.

whilst the Java code of JSLEE and application services is

Other commercial JSLEE implementations are OpenCloud

unchanged.

Rhino [l3], and Amdocs jNetX Convergent Service Platform

3)

[14]. In our experiments we have used the v.1.2.6 GA of

typically used to preserve a computer framework (e.g. a

already compliant with the new JSLEE V.U specifications [7]. The MSLEE includes several J2EE components,

database) in a known, consistent state, after system failures. As

such as

regards transactions, the MSLEE relies on the JBoss AS which

Container Managed Persistence (CMP) fields, which enable

natively supports Java Transaction API (JTA), thus allowing

data persistence for SBB objects, Java Database Connector (JDBC)

drivers,

Java

Management

Extensions

for

Database Transactions

In critical data transmission environments, transactions are

Mobicents JSLEE (MSLEE). It comes with a SIP RA which is

the usage of any transaction manager implementation. JBoss is

the

by default configured to use the so-called "JTA compatible in­

environment management and monitoring, and Java Naming

VM" transaction manager. In some services, such as banking

and Directory Interface (JNDI), which offers lookup functions

transactions, a provider may need a transactional behavior to

for service registration. The MSLEE is installed within the

ensure that all operations made on a remote database remain

JBoss AS [5], which is a hosting environment offering special

consistent during all system operation, as well as after a system

494

failure. Such a guarantee comes with a large performance

more degrees of freedom with respect to the previous

overhead, in the form of disk writing operations, which are

solution. This configuration is labeled "VM".

directly related to the number of service transactions processed.

Looking at Fig.

Hence, it is of great importance for a provider to carefully

Without

evaluate whether to use transactions or not, in order to ensure

and

control

services.

For

A.

these

consumption. overhead

PROPOSED SOLUTION AND TEST SERVICE DESCRIPTION

since

OS

instances.

server In

consists of running multiple JVMs over a common OS

environment,

usage

and

services

deployment

[19]

that

by

far

fact,

a

virtualized

environment

can

allow

server

on. To the best of the authors' knowledge, there are no other proposals of using virtualization technologies (in particular

of MSLEE. Clearly, in order to correctly forward SIP

bare

calls to the appropriate MSLEE, it is necessary to

metal

hypervisors)

in

the

literature

to

improve

performance of a single AS through the deployment of multiple

introduce a SIP proxy acting as a call dispatcher. We

AS instances in different VMs on the same physical server.

have considered two versions of parallel configuration.

Instead, a common practice is a server consolidation through

The fust one, labeled "JVM", is the classical parallel

virtualization, which allows avoiding to run different and

deployment of n JVMs, each free to access all CPU

underutilized ASs in separated physical servers. Clearly, the

cores without restrictions. In the second configuration,

benefit of using the parallel or the virtualized solutions has to

labeled "1VM-taskset", we have bound each JVM to a

be enough to balance at least the cost of an additional entity,

specific subset of CPU cores, thus allowing each core

the SIP proxy. Nevertheless, it can be easily implemented even

to be used by a single 1VM by using the system

by unpretentious hardware using a SER-based implementation

command "taskset". The rationale of this choice is that

[17]. It is worth noting that a SIP proxy acting as a load

the Java-based SIP ASs do not scale well not only with

balancer is needed also with native configurations in large

the Java heap, but also with the number of CPU cores.

architectures. We show the performance of all considered

Java

architectures in section IV, highlighting the improvements of

computing environments within the same OS.

"1VM", "1VM-taskset", and "VM" over native configuration.

Virtualized configuration (Fig. I.c): it consists of a B.

hypervisor that virtualizes hardware resources and hosts some VMs. Within each VM, we replicate the

Mobicents

community

has

published

a

set

of

performance achievements [8], obtained through an automatic

insulated

MSLEE

Test service implementation The

native configuration (OS, JVM, JBoss, MSLEE AS). In the

virtualized

energy usage, service migration through VM mobility, and so

hosts an instance of JBoss, which includes an instance

completely

completely

consolidation procedures for a more efficient hardware and

installed directly onto the server hardware. Each JVM

separated

a

compensates the small performance degradation expected [27].

real time systems due to large GC times, and usually

"emulates"

operating

including even the OS, can provide a degree of flexibility in

ASs badly scale with the Java heap [20]. This solution

and

multiple

However, by considering both pros and cons, we are

heap allocation may cause performance degradation in

have

and

convinced that the preferred configuration is the third one,

availability of several GBs of RAM, an excessive Java

really

hypervisor

CPU cores may cause resource contention and locking events.

known that, even if 64-bit JVMs allows exploiting the

we

the

problem. We expect that the improvement of the "JVM"

scalability with respect to memory. In fact, it is well

environments,

of

configuration is inferior, since allowing all JVMs to access all

often used in Java-based AS deployments to improve

case,

third

each JVM to a specific set of CPU cores, it emulates a physical

Parallel configuration (Fig. I.b): this configuration is

computing

the

server with few cores, thus mitigating the mentioned scalability

OS runs directly on top of the server hardware.

this

in

less overhead than the virtualized one. In addition, by binding

of the server is installed within a single JVM, and the



exacerbated

resources, we expect that the best performance is achieved by

deployment configuration, in which a single instance

solution

is

the second one, and in particular the "JVM-taskset" one. It has

Native configuration (Fig. La): this is the classic AS

this

aspect

configuration does not exploit all the available computing

that we have evaluated by our experiments.

Thus,

This

However, since in the case of Java-based SIP stacks the native

In this section, we present the three deployment solutions



most

configuration, the virtualized one, which also includes the

Application server deployment configurations



the

instances, which clearly imply additional CPU and memory

reasons, in our experiments JTA has been disabled. III.

indication,

from the presence of multiple JVM, JBoss, and MSLEE

requirements are typically very stringent. This is the case of redirect

performance

minimum overhead. In fact, the parallel configuration suffers

ones, transactions may not be needed, whereas performance forwarding,

I, some considerations can be made.

particular

efficient configuration should be the first one, since it has the

maximum performance. In fact, for some services such as VoIP

call

any

answering service in which the MSLEE acts like a simple User

runs

Agent Server (UAS). It is one of the examples included in the

separately in each of them. In this case, we do not bind

MSLEE package. It utilizes only one SBB which responds to

the virtual CPU s of each VM to real CPU cores,

the incoming call, completes the SIP three-way handshake,

leaving the task of scheduling the VM access to

and, after a short timer expiration, sends a BYE request to the

computing resources to the hypervisor. This guarantees

caller User Agent Client (UAC) that has initiated the call.

495

[

Mobieents JBossA5

I

Mobicents JBoss AS

5unJVM L-

i N at v e 0 5

______ _____

(a)

--------------

SunJVM #1

�1 LI

______



I rl

Mobicents

Mob;"",,

Obi""

JBossAS

JBoss AS

5unJVM

5unJVM

SunJVM #n

VM 05#1

VM 05#n



i N at v e 0 5

Hypervisor

(b)

(e)

JBossAS

_________ ________ ______

Figure 1.

Application servers deployments configurations.

We have implemented a more complex SIP-based VoIP

CalierUAC

service to suitably characterize the MSLEE performance in a

I

realistic telecom scenario. It could model an online charging system for pre-paid VoIP calls, which periodically updates the user credit in the subscriber profile by accessing a remote database (see also [16]). However, given the widespread usage

INVI

OpenSips TE

100 -Trvinll

SlEE UAS UAC

INVI

100 -Trvine

INVI

skeleton of non VoIP services. The MSLEE manages the entire

TE

100· Tcy;ng

� � I. �

of the SIP protocol beyond IP telephony, it could be also the

Database

TE

180'R;n,;" 200-0K ACK

180 -Rin ing

lBO-Ringing

200-0K

200-0K ACK

signalling by implementing a call control service through a SIP

ACK

•••• -____

Back-to-Back User Agent (B2BUA) architecture [10], which easily allows the introduction of a third party call control mechanism [IS]. The MSLEE acts both as UAS and UAC, by splitting each call in two SIP dialogs over two distinct call-legs (caller-MSLEE and MSLEE-callee, see Fig. 2). This service

"'''.�

,,

implements two timers. One of them (Call Duration Timer other one (Periodic Database Query Timer - PDQT) triggers periodic updates to the subscriber profile database.

'1

atabaseque request atat:.a5eque

BYE

BYE

As regards the internal MSLEE service operation, when the

200 OK

subsystem is invoked and creates a Selector root SBB. This

Figure 2.

component queries the database to retrieve the subscriber profile, which includes the values of CDT and PQDT timers. A.

Then, upon receiving the answer, it activates a child SBB called CallControl, and leaves the signalling control to it. In

BYE 200- OK

200-0K

initial INVITE is received by the MSLEE, its event routing



i

��

CDT) is used to control the maximum call duration, while the

j

SPOOse

"I

l

Signalling flow for the test service.

Test bed description The Mobicents JSLEE v.1.2.6 GA, deployed on the JBoss

turn, this child SBB creates the second call leg towards the

v.4.2.3 GA, has been installed on a Fujitsu-Siemens server

callee and establishes the media session between the two end

PRIMERGY TX300 S4 with dual Intel Xeon ES410

points, starting both the CDT and the PQDT timers and

GHz

querying the database upon each PQDT timeout. The call ends

(8

@

2.33

CPU cores) and 16 GB RAM. The 64 bit OS is the

Novell Suse Linux Enterprise server x64 v.lO.l. For executing

when the CDT expires and the CallControl SBB sends a BYE

32 bit experiments, we have used OpenSuse Linux 11.1. In the

message on both call legs. The message exchange is illustrated

virtualized setting, we used the ESXi 4.1 hypervisor [12]. The

in Fig. 2. Alternatively, each of the two end points can send a

JVM is the v.1.6.x. In the MSLEE configuration, we have used

BYE to the other to close the call before CDT expiration. The

the JAIN SIP RA v.1.2, tuning the number of threads to allow a

service performs an additional, [mal database query on call

greater number of events to be processed simultaneously. All

termination, in order to update the user profile in the database.

logs have been disabled during tests execution to improve

In the implementation used for this experimental campaign, the

performance. In addition, we have used the following tools:

call is always closed by the MSLEE upon CDT expiration.



IV.

,,

SIPp traffic generator [11], installed on two PCs with Ubuntu Linux v.9.l0, acts as UAC and UAS endpoints.

NUMERICAL RESULTS •

In what follows we first present the test bed used in the

MySQL database v.S.O.Sla and MySQL Connector/J v.S.1.6 JDBC for database access, deployed on a PC

measurement campaign, then the results achieved by executing

with Arch Linux x64 (kernel 2.6.27). Each database

the MSLEE in all configurations illustrated in the previous

interrogation relies on an object-relational mapping by

lILA section. Finally, we compare and discuss the results

using the Hibernate technology, provided by JBoss AS.

obtained with the three deployment schemes.

496



The call dispatcher is an OpenSER [17] proxy running

a long history of optimization. However, increasing the Java

on a PC with Arch Linux x64 (kernel 2.6.27). It forks

heap the improvement with a 64 bit OS is not negligible (about

SIP traffic among different MSLEEs according to a

10%), although it requires about a triple amount of memory

weighted round robin policy in the virtualized/parallel

than a 32 bit system using the same cpu. The conclusion is

settings. The weight is proportional to the number of

that, for a 64 bit OS, the optimal amount of memory allocated to the Java heap is a number of GBs equal or slightly lower

CPU cores allocated to each JVMiVM.

than the number of CPU cores, which in our case are 8. In

As for the traffic statistics, the SIPp UAC has been set to

addition, since 32 bit OS are much less greedy of memory, they

generate new SIP calls with a constant rate, referred to as A,

are better candidates for parallel or virtualized deployments.

lasting 60 minutes for each A value. The value for the CDT timer (i.e. maximum call duration) has been set equal to 3 minutes. The value for the PQDT timer has been set equal to 10 plus two additional queries at the call setup (INVITE received)

60



50

2 .£

40

§

30



20

.!: OJ ::J

and upon CDT timer expiration (call tear down). We have considered the following performance metrics: •

� �

seconds. This means that 18 queries are issued for each call,

E 'x

Maximum call throughput, defmed as the rate of successfully established SIP calls in calls per seconds

2

(cps), with at least 95% of successfully handled calls. •

Figure 3.

Session Request Delay (SRD [18], see Fig. 2). SRD is measured at UAC and defmed as the time interval from response. It represents the latency experienced by the

overload status. In this condition (not shown to improve fIgure neatness), the latency needed to establish few calls exhibits a

measuring timer at the reception of the 200 OK.

steep increase, since the Java heap needs to be continuously polished and the MSLEE has to process the queued messages,

Performance of Native corifiguration

thus

The initial set of experiments has been done to fInd the best server.

We

recall

that

causing

timeouts

and

retransmissions,

which

further

increase the server load and slow down the overall call

confIguration for the MSLEE when it runs in an OS natively physical

Maximum throughput vs. amount of memory allocated to JVM.

maximum throughput. Beyond this value, the server enters the

confIgured with no pause between the 180 RINGING

the

16

increases almost linearly with the offered load up to the

and the 200 OK messages, we have triggered the

in

14

12

10

SRD) as a function of the offered load. It can be seen that SRD

UAC for setting up a call. Since our SIPp UAS is

installed

8

Java heap size (G6)

Fig. 4 shows the average call setup signaling latency (i.e.

the initial INVITE to the fIrst non-lOO provisional

B.

6

4

processing.

this

confIguration uses UDP as transport protocol and the Parallel GC, and JTA is disabled. Fig. 3 shows the maximum call throughput as a function of the amount of RAM allocated to the JVM for a 64 bit and 32 bit OSs. In the 64 bit case, the Java heap ranges from 2 GB to 15 GB, whereas for the 32 bit OS the maximum value is 2.5

"' Ql

GB. First let us consider the maximum throughput achieved by



using the 64 bit OS. When the amount of memory allocated to

0.6rr======1�-�--�-�-1 ' '11 2 GB-32bit -B- 2.5 GB-32bit 0.5 --+-- 4 GB-64bit ---A--- 5 GB-64bit --B---- 6 GB-64bit . f:> 7 GB-64bit 0.4 . + 8GB-64bit ---4- 9 GB-64bit 0.3 '-----.,-----;;..--;Y

o II: (/)

the JVM increases from 2 to 7 GB, the relevant throughput

0.2

increases. The major increase is observed when the memory allocated ranges from 4 to 6 GB, and it is almost negligible

0.1

from 6 to 7 GB. Increasing the memory allocation beyond 7 GB does not produce any throughput increase, but rather a slight decrease, especially for values larger than 8 GB. This is

OL-__-L____L-__-L____L-__-L____�__�

30

in line with what the Mobicents team has stated in [8], that is

35

40

45

Offered load

any increase of the memory allocation causes the GC to be executed less frequently, but increases the garbage time. In

Figure 4.

fact, polishing a larger memory increases the relevant service

50 (cps)

55

60

65

SRD vs. offered load for different sizes of JVM.

pauses, which may cause an avalanche restart and a consequent

The second comment is that, up to 50 cps, one of the best

server overload. A different consideration is needed for the 32

performing confIgurations is the 32 bit OS with 2.5 GB

bit OS. It is interesting to note that, even if the amount of

allocated to the Java heap. For a workload of 55 cps, which is

memory allocated to the JVM is only 2 or 2.5 GB (the

maximum call throughput for this confIguration, the SRD has a

maximum allowed with a 32 bit JVM), its performance is

sharp improvement, which doubles the value achieved at 50

comparable with a 64 bit system with a double amount of

cps. Finally, a further signifIcant comment is that, in 64 bit

memory allocated to the JVM. This result is not surprising,

confIgurations, the average SRD improves with the amount of

since 32 bit systems are much more stable and can benefIt from

memory allocated to the Java heap. This is quite evident for 4,

497

5 and 6 GB configurations. Perfonnance of 64 bit OS with 7 and 8 GBs is almost equivalent, and 7 GB configuration slightly outperfonns the 8 GB one only at 60 cps (at this value the 8 GB setup reaches its maximum throughput). Any further increase of the Java heap does not cause significant performance changes, altough this performance is slightly better than in all the other 64 bit configurations. Also this behavior can be explained by resorting to the GC operation: the larger the amount of memory allocated to the JVM, the fewer the pauses due to the GC collections, the longer their duration. Thus, the backlogged SIP messages may produce an avalanche restart causing a server collapse. This is the reason why the server does not achieve the best performance in tenns of call throughput with 15 GB memory allocated to the JVM.

Fig. 6 shows the maximum call throughput values achieved by the three schemes as a function of the number of MSLEE instances. The first comment is that the intuition at the basis of this paper is correct: since Java-based SIP ASs badly scale with CPU cores, the best option to improve hardware resources exploitation is to use several ASs in parallel. Basically, all schemes outperfonn the native configuration, in both 32 bit (55 cps) and 64 bit (61 cps) versions. Up to 3 MSLEE instances, the configuration that shows better results is the "VM" ones, even if the difference with "JVM-taskset" is really small. We ascribe this result to the better insulation capabilities of the "VM" approach. From 2 up to 8 MSLEE instances, the "JVM" approach has very little improvement, since it passes from 80 cps (2 MSLEE instances) to 90 cps (4 and 8 MSLEE instances). Beyond 3 MSLEE instances, the best configuration is definitely the "JVM-taskset". This result is reasonable, since it causes less overhead than the "VM" one, and better CPU resource insulation with respect to "JVM". In particular, any increase of the number of VMs beyond 3 seems to cause an excessive overhead thus, differently from "JVM" and "JVM­ taskset", performance starts decreasing. The best result in term of maximum throughput is reached by deploying 8 MSLEE instances, each using a single CPU core with 1.8 GB of memory allocated to the Java heap (all other configurations has 2.5 GB allocated to each JVM). In this condition, the maximum throughput reaches 125 cps.

Finally, we have verified that, even for a complex AS like Mobicents, it is still valid the proposition of paper [2]. To this end, we have set an increasing number of CPU cores to be used by the JVM in native configuration through the system command "taskset". The results achieved are shown in Fig. 5, where the maximum call throughput is plotted versus the number of used CPU cores, for both 32 and 64 bit OSs. We first analyze the 32 bit OS behavior. As expected, the maximum call throughput scales with CPU cores only up to 2 cores, then it slightly increases for 3 and 4 cores, and remains stable up to 8 cores. As for the 64 bit version, it exhibits a trend which is nearly linear up to 3 cores, then the slope decreases. However, passing from 1 to 2 cores the relevant throughput does not double, and the same happens with 3 cores. In conclusion, 32 bit OS results the best candidate for parallelization/virtualization purposes, since, even with much less memory allocated to Java heap, with the same number of CPU cores (up to 4), it outperforms 64 bit versions. By using the MSLEE on a 64 bit OS, performance still improves with a number of CPU cores beyond 4, but very slowly. Thus it exhibits poor scalability properties along with a high memory requirement.

5

7

# of MSLEE instances

Figure 6.

2

3

4

5

6

7

On the other hand, considering only the maximum call throughput as performance metric is not suitable, since also the SRD is important. In fact, the delay in establishing a session cannot be arbitrarily high, especially if during an ongoing call it is necessary to redirect the call towards another participant (e.g. a media server or a person). Fig. 7 shows the average and 95th percentile of the SRD as a function of the number of MSLEE instances, evaluated in the maximum throughput condition. We recall that the 64 bit version (native configuration) in the maximum throughput condition (61 cps) achieves an average SRD equal to about 500 ms and a 95th percentile of the SRD equal to about 2 s. It is evident that most configurations exhibit setup delays that cannot be acceptable in real settings. To this end, the reader should bear in mind that the measured delay is relevant only to the component of SRD

8

Number of CPU cores

Figure 5.

C.

Maximum call throughput vs. number of MSLEE instances.

Call throughput vs. number of CPU core used by MSLEE.

Performance of ParallelellVirtualized configurations

In this sub-section, we analyze the results relevant to "JVM", "JVM-taskset", and "VM" configurations.

498

accumulated inside the data center where ASs run, since the

V.

RELATED WORK

UAC and UAS used in our tests represent inbound and outbound SIP proxies or PSTN gateways in the data center.

A.

Thus, in order to obtain the actual SRD values perceived by users, we have to add the delay contributions "caller-AS" and "AS-callee",

systems. They analyze the performance of benchmarks using

which could be two segments of wide area

various numbers of processor cores and application threads.

networks and thus with significant delays. To this end, Fig.

8

JVM performance Papers [24][25] analyze the JVM scalability on multi-core

They correlate low-level hardware performance to the number of JVM threads and system components, in order to observe

shows the maximum throughput for the

potential bottlenecks. Lock contentions and memory stalls

three considered approaches as a function of the number of

cycles, produced by insufficient L2 cache and cache-to-cache

MSLEE instances, with the constraint of limiting the 95th

transfers, are the main observed bottlenecks. JVM includes a

percentile of SRD to 200 and 500 ms. Results are very

parameter called thread-local allocation buffer

interesting. First of all, we observe that striving for maximum configuration showing the best performance in Fig.

(TLAB)

that

limits these issues [24]. A further performance optimization

throughput may cause unacceptable SRD values. In fact, the

[24] is achievable by using the Parallel GC configuration and a

6 (8

suitable ratio of new and old generation memory heap to

MSLEE instances in "JVM-taskset" setting) is now the worst.

reduce the overhead of minor memory collection. It can be set

This happens since the use of a single core for executing GC

by using the

causes excessive delays, and it is not suitable for real time

NewRatio

flag. The

TLAB

and Parallel GC options

have been already used in our experiments,

systems. Instead, the best configuration consists of using 4

NewRatio will be included in

MSLEE instances in the "JVM-taskset" configuration (110 cps

whereas the

future works. However, from first

tests, it seems that in our case its use improves mainly SRD and

with 500 ms constraint and 105 cps with 200 ms constraint).

does not have a large impact on maximum throughput.

This is convincing, since a 32 bit system with 2 CPU cores scales very well (see Fig. 7). A further interesting result is that

B.

the "VM" approach can support up to 95 cps for both

OpenSER SIP Server Performance

constraints. This means that adding the reasonable constraint

OpenSER [17] is a modular SIP proxy server, call SIP

on SRD, the performance in terms of maximum call throughput

router, and SIP registrar server written in C language. It is

of "JVM-taskset" and "VM" approaches gets closer. In fact,

widely adopted in VoIP environments and focuses on proxying

performance loss of the "VM" approach with respect to the

and

"JVM-taskset" one is 13.5% if the 95th percentile of SRD is bounded to 500 ms, and 9.5% with a constraint of 200 ms, In addition, the reader should bear in mind that running an AS

without

server

crash

up

to

our

conclusion

is

that

and localization. Even if also OpenSER can benefit from our internal architecture is not proxy-centric.

the

C.

preferable deployment is "VM" with 3 MSLEE instances.

Web Server Performance Performance scalability problem in multi-core environment

is relevant also for other type of ASs, e.g. web servers.

"'

.s

3 10

Suggest Documents