International Review on
Computers and Software PART
(IRECOS)
A
Contents Accuracy-Enhanced Power Metering Technique in Virtualized Environments by Xiao Peng, Liu Dongbo
427
Utilizing an Enhanced Cellular Automata Model for Data Mining by Omar Adwan, Ammar Huneiti, Aiman Ayyal Awwad, Ibrahim Al Damari, Alfonso Ortega, Abdel Latif Abu Dalhoum, Manuel Alfonseca
435
Investigation of In-Network Data Mining Approach for Energy Efficient Data Centric Wireless Sensor Networks by Sanam Shahla Rizvi, Tae-Sun Chung
443
Research of Two Key Techniques in Virtual Application Development by Duan Xinyu, He Engui
448
The Optimized Wavelet Filters and Real-Time Implementation of Speech Codec Base on DWT on TMS320C64xx by Noureddine Aloui, Mourad Talbi, Adnane Cherif
454
A Hybrid System of Hadoop and DBMS for Earthquake Precursor Application by Tao Luo, Wei Yuan, Pan Deng, Yunquan Zhang, Guoliang Chen
463
Model Construction for Communication Gap of Requirements Elicitation by Stepwise Refinement by Noraini Che Pa, Abdullah Mohd Zin
468
Towards a Reference Ontology for Higher Education Knowledge Domain by Leila Zemmouchi-Ghomari, Abdessamed Réda Ghomari
474
A Fuzzy Logic Based Method for Selecting Information to Materialize in Hybrid Information Integration System by Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat
489
A New Approach for Code Generation from UML State Machine by My Hafid Aabidi, Abdeslam Jakimi, El Hassan El Kinani, Mohammed Elkoutbi
500
An Energy Efficient Deployment Scheme for Ocean Sensor Networks by Bin Zeng, Lu Yao, Rui Wang
507
A Fast Exact String Matching Algorithm Based on Nested Classification by Yuwan Gu, Lei Li, Guodong Shi, Yingli Zhang, Yuqiang Sun
514
A New Solution to Defend Against Cooperative Black Hole Attack in Optimized Link State Routing Protocol by H. Zougagh, A. Toumanari, R. Latif, N. Idboufker
519
Study of Network Access Control System Featuring Collaboratively Interacting Network Security Components by Li He-Hua, Wu Chun-Ling
527
(continued)
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
An Ontology Based Meta-Search Engine for Effective Web Page Retrieval by P. Vijaya, G. Raju, Santosh Kumar Ray
533
MMSD: a Metadata-Aware Multi-Tiered Source Deduplication Cloud Backup System in the Personal Computing Environment by Haiyan Meng, Jing Li, Weiqing Liu, Changchun Zhang
542
(continued on Part B)
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 2 ISSN 1828-6003 February 2013
Accuracy-Enhanced Power Metering Technique in Virtualized Environments Xiao Peng, Liu Dongbo
Abstract – In virtualized environments, accurate metering the power consumption of individual virtual machine (VM) is a challenging issue. Conventional VM power metering techniques rely on the assumption that the power consumption is linear to the utilization of hardware utilization. However, such a utilization-based technique can only provide coarse-grained power measurement with unbounded error. In this paper, we firstly formulize the relationship between the resource utilization and the accuracy of power metering. Then, we proposed a novel VM scheduling algorithm, which uses the information of performance monitoring counters (PMC) to compensate the recursive power consumption. Theoretical analysis indicates that the proposed algorithm can provide bounded error when metering per-VM power consumption. Massive experiments are conducted by using various benchmarks on different platforms, and the results shown the error of per-VM power metering can be limited below 5.2%. Copyright © 2013 Praise Worthy Prize S.r.l. All rights reserved. Keywords: Cloud Computing, Energy Efficiency, Resource Virtualization, Server Consolidation
Therefore, one of the most mentioned challenges is how to accurately meter the power consumption on the base of per-VM [2], [6]-[13]. Many efforts have been taken to construct the power model of VMs, and most of them are based on a simple assumption that the power consumption of a VM is linear to the utilization of its occupied physical resources [6]-[9], [11]. Although the assumption itself is correct, existing power models still cannot provide fine-grained per-VM power metering, and the main difficulties are shown as following: • In existing VM power models, the coefficient of each component is often obtained through empirical studies, which makes the models only suitable for certain kinds of the underlying physical resources [2], [9], [13]. • As several VMs will compete for common physical devices, scheduler’s decision will have significant effects on the per-VM power consumption at runtime [8], [10]-[11]. • I/O requests often involve encrypt and decrypt operations, which often consume a great deal of CPU power. Therefore, it is difficult to distinguish the power spent on I/O devices from others [9], [12], [14]. In this work, we introduce the conception of relative PMC (also called PMC ratio) to measure the power consumption of VMs. In this way, we can avoid using empirical coefficients to construct VM power model. Based on the conception of relative PMC, we propose a novel VM scheduling algorithm, which uses the relative PMC to compensate the recursive power consumption of a VM. The remainder of this paper is organized as following: in Section II, we summarize the related work; in Section
Nomenclature Uj Pivm Wi Ui(t1,t2) Ej(t1,t2) Li Ri
Utilization of components j Dynamic power consumption of VMi Processor utilization ratio that allocated to VMi Actual utilization of resource during [t1,t2] PMC events related to component j in [t1,t2] Accumulator of relative PMC value for VMi Credit value for VMi
I.
Introduction
Recently, more and more high-performance datacenters have deployed their resources on virtualization platforms [1]. From the perspective of resource providers, virtualization technology provides an effective approach to extending the capability of their cloud infrastructures without increasing too many IT devices. Meanwhile, server consolidation and live migration have been commonly considered as two effective mechanisms for reducing the power consumption in large-scale datacenters without significant performance degradation [2]-[5]. In conventional data centers, the power consumption of a device can be directly metered using hardware meters or sensors. While in virtualized datacenters, the basic unit of resource management is virtual machine, which cannot be directly connected by hardware power meters.
Manuscript received and revised January 2013, accepted February 2013
427
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
Xiao Peng, Liu Dongbo
III, we present the formulated relationship between the resource utilization and the accuracy of power metering; in Section IV, we introduce the conception of relative PMC and the PMC-based scheduling algorithm; Section V presents the experimental results and evaluation. Finally, Section VI concludes the paper with a brief discussion of the future work.
II.
utilities and multiple VMs. In this architecture, the privileged VM hypervisor and VM utilities can control over physical devices through driver modules or hardware interfaces directly. Therefore, they can obtain coarse-grained power consumption information of physical components, such as average or peak power consumption. However, fine-grained per-VM power consumption cannot be obtained in this approach, because the capability of physical devices are multiplexed by VM hypervisor across many VMs, whose actual power consumption is decided by the characteristic of the current application that running on it. Furthermore, the virtualization layer may lead to a series of complicated calling-chain, which make it difficult to separate the application’s power consumption from the overall power consumption [8]-[9], [14]. For the convenience of representation, we note the power consuming components as J = {CPU, RAM, Disk, I/O}. Typically, the power consumption of a server is formulated as:
Related Work
Early studies on measuring VM power are often implemented by extending a power monitoring adaptor between VM hypervisor and device driver modules. For instance, Cherkasova et al. presented an approach to measuring the vCPU power consumption on Xen platforms [8]. In [14], Stoess et al. presented a two-layer power managing framework for metering, evaluating and controlling the power consumption of virtualized devices. With the increasing requirements of fine-grained power management in modern data centers, plenty of efforts have been taken to address the issue of per-VM power metering. For instance, Kansal et al. proposed a VM power metering mechanism, namely Joule-meter, which uses software-based power models to track per-VM power consumption [9]. In [10], Koller et al. investigated the power modeling methodology in virtualization environments. In [11], Bohra et al. presented an empirical VM power model called VMeter, which is based on an experimental observation that the power consumption of different hardware components is highly correlated with each other. Recently, PMC-based power metering techniques have been extensive studied. In [13], Bircher et al. presented a comprehensive work on using PMCs to model power consumption for both server and desktop machines. In [15], Bertran et al. conducted massive experiments to investigate the effectiveness and accuracy of PMC-based power modeling technique in both virtualized and non-virtualized environments, and their experiments also demonstrated that PMC-based power models are more accuracy and stable than utilized-based power models. In [16], Lim et al. demonstrated an empirical VM power model on Intel Core i7 platform. Our work also uses PMCs to collect the basic VM power consuming events. The difference is that our work does not rely on the empirical approach to obtaining the coefficients of various PMC events like the above studies, because it is highly couple with the underlying platform and is difficult to be employed in heterogeneous datacenters. Instead, we only record the PMC events on per-VM basis and calculate the distributions of various kinds of PMC events.
(
Pserver = Pstatic + ∑ k j ⋅ U j j∈J
)
(1)
where Uj is the utilization of each kind of components, kj is the dynamic power coefficient. When a server is virtualized, its power model can be rewritten as: Pserver = Pstatic +
M
∑ Pivm
(2)
i =1
where Pivm is the dynamic power consumption of VMi, M is the active VM number on this server. As the VMs cannot be connected by hardware power meters, their actual power consumption Pivm should be measured in an indirect way. The most mentioned per-VM power model is as following: Pivm =
(
Pstatic + Wi ⋅ ∑ k j ⋅ U j M j∈J
)
(3)
where Wi is the processor utilization ratio that allocated to VMi. In order to keep the power model accurate, VM scheduler must satisfy the following equation: ∀i, j,
U i ( t1 ,t2 ) U j ( t1 ,t2 ) − =0 Wi Wj
(4)
where Ui(t1,t2) is the actual utilization of resource consumed by VMi during the time period [t1,t2]. This condition implies that the VM scheduler must keep the actual utilization Ui(t1,t2) strictly consistent with its promise Wi among all VMs. Unfortunately, none of the current VM schedulers can satisfy this condition, because
III. Power Modeling in Virtualized Systems Generally speaking, there exists a multi-layered software stack in a virtualized server, which consists of physical devices, native OS, drivers, VM hypervisor, VM Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
428
Xiao Peng, Liu Dongbo
⎧ Ecpu ( t1 ,t2 ) = uOpst →t − Haltt2→t 1 2 1 2 ⎪ ⎪⎪ Eram ( t1 ,t2 ) = LLCt1 →t2 + TLBt1 →t2 + FSBt1 →t2 ⎨ 3 ⎪ Edisk ( t1 ,t2 ) = Interruptt1 →t2 + DMAt1 →t2 ⎪ ⎪⎩ EIO ( t1 ,t2 ) = Interruptt1 →t2 + DMAt1 →t2
runtime characteristics of applications have significantly effects on the actual utilization. Therefore, the more general form of the condition can be noted as: ∀i, j
U i ( t1 ,t2 ) U j ( t1 ,t2 ) − ≤ψ Wi Wj
(5)
where Ej(t1,t2) is the PMC events related to component j (j∊ J) in the time duration [t1, t2]. If we defined the power model of VMk as:
It is clear that the accuracy of per-VM power model depends on the parameter ψ. The bigger value of ψ will result in less accuracy when modeling the VM power consumption. That is: VM VM Pactural − Pmeasured ∝ψ
(8)
Pkvm ( t1 ,t2 ) = ∑ Pkvm , j ( t1 ,t2 )
(9)
j∈J
(6)
where Pkvm , j ( t1 ,t2 ) is the power consumption consumed
So, parameter ψ can be considered as the upper bound of error when metering the VM power consumption.
by physical component j. So, for a given server that running multiple VMs, the dynamic power consumption of each VM is linear to their relative PMC ratio. That is:
IV.
PMC-Based VM Scheduling Algorithm
Pkvm ,j
( t1 ,t2 ) ∝
IV.1. Relative PMC Based Power Model In modern servers, performance monitoring counter (PMC) has been widely supported in various platforms. Typically, these PMCs can be categorized into many classes according to their relationship to the specific components, such as CPU, GPU, chipset, RAM, I/O controller, and disk. As only a few of PMCs are representative for modeling power consumption, we select these representative PMCs through performing extensive experiments on various benchmarks. In this work, the PMC set that selected includes {uOps, Halt, LLC, TLB, DMA, FSB, Interrupt}, and the detail descriptions of each PMC can be found in [17]. To figure out the co-relation between these PMCs and the power consumption of different subcomponents, a series of tests are performed by using various benchmark on four different kinds of servers. The summary of the experiments are as following: ⎧ Pcpu ∝ uOps − Halt 2 ⎪ ⎪ Pram ∝ LLC + TLB + FSB ⎨ 3 ⎪ Pdisk ∝ Interrupt + DMA ⎪ ⎩ PIO ∝ Interrupt + DMA
E kj ( t1 ,t2 )
E j ( t1 ,t2 )
, ∀j ∈ J
(10)
where Ejk(t1,t2) is the part of Ej(t1,t2) that produced by VMk. In this work, we use relative PMC value (also called PMC ratio) to describe VM power model instead of the absolute PMC accounts, as shown in (10). IV.2. PMC Accounting based Credit Scheduling Algorithm To taken the recursive power consumption into account, we design a novel scheduling algorithm, namely PMC Accounting based Credit Scheduling (PACS), which uses both the relative PMC account and the utilization ratio as credits when scheduling VMs. The implementation of PACS algorithm is shown as following: PACS: PMC Accounting based Credit Scheduling Begin 1. for each arrival VMi do 2. Ri := Wi; Li := 1; 3. end for 4. while the processor is idling do 5. Sort VMs as {VMk1,VMk2,…,VMkm} according to: Lk − Rkm Lk1 − Rk1 Lk2 − Rk2 ≤ ≤"≤ m ; Wk1 Wk2 Wkm
(7)
Based on the above summary, it is clear that the power consumption of different subcomponents may be co-relative to two or three kinds of PMC events. It is noteworthy that we do not impose any empirical coefficients in the above formulas (7), which is of significant importance for architecture-independence. For the convenience of representation, we give the following notations
6. Assign processor to VMk1 with utilization Lk1-Rk1; 7. for n = 2 to m do Lk − Rk1 ⋅ Wkn ; 8. Rkn := Rkn + 1 Wk1 9.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
Lkn := Lkn +
∑
j∈J −{cpu}
E kj n
∑
j∈J −{cpu}
Ej ;
International Review on Computers and Software, Vol. 8, N. 2
429
Xiao Peng, Liu Dongbo
Therefore, we have:
10. end for 11. Rk1 := 0 ; 12.
Lk1 := Lk1 +
∑
j∈{cpu}
E kj1
∑
j∈{cpu}
Ej ;
U i ( t1 ,t2 ) U j ( t1 ,t2 ) − = Wi Wj
13. end while End. In PACS algorithm, each VMi is associate with a credit value Ri and its initial value is set as the utilization ratio Wi. Li is an accumulator that records the relative PMC value of VMi, and its initial value is set 1. In each scheduling, all the VMs will be sorted according to the criterion as shown in step 5, and the first VM will be scheduled. With respect to the scheduled VM, its credit value will be proportional shared by other VMs as shown in step 8. PACS uses the relative PMC value Li as a heuristic of recursive operation. More specific, when a VM is scheduled, its Li will accumulate those CPU-related PMC events (as shown in step 12). So, if the scheduled VM is computation-intensive, the increased Li value will result in more utilization ratio (Li-Ri) in its next scheduling. On the other side, if the scheduled VM is data-intensive, the PACS algorithm will reduce its future utilization ratio in its future scheduling. It can be seen that the PACS algorithm not only uses the PMCs information to compensate the recursive power consumption, it also dynamically adjusts the priority of waiting VMs in a fairness manner. The error bound that the PACS algorithm can provide is shown in the following theorem. Theorem 1. When using PACS algorithm, for any VMi and VMj in any time period [t1,t2], they are satisfying:
≤
Wi max
≤
n∈{1"k}
{Ri ( n )}
Wi
+
−
+
R j (1) − R j ( k ) Wj
R j (1) − R j ( k ) Wj max
+
Wi
n∈{1"k}
{R j ( n )}
Wj
(14)
{Ri ( n ) ,R j ( n )} = min {Wi ,W j } max { Ri ( t ) ,R j ( t )}
max
≤
=
n∈{1"k}
t ∈ [t1 ,t2 ]
{
min Wi ,W j
}
According to the theorem 1, it is clear that the error of utilization-based per-VM power model is upper bounded when using the PACS algorithm. Such a upper-bound is related to the utilization ration Wi and the dynamic credit value Ri(t) of each VM. Based on the conclusion of theorem 1, we have the following corollary on the PACS algorithm. Corollary 1. When using the PACS algorithm, if W1= W2=…= Wn then the overall upper bound of error will be reduced. Proof. Due to ∑Wi=1, it is clear that ∀i , j min{Wi,Wj} will be reduced if W1= W2=…= Wn. By Theorem 1 and Eqn.(6), the proof is completed. This corollary indicates that if all VMs are configured to equally share the processor, the PACS algorithm can improve the accuracy of per-VM power model. This feature is especially useful for those virtualized servers that highlight the fairness when consolidating VMs.
max { R ( t ) ,Ri ( t )} U i ( t1 ,t2 ) U j ( t1 ,t2 ) t∈[t1 ,t2 ] i − ≤ (11) Wi Wj min Wi ,W j
{
Ri (1) − Ri ( k )
Ri (1) − Ri ( k )
}
Proof. Without loss of generality, we assume that the PACS algorithm has been executed k times during time interval [t1,t2]. Accordingly, the scheduling sequence during [t1,t2] can be noted as < VM q1 ,VM q2 ," ,VM qk > . For any VMi, its credit value Ri(t) (t ∊ [t1,t2]) can be
V.
noted as Ri(n) (n ∊ [1,2,…,k]). According to the PACS algorithm, for any VMi, its credit value Ri(k) after the kth iteration is formulated as: Ri ( k ) =
k −1 ⎛
Lqn ( n ) − Rqn ( n )
n =1 ⎝
Wqn
∑ ⎜⎜
V.1.
Wi
=
k −1
∑
⎞ Wi ⎟ + Ri (1) − U i ( t1 ,t2 ) (12) ⎟ ⎠
n =1
Lqn ( n ) − Rq
n
Wqn
(n)
+
Ri (1) − Ri ( k ) Wi
Experimental Settings
To investigate the performance of PACS algorithm, we conducted series of experiments on two platforms with various benchmarks. The configurations of the platforms are listed in Table I. The VM hypervisor used in experiments are Xen version 4.1.2 and the underlying operation system is Linux with kernel version 2.6.2. In the experiments, we use Oprofile [18] to sample the PMC events, and the original reports produced by Oprofile are categorized for individual VMs. The benchmarks used in our experiments includes SPECcpu2006 benchmark suite [19], TPC-W [20], Cachebench [21] and IOZone [22].
where qn is the index of the VM that being scheduled in the nth iteration during [t1,t2]. So: U i ( t1 ,t2 )
Experiments and Evaluation
(13)
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
430
Xiao Peng, Liu Dongbo
TABLE I THE CONFIGURATIONS OF EXPERIMENTAL PLATFORMS Platform Type Parameters Server Desktop CPU Intel Xeon E5606 Intel Pentium D830 Architecture quad-core dual-core CPU Freq. 4 x 2.13 GHz 2 x 3.0 GHz Cache L1: 128 KB L1: 64 KB L2: 1 MB L2: 2 MB L3: 8 MB CPU Voltage 0.75 V~ 1.35 V 1.25 V~1.4 V CPU Power 80 Watt Idle: > 40 Watt Peak: > 150 Watt RAM DDR3 16G DDR2 2G Disk Type SATA 2.0T IDE 160 G
V.2.
That is because those VMs are often blocked by VM hypervisor when performing massive I/O operations, however the CS+UM technique cannot take such recursive power into account. When using PMC measurements, all the power consumption of physical components will be accounted. So, it can significantly reduce the error for those disk and I/O intensive benchmarks. As shown in Fig. 1(a), the error of TPC-W is reduced from 11.9% to 5.4%, and the improvement on IOZone is more significant. The error on desktop platform is higher than that on server platform in most cases. Two exceptions happen on mcf and IOZone. For the mcf, we notice that it is most the cpu-intensive in all benchmarks, while IOZone is the most disk-intensive benchmark. When running the benchmarks on desktop platform, only two VMs can concurrently be executed. So, the utilization of individual devices (i.e. cpu, ram, disk, I/O) is highly imbalanced when running mcf and IOZone. As mentioned before, UM based technique has to account the overall power into the current VMs, which make its error very high. While on the server platform, there are four cores which allow four VMs concurrently running at most. Therefore, this significantly balances the measuring error when using UM based technique. While using PMC based technique, this increased error can be reduced about 25% for mcf benchmark and 30% IOZone benchmark.
Accuracy of Metering per-VM Power
In this experiment, we run the benchmarks on two platforms one by one. Due to the benchmark is the only VM on the platform, the actual power of each benchmark can be approximately estimated by the baseline power. When running multiple VMs concurrently, we use the following formula to calculate the error of per-VM power: err% =
VM VM Pmeasured − Pbaseline VM Pbaseline
× 100%
(15)
VM where Pmeasured can be obtained by formula (3) or (10). If using formula (3), the per-VM power is measured by the conventional utilization-based technique; if using formula (10), it is measured by relative PMC based technique that proposed in this paper.Our experiments will test both techniques and compare with each other. As to the VM scheduler, we compare out PACS algorithm with the Xen’s default Credit Scheduler (CS) [23], which only uses utilization ratio as scheduling credits. Therefore, the experimental results can be categorized into four groups as following: • CS + UM: Using Credit-Scheduler and utilization based power measuring technique. • CS + PM: Using Credit-Scheduler and PMC based power measuring technique. • PACS + UM: Using PACS scheduler and utilization based power measuring technique. • PACS + PM: Using PACS scheduler and PMC based power measuring technique. The experimental results are shown in Figs. 1. In this experiment, we set all VMs equally sharing processor, that is VM1= VM2= VM3= VM4= VM5=20%. According to the analysis in Corollary 1, such fairness allocation strategy will lower down the error bound of power measurement when using PACS algorithm. The most distinguishing result is that the error of measuring per-VM power exhibits highly co-relationship with the characteristics of benchmarks. For example, those cpu-intensive benchmarks (bzip2 and mcf) have very lower error in all cases, while the disk or I/O intensive benchmarks (TPC-W and IOZone) are difficult to be accurately measured when using CS+UM technique.
(a) Server
(b) Desktop Figs. 1. Error of per-VM power measuring with different benchmarks
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
431
Xiao Peng, Liu Dongbo
A most interesting result happens on the IOZone benchmark running on the server platform, when we use PACS scheduling algorithm and utilization-based measuring technique. Its error is dramatically reduced from 12.11% to 2.14% (as shown in Fig. 1(a)). To find out the reason, we check the intermediate data that colleted during the experiments. The data show that the IOZone almost requires no processor during all its execution. So, UM based technique results in very high error, which is especially worse on the server platform. Although PMC based technique can reduce part of error, the disk-related PMC events fail to accurately measure the actual power consumption when the disk is in overload working state. When using the combination of PACS+UM, as the PACS tends to select a VM with lest power consumption at recent duration, the IOZone is frequently scheduled and allocated with a very small processor utilization ratio. Such a scheduling decision is very effective for those disk or I/O intensive workloads. V.3.
(a) VR = 1.0
Performance Comparison on Power Efficiency
In this experiment, we will investigate the power efficiency (PE) when using different VM scheduling algorithms. It is well-known that the virtualization ratio (the ratio of concurrent VMs to physical processor number, VR) metric has significant effects on the PE of virtualized servers. To comparing the performance, we use VR metric to normalize the PE. As the benchmark used in our experiments is not dividable, so we run each benchmark n times on the target platform with aiming to simulate a dividable workload. For example, if n=4 and VR=2, then it only means that workload size is 4 times of the original benchmark, and there are at most two VMs that can concurrently execute the workload. The experiment is conducted four times on each platforms with VR=1.0, 2.0, 3.0, 4.0 and n=10 in all cases. The experimental results are shown in Figs. 2(a)~(d). Based on the results, we notice that the PE metric of cpu-intensive benchmarks (bzip2 and mcf) is generally higher than other benchmarks. Furthermore, it is significantly improved with higher value of VR. That is because the power consumption of processor is the most dominate in the tested platforms. As the cache-bench, its normalized PE is slightly lower than bzip2 and mcf, but significantly higher that TPC-W and IOZone. By examining the PMC logs, we find that the cache-bench triggered plenty of LLC-miss events during its execution. Unlike the DMA event, LLC-miss event will not result in long-term blocking. So, both processor and memory are kept in working state, which makes the power consumption of cache-bench very similar to those cpu-intensive benchmarks. As to the TPC-W, its performance is the worst in all cases. For instance, its normalized PE is always lower than 1.0 when using CS scheduling algorithm.
(b) VR = 2.0
(c) VR = 3.0
(d) VR = 4.0 Figs. 2. Power efficiency on different virtualization ratio
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
432
Xiao Peng, Liu Dongbo
As we tested two kinds of platforms (server and desktop), we noticed that the normalized PE of desktop is higher than that of the server in many cases. By checking datasheets of the two platforms, we find the server is designed to provide high performance as much as possible. For example, the processor on the desktop platform will gate the clock signal when it detects ‘idle phases’, which has a significant effect on power saving. While the server platform does not has such a mechanism on its processors. So, when the workload is not heavy, the desktop platform seems more power efficient than the server platform (as shown in Figs. 2(a) and (b)). With the increasing of VR, the power efficiency of server increases quick especially for those cpu-intensive benchmarks (as shown in Fig. 2(d)). As the effects of scheduling algorithm on the normalized PE metric, we can see that the performance difference of two algorithms is very small when VR=1.0, 2.0. When VR=3.0, 4.0, the normalized PE of TPC-W and IOZone is improved more quickly by using the PACS algorithm. As to bzip2 and mcf, the performance of two algorithms keep the same when VR=3.0, however, the PE significantly increased from about 1.81~1.95 to 2.50 when using the PACS on server platform (as shown in Fig. 2(d)). Such an improvement indicates that PACS algorithm is more effective to improve the power efficiency that conventional CS algorithm.
References [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
VI.
Conclusion
[13]
In this paper, we address the issue of fine-grained per-VM power measuring. By using relative PMC value as scheduling heuristic, a novel PMC Accounting based Credit Scheduling (PACS) is proposed. The scheduling algorithm uses the PMCs information to compensate the recursive power consumption. The experimental results that obtained from various benchmarks show that the practical error of power measuring can be limited below 5.2% when using PACS algorithm. Our analysis also show that the PACS algorithm adopts Recent Lowest Power First scheduling strategy when a sever equally allocates its processor to all VMs. Experimental results indicate that such a scheduling strategy is effective to improve the power efficiency when the virtualization ratio of a server is higher than 3.0. In the future, we plan to enhance the PACS algorithm with deadline-guarantee functionality, since plenty of real-time applications have emerged in many cloud-based systems. In addition, we will take more efforts on the other QoS metrics when using the PACS algorithm.
[14]
[15]
[16]
[17]
[18] [19] [20] [21] [22] [23]
R.K. Jena, P.K. Mahanti, Computing in the Cloud: Concept and Trends,(2011) International Review on Computers and Software (IRECOS), 6 (1), pp. 1-10. G. Dhiman, G. Marchetti, T. Rosing, vGreen: a System for Energy-efficient Management of Virtual Machines, ACM Trans. on Design Automation of Electronic Systems, Vol.16, No.1, pp.1-27, 2010. R.K. Jena, Green Cloud Computing: Need of the Hour, (2012) International Review on Computers and Software (IRECOS), 7 (1), pp. 45-52. X. Liao, H. Jin, H. Liu, Towards a Green Cluster through Dynamic Remapping of Virtual Machines, Future Generation Computer Systems, Vol.28, No.2, pp.469-477, 2012. X. Liao, L. Hu, H. Jin, Energy Optimization Schemes in Cluster with Virtual Machines, Cluster Computing, Vol.13, No.2, pp.113-126, 2010. P. Mahadevan, S. Banerjee, P. Sharma, et al., On Energy Efficiency for Enterprise and Data Center Networks, IEEE Communications Magazine, Vol.49, No.8, pp.94-100, 2011. S. Kumar, V. Talwar, V. Kumar, et al., vManage: Loosely coupled Platform and Virtualization Management in Data Centers, Int’l Conf. on Autonomic Computing, pp.127-136, 2009. L. Cherkasova, R. Gardner, Measuring CPU Overhead for I/O Processing in the XEN Virtual Machine Monitor, USENIX Annual Technical Conference, pp.387-390, 2005. A. Kansal, F. Zhao, J. Liu, et al., Virtual Machine Power Metering and Provisioning, ACM Symp. on Cloud Computing, pp.39-50, 2010. R. Koller, A. Verma, A. Neogi, WattApp: an Application Wware Power Meter for Shared Data Centers, Int’l Conf. on Autonomic Computing, pp.31-40, 2010. A. Bohra, V. Chaudhary, VMeter: Power Modelling for Virtualized Clouds, Int’l Parallel and Distributed Processing Symposium, pp.1-8, 2010. B. Krishnan, H. Amur, A. Gavrilovska, et al., VM Power Metering: Feasibility and Challenges, ACM SIGMETRICS Performance Evaluation Review, Vol.38, No.3, pp.56-60, 2010. W.L. Bircher, L.K. John, Complete System Power Estimation using Processor Performance Events, IEEE Trans on Computers, Vol.61, No.4, pp.563-577, 2012. J. Stoess, C. Lang, F. Bellosa, Energy Management for Hypervisor-based Virtual Machines, USENIX Annual Technical Conference, pp.1-14, 2007. R. Bertran, Y. Becerra, D. Carrera, et al., Energy Accounting for Shared Virtualized Environments under DVFS using PMC-based Power Models, Future Generation Computer Systems, Vol.28, No.2, pp.457-468, 2012. M.Y. Lim, A. Porterfield, R. Fowler, SoftPower: Fine-grain Power Estimations using Performance Counters, Int’l Symp. On High-Performance Distributed Computing, pp.308-311, 2010. R. Bertran, M. Gonzàlez, X. Martorell, et al., Decomposable and Responsive Power Models for Multicore Processors using Performance Counters, ACM Int’l Conf. on Supercomputing, pp.147-158, 2010. Oprofile. http://oprofile.sourceforge.net, 2012. SPEC CPU2006. http://www.spec.org/cpu2006/, 2012. Transaction Processing Performance Council TPC-W. http://www.tpc.org/tpcw/default.asp/, 2012. Cachebench. http://icl.cs.utk.edu/projects/llcbench/cachebench/, 2012. IOZone. http://www.iozone.org/, 2012. L. Cherkasova, D. Gupta, A. Vahdat, Comparison of the Three CPU Schedulers in XEN, ACM SIGMETRICS Performance Evaluation Review, Vol.35, No.2, pp.42-51, 2007.
Acknowledgements
Authors’ information
This work was supported by Provincial Science & Technology plan project of Hunan (2012GK3075).
Department of Computer and Communication, Hunan Institute of Engineering, Xiangtan City, Hunan Province, 411104, China.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
433
Xiao Peng, Liu Dongbo
Xiao Peng (Corresponding author) was born in 1979. He received his M.S. and Ph.D in 2004 and 2010. Now, he works as a researcher in Hunan Institute of Engineering and an engineer in high-performance HP networking computing Lab. His research interests include cloud computing, energy efficiency management and optimization, parallel and distributed architecture, distributed intelligence. He is now a member of IEEE Society and ACM Society. E-mail:
[email protected] Liu Dongbo was born in 1974. He received his master degree in Hunan University in 2001. Now he works in Hunan Institute of Engineering and is a Ph.D candidate in Hunan University. His research interests include distributed intelligence, multi-agent systems, high-performance application. He is now a student member of CCF in China, and worked as Senior Engineer in HP High-performance Lab.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
434
International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 2 ISSN 1828-6003 February 2013
Utilizing an Enhanced Cellular Automata Model for Data Mining Omar Adwan1, Ammar Huneiti1, Aiman Ayyal Awwad2, Ibrahim Al Damari1, Alfonso Ortega3, Abdel Latif Abu Dalhoum1, Manuel Alfonseca3 Abstract – Data mining deals with clustering and classifying large amounts of data, in order to discover new knowledge from the existent data by identifying correlations and relationships between various data-sets. Cellular automata have been used before for classification purposes. This paper presents a cellular automata enhanced classification algorithm for data mining. Experimental results show that the proposed enhancement gives better performance in terms of accuracy and execution time than previous work using cellular automata. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved.
Keywords: Cellular Automata, Clustering, Classification, Data Mining, Moore Neighborhood
I.
Classification is the process of assigning an item the class to which it belongs, so it is a prediction process based on a rule. To classify the data, one must find rules that group the provided data instances into appropriate classes [18] [19] [20]. Cellular automata can be used successfully for data mining [21] [22] [23], because all decisions made locally depend on the state of each cell and the states of the neighboring cells. In this paper we have used as basic data the same set of fRMI data that we classified in a previous paper [23] using an uni-dimensional CA. Here we are classifying those data with a two-dimensional CA which enhances in several ways the model proposed in [21]. This paper is an extended version of a short communication presented to an international conference [24].
Introduction
A cellular automaton (CA) is a discrete mathematical model with three main components; namely, a finite automaton, a regular lattice (grid) not necessarily finite, and a neighborhood rule that defines the set of neighboring cells for every position in the grid. The global behavior of a CA may be described locally because each finite automaton in the grid takes the states of its neighbors as input. However, cellular automata with simple local behavior may give rise to complex dynamic systems. The set of particular states of all the automata in the grid of a CA at a given time is called a configuration. The grids can be seen as matrices of states with a given dimension. The most common grids are one and two dimensional grids. Cellular automata have been successfully used in practice in many different ways as simulation tools for a wide variety of disciplines, including physical modeling and simulation [1], biology [2], fluid dynamics [3], pattern recognition [4], logical organization behind selfreproduction [5], traffic simulation [6] [7], edge detection [8], and urban development simulation [9] [10]. In the theoretical domain, they have been used as parallel computer abstract architectures [11] [12] [13] [14]. In [15] and [16] cellular automata were connected with formal languages as a standard method to study other decentralized spatially extended systems. CA’s have also been used as an alternative method to solve differential equations and to simulate several physical systems where differential equations are useless or difficult to apply [14]. Data mining refers to the process of analyzing huge data bases in order to find useful patterns. As in knowledge discovery, or statistical analysis, in data mining one of the most important applications is prediction. Data prediction has two main approaches, supervised and unsupervised [17].
II.
Cellular Automata
There are many different types of cellular automata, depending on the differences of their components. These components are the states of the cell, the geometrical form of the lattice, the neighborhood of a cell, and the local transition function [1] [25]. One of the best known cellular automata is the game of life, introduced by John Conway [26]. The game of life is a very simple cellular automaton that has been proved to be computationally complete, being able (in principle) to perform any computation which can be done by digital computers, Turing machines or neural networks. The cellular automaton associated to the game of life is defined thus: • The grid is rectangular and potentially infinite. • The set of neighbors to a point in the grid consists of the point itself plus the eight adjacent points in the eight main directions in the compass (Moore's neighborhood).
Manuscript received and revised January 2013, accepted February 2013
435
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
Omar Adwan et al.
• Each finite automaton has two states: empty (also called dead, represented by a zero or a space character) and full (also called alive, represented by a one or a star symbol *). The set of states is thus represented by the two Boolean numbers {0,1} or the two characters space and ‘*’. • The transition function is defined by the following simple rules: o If the automaton associated to a cell is in the empty state, it goes into the full state if and only if the number of its neighbors in the full state is exactly three. o If the automaton associated to a cell is in the full state, it goes into the empty state if and only if the number of its neighbors in the full state is less than two or more than three. o In any other case, the automaton remains in the same state. • Each time step is called a "generation". The set of all the cells alive at a given time step is called the "population". The fact that the grid is potentially infinite makes the game of life difficult to implement. However, restricted versions, associated to a grid of finite dimensions, are very simple, at the cost of losing computational completeness. The transition function defines the next state of the cell depending on its current state and the states of its neighbors (which act as the input to the finite automaton in the cell). There are different ways to define the neighborhood. The most common neighborhoods are Moore Neighborhood and von Neumann Neighborhood as shown in Figs. 1 below. Every cell uses the same update rules, which are applied to all the cells in the lattice simultaneously and synchronously. The update rule depends only on just the neighbors of each cell, so the process is local.
unknown data. Given an item, we want to determine to which class it belongs. To perform a classification, one must find a function or a set of rules that classify the given test data samples into specific groups. This function is the output of a training technique that uses the training data samples [17], [27]. In a formal way:
R n be
-
Let the n dimensional space of real numbers our data universe, with points x ∈ R n .
-
Let S be a sample set such that S ⊂ R n .
-
Let f : R n → {−1, +1} be the target function for a binary classification problem.
-
Let
D=
{ x, f ( x ) } x ∈ S
be the training set
(training examples or training samples). Then we need to compute the approximate target function f : R n → {−1, +1} using D, such that: f ( x ) ≅ f ( x ) for all x ∈ R n .
Informally, while experimenting with a new classification algorithm, there are two phases, a training phase and a testing phase. The training phase uses a part of the dataset, called the training samples, to find the approximation function f ( x ) . In the testing phase, we apply this function f ( x ) on another part of the dataset, called the testing samples. We then compare the results of the testing phase with the original classes of the testing samples and compute a few measurements to determine the quality of the new classification algorithm. In clustering, there is no previous information about the data and the classes, which makes the problem more difficult than classification. The initial number and the identity of the classes is decided first, and then each data sample is assigned to a specific class, based on the nature of the data sample and a specific heuristic procedure such as K-mean [23] .
III. Data Mining Classification is a supervised technique. This means that prior knowledge about the data is used to classify
(a) Moore Neighborhood
(b) Von Neumann Neighborhood Figs. 1. Common Neighborhoods
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
436
Omar Adwan et al.
income, then height, then income again (with the thresholds listed for each one) and encode them to the CCA. In more detail, the income value in the first row in the dataset is 14,000 (less than 15,000, the threshold in Fig. 2(b)), so the mapping will be: Xi,j < tj with a low energy (i is false, because it is less by just a little). Whereas the value of income in the second row is 32,000 (greater than 15,000 by a lot), so the mapping will be: Xi,j > tj with a high energy (i is true). In the third row, the income value is 22,000 (greater than 15,000, by not too much), so the mapping will be: Xi,j > tj with a low energy. The transition rules define the next state by comparing the current state of a cell with the states of its neighbors. The energy of the cell increases or decreases, depending on the actual combination. If a cell energy reaches zero, the cell stops working (is dead). The rules make the bottom training cells disappear after every timestamp. In the classification phase, a single test sample is put at the top of the CCA; then the test sample works its way down the classifiers, until a majority vote can be taken over all the classifiers, as in Fig. 2(c). The pseudo-code for the CCA algorithm is shown in listing 1. This approach requires intuition to initialize the values of the energy parameter, and for the threshold selection. The model needs to be redesigned for each dataset. Another disadvantage is that it requires feature selection preprocessing, which is a time consuming operation.
In this research our interest is classification, because we have previous knowledge about the classes and the data from our previous fMRI experiments [24] [28]. To evaluate a new classification algorithm we need standard measurements to compare the new algorithm with other related algorithms. Common evaluation tools are used for this, such as the accuracy, sensitivity and specificity.
IV.
Cellular Automata for Data Mining
A number of CA models useful for classification have been proposed in the literature. To the best of our knowledge, the first of them, Classification Cellular Automata (CCA), was proposed by Kokol et al [22]. It uses a two dimensional CA and a parameter (energy) that changes with time to provide a more accurate classification. In this model, each feature in the dataset (e.g. age, income, height...) is mapped to a column of the CCA, each column having a predefined threshold. Each cell may be in one of five possible states, four as in Fig. 2(a), plus the dead state. Depending on their state, they may possess a given “energy” represented by variable “i” in Fig. 2(a), which may take three values: high (represented by a dark color), low (a light color) or dead (white). The state of a cell informs the learning procedure of the relation between the sample value and the threshold, and whether the training sample was classified correctly. For example, as shown in Fig. 2(b), we first select
(a)
(b)
Test sample
Training samples
(c)
Classifiers cells Figs. 2. CCA (a). The four possible states. (b). An example of mapping the dataset to the CCA for three values. (c). The classification process for seven different values
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
437
Omar Adwan et al.
Listing 1. CCA Algorithm
// learning phase Input: 1. Training set with n training samples 2. Number of iterations T Output: CCA with highest classification accuracy on the training set 1. for t=1 to T 2. choose a learning sample I 3. fill the automaton 4. the cells in the automaton classify the learning sample I 5. change cells energy according to the transition rules 6. cells with energy below zero do not survive 7. end //Testing phase Input: A test sample Number of iterations V Output: Class of the input sample 1. for t=1 to V 2. the cells in the automaton classify the sample 3. change cells energy according to the transition rules 4. each cell with energy below zero does not survive 5. end 6. Classify the sample according to the weighted voting of the surviving cells. The rule is defined thus, for a CA that must classify the date into two different classes (1 and 2).
A different model was proposed by Fawcett in his paper [21]. The CA grid is initiated with training instances (see Fig. 3) and the CA is run with a flat space boundary condition. Each cell state represents the class of that point in the instance space, so the cells will organize themselves into regions that have the same class. The advantage of using this CA model is its simplicity, which makes it possible to implement it with hardware and so will run much faster than other data mining methods. The state transition for this model uses a voting rule which assigns a new state to each cell according to the number of neighbors (in a von Neumann neighborhood) in a specific class. A non stable n4_V1 rule is used, which examines each cell's four neighbors and assigns the new state of cell according to the majority class. With this procedure, the class of a given cell may change if the majority class changes.
Non stable n4_V1
V.
Proposed Algorithm
We propose an enhancement to this model by using the Moore neighborhood, which checks the states of eight cells in all directions. This neighborhood speeds the operation and makes the classification process more accurate. We have also modified the transition rule, so that when the number of class 1 neighbors equals the number of class 2 neighbors, we assign a new class (x) to the cell, as shown in Fig. 4. This change prevents these cells from changing or affecting the voting process in the next time step. However, at the end of the process, all cells whose state corresponds to the new class are changed randomly to one of the target classes.
0 : class 1 neighbors + class 2 neighbors = 0 1 : class 1 neighbors > class 2 neighbors 2 : class 1 neighbors < class 2 neighbors Random (Class1, Class2): class 1 neighbors = class 2 neighbors 15
22
23
31
37
40
130 150 200
B B
500 550 700
A
A A
Fig. 3. Mapping the training data-set to CA initial values with Fawcett’s CA model
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
438
Omar Adwan et al.
modified non stable n4_V1
0 1 2 x
: : : :
class 1 neighbors +class 2 neighbors=0 class 1 neighbors > class 2 neighbors class 1 neighbors < class 2 neighbors class 1 neighbors = class 2 neighbors
Generation i
Generation i+1
Generation i+2
Generation i+3
Figs. 4. Example of our new model voting process.(a) Cells in blue are assigned to class A. (b) Cells in green are assigned to class B. (c) Cells in brown are currently undecided (they belong to classes 0 or x)
Listing 2: The Algorithm Description Input : A =Array [r,c] ,each cell represent class of record in the DB Output : A with classified values Generation=0 While there is a cell with class 0 do { For i=1 to rows_size do For j=1 to columns-size do { Check the classes of the 8 neighbors of A[ i,j ] If class 1 neighbors + class 2 neighbors = 0 then A-temp [ i,j ] = 0 else If class 1 neighbors > class 2 neighbors then A-temp [ i,j ] = class 1 else If class 1 neighbors T ysoft = ⎨ , X T yhard = ⎨ ⎪⎩0 , X > T
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
(4)
(5)
International Review on Computers and Software, Vol. 8, N. 2
455
N. Aloui, M. Talbi, A. Cherif
where, X ( z ) is the original signal and Y ( z ) is the
obtained by thresholding using threshold value T .
reconstructed signal. From (6), the output signal Y ( z )
• Encoding Zero-Value After truncating the small-value coefficients for each subband, compression can be achieved by efficiently encoding the obtained zeros. There are many coding techniques. One way is to replace the original vector by 2 vectors: the first store the coefficients without zeros. The second, store the start and the end of each zeros sequence. Another approach is to encoding consecutive zero value coefficients with two bytes. One byte to indicate a sequence of zeros in thresholded vector and a second byte to represent the number of consecutive zeros [2].
composed by two terms each multiplied by the original signal X ( z ) .
Processing
where ysoft and yhard are the wavelet coefficients
↓2
H0(z) x(n)
y(n)
+
↓2
H1(z)
↑2
G0(z)
↑2
G1(z)
Analysis
Synthesis
Fig. 2. Quadrature Mirror Filter Bank
• Quantization The encoded coefficients for each subband are converted to others coefficients, with fewer possible discrete values. There are many quantization methods such as: scalar and vector quantization.
The first, called the distortion transfer function, and the second is the aliasing transfer function, which can be eliminated by the condition given in (7). The QMF bank design can be satisfying of the symmetry condition and alias cancellation:
• Entropy encoding The quantized data for each subband contains some redundancy, which waste of space. To remove it, an entropy coder like Huffman coding [7] or arithmetic coding is used. At the decoder step, the received bit stream for each frame is used to decode (entropy decoding) and dequantize (quantization decoder) the compressed subbands. Then, the compressed subbands is decoded (Zero-value decoder) to obtain the subbands. Finaly, inverse discret wavelet transform (IDWT) is applied in order to reconstruct audio frame.
H1 ( z ) = H 0 ( − z ) G0 ( z ) = H 0 ( z )
(7)
G1 ( z ) = − H 0 ( − z )
Then Eq. (7) becomes: Y (z) =
1⎡ 2 H 0 ( z ) − H 02 ( − z ) ⎤⎦ X ( z ) 2⎣
(8)
Hence the complexity of QMF bank designing is reduced to design one single prototype filter H 0 ( z ) .
III. Optimized Wavelet Filters for Speech Compression
Let z = e jω and H 0 ( z ) a finite impulse response filter (FIR) with N order. Using Eq. (8), the transfer function of QMF bank is expressed as follow:
In [12], it was shown that the choice of optimal mother wavelets is essential for an optimum wavelet speech compressor. Several criteria can be considered for choosing an optimal mother wavelet, the purpose of maximizing the signal to noise ratio (SNR) and minimizing the reconstructed error variance [11]. Therefore, in this work, different wavelet filters are optimized and used as mother wavelet for speech compression algorithm based on DWT. For optimizing the wavelet filters, a Quadrature Mirror filter (QMF) bank design have been used. It is based on windowing technique and linear optimization. Fig. 2, shows the QMF analysis and synthesis. The input-output relation of a QMF bank in the Ztransform domain is given by (6):
( )
T e jω =
e − jω N ⎧ jω ⎨ H0 e 2 ⎩
( )
2
(
j ω −π + H0 e ( )
)
2⎫
⎬ (9) ⎭
The perfect reconstruction condition is given by Eq. (10):
( )
H 0 e jω
2
(
j ω −π + H0 e ( )
)
2
=1
(10)
In QMF bank, perfect reconstruction is possible if the condition given by Eq. (11) is satisfied [13]:
( ) =0.707
1 Y ( z ) = ⎡⎣ H 0 ( z ) G0 ( z ) + H1 ( z ) G1 ( z ) ⎤⎦ X ( z ) + 2 (6) 1 + ⎣⎡ H 0 ( − z ) G0 ( z ) + H1 ( − z ) G1 ( z ) ⎦⎤ X ( − z ) 2
H0 e
j
π
2
(11)
Therefore, the cutoff frequency is adjusted such that the perfect reconstruction at frequency ( ω = 0.5π ) in
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
456
N. Aloui, M. Talbi, A. Cherif
ideal condition is approximately 0.707. There are different algorithms for designing wavelet filters bank with perfect reconstruction given in [13] and [14]. They are based on windowing technique and linear optimization of the cutoff frequency. In [13], the authors optimized the algorithm given in [14] in term of reconstruction error, computation time and number of iterations. But, in speech compression using wavelets, the computation time and number of iterations are not necessary, because the used wavelet filters are generally saved in database (MATLAB Toolbox). Consequently, in this work, a simple algorithm is used for optimized the wavelet filters. It consists in initializing and iteratively incrementing the value of the cutoff frequency ( ωc ) so that equation (11) is satisfied. The block diagram of the proposed optimization is illustrated in Fig. 3.
counter (count) is incremented by one and the prototype filter is redesigned using the new cutoff value.
IV.
Simulink Design and Real-Time Implementation of DWT Codec
For rapid development and real-time implementation, the graphical programming language MATLAB/Simulink and real-time workshop (RTW) are exploited. The Fig. 6, illustrates the proposed design for DWT speech codec using MATLAB/Simulink. In which, the DWT algorithm processes data from the line input or the microphone input and converted them into digital data by an analogue to digital converter (ADC) on the audio codec AIC23 (Fig. 4). The McBSP (multichannel buffered serial port) and EDMA (Enhanced Direct Memory access) Fig. 6 are used to efficiently handle the data transfer without intervention from the DSP. The bidirectional serial port McBSP2 is used to transfer the audio data back and forth from the AIC23. The EDMA receive every 16-bit signed audio sample from the McBSP2 and store it in a buffer in memory until it can be processed. The unidirectional serial port McBSP1 is used to configure or control the AIC23 codec parameters such as: sample rate, volume and data format. After processing audio data by the DSP, the EDMA controller sends the data back via McBSP2 to the digital to analogue converter (DAC). After the signal acquisition, the proposed speech processing design is performed as follow.
Start
Initialize: window type, filter order (N), ωc, pass-band ripple, stop-band attenuation, MRI, tolerance (tol), epsilon (ε) and counter (count).
Design prototype filter using windowing technique, calculate MRC and error=|MRI-MRC|
Yes Is
error. (n is the number of replicas, nodei represents where the replicas is, and ) y If the file size of global index is bigger than a threshold, partition the global index file on index attribute, and update the index catalog. 2) Index Access When the query is with predicate of high selectivity on the index attribute, the index access method will be chosen. Before the introduction of algorithm index access, we discuss the improved output-format of global index according to our scenario, and define the concept of index-tree used in our algorithm. An important case is that there are 3 data replicas in our applications, and one replica one layer.
III.2. Overview The system consists of four parts as shown in Fig. 2. The bottom is the storage layer DBMS. On top of DBMS is HDFS, which not only stores the system metadata and result set of query like that in HadoopDB, but also adds an cache layer and a global index layer. The top is MapReduce system responsible for parallelization and fault tolerance. The middleware contains global indexer including index catalog and index cache. The indexer creates global index on the loaded local index of tables with large-scale data set. It uses MapReduce to join the corresponding separated local index to be a global index, and delete the local index after success.If global index is too big to search, the indexer would automatically partition the global index into proper-size files.
Fig. 1. Architecture of Earthquake Precursor Network Data Management System
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
464
T. Luo, W. Yuan, P. Deng, Y. Q. Zhang, G. L. Chen
y Call the algorithm index access to get an index-tree to be access next. y Do in parallel for each node of index-tree: query the table of corresponding DBMS directly on rowid value of node’s pointer list sequentially.
Besides, considering simplicity, consistency and space utilization, we orderly write the < rowidi, nodei > pairs by the layer (from top to bottom) of node while emit the column node1 due to the only one node in top layer in execution of index creation mentioned above. Thus the output-format in global index of our scenario is changed to be< index attribute, rowid1, rowid2, node2, rowid3, node3>. The index-tree is a subset of precursor network mentioned in Section II. Each node points to a list of pointers, each of which points to a desired rowid value directed us to locate record of node. Hence, in other words, the algorithm index access as follows is the construction of the index-tree from bottom to top. a. Insert leaf node to index-tree. y Scan the (partition) global index, mark the row where index-attribute-value satisfies the predicate, and every different node appeared in the column node3 of these rows is the leaf node. y Create pointers for every marked row, each pointer consists of the identifier of a (partition) global index file and an offset within the file to identify the given row, and assigned to the corresponding leaf node. b. Insert non-leaf node to index-tree. y Insert the parent of the leaf node and the root node to index-tree according to the topological relationship of precursor network. Each non-leaf node points to a empty list of pointers temporarily. c. Update the list of pointers for each node. y Let N be the number of nodes in index-tree, and P is the number of pointers of leaf node. Then each node should have P' = ⎡⎢ P / N ⎤⎥ pointers in
IV.
Experiments
IV.1. Configurations The experiments are conducted in a cluster consisting of 4 nodes connected by a gigabit Ethernet. Each node has two quad-core Intel x5550 2.6 GHZ processors, 16GB Memory, and a 128GB RAID level 1 disk. The kernel of the operating system is Ubuntu 11.04 x86_64. Hadoop 0.20.203 is set up on the cluster, and Oracle 10201 x86_64 is running on each worker nodes. The data schema is one table data. It has 1 string, 3 integer and 1 blob attributes, which are startdate, stationid, pointid, itemid and obsvalue, and its primary key is the union of the first 4 attributes. stationid, pointid and itemid are respectively uniformly distributed in the integer range [1,10000), [1,10), and [1,100). startdate starts from 01-Jan-12 in the benchmark, and increases by 1 day every 40000 records. obsvalue is used to store binary information of observation sequence, and its average size is 8.7KB It has about 10,000,000 records, and the space occupancy is 95GB. For ease of testing, we assume that the Precursor Network contains 7 nodes, i.e. 1 national node, 2 regional nodes, and 4 station nodes. Each regional node manages 2 station nodes, and the national node manages all regional nodes. Considering that there are only 4 machines in our cluster, we deploy the nodes within a same layer of Network on one machine, and regard the remaining one machine as the master node in Hadoop. A data generator is designed to produce records. It yields only 25% data records (2.5 million) per station nodes. An uploading of data records will be done from the lower node to upper node next, and finally, each regional node will contain 50% data records (5 million), and the national node will contain the whole 10 million data records. The queries we use is: SELECT startdate, stationid, pointid, itemid, samplerate, obsvalue FROM data WHERE stationid IN (stationid_list) [and pointid IN (pointid_list)]; It finds the data by stationid and pointid attributes with a predicate on the stationid attribute, where the where clause may vary in different experiments. The data in HDFS of our system are in text format with columns separated by space characters, and configured with 3 replicas in 64MB block granularity. Other parameters not mentioned before use default values.
average. y Do in parallel for each leaf node: if a node has more than P' pointers, remains first P' pointers, and moves the rest to the list of its parent node. y Do in parallel for each non-leaf non-root node: if a node has more than P' pointers, remains first P' pointers, and move the rest to the list of root node. y Do in parallel for each node: each pointer re-points to the desired rowid value based on the previous row location information in pointer and the layer of node. III.4. Query Execution As stated above, we will create the (partition) global index before online query. The algorithm query execution is as follows: a. Query on table without predicate y Query the table in national node using DBMS. b. Query on table with predicate. y Load the desired (partition) global index file from HDFS to index cache according to the index catalog and predicate. If file exist already, skip this step. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
465
T. Luo, W. Yuan, P. Deng, Y. Q. Zhang, G. L. Chen
IV.2. Initialization
The execution time (both cache miss and cache hit) is much shorter than that of Oracle. The former (cache miss) takes an average increase of 81.2% in elapsed time, and without the cost of cache load (265ms per partition global index), the latter (cache hit) takes a further increase of 76.2% on average. In a word, our system (cache hit) only spends an average of 5% of the normal Oracle running time. This significant performance improvement is because that the hybrid system partitions a single query among multiple DBMSs, executes the sub-queries in parallel and uses rowid to directly locate the record.
We report the initialization of our system, Table I gives the result. The creation time for global index is broken down into several phases. The first local index creation is the most expensive step in the process. It takes nearly half the total time, 8.2 minute. Of the remaining half time for global index creation, joining local index into global index takes 6.1 minute (76.9%), and partitioning the global index takes 1.8 minute (23.1%). The final whole global index occupies 877.4MB, which is only 0.9% of data in DBMS. In summary, our hybrid system required as little as 16 minute to completely initialize the whole 95GB dataset, and only few indexes (0.9%) are loaded into HDFS, loading is not the bottleneck. TABLE I INITIALIZATION OF SYSTEM Index creation time (s) Local index creation Global index creation Total time Index size (MB) Total size Each partition size
V.
We propose a new system architecture which takes DBMS as the underlying storage and execution units, and Hadoop as an index layer and a cache. For queries with predicate of high selectivity, the global index mechanism is much more efficient than Oracle. And for the data loading, only few tables and indexes loaded is a bright way to guarantee our system against loading bottleneck. In the future, we will expand the performance and scalability of system to leverage the strength of distributed query processing.
493.5 473.7 967.2 877.4 97.6
IV.3. Query with Predicate We now test the query with predicate. The query is to process records with stationid value in stationid_list and sometimes with pointid value in pointid_list at the same time. When the predicate of high selectivity exists, it has a chance to use an index to accelerate the execution. We select stationid=50000 [and pointid=5], and stationed [ג50000,50010] respectively to repeat the experiment. Since the value of stationid and pointid are both uniformly distributed, the number of result set rather than the specific values determines the performance. The index cache is cleaned before execution. Fig. 3 illustrates the power of using our hybrid system.
Acknowledgements This research was supported by the National Natural Science Foundation of China (Grant No.61033009, No. 61133005 and No.61100066).
References [1] [2] [3]
Oracle Hybrid System (cache miss) Hybrid System (cache hit)
[4]
1800
[5]
1600
[6]
Running time (ms)
1400
[7]
1200 1000
[8]
800 600
[9]
400 200
[10]
0 8 (stationid=50000, pointid=5)
119 (stationid=50000)
Conclusion
1134 (stationid∈[50000,50010])
[11] [12]
Number of Result Set
Fig. 3. Running Time for the Query with Predicate
[13]
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
J. Gantz, C. Chute, and A. Manfrediz. The diverse and exploding digital universe, 2008: IDC white paper. J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, 2010: Morgan & Claypool. R.E. Bryant. Date-Intensive Supercomputing: The Case for DISC, 2007: CMU Tech Report. A. Zhou. Data intensive computing-challenges of data management techniques. Communications of the CCF, 2009. 5(7): p. 50-53. Worldwide LHC Computing Grid. Available from: http://public.web.cern.ch/public/en/LHC/Computing-en.html. Facebook, Hadoop, and Hive. Available from: http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/. S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. in Proceedings of the 19th ACM Symposium on Operationg System Principles (SOSP' 03). 2003. New York, USA. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI' 04). 2004. San Francisco, California, USA. J. Dean and S. Ghemawat. MapReduce: a flexible data processing tool. Communications of the ACM, 2010. 53(1): p. 72-77. Hadoop: Open-source implementation of MapReduce. Available from: http://hadoop.apache.org. The HDFS Project. Available from: http://hadoop.apche.org/hdfs. K. Shvachko, H. Huang, S. Radia, et al. The hadoop distributed filesystem. in Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST' 10) 2010. Zhou, L., Fang, Z., Xiang, L., Cai, R., Hu, N., Performance
International Review on Computers and Software, Vol. 8, N. 2
466
T. Luo, W. Yuan, P. Deng, Y. Q. Zhang, G. L. Chen
optimization of processing small files based on HDFS, (2012) International Review on Computers and Software (IRECOS), 7 (6), pp. 3386-3391. [14] Long, F., Zhang, Y., Bin, L., A method for mining association rules based on cloud computing, (2011) International Review on Computers and Software (IRECOS), 6 (6), pp. 1112-1116.
Authors’ information 1
National High Performance Computing Center, Hefei, China.
2
University of Science and Technology of China, Hefei, China.
3 Lab. of Parallel Software and Computational Science, Institute of Software Chinese Academy of Sciences, Beijing, China. 4 State Key Lab. of Computer Science, Institute of Software Chinese Academy of Sciences, Beijing, China. 5
University of Chinese Academy of Sciences, Beijing, China. Tao Luo received her B.E. in Computer Science from University of Science and Technology of China (USTC) in 2010. Now she is a Ph.D. candidate in Computer Science at USTC. Her research interests include Big Data, Parallel Algorithm and High Performance Computing.
Wei Yuan received his B.E. in Computer Science and Technology from Wuhan University in 2011. Now he is a Ph.D. candidate in University of Chinese Academy of Sciences. His research interests include Information Retrieval, Big Data and High Performance Computing.
Pan Deng received her B.E. in Software Engineering from Harbin Institute of Technology in 2004, and Ph.D. in Computer Science from Beihang University in 2011. Now she is a Research Associate in Lab. of Parallel Software and Computational Science, Institute of Software Chinese Academy of Sciences. Her research interests include Distributed and Parallel Computing, Big Data Parallel Processing. Yunquan Zhang received his B.E. in Computer Science from Beijing Institute of Technology in 1995, and Ph.D. in Computer Science from Institute of Software Chinese Academy of Sciences in 2000. Now he is a Professor, Vice Chief Director in Lab. of Parallel Software and Computational Science, Institute of Software Chinese Academy of Sciences. His research interests include High Performance Computing, Performance Evaluation and Parallel Computational Model. Guoliang Chen received his B.Sc in Computer Science from Xi’an Jiaotong University in 1961. Now he is a Professor in School of Computer Science and Technology, University of Science and Technology of China. His research interests include Parallel Algorithm, Computer Architecture and High Performance Computing.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
467
International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 2 ISSN 1828-6003 February 2013
Model Construction for Communication Gap of Requirements Elicitation by Stepwise Refinement Noraini Che Pa1, Abdullah Mohd Zin2
Abstract – Communication breakdown between customer and developer is one of the issues that affect the software requirements elicitation process. There are various methods currently being used to address this issue, among which are support tools, models, and techniques. However, more often than not, both parties have to move towards a solution that can help to resolve the communication problem. Previous studies have indicated the importance of effective communication in requirements elicitation. Nonetheless, communication models are not comprehensive enough to address such problems. This paper demonstrates a constructing model which would fill in the communication gap between customer and developer during the requirements elicitation process. The model covers three aspects of the communication gap (which are gaps between knowledge, expression, and medium). In addition, this model is able to guide in describing communication problems arising during the early stage of communication. Further, it can suggest ideas to prepare plans for solving any problems which may arise. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved. Keywords: Communication Model, Communication Gap, Requirements Elicitation, Requirements
I.
This shows that the main gap leading to the existence of communication gap occurs during the process of requirements elicitation. Hence, it is important that each communication gap be thoroughly reviewed by both developers and customers in order to reach an agreement point to the input and medium gap. At present, several studies have been conducted on the practices of requirements elicitation. However, to the best of our knowledge, there is a lack of research being carried out to study the communication gap between customers and developers during the process of requirements elicitation. The remainder of this paper is organized as follows: Section two will describe in detail the project background for software requirements elicitation. Section three will elaborate upon the methodology of developing the model. Section four will present the model construction, while section five will discuss a communication gap model. Finally, section six will conclude with contributions of the paper and relate back to the literature with some indications for future work.
Introduction
In many organizations, software is considered as one of the main assets by which an organization is able to enhance its operation and to compete at both a national and global level. Hence, it is important to ensure that software applications used within the organization are capable of supporting their customers’ requirements. It is therefore a great challenge to develop software that is useful and, at the same time, satisfies the customer’s needs. This leads to careful consideration of important criteria during the software development process, which is essentially the software requirements elicitation process. Requirement engineering involves a number of stakeholders who process a group of information about the problem domain and solution [1], [2]. Traditionally, information systems suffer from an impedance mismatch between software requirements elicitation and system development. Requirements are understood in terms of functionality and non-functionality [3], while the system itself is conceived as being a collection of modules, entities, data structure, and interfaces. This mismatch is known as the communication gap, which is one of the main factors that often results in system failure or dissatisfied customers. One cause of this mismatch is incomplete or inaccurate information during requirement gathering, whereby software requirements are the main reference obtained through oral or written processes.
II. II.1.
Background Research Software Requirements Elicitation
Requirements elicitation is a process of seeking, uncovering, acquiring and elaborating on requirements for developing a computer-based system. For most of the projects, the requirement elicitation process will end at
Manuscript received and revised January 2013, accepted February 2013
468
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
N. Che Pa, A. Mohd Zin
the analysis phase during software development. Requirements elicitation is essential during the software development process, since poor software requirements will result in cost reworking, schedule overruns [4], poor quality systems, stakeholders’ dissatisfaction, and even project failure. In solving this problem, a number of research studies have been carried out to study practices, problems, and issues in requirements elicitation. However the process is complex as it involves various activities, techniques, approaches, and support tools [5], [6]. Activities involved during requirements elicitation include, namely: requirements discovery, requirements classification and organization, requirements prioritization and negotiation, as well as requirements documentation [7]. Requirements elicitation is also viewed as part of a negotiation process among stakeholders in order to achieve an agreement on the system to be developed [8]. The criteria that are often considered during negotiation cover are, namely: the interaction supported and examined; the final decision needing to be maintained to present the project, and the negotiation process needed to be carried out by intuition in order to avoid bias [9]. Meanwhile, the process of documenting the requirements includes activities such as, namely: creating the software requirements specifications (SRS), reviewing the SRS content and checking the SRS result. These activities are carried out to ensure the document that is created adheres to the relevant quality standard and satisfies the customer. II.2.
In software engineering, stepwise refinement is a method that underlies all top-down approaches [10]. The stepwise refinement consists of an incremental development and a sequence of steps in a study context [11]. Each step implies some design decisions, underlying criteria and existence of alternative solutions. III.1. Model Construction Based on stepwise refinement, a communication gap model will be constructed to illustrate the communication problem in general as faced by both customers and developers. It considers theory and concepts of requirements elicitation and communication, which is derived from literature and empirical findings before the model is applied and tested using in real case studies. This process involves four phases, which are, namely: generalizing refinement, analytical refinement, procedural refinement and evaluation refinement. 1. Generalizing Refinement In this phase, the processes of requirements elicitation and communication activities are reviewed in depth. This considers findings on practices, issues and problems during the requirements elicitation process. Next, the research will continue with designing questionnaires and testing the model through a pilot study. Data collected from the pilot study will then be analyzed to produce pilot reports so that modifications on the questionnaires could be implemented before the real survey is conducted. Communication gap occurs due to differences that can occur between the resulting systems, showing needs of the customer often in conflict with the system, as implemented by the developer. Fig. 1 shows communication gap in general.
Communication for Software Requirements
Communication is the activity of sending and receiving messages from the source to the receiver through various medium, which are the customers and developers in a requirements elicitation process. There are several important components under consideration during communication. These include, namely: the medium, sender, receiver, and the content of messages, which relates to the input and output from both parties. Such information by the customer, which is often delivered verbally and not in writing, will be used to produce the Software Requirements Specification document (SRS). Communication enables a mutual understanding to evolve between developers and customers. The basic assumption is that lack of understanding among both parties may block effective communication. The most common problems hindering identification of the user’s needs include, namely: poor communication, resistance, articulation, and diversity in perspective.
Fig. 1. Communication Gap in General
2. Analytical Refinement Empirical study involves preparing requirements elicitation studies within an intended environment, conducting a survey, and analyzing the data collected from the environment. The purpose of the survey is to gather data and information from various agencies who are involved in the requirements elicitation process. The study then proceeds with interviews between customer and developer. Findings from this phase will be used as the basis for producing the specification and requirements for the proposed communication gap model.
III. Methodology This research is carried out using stepwise refinement in developing the model.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
469
N. Che Pa, A. Mohd Zin
Past research studies have shown that there are five categories of communication problems, which are namely: input, personality, communication skills, medium, and procedure [12]. In the process of software development, the customer will usually state their requirements in the form of expression, while the developer will implement the system based on software specifications as drawn from the customer. Knowledge gap results from system differences between what is required and what is actually conveyed by the customer, when the customer fails to describe relevant information to the developer. Normally, a customer will state their domain knowledge and their heuristics on how the system is supposed to work, while the developer carries their domain knowledge on both applications and technology, and the nature of the system [13], [14]. Both customer and developer agree that knowledge of application domains has a significant impact on project performance [15]. Siau and Tan [16] state that there are few personality factors contributing to this problem (such as unconscious or accidental implementation of some task by certain individuals). This is possible since it may be very hard for the customer to imagine their daily work accurately and completely. Sometimes they use conventional approaches in decision-making and heuristic approaches at other times in order to adapt to the new environment. Besides that, it is difficult for a customer to represent the detail of the domain, as well as allowing for clear expression of the business logic-based requirements [14]. In relation to human beings, it is known that the memory capacity for an individual is limited to certain information, whether in the form of problem solution or information processor. On the other hand, the knowledge gap occurs due to differences in translation of understanding into requirement specifications by the developer. Fig. 2 shows the knowledge gap that has contributed to the communication gap.
Usually customers face problems in translating their needs in the form of expression required by the developer, mainly due to insufficient knowledge. Normally, both customer and developer use natural language and some structured form, such as templates or other forms of communication. The developer then needs to translate this information to more formal representations as a model for software development [17]. This task implies time-consuming interpretation which is prone to error. Cybulski et al. [18] also express similar experiences in elicitation of software requirements. On the other hand, knowledge gap and expression gap are caused by differences in input obtained by developers against translation of input understanding into requirement specifications. This chasm is caused by insufficient knowledge that is only accessible through real-life experiences, or by sharing expertise and knowledge from formal education [19]. One issue is writing skill, which is crucial in order to be a good developer [19]. This skill involves the preparation of system proposals, system requirements, system documentation, training manuals, and even replying to e-mails. Such skills are needed to ensure the success of tasks performed by a developer in translating verbal information into textual form. Another issue is that of verbal skills [19]. Development of software specification must be guided by customer needs [20]. As such, a developer must use every input received from a customer to generate specification.
Fig. 3. Communication Gap in Procedure
4. Evaluation Refinement The purpose of model application is mainly for evaluation. The phase involves developing support tools and conducting case studies based on two organizations in Malaysia. Applications of the case studies are used to test and validate the proposed model by assessing the actual organizational environments, followed by model refinement if necessary. From the empirical result and evaluation refinement, it can be seen that input gap and medium gap are major contributors to the existence of communication gap. This is also supported by result analysis using measurement of a case study [21]. This occurs where customers have
Fig. 2. Communication Gap In Analytical
3. Procedural Refinement The expression gap illustrated in Fig. 3 exists because customers elicit their user requirements in a document form. Expression gap signifies the differences between the documented system requirements (as described by the customer) to the input (as conveyed to the developer).
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
470
N. Che Pa, A. Mohd Zin
some knowledge and understanding of the system requirements and the developer uses various medium to arrive at a shared understanding of those requirements [14]. Meanwhile, medium gap occurs due to differences between input administered by the customer and input obtained by the developer. This gap occurs due to the communication medium used, such as telephone, meeting, face-to-face discussions or email. For example, in a meeting, a stakeholder would be likely to use different terms, (ie. the same words but different meanings).At the same time, they could disregard the information that they consider to be common knowledge, since it is not necessary for the others [17]. Factors that contribute to this problem include customers and developers alike having differences in aspects of communication skills, such as being diplomatic in conveying information and giving directions, especially in the case of non-verbal communication [19]. With regard to the software development process, customers state their requirements by expression, while developers develop the system based on software specification as expressed by the customer. Nonetheless, the situation possibly contributes to information differences between what is being conveyed and what is being accepted. The proposed Communication Gap Model for Requirements Elicitation has been applied and tested using real case studies that involved collaboration with two organizations in Malaysia. In the case studies, two main systems operating in their environment have been selected and assessed. System assessment was performed by means of discussion and evaluation among customers and developers. We refer to one of the case studies as the “AbC system”. In this example, product AbC is a system that has been developed through a collaboration of several consortiums. The development of this project is classified as a large project because it involved numerous functions and high integration with other systems. The project development began in January 1999 with the intention to replace existing systems that were being used at that time. The project was divided into eleven modules, which are namely: Personal Record Management Module, Human Resource Acquisition Module, Competency Assessment Module, Career Management Module, Performance Management Module, Allowance Management Module, Interest and Reward, Communication Management Module and Employee Discipline, Formulation Module and Strategy Evaluation, Service Termination Module, Information Service Module, and System Admin Module. The requirements elicitation process for this project development involved upgrading of three modules, namely: Module Management Pension, Dividend and Reward Module, and Competency Evaluation Module. The purpose of assessment was to ensure that the
system could be practically implemented, and that implementation is easy and effective. Implementation of the evaluation process involved three phases, namely: Preparation Phase, Implementation Phase, and Assessment Phase (refers Table I). This evaluation process is also divided into two levels, both at the level of customer and developer. The evaluation process was carried out to ensure feasibility and practicability of the developed system in real situations. The results from questionnaires provided to customers and developers were analyzed by listing a few feedback comments received. Among the feedback received are the following: i. Both customer and developer agree with the fact that system implementation has given beneficial information during the requirements elicitation process. This reduced knowledge gap is faced by both customer and developer. ii. Customer and developer also agree that user commitment may be increased by using this system. iii. Customer and developer agree that the requirement information that is being sent and received when using the system is consistent and complete. Apart from that, feedback is also gained within a short period. This will reduce the medium gap as faced by customer and developer. iv. Customer and developer agree that requirement information produced is more complete and accurate when using this system. This may reduce the expression gap as faced by the developer. In general, all customers and developers agree that the system is able to reduce the communication gap between both parties. Fig. 4 shows the communication gap model for requirements elicitation.
IV.
Communication Gap Model of Requirements Elicitation
This section presents acquisition of the proposed communication gap model for requirements elicitation. TABLE I PHASES, AGENDA, DELIVERABLES AND RESULTS OF EVALUATION PROCESS Phases Agenda Deliverables Results Preparation Briefing Evaluation Find out how a user session and goals could easily discussion with Input for understand a way users evaluation to use the system. Explaining and process. showing a way to use the system Implementa Observation System easily used Questionnaire by users tion Questionnaire process is Users give positive carried out feedback that shows agreement on system to reduce input gap and medium
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
471
N. Che Pa, A. Mohd Zin
No 1.
2.
3. 4.
5.
Fig. 4. Communication Gap Model for Requirements Elicitation
This model aims to address the communication gap between customers and developers. The proposed model consists of communication gap in general and requirements elicitation gap specifically. The components of this model are the customer, the developer and different classifications of gaps. Fig. 4 explains the proposed communication gap model in detail.
TABLE II CLASSIFICATION OF COMMUNICATION GAPS Types of Gaps Description Customer Differences between systems as needed by Knowledge customer with software requirements that have been expressed by customer Customer Differences between software requirements as Expression expressed by customer with input given by customer Medium Differences between inputs given by customer with input received by developer Developer Differences in system understanding of the Knowledge system as needed by customer with input received by developer Developer Differences between translations of Expression understanding into requirements specifications with system understanding as needed by customer
V.
Conclusion
With the role of designing software requirements for optimum performance and thus producing results that meet the needs and goals of customers, there is a high liability for stakeholders to perform during the process of requirements elicitation. This paper proposed construction of a communication gap model between customers and developers during the requirements elicitation process. The proposed model shows that communication gap includes namely: input gap, expression gap, knowledge gap, and medium gap. Bridging the communication gap may be indirectly reduced through the use of training, education and support tools. This research will be further enhanced to produce a more comprehensive model by performing model checking via a more formal method. The improvement of models can be built through considering other issues, namely: political factors and social factors for a new dynamic environment (such as agile method and extreme programming).
1. The Customer A customer is an individual who is directly or indirectly involved in system development. It may be the end user or the management team. Customer involvement during the requirements elicitation process may be through several communication medium, since a customer having little knowledge in technical and computer fields may possess a certain skill and depth of knowledge in the area of system domain. 2. The Developer Developers are individuals involved in system development. The team may comprise a system analyst, a project manager, a software engineer, and a database administrator. Their involvement during requirements elicitation is mainly concerned with receiving the requirements information as conveyed by the customer through multiple medium. Although a developer may have little knowledge of field domain, they possess knowledge and high skills in the computer technical arena.
Acknowledgements This study has been supported by the funds from Universiti Putra Malaysia’s Research University Grant Scheme.
References [1]
[2]
3. Classification of Gaps Communication gap is classified into five categories, which are namely: customer knowledge, customer expression, medium, developer knowledge, and developer expression as defined in Table II. The gaps occur due to differences in knowledge, background skills and technology supporting both customer and developer.
[3]
[4]
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
Erra, U., Scanniello, G.: Assessing communication media richness in requirements negotiation. Software, IET, vol. 4, issue. 2, pp.134-148, 2010. Wan, J., Zhang, H., Wan, D., Huang, D.: Research on Knowledge Creation in Software Requirement Development. Journal of Software Engineering and Applications, vol. 3, n. 5, pp.487-494, 2010. Pa, N.C., Admodisastro, N., Interaction in software requirements for future computing environment, (2012) International Review on Computers and Software (IRECOS), 7 (6), pp. 3007-3011. Shrivastava, A., Tripahi, S.P., Requirements engineering process assessment: An industrial case study, (2012) International Review on Computers and Software (IRECOS), 7 (5), pp. 2149-2158.
International Review on Computers and Software, Vol. 8, N. 2
472
N. Che Pa, A. Mohd Zin
[5]
[6]
[7] [8]
[9]
[10] [11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Aurum, A. , Wohlin, C. Engineering And Managing Software Requirements. Germany, Springler-Verlag Berlin Heidelberg, 2005. Kheirkhah, E., Deraman, A., Analysis of current requirements engineering techniques in end-user computing, (2010) International Review on Computers and Software (IRECOS), 5 (6), pp. 724-730. Sommerville, I. Software Engineering. Ninth Edition. Pearson Education, Inc. United States of America 2011). Laporti, V., Borges, M., Braganholo, V.: Athena - A collaborative approach to requirements elicitation. Journal of Computers in Industry, vol. 60, issue. 6, pp.367–380, 2009. Grunbacher P. , Braunsberger, P. Tool Support for Distributed Requirements Negotiation: Lessons Learned. Cooperative Methods and Tools for Distributed Software Processes. Proceeding(s) of IEEE, (Page: 46-55, Year of Publication: 2003). Boger, E. The ASM Refinement Method. Formal Aspects of Computing, vol. 15, no. 2-3, pp. 237-257. Xavier, M. , Cavalcanti, A. Mechanised Refinement of Procedures. Electronic Notes in Theoretical Computer Science 184 (2007) (Page: 63-80, Year of Publication: 2007). Pa, N.C , Zin, A.M. Requirements Elicitation: A Communication Model For Developer and Customer, Proceeding(s) of the 2nd Malaysian Software Engineering Conference (MySEC’06), Kuala Lumpur, (Page: 136-141, Year of Publication: 2006). Vale, L., Albuquerque, A., Beserra, P. Relevant Skills to Requirement Analysts According to the Literature and the Project Managers Perspective, Proceeding of International Conference on the Quality of Information and Communications Technology, IEEE, Los Alamitos, (Page: 228–232, Year of Publication: 2010). Chakraborty, S., Sarker, S.: An Exploration into the Process of Requirements Elicitation: A Grounded Approach. Journal of the Association for Information Systems, vol. 11, no.1, pp. 2010. Tesch, D., Sobol, M., Klein, G., Jiang, J.: User and developer common knowledge: Effect on the success of information system development projects. International Journal of Project Management, vol. 27, no. 7, pp. 657–664, 2009. Siau, K. , Tan X. Technical Communication in information System Development: The Use of cognitive Mapping, Journal of IEEE Transactions on Professional Communication, vol. 48, no. 3, pp.269-284, 2005. Fuentes-Fernández, R., Gómez-Sanz, J., Pavón, J.: Understanding the human context in requirements elicitation. Requirements Engineering, vol. no. pp. 1-17, 2010. Cybulski, J., Nguyen, L., Thanasankit, T. , Lichtenstein, S. Understanding Problem Solving in Requirements Engineering: Debating Creativity with IS Practitioners, Proceeding(s) 7th Pacific Asia Conference on Information System, (Page: 456–481, Year of Publication: 2003). Hornik, S., Chen, H.G., Klein, G. , Jiang, J.J. Communication Skills of IS Providers: An Expectation Gap Analysis From Three Stakeholder Perspectives, Proceeding(s) IEEE Transactions on Professional Communication, (Page: 17-34, Year of Publication: 2003). Bowen, P.L., Heales, J. , Vongphakdi, M.T. Realibilty factors in business software: volatility, requirements and end-users. Info Systems Journal, vol. 12, no. 3, pp.185-213, 2002. Zin, A.M., Pa, N.C.: Measuring communication gap in software requirements elicitation process. Proceeding of World Scientific and Engineering Academy and Society (WSEAS), (Page: 66-71, Year of Publication: 2009).
Authors’ information 1
Department of Information System, Faculty of Computer Science and Information Technology. Universiti Putra Malaysia, Serdang, 43400, Selangor, Malaysia 2 Head of Programming and Software Technology Research Group, Faculty of Information Science and Technology, Universiti Kebangsaan, Bangi, 43650, Selangor, Malaysia.
Noraini Che Pa was born in Kelantan, Malaysia in 1972. She obtained her B.A. (Hons) in Economic from Universiti Kebangsaan Malaysia in 1996. She received the M.Sc in Information System and the Ph.D. degree in computer science from the Universiti Kebangsaan Malaysia. Currently, she is a senior lecturer in the department of Information System, Faculty of Computer System and Information Technology, Universiti Putra Malaysia. Her research interest includes requirements engineering, information system, knowledge management and management information system. E-mail:
[email protected] Abdullah Mohd Zin is a professor and head of Programming and Software Technology Research Group in Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia. His research interest includes computer science and programming education, software tools and advanced software development methodology, the use of formal method in software development, network and mobile applications. E-mail:
[email protected]
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
473
International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 2 ISSN 1828-6003 February 2013
Towards a Reference Ontology for Higher Education Knowledge Domain Leila Zemmouchi-Ghomari1, Abdessamed Réda Ghomari2
Abstract – Most ontologies are application ontologies designed for specific applications. Reference ontology is able to contribute significantly in resolving or at least reducing the issue of ontology applications specificity and hence increasing ontology reusability. Particularly considering higher education domain, we think that a reference ontology dedicated to this knowledge area, can be regarded as a valuable tool for researchers and institutional employees interested in analyzing the system of higher education as a whole. This paper describes, a reference ontology called HERO ontology, which stands for “Higher Education Reference Ontology”. We explain HERO ontology building process from requirements specification until ontology evaluation using NeOn methodology. HERO ontology is projected to be a reusable and generalisable resource of academic knowledge which can be filtered to meet the needs of any knowledge-based application that requires structural information. It is distinct from application ontologies in that it is not intended as an end-user application and does not target the needs of any particular user group. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved.
Keywords: Reference Ontology, Higher Education Ontology, Ontology Engineering, Neon Methodology
I.
Nevertheless, none of them is dedicated to higher education area. Besides, reference ontology for higher education domain can be considered as a relevant instrument for sharpening an institution’s mission and profile [6]-[7][8]-[9]-[10]-[11]. By focusing on the relevant constituents of the ontology the institutions indicated that they would be able to strengthen their strategic orientation and develop and communicate their profile [6]. In addition the institutions in the case studies indicated that they would be highly interested in identifying and learning from other institutions comparable to them on a number of relevant dimensions and indicators. Developing and expanding partnerships and networks with these colleague institutions and setting up benchmarking processes were seen as important benefits of this knowledge representation. In this perspective, we decided to build reference ontology for higher education area that we called HERO ontology which stands for “Higher Education Reference Ontology”. Since it is a Reference ontology, it is intended to have a broad coverage of university domain, in other words, the ontology describes several aspects of university domain such as organisational structure, administration, staff, roles, incomes, etc. The purpose of this reference ontology is to be relevant or at least convenient to describe any university We adopted NeOn methodology [12] to construct HERO ontology since it is based on famous ontology
Introduction
Ontology is considered as the backbone of the semantic web since it provides a common, comprehensible foundation for its resources in order to allow reusability and sharing among the community of domain experts. Unfortunately most of the available ontologies are too specific and do not stand the test of large applications [1]. Consequently, constructing ontologies from scratch to support domain applications requires a great deal of effort and time [2]. Alternatively, reusable ontologies provide opportunities for developers to exploit and reuse existing domain knowledge to build their applications with much ease and reliability. A common belief is that reusable ontologies ought to be conceived and developed independent from application and context of its use. The reusable ontologies will serve as a basis for communication, integration and sharing of information pertaining to experimental analysis within the collaboration [3]. Reference ontology is able to contribute significantly in resolving or at least reducing the issue of ontology applications specificity. Computer scientists recognize that common and robust reference ontology for a particular knowledge domain might provide significant advantages over domain and application ontology previously used [4]-[5]. There exists some reference ontologies, mostly are dedicated to medical area such as: Foundational Model Anatomy ontology (FMA), Gene Ontology (GO), etc.
Manuscript received and revised January 2013, accepted February 2013
474
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
L. Zemmouchi-Ghomari, A. R. Ghomari
engineering methodologies such as: METHONTOLOGY [13], On-To-Knowledge [14] and DILIGENT [15] combined with good practices and feedback from previous experiences of NeOn consortium members. In this paper, we explain in detail, HERO ontology building process from requirements specification until ontology evaluation throughout section 6 and section 7. Section 2 is intended to define the paradigm of reference ontology by comparing it to other ontology types, followed by section 3 which relates its possible applications. In section 4, we present the related work with regard to two aspects, namely: ontology knowledge domain and ontology typology then we describe briefly the selected building methodology (section 5). We conclude our work in section 8 by presenting the main results obtained and what follows from our work.
II.
ontology development: Experts community view point versus End-user view point; • Reference ontology is a Core ontology, as illustrated by (Fig. 1). II.2.
The Reference ontology is an incontestable contribution in several research areas such as: ontology evaluation, ontology matching [17] and semantic web. TABLE I COMPARISON BETWEEN DIFFERENT ONTOLOGY TYPES Foundational Reference Core ontology Application ontology ontology ontology DomainDeclare a theory Catches the Provides a independent about a central minimal theories particular concepts and terminological domain of relations of a structure reality domain Fits the needs of a specific community High degree of Make use of Defines Offers representational methods of topconcepts terminological accuracy level ontologies which are services for Rich, axiomatic generic across semantic theories, a set of access, designed domains, checking according to constraints strict between terms ontological principles Designed to be Generalize to Focuses on a Lightweight used as controls other domains domain ontologies, on other (more specific application designed ontologies domains) without being according to types restricted to the viewpoint specific of an end-user applications in a particular domain Can be derived Built in Can be from agreement derived from Foundational with Reference ontology foundational ontology ontologies or based on well founded methodologies
Reference Ontology II.1.
Role
Definition
While a variety of reference ontology definitions have been suggested [5], this paper will use the definition proposed by Burgun [10] who said that: “Domain Reference ontologies represent knowledge about a particular part of the world in a way that is independent from specific objectives, through a theory of the domain”. In fact this definition describes reference ontology main characteristics, explicitly: • To have a realist bias; • To be independent from application specific purposes; • To represent the theory of a domain in accordance with strict knowledge representation principles belonging to ontologies. • To be validated by a large community of domain experts Burgun makes distinction between Top level Reference ontologies, such as: BFO1, DOLCE2, OpenCyc3, SUMO4 which are also called foundational ontologies and Domain Reference ontologies, like: FMA, AKT, Reference Ontology for Business Models (presented in section IV.2). More clarification is provided by a comparison between reference ontology and three ontology types in table below. To conclude this comparison we can attest that: • Reference ontology is a heavy weight ontology; • When Reference ontology is generic, foundational and reference ontologies are equivalent, otherwise, they belong to different abstraction levels; • Reference ontologies and application ontologies reflect different aspects of a single methodology of
II.2.1. Reference Ontology in Ontology Evaluation Ontology evaluation is a crucial activity, which needs to be carried out during the whole ontology life cycle; most evaluation approaches fall into one of the following categories: • those based on comparing the ontology to a “golden standard”; • those based on using the ontology in an application and evaluating the results; • those involving comparisons with a source of data about the domain to be covered by the ontology; • those where evaluation is done by humans who try to assess how well the ontology meets a set of predefined criteria, standards, requirements [18].
1
http://ontology.buffalo.edu/bfo/ http://www.loa-cnr.it/DOLCE.html 3 http://www.cyc.com/SUO/opencyc-ontology.txt 4 http://suo.ieee.org 2
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
475
L. Zemmouchi-Ghomari, A. R. Ghomari
This can be used for providing the missing structure when matching poorly structured resources intended to be matched; they often lack a common ground on which comparisons can be based. The focus here is on the use of intermediate formal ontologies for that purpose. These intermediate ontologies can define the common context or background knowledge for the ontologies to be matched. The intuition is that a background ontology with a comprehensive coverage of the domain of interest of the ontologies to be matched helps in the disambiguation of multiple possible meanings of terms. Fig. 1. Position of Reference ontology regarding to different perspectives [16]
II.2.3.
Our interest is in the first approach where a gold standard or a reference ontology for a particular domain is needed to attest if a given ontology is better or worse than other ontologies or if it represents correctly the intended domain of knowledge. If we have to compare several ontologies to each other, the Reference ontology could play the role of a framework to facilitate this evaluation, for example: In the Information Retrieval field, it is frequent to compare between ontologies for deducing which ontology is the more relevant for the IR task, the Reference ontology represents here the corpus in which the task will be achieved. Evaluation based on comparison to a gold standard can be incorporated into this theoretical framework as a function defined on a pair of ontologies (effectively a kind of similarity measure, or a distance function between ontologies). Similarly, data-driven evaluation can be seen as a function of the ontology and the domain-specific data corpus D, and could even be formulated probabilistically as P(O|D) [19].
The semantic web is envisioned as an evolving set of local ontologies that are gradually linked together into a global knowledge network. Many such local application ontologies are being built, but it is difficult to link them together because of incompatibilities and lack of adherence to ontology standards. Reference ontologies attempt to represent deep knowledge of basic science in a principled way that allows them to be re-used in multiple ways, just as the basic sciences are re-used in clinical applications. As such they have the potential to be a foundation for the semantic web if methods can be developed for deriving application ontologies from them [21].
III. Why a Reference Ontology for Higher Education Knowledge Domain? Multidimensional perspective tool of higher education system is necessary as [6]: • Research tool: that offers relevant information to stakeholders • Transparency instrument : That makes the diversity of higher education clear • Base for governmental policy-making: some governmental institutions consult the classifications when making decisions about institutional funding • Instrument for university profiling and strategy development: Some higher education organisations use the Classification in determining membership fees • Global ranking tool: that can contribute to the international competitiveness of higher education institutions in knowledge production and knowledge utilization Compared to ranking systems, higher education reference ontology gives a better basis for developing a diversified higher education system and quality development and benchmarking. Moreover, J. Milam [11] is convinced that ontology for higher education knowledge domain is critically needed, precisely for these applications: • Marketplace of institutions: classifications document this marketplace by identifying categories that are
II.2.2. Reference Ontology in Ontology Matching Reference ontology provides the context in which it is easier to match ontologies [20] (by managing differences that arise between these ontologies) (Fig. 2).
Pairwise alignment
Reference Ontology in the Semantic Web
Alignment through a Reference
Fig. 2. Two ways to align ontologies [5]
For example, the Foundational Model of Anatomy (FMA) can be used as the context for the other medical ontologies to be matched (as long as it is known that the reference ontology covers the ontologies to be matched).
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
476
L. Zemmouchi-Ghomari, A. R. Ghomari
homogenous with regard to functions, students , faculty members of these institutions. • Academic disciplines: student enrollment, degrees conferred and research expenditure data are collected at the discipline level in order to build academic disciplines classification. • Documentation of data: development of standard taxonomies for higher education data promotes best practices to improve data collection and reporting. • Metadata about learning management systems: because information about learning must be shared between different computer systems. • Nature of higher education enterprise: an agreedupon taxonomy can help to categorize types of expenditures and revenues of higher education institutions. Online resources: taxonomies make possible development of online applications such as: data warehouses, data mining, content management systems, e-learning resources, etc.
IV.
intended to evaluate the performance of those repositories with respect to extensional queries over a large data set that commits to a single realistic ontology.This ontology has been designed for a specific application, i.e.: to provide synthetic data for test queries and performance metrics. C. Academic Institution Internal Structure Ontology7 (AIISO): The authors are Rob Styles and Nadeem Shabir from Talis. The current version dates from 2008. The Academic Institution Internal Structure Ontology (AIISO) provides classes and properties to describe the internal organisational structure of an academic institution. AIISO is designed to work in partnership with Participation (http://purl.org/vocab/participation/schema), FOAF (http://xmlns.com/foaf/0.1/) and AIISO-roles (http://purl.org/vocab/aiiso-roles/schema) to describe the roles that people play within an institution. This ontology focuses on structural perspective of a university (this is reflected by its small number of classes: 15).
Related Work
IV.2. With Regard to Ontology Typology: Other Reference Ontologies
In this section we briefly present some of the related work according to two perspectives: our domain of interest (ontologies describing higher education domain) and ontology typology (reference ontologies intended for other knowledge domains).
Some famous reference ontologies have been built, particularly dedicated to biomedical domain. OBO8 foundry is a good illustration of this high quality work. In addition to that, other knowledge domains have been targeted, such as: business and academia as explained in what follows. A. FMA The most famous biomedical reference ontology is the FMA9 ontology which stands for the Foundational Model of Anatomy. It is concerned with the representation of entities and relationships necessary for the symbolic modelling of the structure of the human body in a computable form that is also understandable by humans, but why foundational? For two reasons: (1) anatomy is fundamental to all biomedical domains; and (2) the anatomical concepts and relationships encompassed by the FMA generalize to all these domains. The FMA currently contains 70,000 distinct anatomical concepts representing structures ranging in size from some macromolecular complexes and cell components to major body parts. These concepts are associated with more than110 000 terms, and are related to one another by more than 1.5 million instantiations of over 170 kinds of relationships [22].
IV.1. With Regard to Ontology Knowledge Domain: Other University Ontologies There exist some university domain ontologies that can be qualified as good representations with regard to aspects, such as: correctness of syntax language and satisfactory coverage degree of university domain. Nonetheless these ontologies do not meet the criteria of reference ontologies defined in section 2, in short: to be heavyweight, to contain only central concepts and to not be intended for specific applications. A. University Ontology5: The author is Jeff Heflin of Lehigh University. The current version dates from 2000 and is no longer maintained. This ontology defines elements for describing universities and the activities that occur at them. It includes concepts such as departments, faculty, students, courses, research, and publications. This ontology is a lightweight ontology (no inference rules are defined). B. Univ-Bench6 : LUBM (Lehigh University Benchmark) the author is Zhengxiang from Lehigh University. The current version dates from 2004. This ontology has been developed to facilitate the evaluation of Semantic Web repositories in a standard and systematic way. The benchmark is
B. Reference ontology for Business Models The Reference Ontology for Business Models [23] uses concepts from three established business model ontologies: the REA10, BMO11, and e3-value12. The basic 7
http://purl.org/vocab/aiiso/schema www.obofoundry.org 9 http://sig.biostr.washington.edu/projects/fm/ 10 http://www.getopt.org/ecimf/contrib/onto/REA/rdf/REA.rdfs 8
5
http://www.cs.umd.edu/projects/plus/SHOE/onts/univ1.0.html http://swat.cse.lehigh.edu/projects/lubm/
6
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
477
L. Zemmouchi-Ghomari, A. R. Ghomari
concepts in the reference ontology concern actors, resources, and the transfer of resources between actors. The BMO provides an ontology that allows describing the business model of a firm precisely and in detail, highlighting its environment and concerns for facing a particular customer’s demands. It consists of nine core concept in four categories: (1) Product (Value Proposition), (2) Customer Interface (Target Customer, Distribution Channel, and Relationship), (3) Infrastructure Management (Value Configuration, Capability, and Partnership), and (4) Financial Aspects (Cost Structure and Revenue Model). C. AKT reference ontology Advanced Knowledge Technologies Reference ontology [24], built on a number of smaller ontologies. It models the domain of academia, and thus it contains representations for people, conferences, projects, organisations, publications, etc. It is written in OWL and currently consists of 175 classes and 142 properties.
V.
is performed according to these phases: a. Ontology Specification phase is defined as a collection of requirements that the ontology should fulfill. The output of this activity is the ORSD (Ontology Requirements Specification Document) that includes the purpose, level of formality and scope of the ontology, target group and intended uses of the ontology, and a set of requirements, which are those needs that the ontology to be built should cover [25]. Neon methodology suggests identifying competency questions as a technique for establishing the ontology requirements. Competency questions (CQs) were proposed for the first time in [26]. They are defined as natural language questions that the ontology to be built should be able to answer. b. Ontology Conceptualisation phase is defined as an activity in which domain knowledge is structured in a conceptual model that describes the problem and its solution in terms of the domain vocabulary identified in the ontology specification activity. Once the GT (glossary of terms) is completed, terms are grouped as concepts and verbs. Each set of concepts/verbs would include concepts/verbs that are closely related to other concepts/verbs inside the same group as opposed to other groups. Indeed, for each set of related concepts and related verbs, a concepts classification tree and a verbs diagram is built [27]. c. Ontology Formalisation & Implementation phase: The resulting basic taxonomic structures are then enriched by axioms in formal ontology languages, in our case OWL 213. This phase has been slightly demoted because of ontology editors which automatically generate ontologies in selected formal language. Simultaneously to the above phases, knowledge Acquisition, evaluation and documentation arc tasks that are carried out during the whole life of the ontology. In fact, unless the ontology developer is an expert in the application domain, most of the acquisition is done simultaneously with the requirements specification phase and decreases as the ontology development process moves forward [27].
Building Methodology
NeOn Methodology [12] is a scenario-based methodology that provides a set of nine scenarios that can be combined among them. These scenarios can be summarized as follows: 1. From specification to implementation. The ontology is developed from scratch. 2. Reusing and re-engineering non-ontological resources. 3. Reusing ontological resources. Developers use ontological resources (ontologies as a whole, ontology modules, and/or ontology statements) to build the new ontology. 4. Reusing and re-engineering ontological resources. Ontology developers reuse and re-engineer ontological resources. 5. Reusing and merging ontological resources. 6. Reusing, merging and re-engineering ontological resources. Ontology developers reuse, merge, and reengineer ontological resources. This scenario is similar to Scenario 5, but here developers decide to re-engineer the set of merged resources. 7. Reusing ontology design patterns. Ontology developers access repositories (e.g., http://ontologydesignpatterns.org/) to reuse ODPs. 8. Restructuring ontological resources. Ontology developers restructure (e.g., modularize, prune, extend, and/or specialize) ontological resources to be integrated in the new ontology. 9. Localizing ontological resources. Ontology developers adapt an ontology to other languages and culture communities, thus obtaining a multilingual ontology. The ontology building process of NeOn methodology
VI.
HERO Development Process VI.1. Selected Scenarios
Ontology reuse is recommended by default in current methodologies and guidelines as a key factor to develop cost-effective and high-quality ontologies. The underlying principle is that reusing existing and already consensuated terminology allows saving time and money in the ontology development process, and promotes the application of good practices [2]. Due to the complexity of the domain of interest on one hand and the need of a broad coverage of the
11
http://www.bpiresearch.com/Resources/RE_OSSOnt/RE_BMO_DL/B MO_V_01_00.zip 12 http://docs.e3value.com/misc/example.xsvg
13
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
www.w3.org/TR/owl2-overview/
International Review on Computers and Software, Vol. 8, N. 2
478
L. Zemmouchi-Ghomari, A. R. Ghomari
reference ontology, we decided to combine development from scratch (scenario 1) with reuse-oriented engineering strategy (scenario 2 and scenario 3) performed according to the following phases: • During the specification phase: reuse of non ontological resource (scenario 2), namely the Carnegie classification14 and domain-related sites and documents: this search strategy focused on international and national governmental institutions, such as: o Official governmental websites, such as: United States university15, universities of United Kingdom16, international university of Japan17. o University Associations, like: international association of universities18, association of African universities19, European university association of American association20, universities21, association of universities of Asia and the pacific22; o Academic report, for example: ACE23 reports, EPI24 reports, EAO25 reports. Once the resources identified, ontology developers have to decide if these resources are useful for the development or not. The selection in the case of reference ontology is guided by the reputation, the reliability and the agreement of these resources. Then NeOn methodology suggests carrying out the selected non ontological resources reengineering process in order to transform them into ontologies after the specification phase. However we decided to reuse these resources in a different way and at a different time of the ontology building process, namely: the beginning of the specification phase, because we aim to build a reference ontology which has to meet the requirements of any ontology application so no need to specific requirements. In addition, we think that our efforts in the specification phase have to be guided by the capture of relevant and wide spectrum knowledge more than design requirements necessary to generate data models from non ontological resources. So this is the way we proceeded; every time an interesting statement that is supposed to be encoded in the ontology we consider it as a potential answer to a competency question. For example: the
classification of higher education institutions according to Carnegie classification: • By the highest degree granted • The primary source of funding: public, independent or commonly known as private not-for-profit, or private for-profit. • By source of control-whether an institution is established by the state (referred to as public) or is an independent entity receiving a charter from the state (referred to as private). This sorting has been considered as relevant knowledge and can constitute an answer to a competency question: «What criteria can be taken into account to generate different classifications of higher education institutions?” • During the conceptualisation phase: reuse of ontologies (scenario 3) via Watson26 tool (NeOn Toolkit plug in) which integrates the search capabilities of the Watson Semantic Web gateway within the environment of the ontology editor: the NeOn Toolkit27 and finds in online ontologies, statements that are relevant to extend the description of a particular ontology entity, then statements selected by the user are integrated in the ontology. For example, a class28 “C” has been created. Clicking on the “Watson Search” item of its right-click menu will trigger a search for any statement on the semantic web concerning a class named “C”. The result of the search for a particular entity is displayed in a separate view. The list of entities that have been found is shown, with for each entity, the statements they are associated with. Each of the statements retrieved can be imported into the ontology. The statement will then be attached to the original entity (the one that triggered the search) in the currently built ontology. We found that there are agreements in most ontologies detected by Watson tool concerning some classes and some associations, some trivial examples are mentioned below: o the class “JournalArticle” is a subclass of the class “Publication” o the class “UndergraduateStudent” is a subclass of the class “Student” o There always exists an association named “TakesCourse” between the classes: “Student” and “Course”.
14
http://classifications.carnegiefoundation.org/, the leading framework for recognizing and describing institutional diversity in U.S. higher education since 1973 and regularly updated. 15 http://www.usuniversity.edu/ 16 http://www.universitiesuk.ac.uk 17 http://www.iuj.ac.jp/ 18 www.iau-aiu.net 19 www.aau.org 20 www.eua.be 21 www.aau.edu 22 auap.sut.ac.th 23 American Council on Education 24 Educational Policy Institute (Canada) 25 Education, Audiovisual& Culture (Executive Agency of Bologna Process)
VI.2. HERO Building Phases In this section, we present Hero building process from specification to implementation. A. Specification phase Neon methodology (Table II) has been very helpful in the specification phase throughout the Ontology 26
http://kmi-web05.open.ac.uk/WatsonWUI/ http://neon-toolkit.org 28 In this paper, we use the terms “Class” (OWL term) and “Concept” (Description Logic term) interchangeably since they are equivalent. 27
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
479
L. Zemmouchi-Ghomari, A. R. Ghomari
Refer to the particular knowledge to be represented by the ontology or what knowledge the ontology must contain. This specification is achieved by using the technique of competency questions. The adopted strategy for identifying these questions is the Middle-Out approach: we start by selecting the most important questions regarding to the ontology goals further we will study the possibility to decompose (and obtain more concrete or simpler questions) or on the contrary, to gather some of them (and obtain more abstract or complex questions). The competency questions have to be classified by category and optionally by priority. This categorization facilitates the highlight of the different ontology modules belonging to the same domain and hence increases their reusability. Additionally, sorting these question into categories present some advantages such as to highlight key concepts (the most information-rich concepts) deduced by two indicators, the first one is the intersection between categories which overlap because of their dependencies, for example: a person can be a student and a teacher at the same time, a dean is an administrator and a research project manager, on the other hand, the frequency of the terms extracted from the questions and their answers (the frequency of a given term is proportionally accorded to its importance). We opted for six categories inspired by higher education domain classification of ACE (American Council on Education) reports, which is a leading organism in analyzing higher education domain, these categories are: 1. Faculty, appointments and research area 2. Student and their life 3. Administration 4. Degrees and curriculum programs 5. Finance 6. Governance
requirement specification document (ORSD) [28] and the competency questions technique [26]. TABLE II HIGHER EDUCATION REFERENCE ONTOLOGY REQUIREMENTS SPECIFICATION DOCUMENT (HERO ORSD) Purpose The purpose of building the Reference Ontology is to provide a consensual knowledge model of university domain that can be considered as a basis to derive more specific university domain ontology. This reference ontology is named “ HERO” which stands for “Higher Education Reference Ontology” Scope Since the targeted ontology is a Reference ontology, it must have a broad coverage of university domain, in other words, the ontology has to describe several aspects of university domain such as organisational structure, administration, staff, roles, incomes and any other aspect considered important for the majority of domain experts to describe university domain. But without missing the fact that it is also a core ontology which means that it should contain only central concepts and relationships between them, this leads us to talk about the level of granularity which is determined, as mentioned previously, by the level of concept specificity, the leaves (the most specific concepts in the ontology) must be relevant or at least convenient to any described university. Considering the level of formality, the reference ontology, as defined in [18], is a formal and heavyweight ontology in other terms it should contain a significant number of axioms in order to avoid any ambiguity in the interpretation of any concept, object property or data property contained in it. Implementation Language The university reference ontology has to be implemented in OWL 2 Intended End-Users Users of this ontology might be: 1. Ontology developers in collaboration with domain experts interested in building domain or application ontologies by deriving more specific ontologies from the reference ontology; 2. Ontology aligners which may consider this reference ontology as a gold standard in the ontology alignment process; 3. Ontology evaluators, in the case of comparing between several ontologies to determine the best one in terms of quality or coverage. Reference ontology provides a predefined consensual context for candidate ontologies; 4. National Academic Commissions, higher education governmental structures, independent accreditation organisms which have to produce reports on higher education concerns such as: university organisation, higher education policy, universities ranking. Intended Uses The main use of a reference ontology describing higher education domain is to provide a consensual knowledge about the intended domain in order to be shared (eliminate or reduce the interoperability problem that can occur between universities due to the lack of a common terminology) and reused among different users, different communities, different universities. In other words, its uses can be qualified to be either general, because of its nature as a core ontology (semantic web, alignment, evaluation… it provides a context for all these tasks) or more specific objectives by specializing this ontology with regard to some predefined end-users requirements such as to describe a particular higher education organisation. Ontology Requirements a. Non-Functional Requirements These requirements refer to syntactical aspects of the resulted ontology such as the language in which the ontology is described, namely: English language; Another aspect is the selected terminology that will be used to describe the ontology. The designated terminology has to be consensual or at least, used in the most widely recognized higher education institutions. b. Functional Requirements
As detailed in the previous section, we relied on several academic reports on higher education systems, reliable websites, Carnegie classification and WordNet29 to write competency questions and their answers. As a first attempt to detect HERO ontology requirements, we have identified eighty one (81) Competency questions (CQs) in the specification phase of HERO ontology development process. Repartition of competency questions through categories indicates the priorities in academic domain; made explicit by the number of questions dedicated to each category, since our purpose is to cover the domain as broadly as possible without having any specific application in mind. There is an overlap of some competency questions (20 common questions, which represents 18%) on some categories. Concepts pertaining to these common questions could represent articulations between the different categories which might be useful in case of building a network of modular ontologies. From the requirements in form of competency questions and their respective answers, we extract the terminology (names, adjectives and verbs) that will be formally represented in the ontology by means of concepts, attributes and relations [29] as illustrated in Table III. B. Conceptualisation phase The conceptualisation phase includes the concepts supposed to exist in the world and their relationships. 29
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
http://wordnet.princeton.edu/
International Review on Computers and Software, Vol. 8, N. 2
480
L. Zemmouchi-Ghomari, A. R. Ghomari
Administration
Academic year: 6 Admission: 11 Classification: 5 Community college: 6 Dean: 8 Department: 14 Governing board: 5 Higher education institution: 13 Organisation: 11 President (of university): 10 Private: 10 Public: 9 Semester: 8 Test (access to university): 7 To appoint: 6, appointment: 5
TABLE III GLOSSARY OF TERMS AND THEIR FREQUENCY (EXCERPT) Faculty, Student and Degrees and appointments and their life curriculum research area programs Assistant professor: 9 Associate professor: 5 Faculty member: 12 Faculty: 36 Instructor: 7 Rank:7 Research: 39 Staff: 6 Teacher evaluation: 5 Teacher tenure: 11 Teacher:12 University: 17
Graduate student: 6 International student: 5 Service (campus): 11 Student: 74 To enroll: 6, Year name: 6
Associate degree: 10 Baccalaureate: 8 (total) Bachelor degree: 21 Course: 33 Credit: 29 Degree: 73 Diploma: 5 (training, vocational, high school) Grade: 19 Grading: 11 Program: 34 Student evaluation: 6
This step integrates the following intermediate representation techniques: Data Dictionary (DD) (Table IV), Concepts Classification Trees (Fig. 3), Attributes Classification Trees (Fig. 4.) and Object properties table (Table V). The DD identifies and gathers domain concepts, their meanings, attributes. A few checks are necessary to detect omissions in order [13]: • To guarantee the completeness of the knowledge represented by each concept. That is, the concept description is concise and all the relevant attributes have been identified. • To determine the granularity or level of detail of the concepts covered by the ontology. • To ensure consistency of attributes. That is, they make sense for the concept. • To provide concept names and descriptions: to ensure absence of redundancies and to guarantee conciseness. Once the ontology builder has almost done the DD, the next step is to develop Concepts Classification trees. Given all the concepts of the DD, a concepts classification tree usually organises the domain concepts in a class/subclass taxonomy in which concepts are linked by subclass-of relations. Fig. 3 is a visualization of HERO ontology key concepts or the most information-rich nodes which means, according to Motta [30], that they have been richly characterized with properties and taxonomic relationships in the ontology.
Finance
Governance
Financial aid: 5 Tuition: 5
Accreditation: 6
Fig. 3. HERO Key Concept classification tree30
C. Formalisation & Implementation phase Formal ontology must include axioms or axiomatic theories using formal language to constrain the possible interpretations of the ontology components. Since OWL is based on Description Logics, we used it to express property restrictions.
Fig. 4. Attributes Classification Tree (institution type example)
30
Automatically generated by KC Viz tool: plugin of NeOn toolkit
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
481
L. Zemmouchi-Ghomari, A. R. Ghomari
TABLE IV DATA DICTIONARY (EXCERPT) Description A Course is a Knowledge 1. Grouping that represents a 2. cohesive collection of educational 3. material referred to by the owning 4. organisation as a course 5.
Concept Name Course
Label(s) Course/Module
Student
Student
An individual for whom instruction is provided in an educational program under the jurisdiction of a school, school system, or other education institution/ a learner who is enrolled in an educational institution
Teacher
Teacher/Tutor
a person whose occupation is teaching ( activities of educating or instructing; activities that impart knowledge or skill)
Attributes Course Category Course ClassHours Course Code Course CreditsNumber Course GradingSystem 5.1. Criterion Referenced Grading 5.2. Grading On Curve 5.3. NonGraded Evaluation 5.4. Pass Fail System 6. Course Level 7. Course Material 8. Course Prerequisites 9. Course Room 10.Course Session Code 10.1. Session Timing 10.2. Session Type 11.Course Syllabus 11.1. Course Description 11.2. Course Objectives 12.Course Title 13.Lecture 13.1. Lecture Room 13.2. Lecture Schedule 1. Full Time Student 2. Student Code 3. Student Email 4. Student Grade 5. Student Language Proficiency 6. Student Major Field 7. Student Minor Field 8. Student Name 9. Student Nationality 10. Student Recommandation Letter 11. Student Tuition 11.1. Tuition Origin 12. Student Year Name 12.1. Freshman Year 12.1.1. Junior Year 12.1.1.1. Sophomore Year 12.1.1.1.1. SeniorYear 1. Teacher Competency Domains 2. Teache Email 3. Teacher Home PageURL 4. Teacher Name 5. Teacher Nationality 6. Teacher Rank 6.1. Assistant Professor 6.1.1. Associate Professor 6.1.1.1. Full Professor 6.1.1.1.1. Professor Emeritus 7. Teacher Recruitement Year 8. Teacher Role 8.1. Instructor 8.2. Lecturer 8.3. Teaching Assistant 8.4. Tutor 9. Teacher Tenure
provided in HERO ontology website31.
These restrictions are used to limit the individuals belonging to a single class and contain anonymous classes that satisfy those limits. Some constraints are represented in Table VI. Finally, HERO ontology has been implemented in OWL 2 DL profile produced by Neon Toolkit editor. The resulting ontology is available at: http://sourceforge.net/projects/heronto/?source=directory .It can be described by Fig. 6, more metadata are
VII.
HERO Ontology Evaluation
Usage of multiple independent evaluation approaches ensures development of a consistent and usable ontology.
31
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
http://herontology.esi.dz/
International Review on Computers and Software, Vol. 8, N. 2
482
L. Zemmouchi-Ghomari, A. R. Ghomari
TABLE V HERO OBJECT PROPERTIES (EXCERPT) ObjectProperty Domain Range AppointedTo Teacher Department BelongsTo Researcher Research Group Composed Of Research Group Researcher Cooperates With Researcher Researcher Enrolled By Student Higher Education Organisation Organises Laboratory Seminar Provides Higher Education Student Financial Aid To Organisation Studies At Student Department Supervised By Student Teacher Supervises Teacher Student Writes Researcher Publication
Restricted Class Doctorate
TABLE VI HERO PROPERTY RESTRICTIONS (EXCERPT) Restriction Type SubClass Of32
Restriction in OWL Doctorate SubClassOfHasDegree onlyResearchMaster
Signification: Doctoral degree is necessarily preceded by a research master degree GraduateStudent SubClass Of Has Master only Master Signification : a graduate student has necessarily obtained a master degree Laboratory Equivalent To33 Contains min 1 Research Group 1 Signification : A laboratory contains at least 1 research group PostGraduateStudent SubClass Of PostGraduateStudentSubClassOf HasDoctorate only Doctorate Signification: a postgraduate student has necessarily obtained a doctoral degree
32 33
Necessary Condition Necessary and Sufficient Condition
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
483
L. Zemmouchi-Ghomari, A. R. Ghomari
Role Higher Education Organization Course Provides financial aid to
Alumni
Enrolled by paytuition to
Campus
Take course Has grade in
lives in Degree
Has been
Teacher Assessment Commitee
Obtained by
Student
Studies at
Student-AC Member
supervised
Graduate student
Undergraduate student
Department
Supervises Give gradeto
Postgraduate student
Teacher
SubClass of (Generalization) Association
Fig. 5. Excerpt of HERO ontology data model with focus on class “Student”
logical structure of the ontology, usually depicted as a graph of elements which focuses on syntax and formal semantics of ontology graph. Several language-dependent ontology verification tools and ontology platforms, such as Protégé, NeOn toolkit with Pellet35, FaCT++36, Hermitt37 and Racer38, can be used in order to evaluate these ontologies. Such tools focus on detecting inconsistencies and redundancies in concept taxonomies. After submitting HERO ontology to the previously mentioned reasoners, neither inconsistency nor redundancy has been discovered. VII.2.
Functional evaluation focus on the usage of the ontology, how well it matches the intended conceptualisation or a set of contextual assumptions about a world. The goal is to find ways of measuring the extent to which an ontology mirrors a given expertise, or competency. This evaluation can include: expert agreement, user satisfaction, task assessment, and topic assessment. • Expert agreement can be verified via several techniques such as: investigation, interview, and questionnaire.
Fig. 6. HERO Constituents34 (30.09.2012 version)
The quality of an ontology may be assessed relatively to various dimensions, hence, we are going to measure its quality relatively to three main groups of dimensions: structural, functional and usability-related dimensions (Gangemi et al, 2005). VII.1.
Structural Evaluation
Ontologies implemented in RDF(S) and OWL should be evaluated from the point of view of knowledge representation before using them in Semantic Web applications. Indeed, structural evaluation considers the 34
Functional Evaluation
35
clarkparsia.com/pellet/ http://owl.man.ac.uk/factplusplus/ 37 hermit-reasoner.com/ 38 www.sts.tu-harburg.de/~r.f.moeller/racer/ 36
Automatically generated by NeOn Toolkit
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
484
L. Zemmouchi-Ghomari, A. R. Ghomari
• Task assessment evaluates the ontology with respect to its appropriateness for the intended task, for example by using competency questions technique. • Topic based assessments measures the fitness of the ontology with respect to some existing repository of knowledge on the same topic. • User satisfaction can be measured in terms of user ratings or polls In our case, we focused on the first two approaches that are explained below. A. Evaluation by domain experts This type of evaluation is done by domain experts who try to assess how well the ontology meets a set of predefined criteria, standards and requirements [32]. With the aim of achieving this investigation we chose to use an online questionnaire39 (Fig. 7) proposed to higher education domain experts who include: researchers, teachers, administrators and students (current and alumni). The questionnaire is intended to represent the most relevant constituents of the ontology in an informal way (natural language) in order to be assessed by domain experts who are not necessarily proficient in web ontology language (OWL2 in our case). As ontology is a fairly complex structure, it was found more practical to focus on the evaluation of different levels of the ontology separately rather than trying to directly evaluate the ontology as a whole [18]. So, we decided to divide the survey into several parts, namely: • Verification of the five levels of the ontology (hierarchy level or depth); • Verification of restrictions; • Verification of relations between concepts and • Verification of descriptive attributes of concepts. Domain experts answer to the questionnaire and make several comments on the knowledge encoded in the ontology, such as: • A student could be enrolled in an undergraduate program and in a graduate program at the same time while we have declared undergraduate student class and graduate student class as disjoint40 classes. • Postgraduate student class has been identified as missing class in the ontology Class Registrar has been judged unclear and need some annotations and/or renaming As a result of this evaluation, the ontology has been updated according to opinions’ experts who obtained the majority because the purpose of reference ontology is to materialize specialists’ consensus. It is worth mentioning that obtaining experts consensus is a major challenge. B. Evaluation via Competency Questions technique Competency questions (CQs) [26] are the set of requirements or needs that the ontology should fulfil.
Fig. 7. Some screenshots of HERO ontology online questionnaire
In order to enable the automatic evaluation with regard to competency questions, they need to be formalised in a query language. SPARQL41 is the language proposed by W3C for querying RDF data published on the Web. In order to achieve the translation of natural language competency questions into SPARQL queries42, we adopted an intuitive approach even though inspired by the guidelines proposed in [33]. This approach can be summarized in five steps [34]: 1. Identifying competency questions categories according to expected answers’ types [14], actually there are five sets of questions, which are: a. Definition Questions: that begins with “What is/are” or “What does mean”; b. Boolean or Yes/No Questions; 41
www.w3.org/TR/rdf-sparql-query/ The entire set of competency questions and their corresponding SPARQL queries are available at: http://herontology.esi.dz/content/downloads 42
39 40
http://herontology.esi.dz/content/questionnaire-hero-ontology No common instances
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
485
L. Zemmouchi-Ghomari, A. R. Ghomari
c. Factual Questions: the answer is a fact or a precise information ; d. List questions: the answer is a list of entities and e. Complex Questions: that begins with “How” and “Why”, in this case, obtaining a precise answer is almost improbable. 2. Determining the expected (perfect or ideal) answer; 3. Extracting Entity or Entities from questions and their corresponding expected answers identified in 2; 4. Identifying answer entity type (class, data property, object property, annotation, axiom, instance) and its location in the ontology; 5. Constructing the appropriate SPARQL query that gives the closest answer to the ideal answer: based on question type identified in 1 and questions’/answers’ characteristics extracted from 3 and 4, namely: entity, entity type and its location in the ontology. In Table VII, we present some competency questions with their corresponding SPARQL queries.
As a result to HERO evaluation via competency questions technique, we can confirm that knowledge encoded in HERO ontology is sufficient to respond to SPARQL queries translated from natural language competency questions. Screenshots43 of queries 3, 4 and their answers are shown in Fig. 8. VII.3.
This evaluation dimension depends on the level of annotation of the evaluated ontology. How easy is for users to recognize its properties? How easy is to find out which one is more suitable for a given task? [31]. Usability issues are focusing on the ontology profile, which addresses the communication context of an ontology. These measures focus on ontology metadata about recognition, economical efficiency, and interfacing of an ontology. Moreover, usability levels are also useful to design specific ontologies for particular user communities. Motivated by annotation benefits described above, big effort has been spent in annotating HERO ontology (classes and properties). Based on input resources described in selected scenarios (section VI.1), we have documented our ontology by 97 annotations divided as follows: 48 definitions, 37 comments and 12 labels, in order to allow better understanding of its components.
TABLE VII SOME COMPETENCY QUESTIONS WITH THEIR CORRESPONDING SPARQL QUERIES Competency SPARQL QUERY Question CQ3. Must a ASK { university teacher HERO:Teacherrdfs:subClassOfHERO:Res be a researcher? earcher .} CQ4. What is expected from university teachers? CQ29.What is a campus?
CQ41.Why universities are organised into departments?
CQ53.What high education degrees exist? CQ73.What does mean student tuition?
CQ77. Who are accreditors?
Usability Issues
SELECT ?prop ?range WHERE {?prop rdfs:domainHERO:Teacher ; rdfs:range ?range ; a owl:ObjectProperty .} SELECT ?definition WHERE { rdfs:isDefinedBy ?definition . } SELECT ?comment WHERE { rdfs:comment ?comment } SELECT ?superclass WHERE {HERO:Campusrdfs:subClassOf ?superclass .} SELECT * WHERE { HERO:Departmentrdfs:subClassOf ?x ; OPTIONAL {?x rdfs:subClassOf ?y ; OPTIONAL { ?y rdfs:subClassOfHERO:HigherEducationOr ganization } } } SELECT ?definition WHERE { HERO:Departmentrdfs:isDefinedBy ?definition . } SELECT * WHERE { ?subclass rdfs:subClassOfHERO:Degree } SELECT ?definition WHERE {http://www.UniversityReferenceOntology. org/HERO#StudentTuition>rdfs:isDefined By ?definition . } SELECT ?comment WHERE { HERO:UniversityAccreditationrdfs:comme nt ?comment }
Fig. 8. Screenshots of SPARQL Queries 3 & 4 and their Answers
VIII. Conclusion This work was undertaken to construct Higher Education Reference Ontology by following the guidelines indicated by NeOn methodology. Actually, very few research papers in the literature explain in detail an ontology engineering process which might be very helpful for many stakeholders (ontology developers, domain experts, end-users) in order to 43 For query 4, we used TWINKLE (an offline SPARQL tool) because NeOn toolkit does not support “ASK” form of SPARQL queries.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
486
L. Zemmouchi-Ghomari, A. R. Ghomari
achieve ontology quality with regard to several perspectives: structural, functional and usability issues. In general, therefore, it seems that ontology building process can be qualified as time-consuming and hard according to several aspects: • several phases: specification, conceptualisation, formalisation, evaluation; • numerous possible scenarios: development from scratch, merging, matching, integration and; • Potential resulting forms: taxonomy, classification, thesaurus, formal ontology. More specifically, reference ontology development requires more efforts than development of domain or application ontology since it has to fulfil some particular characteristics which seem to be contradictory but in fact are complementary as they mutually adjust each other: • To be a core ontology; an ontology that regroups central concepts of the knowledge domain • To have a broad coverage of the domain of interest in order to suit to any specific real-world application However, efforts spent in the construction of reference ontology are worthwhile since the return on investment is significant, as demonstrated in [5], Reference ontology is an incontestable contribution in several research areas such as: ontology matching, ontology evaluation, ontology modelling and the semantic web vision in general. In particular, reference ontology for higher education domain can serve as an Instrument for university profiling and strategy development in addition to providing a non discriminatory ranking tool. Based on the above arguments, we undertook the construction of a reference ontology for higher education ontology (HERO ontology), by combining three scenarios among the nine scenarios proposed by NeOn methodology, namely: development from scratch (scenario 1) with reuse of ontological and non ontological resources (scenario 2 and 3) in order to achieve a broad coverage of relevant concepts describing the knowledge domain of interest. These concepts have been related to each other via hierarchical links and associations, described by properties and bounded in their interpretations by some axioms. Once the ontology built, the evaluation process began, according to different perspectives: • Structural: by using logical reasoners, mainly: Pellet, that didn’t detect any inconsistencies or redundancies in concept taxonomy; • Functional: by assessing to what extent the ontology represents the real world domain, which includes; - Expert agreement via an online questionnaire: that helps us to update the ontology according to agreed experts’ viewpoints so the ontology can fulfil the requirement of being a core ontology as explained in section II.1 and; - Task assessment by way of competency questions that have been translated into SPARQL queries to
demonstrate that the ontology satisfied the specifications for which it has been developed for. • Usability issues that evaluate the documentation and the annotation level of the ontology, judged to be good compared to other available ontologies on the web. Besides, assessment should not stop at this point, since reference ontology objective is to reach the broadest possible agreement of domain experts in particular and end-users in a more general perspective. Finally, we hope that this research will serve as a foundation for future reference ontology developments intended for diverse knowledge areas with the aim of reducing the heterogeneity of ontological resources available on the web and to accomplish partially the semantic web vision.
Acknowledgements We are grateful to all domain experts who have accepted to answer to our online questionnaire. We thank them for their constructive responses and helpful comments, especially: Professor Nathalie AussenacGilles (IRIT, Paul Sabatier University, France), Professor Houari Sahraoui (Montreal University Canada) and Dr Abdelaziz Khadraoui (University of Geneva, Switzerland).
References [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
D. Oberle, A. Ankolekar, P. Hitzler, P. Cimiano, M. Sintek, M. Kiesel, B. Mougouie, S. Baumann, S. Vembu, M. Romanelli, P. Buitelaar, R. Engel, D. Sonntag, N. Reithinger, B. Loos, H. Zorn, V. Micelli, R. Porzel, C. Schmidt, M. Weiten, F. Burkhardt, J. Zhou, DOLCE, ergo SUMO: On Foundational and Domain Models in the SmartWeb, Integrated Ontology (SWIntO), Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 5, N° 3, pp. 156–174, 2007. L. Zemmouchi-Ghomari, A.R Ghomari, Terminologies versus Ontologies from the perspective of ontologists, International Journal of Web Science, Inderscience Publisher (in press), 2013. M. Annamalai, L. Sterling, Guidelines for Constructing Reusable Domain Ontologies, Proceedings, Workshop on Ontologies in Agent Systems: 2nd International Joint Conference on Autonomous Agents and Multi-Agent Systems, Melbourne, Australia. (Page 71 Year of Publication: 2003). B. Smith, C. Welty, “Ontology: Towards a New Synthesis”, Proceedings Formal Ontology in Information Systems, ACM Press, New York, pp. 3–9, 2001. L. Zemmouchi-Ghomari, A.R Ghomari, Reference ontology, International IEEE Conference on Signal-Image Technologies and Internet-Based System, Marrakech, Morocco, pp. 485-491, 2009. F.A Vught (van) (ed.), Mapping the Higher Education Landscape: Towards a European Classification of Higher Education, Springer, 2009. M. Bucos, B. Dragulescu, M. Veltan, Designing a Semantic Web Ontology for Elearning in Higher Education, Proceedings of IEEE 9thInternational Symposium on Electronics and Telecommunications, Timisoara, Romania, (Page 415 Year of Publication: 2010). A. Laoufi, S. Mouhim, E.H Megder, C. Cherkaoui, An Ontology Based Architecture to Support the Knowledge Management in Higher Education, Proceedings of International Conference on
International Review on Computers and Software, Vol. 8, N. 2
487
L. Zemmouchi-Ghomari, A. R. Ghomari
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20] [21]
[22]
[23]
[24]
[25]
[26]
[27]
Multimedia Computing and Systems, Ouarzazate, Morocco, (Page 1 Year of Publication: 2011). J. Mesaric, B. Dukic, An Approach to Creating Domain Ontologies for Higher Education in Economics, Proceeding of 29th International Conference on Information Technology Interfaces, Cavtat, Croatia, (Page 75 Year of Publication: 2007). A. Burgun, Desiderata for domain reference ontologies in biomedicine, Journal of Biomedical Information, Vol 39, N°3, pp.307-313, 2006. J. Milam, Ontologies in Higher Education In A. Metcalfe (Ed.), Knowledge Management and Higher Education: A Critical Analysis Hershey, (PA: Information Science Publishing, Chapter 3, 2006). M. Suárez -Figueroa, et al., The NeOn Methodology for Ontology Engineering, Book Chapter in Ontology Engineering in a Networked World, Springer, Berlin Heidelberg, pp 9-34, 2012. A. Gomez-Perez, et al., Towards a method to conceptualize domain ontologies, Workshop on Ontological Engineering, ECAI ’96, Budapest, Hungary, (Page 41 Year of Publication: 1996). Y. Sure and R. Studer, On-To-Knowledge methodology In (J. Davies, et al. Eds) On-To-Knowledge: Semantic Web enabled Knowledge Management, J. Wiley and Sons, 2002. H.IC Pinto, et al., Diligent Towards a fine-grained methodology for distributed, loosely-controlled and evolving engineering of ontologies, Proceedings of ECAI, the 16th European Conference on Artificial Intelligence, Valencia, Spain, (Page 393 Year of Publication: 2004). S. Staab, Core Ontologies, or how to make use of some ontology design patterns, Seminar ontologies, ISWeb: information systems and semantic web, at http://www.uni-koblenz-andau.de, 2006 (accessed 31.01.2009) L. Ghomari, A.R Ghomari, A Comparative Study: Syntactic versus Semantic Matching Systems, Proceedings of International Conference on Complex, Intelligent and Software Intensive Systems, Fukuoaka, Japan, (Page 700 Year of Publication: 2009). Brank J.M., Mladenic D., A survey of ontology evaluation techniques, Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD), Ljubljana, Slovenia, (Year of Publication: 2005). K. Brank J., M. Grobelnik, M. Gold, Standard based ontology evaluation using instance assignment, Proceedings of the EON (Evaluation of Ontologies for the Web) Workshop, Edinburgh, United Kingdom, 2006. J. Euzenat and P. Shvaiko Ontology Matching, Springer, Heidelberg, 2007. J.F Brinkley, et al., A framework for using reference ontologies as a foundation for the semantic web, Proceedings AMIA Fall Symposium, (Page 95 Year of Publication: 2006). C. Rosse and J.L.V Mejino, A reference ontology for biomedical informatics: the foundational Model Anatomy, Journal of Biomedicals Informatics, Vol 36, N°6, (Page 478 Year of Publication: 2003). B. Andersson, M. Bergholitz, A. Edirisuriya, J. Ilayperuma, P. Johanesson, B. Gregoire, Towards a Reference Ontology for Business Models, 25th International Conference on Conceptual Modeling, Tucson, AZ, USA, November 6-9, (Year of Publication: 2006). N. Motta, N. Gibbins, AKT Reference Ontologyv2.0, available at: http://www.aiai.ed.ac.uk/~jessicac/project/akt-maphtml/toplevel.html , 2003 (accessed 10.01.2009). M.C Suárez -Figueroa, A. Gomez-Perez, B. Villazon-Terrazas, How to write and use the Ontology Requirements Specification Document, Proceedings of ODBASE 2009, On the OTM’09, LNCS 5871, Vilamoura, Algarve-Portugal (Page 966 Year of Publication: 2009). M. Gruninger, M.S Fox, Methodology for the design and evaluation of ontologies, Skuce D (ed) IJCAI95 Workshop on Basic Ontological Issues in Knowledge Sharing, Montreal, Canada, (Page 6.1 Year of Publication: 1995. M. Fernandez-Lopez, A. Gomez-Perez, N. Juristo, METHONTOLOGY: From Ontological Art Towards Ontological Engineering, AAAI Symposium on Ontological Engineering, Stanford, USA, (Year of Publication: 1997).
[28] M.C Suárez-Figueroa, NeOn methodology for building ontology networks: Specification, scheduling and reuse, Ph.D. Thesis, Universidad Politécnica de Madrid, 2010. [29] M. Suarez-Figueroa, K. Dellschaft, E. Montiel-Ponsada, B. Villazon-Terrazas, Z. Yufei, G. Agyado-Decea, A. Garcia, M. Fernandez-Lopez, A. Gomez-Perez, Espinoza, M. Sabou, NeOn Methodology for Building Contextualized Ontology Networks, (NeOn Deliverable D5.4.1.), FP7 NeOn Project, 2008. [30] N. Motta, S. Peroni, J.M Gomez-Perez, M. D'Aquin, N. Ning Li, Visualizing and Navigating Ontologies with KC-Viz,In (SuarezFigueroa et al. Eds.), Ontology Engineering in a Networked World, Berlin, Germany, Springer, 2012. [31] A. Gangemi, C. Catenacci, M. Ciaramita, J. Lehmann, Ontology evaluation and validation, Technical report, Laboratory for Applied Ontology, 2005. [32] A. Lozano-Tello, A. Gomez-Perez, Ontometric: A method to choose the appropriate ontology, Journal of Database Management, Vol. 15, N°2, pp 1–18, 2004. [33] A. Ben Abacha, P. Zweigenbaum, Medical Question Answering: Translating Medical Questions into SPARQL Queries, Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, (Page 41 Year of Publication: 2012). [34] L. Zemmouchi-Ghomari, A.R Ghomari, TranslatingNatural Language Competency Questions into SPARQLQueries: a Case Study, Web 2013, IARIA, Seville, Spain, 27-01 Février 2013 (to appear).
Authors’ information Leila Zemmouchi-Ghomari is a computer science engineer; she received a research Master in computer science from ESI, Algiers, in 2009. She has been a PhD student since 2009, member of LMCS (the Laboratory of Systems Design Methodologies). Her research interests focus on ontology engineering, knowledge engineering and semantic web. Abdessamed Réda Ghomari received a PhD in computer science from ESI, Algiers, in 2008. During 2000–2012, he is “Information System management” team head at LMCS (the Laboratory of Systems Design Methodologies). His research interests focus on cooperative knowledge engineering, organisation 2.0 and collective intelligence.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
488
International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 2 ISSN 1828-6003 February 2013
A Fuzzy Logic Based Method for Selecting Information to Materialize in Hybrid Information Integration System Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat Abstract – The virtual approach of information integration, called mediation, suffers from the slowness response to user queries and the risk of sources unavailability. To cure these problems, some information may be stored at the mediator level. However, the success of this solution depends on selecting the right set of information to materialize. This task should be automated when the collection of data sources is large and evolving. Existing methods for the selection of information to materialize have some drawbacks and do not always give feasible or optimal results. Therefore, we propose a new approach based on fuzzy logic, which we call FULVIS. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved. Keywords: Virtual Information Integration, Mediation, Materialization, Hybrid Integration, Fuzzy Logic
information sources cannot satisfy a user query. They hide complexity and heterogeneity of the sources. Indeed, without an integration system, the user faces several heterogeneity types, because of the sources autonomy [2]: The information sources can offer disparate interfaces and use dissimilar query languages; it’s the syntactic heterogeneity. They can use different data models to represent data (relational, object, semistructured, unstructured); it's data model heterogeneity. And even if sometimes they use the same syntax and data model, logical heterogeneity remains because the distinct sources are designed autonomously so diverse semantics, patterns and structures are used. Several information integration approaches’ exist, one of the most used is the mediation, known also as virtual approach [3]. Systems adopting this approach are called mediators [4]. They allow combining a set of independent, heterogeneous, distributed and dynamic information sources. In this approach, for each user query, information is extracted from the sources and integrated at the runtime. However, this solution suffers from several shortcomings [5]. The most important of them are the slow response to the users’ queries and the risk of sources unavailability. Since the mediation needs access to the sources for each user query. Storing some information at the mediator level can solve the problem. While it can avoid sources access when processing user queries. However, the success of materialization is conditioned by making the right choice of utile information to materialize. This task is laborious, as there are several factors affecting this choice [6]. In this work, we study the criteria affecting the selection of the information to materialize, to study existing approaches for selecting the element of information to materialize, and to offer a new and more
Nomenclature ANCFCC DB CM CRDB DDE DGI ENA EP ES ESL ETL FLC KP MEP RDBMS SES
National Agency of Land Conservation and Mapping Database Cluster and Merge Cancer Research Data Base Directorate of State Domains of Morocco General Directorate of Taxes Element is Not Available Element is Predictable Element is Stable Element is Slow Extract Transform Load Tool Fuzzy Logic Controller Knapsack Problem the Materialization of the Element is Profitable Relational Data Base Management System Size of the Element is Small
I.
Introduction
We live in a connected world; some interdependent but autonomous entities can easily share their data sources. To boost their cooperation, we have to integrate them. Nevertheless, if the sources are numerous, none interoperable and heterogeneous the integration will be arduous. To facilitate this task we have witnessed; in the last decade, a strong emergence of information integration systems [1]. The information integration systems are important in several areas such as e-gov, e-learning and e-commerce, especially when the user needs to retrieve and combine in-formation from multiple sources and when isolated Manuscript received and revised January 2013, accepted February 2013
489
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat
virtual approach is more suited for integrating them because the physical approach is adapted for small number of sources due to the constraint of available space for storage [5].
XML
Wrapper
Flat File
Cache
DB
Wrapper
The Directorate of State Domains of Morocco (DDE) [7] is the manager of the private real estate of the Moroccan state. It’s responsible for the preservation and the management of a patrimony estimated at more than 500 billion MAD (more than 50 billion dollar). The DDE plays a key role in many economic and social developmental projects, such as housing and urbanization projects like Zenata new city, and major infrastructure projects like Tanger Med. To better manage its assets the DDE needs accurate, complete and updated information on it. Indeed, the DDE needs information from its partners, and this to: - Update the selling price and rental value of its patrimony. For this, the DDE has to use the database of the General Directorate of Taxes (DGI) containing all real estate transactions and their values at the national level. - Complete the information on the assets by essential information such as: limits, maps and legal status of real estate properties. In this sense, the DDE can access to the data sources of the National Agency of Land Conservation and Mapping (ANCFCC). - Master the uses and the consistency details of the assets allocated to the public administrations (private domain of the state exploited by the administrations); the DDE has to resort to the information sources of public administrations (more than 30 sources). - Know the position of its assets relatively to the infrastructures (road, port, airport ...). That's why the database of the Ministry of Public Works is very useful, and also knowing the situation of the assets according to the development plans, for that, the DDE should interface with the databases of the urban agencies (27 autonomous regional agencies). Two main approaches can be used for the integration of information from the partner sources: The physical approach (or data warehouse) which consists on storing all relevant information in a common warehouse. And the virtual approach (known as mediation) where data remain in their original locations and are extracted from the sources only when responding the user's queries. The DDE partners' sources are numerous (about sixty); the
Integration and presentation
Catalog (global and local schema, mapping …etc)
Wrapper
Results
Example
Query Processor (Analyze, Enrich, Rewrite, Optimize)
Execution engine
Interface
II.
Queries
efficient one. This paper is organized as follows: In the second section we going to present an e-gov example, in the third section we are going to describe the mediation architecture, and the fourth section discusses the factors affecting performance. The fifth section formalizes the problem of selecting the information to materialize. The sixth section presents the existing approaches offering a solution for the selection of information part to materialize. The seventh section presents our new solution based on fuzzy logic. As for the eighth section we present the results of the simulations. Finally, we present conclusion and outlooks in the ninth section.
Fig. 1. The Mediation Architecture
III. Mediation Architecture The Mediation is an intermediary tool between a user (or its applications) and a set of information sources. This tool provides a service for transparent access to sources via an interface and unique access language. In the information integration by mediation [8], the user does not formulate his queries according to the schema of the source where the information is stored but the queries are expressed in terms of the global schema (mediation schema). However, the data is stored in sources using other schemas (source schema). Therefore, the mediation system must rewrite user queries into queries formulated in terms of source schemas [9]. Generally, a mediation system (see Fig. 1) must include two main levels: the central level called mediator and the second level called wrapper. The mediator hides the diversity and distribution of information sources and provides users with a unified overview of the content of those different sources. The wrapper level fits user queries to the sources. It acts like a translator that transforms the user queries from the mediator query language to another directly understandable by sources, if necessary. In the other direction, it transforms the responses of sources to the format used by the mediator [10]. In the first level, an Interface captures and manages user queries, and in the other way, displays the results. A query analyzer can be added depending on the system, to do a lexical, syntactical and semantic analysis to verify the queries accuracy. This latter can be enriched based on domain ontology [11]. Then, the user query (formulated in terms of global schema) will then be rewritten into several sub-queries formulated in terms of sources schemas. Each sub-query is dedicated to one or more sources, and is formulated according to the local schema of the source(s) in question.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
490
Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat
mediation system. Since the space available for storage is limited and that we cannot store all the desired information. - The Stability of information is a key criterion in the selection process, since if the element of information that change frequently are materialized, they should be updated more often, and that is not possible. - The predictability of the information must count in the selection process, since the storage of the most demanded information may improve the overall response time and availability. By cons, the materialization of some information slightly requested will only have a limited effect on performance. We conclude that there are mainly five factors for the selection of information to materialize (slowness, unavailability, size, stability and predictability); the question is how to use them to make a choice?
The optimizer's role is to optimize sub-queries to find a better execution plan. Subsequently, an execution controller sends the sub-queries to the source via the wrappers. These modules communicate with a catalog that stores a set of meta-data on the sources. In the other direction, the different responses of the sources are sent, after the staging phase, to the execution engine that sends them to an integrator to form a single integrated result to be presented to the user via the interface. The execution controller uses a cache where it stores the intermediate results.
IV.
Discussion
The analysis of the mediation approach, especially the user queries’ treatment process reveals two main disadvantages, which are: the risk of sources unavailability and the slowness of user queries response. The first annoyance is due to the deficiency of one or more data sources when processing user queries. Indeed, in some cases, sources may be unavailable due to maintenance of the source for example, limitation of the number of connections or other reason. As for the second annoyance, it’s justified by the fact that the mediator rewrites the user query into many subqueries to be executed in the sources. The time for respond to a user query is penalized by the response time of the slowest sub-query in the best case. Since the response time of the overall query is equal to the response time of the slowest sub-query if the sub-queries are executed in a parallel manner, or to the sum of all sub-queries response time that are executed in serial. Like if the execution engine has to wait the execution of one or more other sub-queries to be executed. To overcome these drawbacks, the solution would be to store part of the sources in advance at the mediator level to prevent source unavailability, and slowness of user's queries response time. This approach is called hybrid approach. Nevertheless, we have two major constraints; the space available for storing the in-formation is limited. And the stored information can become outdated quickly if they change frequently at their origin sources. Hence, we need updating them under the constraint of a limited time for their update. Then we deduce the following selection criteria: - The unavailability of information when processing the user queries is a major risk in the virtual approach of information integration. Thus the availability of information is a capital criterion in the selection of information to materialize. - The slowness of information extraction from the sources is an important criterion in selecting the information to materialize at the mediator level, because we aim to improve the response time that is penalized by the slowest sub-queries. - The size of the information should be considered in the selection of information to materialize at the
V.
Problematic
In order to choose the parts of sources to materialize in a hybrid mediation system, we need to measure different criteria, that is to say define a metric for each criterion; the first question is how to quantify these criteria? In addition, the combination of these criteria makes the choice difficult [12]. In reality, some information may have criteria that push to their materialization and others that promote their virtualization. For instance, information may be small which encourages its storage at the mediator because it is not expensive, and it changes so much that makes its maintenance cost high if admitted to the materialization. Hence, the second question is how to combine different criteria to make a decision? Although the criteria we have derived are the most important, others may arise in particular contexts such as whether access to some data sources is fee required. Thus, any proposed solution should allow the addition of other criteria easily. More, the choice of the part to materialize becomes more difficult when the collection of information sources to be integrated is large and changing, hence the need to automate this task. The problematic is reduced to two main questions: How to quantify criteria affecting the choice of information to materialize? How to combine them to make the best choice?
VI.
Study of Existing Approaches
We studied several mediators adopting the Hybrid approach that enable the extraction and the storage of some information, in advance, at the mediator level. We excluded the mediators that propose caching the last results of user queries. For the reason that information stored in a cache is used only to respond queries identical to the previous ones. In addition, the information stored in the cache is not maintained; it is quickly replaced because of the very limited size of the cache.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
491
Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat
Database Model
Learning Query-plan reformulation
Wrapper
Results
Execution
Materialized views and knowledge base
Wrapper
Access planning
of each object, ranking the objects in descending order and fill the knapsack until there's no more place. However, if we had only two patterns, the first has an ad-vantage of 4 and a size of 2 and the second has an advantage and a size equal to 10 and the available storage capacity is 11. If we apply this algorithm in this case, it's the first pattern that will be stored, generating a profit of 4 and preventing the storage of the second pattern of profit 10. In this method, the entity of admission is the pattern. It's an entity whose granularity is editable (can be narrowed or widened) allowing flexibility. So, for the same information source, some data may be materialized and others can remain in their sources. The quantification of the different criteria is done through the gathering of statistical information on previous users' queries. Although this method has the advantage of being automated, it suffers from intolerance to the addition of other criteria, such as the information access cost where appropriate. Besides, knowing that different criteria are not equally important, Ariadne does not associate different weights to the criteria. Moreover, the proposed manner for calculating the profit of the materialization of each pattern does not take into account some important factors such as availability.
Web source 1 Web source i Web source n
Domain Model
Wrapper
Queries
Information source selection
Fig. 2. The Ariadne Architecture
The systems studied in this work are: Squirrel [13], Lore [14], HAOI [15], Yacob [16], IXIA [17] Ariadne [18] and CRDB [19] of the research world, EXIP [20], DataPort [21] and XIP [22] of the industrial world. Among these works, we find only two solutions that provide methods for selecting the information to materialize that are CRDB and Ariadne. VI.1. Ariadne Ariadne [23] is an extension of the SIMS [24] mediation system to support semi-structured information sources in a web environment; it is based on mediatorwrapper architecture, as shown in Fig. 2. The main query processing components are “The information resources selection” that dynamically selects an appropriate set of information sources, “The Access planning” that generates an execution plan for processing the user query, and the “Query-plan reformulation” that optimizes this plan to minimize the execution cost. In Ariadne, the intake unit for materialization is the pattern. A pattern can be a set of frequently requested attributes of a class or a set of objects from one class. To construct these patterns automatically, Ariadne uses an algorithm called CM (Cluster and Merge) that relies on statistics of previous user queries [18]. For the selection of patterns to materialize, Ariadne uses three factors: stability, slowness and size. Thereafter, the problem is reduced to a problem of knapsack, noted in the literature by Knapsack Problem (KP) [25]. KP is a combinatorial optimization problem that models a situation similar to filling a backpack with a set of objects, each with a weight and value. We must put the objects in the backpack by maximizing the total value without exceeding the weight tolerated by the backpack. By analogy, 'Patterns' are the objects, the available storage space is the weight allowed by the backpack, and the gain of materialization is the value of objects to maximize. Ariadne uses a simple and effective algorithm for solving the problem of the Knapsack called Greedy algorithm [25]. But this algorithm risks having nonoptimal outcomes in particular cases. Indeed, the Greedy algorithm is to calculate the ratios of the size and the gain
VI.2. CRDB Cancer Research Data Base (CRDB) [26] is a research project for the integration of information sources of a dozen research laboratories for cancer therapy. The integration was performed using the functionalities of a relational DBMS. Some sources like relational DBMSs are integrated through mediation by connecting to remote servers, and using the features of the RDBMS. Other sources are embodied in the common database, and are refreshed periodically by Extract Transform Load tool (ETL). As shown in Fig. 3, data from Computational Docking and Small Molecule databases are integrated via connections to databases through mediation. Data of Functional Assays and Structured Assays are integrated by importing in the CRDB according to data warehouse approach. The data warehouse part of CRDB consists of two data marts one about Mutant and the other is about Molecular. Queries
Translated data
Molecular
Results
Cancer Research Data Base (CRDB) Mutants
Integrator/ Wrapper
Structural assays Functional assays
Wrapper
Small molecules
Wrapper
Docking
Fig. 3. The CRDB Architecture
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
492
Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat
VI.3. Discussion
This work [19] raises the issue of selecting the part to materialize and suggests ways to facilitate this selection. Indeed, for each of the five criteria (the slowness, availability, size, stability and predictability), a source is rated by 1 if its adherence to this criterion favors its materialization, 0 otherwise. When the sum of scores for a source is greater than or equal to 3, the data source must be materialized. If not, the data will remain in their original location, and this, according to the following sequence: A- Data sources are classified according to their characteristics (the slowness, availability, size, stability and predictability). B- Each characteristic is converted to the choice (materialize or not) that implements it the best. C- A numerical value is assigned to each choice, 1 for Materialize and 0 for Not, and a score is calculated for each data source by summing values for all the characteristics. D- Finally, sources with a score greater than or equal to 3 are to materialize. This approach is very simplistic, since for quantifying the criteria, it does not specify how to note a source belonging to a criterion, but the scoring is done by an estimate. The method adopted for CRDB does not take into account the available storage capacity or the time limit for the maintenance of the materialized part. All eligible sources for the materialization, that is to say, having accumulated a score greater than 3, are taken. So, it is possible that the available storage space or maintenance time is depleted before storing all admitted sources. In consequence, this method does not prioritize a source on another if both are eligible for materialization. In addition, if the storage of all eligible sources does not consume the available resources, the method does not advocate the materialization of some sources with scores below 3 although their materialization can improve the overall performance of the system. Also, the different criteria in this method are on the same equal footing. But in reality the different criteria do not have the same importance. For example, it seems obvious that the availability of sources and speed are more important than other criteria for the end users. In the other side, frequency of change and the size of the sources are decisive in the choice for the system administrator. In addition, this approach proposes to take a source or leave it in whole. While in some sources, there are stable parts and others are changing. Beside, some parts are pre-dictable and others are not. So the entity selection for this method is not optimal. Moreover, this approach is static; because once the selection is done for the first time, it is difficult to change it following a change of user behavior (predictability of user queries) or source performance (availability, speed, frequency of change and size). We conclude that this approach is suited to situations where the collection of sources is small and not changing and that it is not suitable for automation.
We believe both studied methods have shortcomings. The one proposed in CRDB cannot be automated, since the assignment of values to the various factors is done by subjective estimation. Furthermore, it does not take into account the available resources, so it can present unfeasible results. Although the method implemented in Ariadne is automated, it does not take into consideration all factors affecting the choice, as availability. In addition, it does not combine more effectively the factors affecting the selection, and it is not flexible to the addition of other choice factors. Hence, we need to propose a new flexible method for adding new factor, which effectively combines the choice factors in order to provide better results. For this, we propose a new method based on fuzzy logic, named Fulvis which meets those requirements and takes into consideration the uncertainty aspect of the values assigned to the selection factors.
VII.
FULVIS: Our New Approach based on Fuzzy Logic
After studying several hybrid systems that materialize, in advance, some of the information at the mediator level to prevent the unavailability of sources and to improve the processing time of user queries. We found that only two systems offer solutions to the problem of how selecting the part to materialize. These two solutions present some shortcoming among those presented below: - The non-consideration of all necessary factors for selecting the information to materialize; - The quantification of factors does not take into account the uncertainty of the collected values; - The bad combination of factors; - The inflexibility to adding new selection factors; - The method cannot be automated. To address these shortcomings, we propose a new method called FULVIS (FUzzy Logic based approach for VIews Selection) based on fuzzy logic. Each criterion for the selection of information to materialize (the stability, the size, the slowness, the availability, and the predictability of user queries) is difficult to quantify because it's affected by several underlying factors. For example, the slowness depends on the computing capacity of the machine hosting the source, on its workload at the time of information extraction, on the bandwidth, on the nature of the source (RDBMS, semistructured data, flat files ...), on the kind of integration and more. By cons, the size of the information depends on the data model used; knowing that the volatility of information since information change may alter the size. We believe that all attempts to quantify these criteria can only lead to inaccurate values. In addition, for a particular criterion, such as size, saying that an element is either small or large does not make sense.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
493
Wadiî Hadi, Ahmed Zellou, Bouchaib Bounabat
used to obtain the fuzzy rules from the expert. The parametric representation of the trapezoidal membership functions is achieved by the tuple (ci, ai, bi, di), which characterizes the membership functions. 3. The fuzzy inference is made with a fuzzy conjunction and the generalized modus ponens constructed from I the implication function (like t-norme). 4. The integration of all fuzzy rules is made by means of the disjunctive connective G (like t-conorme). Most of the existing FLCs are based on the fuzzy reasoning methods. Most commonly used are the Mamdani method, called "min-max method" [32] and the Larsen method, called "prod-max method" [33]. However, different fuzzy implications often employed in an FLC have been described in the literature [34], [35], [36]. We will use the max-prod inference system (I = PROD, G = MAX) (note that max-min inference give similar results). More concretely, we will define trapezoidal membership function for linguistic labels, each representing one criterion. We will also define rule set to deduce the profitability of materialization. The proposed functions for each criterion are as follows.
Indeed, if we consider that elements whose size is greater than 10 MB sizes are large and the elements below this value are small, we will consider in the same manner elements of 1 MB and elements of 9 Mo. Furthermore, it is not reasonable to consider an element as large and another as small when their sizes differ by one byte. The same remarks can be made on other criteria. To take into account the uncertainty and for better combination of different factors affecting the choice of the part to materialize, we suggest using fuzzy logic. The fact that humans can often handle different types of complex situations by using information many of which are subjective and imprecise has stimulated the search for alternative paradigms for modeling and control. Thus, the concept of modeling called "fuzzy" has found its origins in the fuzzy set theory proposed by Zadeh in 1965 [27], as a way to deal with uncertainty, based on the idea of defining sets that can contain elements gradually. This theory has introduced a way to formalize the methods of human reasoning using rule bases and linguistic variables for the knowledge representation [28]. Fuzzy logic is the best suited to our situation, since it tolerates imprecise data. It is more appropriate for reasoning under uncertainty than other theories such as probability [29]. Fuzzy logic is simple since, it is based on natural language and flexible because we can easily add new rules or delete others if necessary [30]. That’s why we propose to build a fuzzy controller for the admission and replacement of information to materialize. For this, we will not consider membership of an element of information selection to a criterion by 1 or 0, but by a membership function. In the theory of fuzzy logic, this function is the degree to which an element x belongs to the set A, where A is the fuzzy set of size, slowness and stability etc. This function is presented by a continuous representation A(x) with A(x) [ א0, 1] and this unlike classical logic where:
⎧1, x ∈ A ⎩0 , x ∉ A
µA ( x) = ⎨
VII.1. The Membership Function: Size of the Element is Small (SES) The space available for the materialization is critical. Since storing information element at the mediator would be impossible if its size exceeds the space devoted to the materialization in the mediation system. If we assume that elements are similar considering all criteria except the size. Then, it would be wise to materialize the elements from the smallest to the largest one until the space is no more available. We suppose that M is the size devoted to materialization, nvc the number of competing elements, and T(x) the size of the element x. When constructing the membership function related to the size of information element, we considered the elements which size is twice smaller than the size dedicated to each of the elements if the storage space is shared fairly between the elements as elements of small sizes. Then, for an element x, if T(x) < M/2nvc, x is considered as small. We consider the elements whose size is bigger than the size dedicated to each of the elements if the storage space is shared fairly as not small. That means if T(x) ≥ M/nvc, x is not considered as small. The transition from "small" to "is not small" is not discontinuous for the element x for which M/2nvc ≤T(x)max_hops, update the tabu value; stop broadcasting of mobile agents; Send the data to the Mobile Client; terminate the process. Step 5: For each node of MA, then update tabu value. Step 6: If tabu entry is already updated, that mean MA already reached the node. MA is terminated. Step 7: If tabu value is not updated then update. Step 8: Once, tabu value is completely filled, that means MA has listed all the nodes. Step 9: Repeat from Step2 to Step8 until the ant agent reaches to the final destination. (i.e. until it finds the required data).
Cache Discovery DCS
Transport Layer
Network Layer
Data Link Layer
IV.
Performance Evaluation
Our proposed algorithm was evaluated using NS2
Fig. 1. DCS Architecture
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
595
S. Umamaheswari, G. Radhamani
[21]. We assume that each node generates a sequence of data requests with exponentially distributed time intervals. The simulation was carried out in an area of 500m × 500m with 50 nodes roaming within that simulation area.
Fig. 4 shows the performance of the DCS scheme with respect to End-to-end delay. DCS achieves less time, to deliver the packets in the packet interval of 70s when compared with AODV.
TABLE I SIMULATION PARAMETERS No.nodes 50 Routing Protocol AODV Traffic model CBR Area 500 X 500 Number of data items 1000
The following performance metrics are used for evaluating, DCS, the proposed cache discovery algorithm. • End-to-End Delay or Mean Overall Packet Latency: This implies the delay a packet suffers between leaving the sender application and arriving at the receiver application. • Routing Overhead: The total number of routing packets transmitted during the simulation. • Throughput or packet delivery ratio: The ratio between the number of packets sent out by the sender application and the number of packets correctly received by the corresponding peer application. • Average query distance or Route Length: The average distance (number of hops) covered by a successful request. Fig. 3 shows the throughput in packet delivery ratio. The data packets are delivered without any packet drop in the DCS scheme where as AODV drops data packets when routes are disconnected. The packet delivery ratio is almost 100% during the packet interval 950s to 1000s.
Fig. 5 presents the energy consumption. The DCS scheme utilizes less energy when compared with the AODV algorithm. As the mobile agents serve the data requests of the mobile nodes, the communication with the server is minimized. This leads to the preservation of energy.
Fig. 3. Packet Interval vs Packet Delivery Ratio
Fig. 5. Packet Interval vs Energy Consumption
Fig. 4. Packet Interval vs End-End-Delay
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
596
S. Umamaheswari, G. Radhamani
Fig. 6 shows the routing overhead. Compared with AODV, DCS has less overhead. As the MSS and mobile agents cooperate with each other for serving the mobile node’s data requests, there is no need for much of routing between nodes. This results in a less routing overhead in the proposed algorithm.
V.
Conclusion
Caching frequently accessed items in mobile clients improves the performance of the network by the reduction of the channel bandwidth, minimized energy consumption and cost. Ant colony system helps to find the optimized solution based on its positive feedback of pheromone. Each ant in the ant colony is seen as a mobile agent. An agent traverses from one client to another client. This provides the agents the ability to travel and gather information at different sites and collaborate with other agents on behalf of their clients. The concept of mobile agent is applied in this paper to implement the data cache strategy. In this paper we have proposed the cache discovery process only. The future work will focus on incorporating the cache replacement and cache consistency algorithms.
References [1]
[2]
[3]
Fig. 6. Packet Interval vs Overhead
Fig. 7 reports the average hop distance of each protocol. We can see that the DCS scheme has the shorter hop length when compared with AODV. As the maximum hop count is compared during each broadcast in the proposed algorithm, the route length is minimized. The minimum route length at each packet interval level improves the performance of the network.
[4]
[5]
[6]
[7]
[8]
[9]
[10] [11]
[12]
[13] Fig. 7. Packet Interval vs Route Length
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
H. Yan, J. Li, G. Sun, and H. Chen, “An optimistic power control MAC protocol for mobile ad hoc networks,” in Proc. IEEE ICC, 2006, pp. 3615–3620. Z. Haas and J. Deng, “Dual busy tone multiple access (DBTMA)—a multiple access control scheme for ad hoc networks,” IEEE Trans.Comput., vol. 50, no. 6, pp. 975–985, Jun. 2002. Thanasis Korakis, Zhifeng Tao, Yevgeniy Slutskiy, Shivendra Panwar, “A Cooperative MAC protocol for Ad Hoc Wireless Networks”, in Proceedings of the Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops(PerComW'07), 2007. Wei Wang, Vikram Srinivasan and Kee-Chaing Chua, “Power Control for Distributed MAC Protocols In Wireless Ad Hoc Networks” IEEE Transactions on Mobile Computing, vol. 7, No. 8, August 2008. A. Boukerche, K. El-Khatib, L. Xu, and L. Korba, “An efficient secure distributed anonymous routing protocol for mobile and wireless ad hoc networks,” Comput. Commun., vol. 28, pp. 1193– 1203, Jun. 2005. I. D. Chakeres and E. M. Belding-Royer, “AODV routing protocol implementation design,” in Proc. 24th Int. Conf. Distributed Computing Systems Workshops (ICDCSW’04), 2004, pp. 698–703. A. H. Altalhi and G. Richard III, “Load-Balanced Routing through Virtual Paths: Highly Adaptive and Efficient Routing Scheme for Ad Hoc Wireless Networks,” 23rd IPCCC, 2004. J. Li, C. Blake, D. S. J. D. Couto, H. I. Lee, and R. Morris, “Capacity of ad hoc wireless networks,” in Proc. 7th Annu. Int. Conf. on Mobile Computing and Networking (MobiCom’01), 2001, pp. 61–69. H.Artail, Haidar Safa, Khaleel Mershad, Zahy Abou-Atme, Nabeel Sulieman. COACS: A Cooperative and Adaptive Caching System for MANETs, IEEE Transactions On Mobile Computing, Vol. 7, No. 8, pp. 961-977, August 2008. Liangzhong Yin and Guohong Cao. Supporting Cooperative Caching in Ad Hoc Networks, IEEE Infocomm 2004. Liangzhong Yin and Guohong Cao. Supporting Cooperative Caching in Ad Hoc Networks, IEEE Transactions on Mobile Computing, Vol.5, No. 1, pp. 77-89, 2006. Narottam Chand, R. C. Joshi, and Manoj Misra. An Efficient Caching Strategy in Mobile Ad Hoc Networks Based on Clusters, IEEE 2006. Narottam Chand, R.C. Joshi and Manoj Misra. Cooperative Caching in Mobile Ad Hoc Networks Based on Data Utility, International Journal of Mobile Information Systems, Vol. 3, No. 1, pp. 19-37, 2007.
International Review on Computers and Software, Vol. 8, N. 2
597
S. Umamaheswari, G. Radhamani
[14] Yi-Wei Ting and Yeim-Kuan Chang. A Novel Cooperative Caching Scheme for Wireless Ad Hoc Networks: GroupCaching, In International Conference on Networking, Architecture, and Storage (NAS 2007), IEEE 2007 [15] Yu Du and Sandeep K. S. Gupta: COOP – A cooperative caching service in MANETs, In International Conference on Autonomic and Autonomous Systems and International Conference on Networking and Services, IEEE (ICAS/ICNS 2005). [16] Xin Yao. Evolving Artificial Neural Networks, Proceedings of the IEEE, Vol. 87, No. 9, pp. 1423-1447, 1999. [17] Naveen Chauhan, Lalit K. Awasthi, Narottam Chand, R.C. Joshi and Manoj Misra: Global Cluster Cooperation Strategy in Mobile Ad Hoc Networks, ,International Journal on Computer Science and Engineering, Vol. 02, No. 07, pp.2268-2273, (2010). [18] M. Dorigo, V. Maniezzo, A. Colorni, Ant system: optimization by a colony of cooperating agents, IEEE Trans. Systems, Man, Cybernet.-Part B 26 (1) (1996) 29–41. [19] M.Dorigo, M.Birattari and T. Stützle, special section on "Ant Colony Optimization", IEEE Computational Intelligence Magazine, November 2006. [20] Peter.B, Wilhelm.R, “Mobile Agents – Basic Concept, Mobility models and the Tracy Toolkit”, Morgan Kaufmann Publishers (2005). [21] Network Simulator 2, http://www.isi.edu/nsnam/ns [22] A.Jeyasekar, S.V.Kasmir Raja, “A Survey on Cross Layer Approaches in Wireless Networks”, International Review on Computers and Software, Vol.7 N.4, pp.1639-1649, July 2012. [23] Lipardi, M., Mattera, D., Sterle, F., MMSE equalization in presence of transmitter and receiver IQ imbalance, (2007) 2007 International Waveform Diversity and Design Conference, WDD, art. no. 4339402, pp. 165-168. [24] Mattera, D., Tanda, M., Blind symbol timing and CFO estimation for OFDM/OQAM systems, (2013) IEEE Transactions on Wireless Communications, 12 (1), art. no. 6397549, pp. 268-277. [25] Mattera, D., Tanda, M., Bellanger, M., Frequency-spreading implementation of OFDM/OQAM systems, (2012) Proceedings of the International Symposium on Wireless Communication Systems, art. no. 6328353, pp. 176-180.
Authors’ information 1
Research Scholar & Assistant Professor, School of Information Technology & Science, Dr G R Damodaran College of Science, Coimbatore, India. E-mail:
[email protected] 2 Professsor & Director, School of Information Technology & Science, Dr G R Damodaran College of Science, Coimbatore, India. E-mail:
[email protected]
S. Umamaheswari received her MCA degree from Bharathidasan University, Trichy, India and M.Phil degree from Bharathiar University, Coimbatore, India. She is currently working as Assistant Professor, School of Information Technology and Science, Dr G R Damodaran College of Science, Coimbatore, India. Her current research interests include Web Programming, Cloud Computing, Mobile Databases and Mobile Ad Hoc Networks. She is a member of IEEE. Dr. G. Radhamani is presently working as Professor and Director, School of Information Technology and Science, Dr. G R Damodaran College of Science, affiliated to Bharathiar University, India. Formerly, she worked as Head, Department of IT, IbriCT, Ministry of Manpower, Sultanate of Oman. Prior to that she served as a Research Associate in IIT (India) and as a faculty in Department of Information Technology, Multimedia University, Malaysia. She received her M.Sc and M.Phil degrees from the P.S.G College of Technology, India, Ph.D degree from the Multimedia University, Malaysia. Her research interests are Databases, Computer Security and Mobile Computing. She is a Senior Member of IEEE.
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
598
International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 2 ISSN 1828-6003 February 2013
An Efficient Provably Secure Certificateless Signcryption without Random Oracles Hua Sun
Abstract – Although most of existed signcryption schemes proven secure were proposed in the random oracle, however, it always could not be able to construct the corresponding schemes in the practical application. By analyzing several certificateless signcryption schemes in the standard model, it was pointed out that they are all not secure. Based on Au’s scheme, a new proven secure certificateless signcryption scheme was presented in the standard model by using bilinear pairing technique of elliptic curves. In the last, it was proved that the scheme satisfied indistinguishability against adaptive chosen ciphertext attack and existential unforgeability against adaptive chosen message and identity attack under the complexity assumptions, such as decisional bilinear Diffie-Hellman problem. So the scheme was secure and reliable. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved. Keywords: Certificateless PKC, Signcryption, Provable Security, Without Random Oracles
Nomenclature CDH DBDH q-ABDHE q-SDH CLSC
Computational Diffie-Hellman Decisional bilinear Diffie-Hellman q-augmented bilinear Diffie-Hellman exponent q-strong Diffie-Hellman Certificateless signcryption
I.
Introduction
In 1984, ID-based cryptosystem was proposed by Shamir[1], which was utilized to solve the complex certificate management in traditional public key infrastructure. However, it has an inherent key-escrow problem and can not achieve non-repudiation in fact. In 2003, [2] proposed certificateless public key cryptosystem. As it only produces the partial private key of user and uses no certificate, it eliminates the inherent problem with ID-based cryptography and has both merits of them. In 1997, the cryptography primitive signcryption was firstly proposed by Zheng [3]. However, the signcryption schemes [4]-[5] containing formal security proof were presented until several years later. In 2008, the concept of certificateless signcryption was proposed by Barbosa et al [6], while it needed six pairings computation and was not secure. Subsequently, Aranha et al [7] gave a signcryption scheme, but without the formal security proof. Another certificateless signcryption scheme was proposed by Wu et al [8] in the same year. It needed four pairings and was pointed out to be insecure by [9]. In 2008, Selvi et al [10] gave a multi-recipient signcryption scheme for the first time. Manuscript received and revised January 2013, accepted February 2013
In 2010, Xie et al [11] presented a certificateless signcryption scheme that only needs two pairings. However, [12] pointed out that it was not unforgeable under the type I attacker. All of the above signcryption schemes were given in the random oracle. However, it is a very strong demand to view the hash functions in the random oracle model as completely random ones. It always can’t construct corresponding examples in the practical application. So it is meaningful to design the schemes without random oracles. In 2010, [13] gave a certificateless signcryption scheme in the standard model, while it was pointed out to be insecure by [14]. Later on, two certificateless signcryption schemes were proposed by Xiang [15] and Wang et al [16] respectively, but they all were insecure. In this paper, we put forward a provably secure certificateless signcryption (CLSC) scheme without random oracles. The paper is organized as follows. Some mathematical preliminaries are given in Section 2. Our proposed certificateless signcryption scheme is presented in Section 3. Security analysis of the scheme is given in Section 4. Finally, we conclude the survey in Section 5.
II.
Preliminaries II.1.
Pairings
Let G,GT be cyclic groups of prime order p and g be a generator of G . A bilinear pairing is a map e : G × G → GT that satisfies the following properties:
(
)
1. Bilinear: e g a ,g b = e ( g ,g )
ab
for all a,b ∈ Z p .
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
599
Hua Sun
2. Non-degeneracy: e ( g ,g ) ≠ 1 .
then KGC returns DID = ( d1 ,d 2 ) to user as his partial
3. Computability:it is efficient to compute e ( u,v ) for
private key. Set-Secret-Value: For an identity ID , he randomly chooses xID ∈ Z *p as his secret value.
all u,v ∈ G .
II.2.
Set-Public-Key: For an identity ID , he computes PK ID = g xID as his public key. Set-Private-Key: For an identity ID , he uses SK ID = ( s1 ,s2 ,s3 ) = ( d1 ,d 2 ,xID ) as his private key.
Intractability Assumption
Definition 1 CDH problem: Given a group G of prime p and elements g a ,g b ∈ G , where a,b ∈ R Z *p , the CDH
Signcrypt: Suppose that M ∈ GT is the message to be signcrypted, the sender identity is IDS , and the receiver identity is IDR . The certificateless signcryption can be produced as follows: 1. Let W = H ( M ) be the bit string of length nm
problem is to compute g ab . Definition 2 DBDH problem: Given a group G of prime p and elements g a ,g b ,g c ∈ G , h ∈ GT , where a,b,c ∈ R Z *p , the DBDH problem is to decide whether
h = e ( g ,g )
abc
.
representing the message M , let M ⊆ {1, 2, " , nm } be the set of index k such that W [ k ] = 1 , where W [ k ] is the
Definition 3 Truncated Decisional q-ABDHE problem: q+3 elements Given a vector of
( g ,g '
' a q+2
2
,g ,g a ,g a ,...,g a
q
)∈G
q +3
and
an
kth bit of W , then computes T =
element
2. The sender randomly chooses a number s ∈ Z p , by
Z ∈ GT , where a ∈ R Z *p , the truncated decisional
(
q +1 q-ABDHE problem is to decide whether Z = e g a , g'
(
)
using his private key SK IDS = sIDS ,1 ,sIDS ,2 ,sIDS ,3
(
vector of q + 1 elements g ,g x ,...,g x
)∈G
q +1
C1 = g1s g − sIDS , C2 = g1s g − sIDR , C3 = e ( g ,g ) = z0s , s
(
, where
C4 = M ⋅ e h1 ,g
x ∈ R Z *p , the one-generator q-SDH problem is to
(
compute c,g
1/ ( x + c )
) for some c ∈
* RZp
H : {0 ,1} → {0.1}
receiver
i
R
(
(
1 / α − ID
receives
a
(
)
(
)
s M = C4 ⋅ ⎡⎢ e C2 ,sIDR ,1 C3 IDR ,2 ⎤⎥ ⎣ ⎦ key S IDR of the receiver IDR .
)
and the system private key is msk = α . Partial-Private-Key-Extract: Given an identity information ID to KGC, KGC randomly chooses
)
when
ciphertext
)
T
⎞ ⎟ ⎠
(1)
and then accepts it and outputs True if the equation holds; otherwise, outputs False. 2. The message can be recovered through computing
p
ˆ ,z ,z params = G,GT ,e,g ,g1 ,h1 ,h2 ,h3 ,H ,M 0 1
,
IDR
⎛ e g1 g − IDS ,C5 = z1 ⋅ z0−C6 ⋅ e ⎜ C1 , h3 h2IDS ⎝
z0 = e ( g ,g ) , z1 = e ( g ,h1 ) . The system parameters are:
(
,
1. To verify whether C is a valid signcryption, IDR firstly verifies the equation as (1):
chooses a random number α ∈ Z *p , and sets g1 = g α ,
rID ∈ Z p and computes d1 = h1 g − rID
s ⋅T
C = ( C1 ,C2 ,C3 ,C4 ,C5 ,C6 ,T ) , he computes as follows:
can be used to create message of
m
)
Unsigncrypt: Let IDS be the identity of signcryption sender and S IDR be the private key of signcryption
length nm . KGC randomly chooses h1 ,h2 ,h3 ∈ R G and ˆ = ( m ) of length n , where m ∈ Z * . KGC vector M i
(
, C5 = sIDS ,1 ⋅ h3 h2IDS
is C = ( C1 ,C2 ,C3 ,C4 ,C5 ,C6 ,T ) .
.
In this section, we propose an efficient certificateless signcryption scheme without random oracles, which consists of the following algorithms: Setup: Let G,GT be groups of the same order p , g be a generator of G , the bilinear pairing is e : G × G → GT . A collision-resistant hash function nm
)
xIDR − s
C6 = sIDS ,2 = rIDS . So the certificateless signcryption
III. The Proposed CLSC Scheme
*
)
and then computes:
Definition 4 One-generator q-SDH problem: Given a q
∑ mi .
i∈M
IV.
sIDR ,3
by using private
Analysis of the Proposed CLSC Scheme IV.1. Correctness
The verification of the signcryption is justified by the following Eqs. (2) and (3):
, d 2 = rID ,
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
600
Hua Sun
) ⎛ = e⎜ g g ,( h g hh ) ( ( ) ⎝ = e ( g ,h ) e ( g ,g ) e ⎛⎜⎝ g g ,( h h )( (
e g1 g
− IDS
1
When the adversary AI issues a number of queries, B responses as follows: H queries: When AI queries H ( M ) , let
,C5 =
− IDS
− rIDS 1 / α − IDS
1
− rIDS
1
(
)
IDS 3 2
)⎞=
i∈M
s ⎡ ⎤ C4 ⋅ ⎢e C2 ,sIDR ,1 C3 IDR ,2 ⎥ ⎣ ⎦
sIDR ,3
i
⎟ ⎠
∑ mi
i∈M
∑ m )( ) ⎞
(
)
− sIDS
s 1
⎛ = z1 ⋅ z0−C6 ⋅ e ⎜ C1 , h3 h2IDS ⎝
(
IDS s ⋅ i∈∑M mi 3 2
)⎞=
s ⎤ − s ⎡e g ,h1 ⋅ xIDR ⎞ ⎢ ⎛ ⎥ = M ⋅ e ⎜ h1 ,g ⎟ ⋅⎢ ⎥ ⎝ ⎠ ⎢e g s ,g − rIDR e ( g,g )srIDR ⎥ ⎣ ⎦
)
(
)
(
(
= M ⋅ e h1 ,g
) ⋅ e ( g ,h )
xIDR − s
s
xIDR
1
)
d1 = g
(3)
f q−1 ( a )
adds IDi ,DIDi
=M
2
q
(
its goal is to decide whether Z = e g ,g'
) where a ∈ Z )
a
* p
then
randomly
chooses
u
The
,
u
h3 = g1
. B
xIDi ∈ Z *p then returns
and
computes
PK IDi
and adds
i
i
returns SK IDi .
(
) ( and L3 , let SK ID = ( DID ,xID ) , B and adds ( IDi ,SK ID ) to L4 .
Otherwise, B finds IDi ,DIDi , IDi ,PK IDi ,xIDi ,c
,
in L2
. B simulates
SK IDi
i
i
i
)
then returns
i
Public-Key-Replacement queries: When AI makes a query on public key replacement of input PK 'IDi , if the list
(
)
L3 contains IDi ,PK IDi ,xIDi ,1 , B sets PK IDi = PK 'IDi
polynomial
and c = 0 ; Otherwise, B makes a public-key query on IDi , and then sets PK IDi = PK 'IDi and c = 0 in the list L3 . Signcrypt queries: When AI queries a signcryption on message M , the sender identity IDS and the receiver identity IDR , if IDS = a , B can use a to solve the truncated decisional q-ABDHE problem; Otherwise, B can construct the private key of IDS , then he may produce the corresponding ciphertext C by executing signcrypt algorithm.
parameters are ˆ params = G,GT ,e,g ,g1 ,h1 ,h2 ,h3 ,H ,M ,z0 ,z1 and the
(
public
xIDi
( IDi ,SK ID ) , then B
, z0 = e ( g ,g ) , z1 = e ( g ,h1 ) .
system
key
Private-Key queries: When AI makes a query on the private key of input IDi , if the list L4 contains
f a f q ( x ) ∈ Z p [ x ] of degree q , let g1 = g a , h1 = g ( ) ,
h2 = g
private
B returns PK IDi ; otherwise B
chooses
i
q +1
a
i
( IDi ,PK ID ,xID ,1) to L3 .
the Setup algorithm of the scheme as follows: B firstly chooses a hash function * nm ˆ = ( m ) of length H : {0 ,1} → {0.1} and vector M i nm . B
the
) to L2 .
( IDi ,PK ID ,xID ,c ) , PK IDi = g
,g ,g a ,g a ,...,g a ,Z
d 2 = f q ( a ) as
computes
Public-Key queries: When AI makes a query on the public key of input IDi , if the list L3 contains
Theorem 1. Our CLSC scheme is IND-CCA secure against type I adversary under the assumption that the truncated decisional q-ABDHE problem is intractable. Proof. Let AI be a Type I adversary against the proposed scheme, there will exist an algorithm B that can use AI to solve the truncated decisional q-ABDHE problem. B is given a truncated decisional q-ABDHE ' a q+2
k such
∑ mi and adds ( M ,T ) to L1 .
and
(
=
randomly
'
index
DIDi = ( d1 ,d 2 ) of identity IDi , B then returns DIDi and
IV.2. Security Proofs
( g ,g
of
be f q −1 ( x ) = f q ( x ) − f ( IDi ) / x − IDi , B
=
i
instance
set
Partial-Private-Key queries: When AI makes a query on the partial private key of input IDi , if IDi = a , then B can use a to solve the truncated decisional q-ABDHE problem. Otherwise, let polynomial of degree q − 1 to
=
⎡ ⎛ g1s g − sIDR , ⋅ ⎢⎢e ⎜⎜ − rIDR 1 / α − IDR ⎢⎣ ⎜⎝ h1 g
(
computes T =
⎟ ⎠
xIDR
xIDR
the
i∈M
xIDR
= M ⋅ e h1 ,g
be
that W [ k ] = 1 , where W [ k ] is the k bit of W , B then
⎟ ⎠
⎤ ⎞ ⎟ z srIDR ⎥ ⎟ 0 ⎥ ⎟ ⎥⎦ ⎠
−s
M ⊆ {1, 2 ," ,nm }
(2)
)
system master key is msk = a . Finally, B sends params to AI . Phase 1: We assume that any query of Set-Private-Key and Signcrypt is preceded by H1 and Set-Public-Key query. B maintains four lists L1 , L2 , L3 and L4 , where all of them are initially empty. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
601
Hua Sun
decisional q-ABDHE problem; otherwise, B outputs Z as a random element in GT . Theorem 2. Our certificateless signcryption scheme is IND-CCA secure against type II adversary AII under the assumption that the DBDH problem is intractable. Proof. Let AII be a Type II adversary against the proposed scheme, there will exist an algorithm B that can use AII to solve the DBDH problem. B is given a
Unsigncrypt queries: When AI queries an unsigncryption on the ciphertext C , if IDR = a , B can use a to solve the truncated decisional q-ABDHE problem; Otherwise, B finds private key SK IDR of IDR from L4 , B then executes unsigncrypt algorithm to obtain the message M and returns. Challenge Phase: AI outputs two distinct messages M 0 and M1 of the equal length, a sender identity ID*S
index k such that W T* =
∑ mi
.
fq+2 ( x ) = xq+2
i∈M
chooses γ
= g'
2
(
C3* = Z ⋅ e g' ,∏ i = 0 g q
(
)
)
(
C5* = sID* ,1 g1u g S
uID*S
)
s ⋅T *
ID*R ,3
( )
(
' q +1 Let s = ⎛⎜ log g g ⎞⎟ f q +1 ( a ) , if Z = e g a ,g' ⎝ ⎠ have:
and a receiver identity ID*R . B computes the private key
*
(
−s
x * ⎞ * ⎛ C*4 = M b ⋅ e ⎜ h1 ,g IDR ⎟ , C5* = sID* ,1 g1u g uIDS S ⎝ ⎠
( )
SK ID* of ID*S and randomly chooses b ∈ ( 0,1) .
S
),
C1* = g1s g − sIDS , C*2 = g1s g − sIDR , C3* = e ( g ,g ) *
S
Set W * = H ( M b ) , M* ⊆ {1, 2 ," ,nm } be the set of we
index T* =
so
(
C* = C1* ,C*2 ,C3* ,C*4 ,C5* ,C6* ,T *
s
is
)
s ⋅T *
( )
C1* = g c
(
B outputs Z = e g
,g
B
computes
then construct the challenge ciphertext
γ − ID*S
( )
= g1c g − cIDS , C2* = g c *
(
)
a
( )
C5 = sID* ,1 g c
valid
S
γ − ID*R
= g1c g − cIDR *
c
( v +uID )T * S
*
(
*
= sID* ,1 h3 h2IDS S
)
c⋅T *
C6 = sID* ,2 = rID* S
Let PK ID* = g b , R
C4* = M b ⋅ e ( g ,g )
unsigncryption on C* . Guess Phase: AI outputs a guess b' for b . If b = b' , '
W * [k ] = 1 ,
C3* = e g ,g c = e ( g ,g ) = z0c , C*4 = M b ⋅ h −1
Phase 1, but he can’t query the private key of ID*R and the
a
that
as follows:
signcryption, B then return it to AI . Phase 2: AI may issues a number of queries as in
q +1
∑ mi . B
S
)
such
k i∈M*
C6* = sID* ,2 = f q ID*S = rID* S
)
M 0 and M1 of the equal length, a sender identity ID*S
, C6* = sID* ,2 = f q ID*S = rID* S
, let g1 = g , h1 = g a , h2 = g u ,
and the master key is msk = γ . Finally, B sends params and msk to AII . Phase 1: AII can issue a number of queries as in Theorem 1, but he could not make Partial-Private-Key query and Public-Key-Replacement query. Challenge Phase: AII outputs two distinct message
( )
s
. B then randomly
nm
ˆ ,z ,z params = G,GT ,e,g ,g1 ,h1 ,h2 ,h3 ,H ,M 0 1
f q+2 ( x ) − f q+ 2 ID*R
s * ⎤ ⎡ C*4 = M b / ⎢ e C*2 ,sID* ,1 C3* IDR ,2 ⎥ R ⎣ ⎦
nm
γ
,u,v ∈ Z *p
(
f q +1 ( x ) , where
Fq +1,i ai
.
parameters are:
0 ≤ i ≤ q + 1 . B then construct the challenge ciphertext as follows:
( ) , C*
abc
h3 = g v , z0 = e ( g ,g ) , z1 = e ( g ,h1 ) , the system public
,
( )
f q+2 ( x ) − f q+ 2 ID*S
where a,b,c ∈ Z *P , its
ˆ = ( m ) of length and vector M i
f q +1 ( x ) = f q + 2 ( x ) − f ID*S / x − ID*S , Fq +1,i be the
C1* = g'
)
*
*
coefficients of xi in polynomial
,g b ,g c ,h
B firstly chooses a hash function H : {0 ,1} → {0.1}
[ k ] = 1 , B then computes
Let
a
goal is to decide whether h = e ( g ,g )
Set W * = H ( M b ) , M* ⊆ {1, 2 ," ,nm } be the set of *
(g
DBDH instance
and a receiver identity ID*R , if ID*S = a , B can solve the truncated decisional q-ABDHE problem; otherwise, B randomly chooses b ∈ ( 0 ,1) .
− abc
S
if h = e ( g ,g )
(
= M b ⋅ e h1 ,g b
(
C* = C1* ,C*2 ,C3* ,C*4 ,C5* ,C6* ,T *
) as the solution of truncated
)
abc
−c
,
we ,
have so
) is a valid signcryption,
B then return it to AII .
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
International Review on Computers and Software, Vol. 8, N. 2
602
Hua Sun
(
If b = b' , B outputs h = e ( g ,g )
abc
If A−1 = 0 , then B outputs FALL and aborts the simulation; otherwise, we can get:
as the solution of
DBDH problem; otherwise, B outputs h as a random element in GT . Theorem 3. Our certificateless signcryption scheme is UF-CMA secure against type I adversary AI under the assumption that the one-generator q-SDH problem is intractable. Proof. Let AI be a Type I adversary against the proposed scheme, there will exist an algorithm B that can use AI to solve the one-generator q-SDH problem. B is given a one-generator q-SDH instance
(
2
g ,g a ,g a ,...,g a
( c,g
1 / a+c
)
q
)
C1* = g
(
C5* = h1 g −C6
g1 /
*
[1]
[2]
is the
[3]
[4]
)
signcrypt query on M * ,ID*S ,ID*R . ,
)
s ⋅T *
=
⎡ C5* =⎢ ⎢ * uT* ∑ q −1 Ak a k k =0 ⎣⎢ C1 ⋅ g
1 / A−1
⎤ ⎥ ⎥ ⎦⎥
Conclusion
References
must never query the private key of ID*S and make a f q −1 ( x ) = f q ( x ) − C6* / x − ID*S
(
*
⋅ h3 h2IDS
In this paper, we presented an efficient certificateless signcryption scheme that is provably secure without random oracles. It only needs four pairings computation in this scheme. The reductionist security proofs of IND-CCA of the scheme rely on the truncated decisional q-ABDHE and DBDH intractability assumptions, while the security proofs of UF-CMA rely on the one-generator q-SDH and CDH intractability assumptions. Furthermore, we will try to propose a more efficient certificateless signcryption scheme with less pairings computation without random oracles.
forgery certificateless signcryption on message M , a sender identity ID*S and a receiver identity ID*R . AI
Let
a − ID*S
V.
*
(
1 / a − ID*S
Theorem 4. Our certificateless signcryption scheme is UF-CMA secure against type II adversary AII under the assumption that the CDH problem is intractable. Proof. The proof is quite straightforward, so we omit it here.
parameters as in Theorem 1, the only difference is that h2 = g −u , and sends params to AI . Queries Phase: When the adversary AI issues a number of queries, B responses as follows: H1 queries: B responses as in Theorem 1. Partial-Private-Key queries: When AI makes a query on the partial private key of input IDi , if IDi = a , then B can use a to solve the one-generator q-SDH problem; Otherwise, B responses as in Theorem 1. Public-Key queries: B responses as in Phase 1 of Theorem 1. Private-Key queries: B responses as in Phase 1 of Theorem 1. Public-Key-Replacement queries: B responses as in Phase 1 of Theorem 1. Signcrypt queries: When AI queries a signcryption on message M , the sender identity IDS and the receiver identity IDR , if IDS = a , then B can use a to solve the one-generator q-SDH problem; Otherwise, B responses as in Phase 1 of Theorem 1. Unsigncrypt queries: When AI makes an unsigncryption query on ciphertext C , if IDR = a , then B can use a to solve the one-generator q-SDH problem; Otherwise, B responses as in Phase 1 of Theorem 1. Forgery Phase: Finally, AI outputs a tuple
) which means C
)
)
so the solution of the problem one-generator q-SDH is:
, c ∈ Z *p . B then construct the system
(
*
(
s a − ID*S
* * f a −C* / a − ID*S =g ( ) 6 ⋅ g u( a − IDS )s⋅T
where a ∈ Z *p , its goal is to compute
C* = C1* ,C*2 ,C3* ,C*4 ,C5* ,C6* ,T *
)
have f q −1 ( a ) = ∑ qk −=10 Ak a k + A−1 / a − ID*S .
Phase 2: When AII makes a number of queries, B responses as in Theorem 1. Guess Phase: AII outputs a guess b' for b .
we
Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
A. Shamir, Identity-based cryptosystems and signature schemes, Proceedings of CRYPTO 84 on Advances in Cryptology (Page: 47 Year of Publication: 1985 ISBN: 0-387-15658-5) S. S. Al-Riyami, K.G. Paterson, Certificateless public key cryptography, Proceedings of the 9th International Conference on the Theory and Application of Cryptology and Information Security (Page: 452 Year of Publication: 2003 ISBN: 978-3-540-20592-0) Y. Zheng, Digital signcryption or how to achieve cost(signature & encryption)