A Big Data Architecture for Spectrum Monitoring in Cognitive Radio Applications Giuseppe Baruffa · Mauro Femminella · Matteo Pergolesi · Gianluca Reali
Abstract Cognitive radio has emerged as a promising candidate solution to improve spectrum utilization in next generation wireless networks. A crucial requirement for future cognitive radio networks is the wideband spectrum sensing, which allows detecting spectral opportunities across a wide frequency range. On the other hand, the Internet of Things concept has revolutionized the usage of sensors and of the relevant data. Connecting sensors to cloud computing infrastructure enables the so-called paradigm of Sensing as a Service (S2 aaS). In this paper, we present an S2 aaS architecture to offer the Spectrum Sensing as a Service (S3 aaS), by exploiting the flexibility of software defined radio. We believe that S3 aaS is a crucial step to simplify the implementation of spectrum sensing in cognitive radio. We illustrate the system components for the S3 aaS, highlighting the system design choices, especially for the management and processing of the large amount of data coming from the spectrum sensors. We analyze the connectivity requirements between the sensors and the processing platform, and evaluate the trade-offs between required bandwidth and target service delay. Finally, we show the implementation of a proof-of-concept prototype, used for assessing the effectiveness of the whole system in operation with respect to a legacy processing architecture. Keywords spectrum sensing · Big Data · NoSQL · MapReduce · performance evaluation
1 Introduction Radio spectrum is a valuable and strictly regulated resource for wireless communications. With the proliferation of wireless services, the demand for the radio spectrum is constantly increasing, leading to contention of spectrum resources. On the other hand, since it could happen that the utilization of the radio spectrum at any specific time and location is low, spectrum re-usage is extremely useful. Cognitive radio [1, 2] is a promising solution to face this issue in the next generation wireless networks, by exploiting transmission opportunities in multiple dimensions [3]. Giuseppe Baruffa, Mauro Femminella, Matteo Pergolesi, and Gianluca Reali Department of Engineering, University of Perugia via G. Duranti 93, 06125 Perugia, Italy E-mail:
[email protected],
[email protected],
[email protected],
[email protected]
gian-
2
Giuseppe Baruffa et al.
Recently, standardization activities have concerned the re-use of the spectrum unused or underutilized by digital TV signals, the so called TV white space, by exploiting it to extend the coverage of WiFi signals with the IEEE 802.11af standard [4]. In such case, there are not only opportunities to be exploited in the frequency dimension, but also in the spatial one; moreover, the standard hinges upon a coordinated access to the spectrum managed by a spatial geolocation database. A fundamental task in the development of cognitive radio is the spectrum sensing, which allows individuating specific usage patterns in the various dimensions (time, frequency, space, angle, etc.), useful to exploit free spectrum. Since continuous spectrum sensing is a very demanding task for a wireless device, we propose the introduction of a monitoring platform that makes spectrum sensing available as a service. This service consists of a number of geographically distributed spectrum sensors, implemented through software defined radio (SDR), which carry out the sensing operation and report results to a storage and computing platform, whose task is to disseminate its available data to any requesting device. Although this service cannot completely eliminate the need of spectrum sensing by the wireless terminals, it can be extremely effective for instructing mobile devices to carry out the sensing only where (and when) it is expected to find unused radio bands, thus achieving improvements in terms of energy consumption and stand-by times extension. The above description brings to mind the emerging Internet of Things (IoT) paradigm [5]. This paradigm is a huge step in the evolution of Internet applications [6]. In fact, the network role has changed, from merely connecting computers and people to an entity that comprises smart objects that can interact with the surrounding environment [7]. The most representative IoT components include (networks of) sensors. In fact, the low cost of electronic devices and the development of wireless networks has favored the growth of connected devices. Smart objects autonomously collect and send information to be stored for further analysis, thus allowing a deeper understanding of the monitored environments [8]. IoT data are typically classified as “Big Data”, characterized by high Volume, Velocity, and Variety (3V) [9,10]. These features demand for a novel computing model that enables fast resource scalability and reliability. Cloud computing and Big Data analytical tools meet these constraints by offering distributed storage and computing resources with ubiquitous network access [11]. In addition to the capabilities of transmitting, storing, and processing data, the new technologies are expected to provide easy access to data by hiding complex technicalities. This way, developers should benefit of such a support for quickly creating new applications, while other users, such as data analysts, should at least be allowed to visualize the data. In this framework, our contribution is the design and performance assessment of a storage and computing platform offering Sensing as a Service (S2 aaS) [12, 13] and, more specifically, Spectrum Sensing as a Service (S3 aaS) [14]. We propose to adopt Apache Flink [15], which allows using a single platform for running both batch and streaming-oriented processing [16]. In this way, we can process data for any later usage as soon as they enter the system, so as to decrease the time needed for executing batch processing of large amounts of raw data. As for the database, we selected MongoDB [17] for its widespread acceptance, excellent scalability, and speed [18]. Finally, data ingestion is managed through Apache Kafka [19], a distributed messaging queue that allows it to easily change or integrate different types of data consumers [20]. Our platform can integrate not only heterogeneous sensors, with different sensing capabilities, but also opportunistic sensors, implemented, for instance, by means of smartphones and tablets, with minimal sensing aptitudes. A potential application of this service is clearly the autonomic management, deployment, and personalization of wireless networks, as shown in [21]. With respect to our previous work [22], in this paper
A Big Data Architecture for Spectrum Monitoring
3
we present a full-fledged architecture with relevant testbed implementation, together with a performance comparison between classic sensor data collections management and Big Data alternatives. The rest of this paper is organized as follows. In Section 2 we briefly present the key concepts and the previous work on the subject. In Section 3 we show the system architecture, focusing on the design and interconnection of sensors, the implementation of the Big Data analysis platform, and the presentation of a specific tool enabling visual analysis of spectrum data. Section 4 presents the performance evaluation, which illustrates the trade-off between required bandwidth and service latency, and the gain of using a Big Data analysis platform over a legacy one, evaluated experimentally by means of a prototype system implementation. Finally, Section 5 discusses conclusion and future work.
2 Background and related work An extensive literature on cognitive radio exists. The state of the art can be found in some interesting surveys, such as [1,2]. A significant literature also exists on spectrum sensing, specifically on sensing algorithms [3,23]. The typical dimensions of the sensing operations are frequency, time, geographical space, code, and angle. Spectrum sensing can be conducted either non-cooperatively (individually), when each user performs radio detection and takes decisions, or cooperatively, in which a group of users can cooperatively sense the spectrum to detect the presence of signals, e.g., by using consensus algorithms [24]. Further sources of data, useful to extract insights for predicting wireless network load, are the social networks, such as Twitter [25]. Some proposals on spectrum sensing leverage the IoT paradigm to make use of common mobile devices [26,27]. The goal is to achieve an almost full geographical coverage and have real-time data while guaranteeing good sensing performance by using low cost mobile devices. Great attention has been devoted to the sensor implementation by means of SDR1 applied to portable dongles. However, the selection of appropriate storage and computing technologies has not yet received significant contributions. In fact, legacy SQL databases and a traditional computing scheme can not cope with the fast growing number of sensors and data. The objectives of the Open IoT initiative [29], started in 2012, include the integration of cloud computing services and IoT. In this paper we follow this approach. In fact, when focusing on spectrum sensors, we consider devices able to produce high bit rates, thus we need suitable technologies to manage such amount of data in short times. First, we make use of pre-processing functions to clean up and compress raw data before sending them through the network. We realistically assume that the network is shared with other users and services and that congestion should be avoided. Secondly, we need suitable data structures, load balancing systems, and scalable storage to collect data and to store them so that users can access these data with tolerable waiting times. The requirement for fast scalability can be met thanks to the cloud computing paradigm: by leveraging virtualization technologies, cloud providers can offer large amounts of computing, networking, and storage resources to multiple users. ICT companies offer their commercial cloud services (like Amazon Web Services [30] and Google Cloud Platform [31]), 1 SDR can be considered among the enabling technologies that allow dynamic reconfiguration and quick adaptation to the offered communication opportunities, since physical layer (PHY) processing is carried out by general purpose processors in software, and they can be reconfigured by software in real-time and continuously [28].
4
Giuseppe Baruffa et al.
but it is also possible to realize private cloud infrastructures by using an open source cloud management system such as OpenStack [32]. While commercial services allow for short time deployment and convenient subscription fees, private clouds can also provide better privacy management. In order to efficiently store data for future analysis, NoSQL databases are the mandatory choice, since they are designed to be distributed over large clusters of servers. Furthermore, many of them provide a non-fixed schema data model that allows great flexibility.
3 System architecture
Request for spectrum occupancy
Sensor Agent
FPMP
Sensor Agent
SQL-like queries Database Agent
Web Interface
FMP
Sensor Agent
Users
Fig. 1 System architecture of the FPMP and FMP configurations.
Our prototype system architecture is depicted in Fig. 1. On the left-hand side of the figure there are spectrum sensors, which include the sensing device and a software agent running on a local computer (even with limited computing capabilities, such as a Raspberry Pi or similar device), located near to the sensor. The agent realizes an interface between sensors and our system. It collects data from the sensor through a TCP socket, serializes data by Apache AVRO [33,34], and sends them to a queue realized by using Apache Kafka. This configuration enables our system to support the ingestion of large data sets coming from multiple sensors. In fact, Apache Kafka can be distributed over multiple machines, allowing for horizontal scalability. Data stored in the queue can be consumed by multiple applications at the same time. In our architecture a software agent receives data from Kafka and stores them in a database. The agent can be extended to support multiple databases. In the present architecture we support MongoDB [18], a popular NoSQL, document-oriented database. We selected this database since it does not have a fixed data schema, which allows for improved flexibility with respect to relational databases. In addition, it natively supports a binary data type, which is suitable to store sensor data. Additional details on this topic are provided in Section 3.1. We selected Apache Flink framework for analyzing sensor data, since it supports both batch and streaming jobs. As long as Flink processes the records retrieved from the database, it may store results in the same database for any future (re-)use. The specific database can be abstracted to Flink by using PrestoDB, that allows issuing SQL-like queries to numerous
A Big Data Architecture for Spectrum Monitoring
5
NoSQL databases. The right-hand side of Fig. 1 shows how users can interact with the database via a PHP web interface. They can explore data by directly issuing queries to PrestoDB or by interacting with Flink for more complex analysis. Note that the system is independent from the nature of the database used to store data. A change in the database would have no impact in the operation of sensors and their agents due to the presence of Kafka, while Flink and the web interface would require minor configuration changes to interact with PrestoDB. Furthermore, the usage of PrestoDB allows users to explore sensor data with the familiar and highly expressive SQL language. We define this configuration with the acronym FPMP (Flink, PrestoDB, MongoDB, PHP). To evaluate performance, we also explored an alternative configuration where Flink reads data directly from MongoDB, bypassing PrestoDB. To identify this we use the acronym FMP (Flink, MongoDB, PHP). Anyway, PrestoDB remains in the architecture as an interactive SQL interface for users to retrieve raw data. In Subsection 3.1 we show the operation and characteristics of the sensors, with a particular focus on data rates, sensing delay, and resulting trade-offs. Subsection 3.2 describes how the application to produce spectrum scan images works.
3.1 Spectrum Sensors The spectrum sensing operation is performed by SDR sensors, which are connected to the core platform through a dedicated agent. We have designed the platform general enough in order to integrate different SDR sensors with heterogeneous features. This is achieved by a suitable design of both the agent and the component of the core system (Kafka), which provides the API for transmitting spectrum sensing data. The agent is equipped also with a global navigation satellite system (GNSS) sensor, which allows it to accurately synchronize and locate data coming from geographically, spatially, and temporally heterogeneous sources. In this regard, Table 1 reports the functional features of three different SDR spectrum sensors. Table 1 Spectrum sensing features of three SDR devices. Parameter Digitization bw, fs FFT size, NFFT Data averaging, Navg Scan bandwidth, BS Data precision, Pb Sensor data rate, RS Freq. resolution, ∆ f Time resolution, T Whole scan time, D Agent data rate, RA Data size per hour
Realtek SDR 3 MHz 512 100 1.68 GHz 2B 48 Mb/s 5.9 kHz 17.1 ms 9.5 s 509 kb/s 223.5 MiB
USRP N210 25 MHz 4096 100 4 GHz 4B 800 Mb/s 6.1 kHz 16.3 ms 2.62 s 8 Mb/s 3.45 GiB
USRP X300 60 MHz 8192 100 4 GHz 4B 1.92 Gb/s 7.3 kHz 13.7 ms 0.91 s 19.2 Mb/s 8.26 GiB
The first device, a Realtek RTL2832U chipset-based SDR, is a low cost USB dongle that needs to be connected to a PC or a smartphone/tablet (the agent) to send data to the IoT platform. Its maximum digitization frequency, fs , is 3 MHz, which generates a gross data rate from the sensor to the agent through the USB link equal to 48 Mb/s. It is obtained by multiplying fs by the number of bits used to encode I & Q samples (16 bits). Since this rate
6
Giuseppe Baruffa et al.
is modest in comparison with the maximum rate supported by USB 2.0 and beyond, a single agent could manage a few SDR sensors, depending on its processing capabilities. The second and third devices are high-end SDR sensors. In fact, their typical digitization frequency is much higher (25 and 60 MHz, respectively), which translates into very high data rate (800 Mb/s and 1.92 Gb/s, respectively). Their data can be streamed over an IP network by using the UDP transport protocol, which adds a negligible bandwidth overhead (just 8 B per packet). While it could be possible to interconnect the high-end SDR sensors directly to the core system, the relevant high data rate makes it quite difficult. In fact, differently from most common IoT scenarios, where sensors are power, CPU, and bandwidth constrained, in this case they are connected to the power grid, and the produced data flow is significant. For this reason, they need to be directly interconnected to the core system by means of a high-speed LAN, namely a Gigabit Ethernet for USRP N210 and a 10 Gigabit Ethernet for USRP X300. In addition, the high data rates would require the SDR sensors to be very close to the location where the core system runs, to interconnect them by means of a high-speed LAN. Otherwise, it would be necessary to deploy at least a 10 Gigabit WAN link for interconnecting each sensor to the core system. Finally, with all these high-bandwidth sources, it is necessary to size the bandwidth of the core system so as not to be the system bottleneck. This means that the ingress connection of the core system has to make use of a net bandwidth higher than the sum of the throughput produced by these high-end SDR sensors, which limits the number of supported sensors and their displacement. A possible workaround could be to implement the core system in a distributed fashion, in order to benefit from multiple high speed ingress connections, with portions of the core system always very close to the SDR sensors. However, even if this solution could be implemented by means of a distributed, virtual data center, this would require very high speed connections between its distributed instances, with some known networking challenges when running large scale distributed data processing (e.g., see [35, 36]). In addition, it would always pose significant limitations on the displacement options of sensors. Since these requirements are excessive for a platform designed to manage a large number of geographically distributed spectrum sensors, they call for the usage of an intermediate agent also with high-end SDR sensors. The role of the agent is twofold. First, it decouples the binary samples captured by SDR sensors from the core system, by also adding a high precision timestamp. This allows it to transparently integrate sensors with heterogeneous features. The second function consists of the pre-processing of the data received by the agent. This function allows reducing the data rate RA towards the core system (see Table 1). The operation performed by the agent can be easily explained by inspecting Fig. 2. The spectrum to be analyzed is in the range [BS,min , BS,max ], with BS = BS,max − BS,min . At each time step, an SDR sensor senses a bandwidth fs , which covers, at most, the value indicated in Table 1, and depends on the sensor type. The agent performs a fast Fourier transform (FFT) of these samples with NFFT points [26,27,37], which produces NFFT frequency samples, representing the received amplitude spectrum over a sub-channel of size ∆ f = fs /NFFT . Then, the energy on each sub-channel is calculated by squaring the complex amplitude frequency bin. However, since the results of a single acquisition are generally not statistically significant, we consider Navg consecutive FFT samples at the same frequency, which are averaged by the agent. Each acquisition is carried out upon a request from the agent, with an optional and customizable idle time between two consecutive requests equal to Tidle . This idle time can be used for different purposes. For instance, if the agent is implemented in a battery-powered mobile device, as in [26,27], relaxing the rate of acquisition may extend the device up-time or simply decrease
A Big Data Architecture for Spectrum Monitoring
7
Time Δf = fs/NFFT fs
Frequency
Δt Tidle Tscan BS T = NavgTscan
D
Fig. 2 Visualization of the spectrum sensing operation.
the wireless data rate. Additionally, this idle time can also be used to lock the SDR sensor to a new carrier frequency, before the new acquisition time slot takes place. Since the time allocated to perform an FFT is equal to ∆t = NFFT / fs , the time resolution of the scan of a sub-band is equal to NFFT T = Navg Tscan = Navg + Tidle . (1) fs Once a portion of the spectrum is scanned (horizontal shift in Fig. 2), the system starts analyzing the adjacent one (vertical shift in Fig. 2). This operation is performed BS / fs times, thus the total scan time is equal to D=T
BS = Navg fs
NFFT + Tidle fs
BS . fs
(2)
The data rate produced by the agent is denoted RA in Table 1, with RA = 8NFFT Pb /T bit/s and Tidle = 0, and evaluated for each device type. In fact, the sensor data are organized as binary large objects (BLOB), which some database engines adopt for an easy storage of medium-large binary data chunks, such as images, video/audio clips, programs, etc. In our case, each BLOB field stores the uncompressed raw energy values produced after averaging the FFTs, in machine-endian format, as a consecutive array of NFFT elements, each one of Pb bytes. This average is calculated over Navg FFTs. Table 2 reports the metadata that are associated with each transmission of the agent, which occurs every T seconds and consists of 120 additional bytes. Thus, for each sensor, the net data rate transmitted by the agent is equal to RT =
8 · (120 + NFFT Pb ) 8 · (120 + NFFT Pb ) . = T Navg NFFT fs + Tidle
(3)
Clearly, in order to reduce the acquisition delay D, it is necessary to deploy n SDR sensors, which can be used to parallelize the acquisition phase, so scaling the acquisition period
8
Giuseppe Baruffa et al.
to D/n, but also the data rate to RT n. However, increasing n should imply also increasing the number of agents. In fact, the number of SDR sensors attached to each agent cannot be unbounded, since it depends not only on the processing and networking capabilities of the agent, but also on the sensor type and the relevant processing and network load generated by the sensor data. The value of Tidle can be used to trade-off RT for D, as will be shown in Section 4.
Table 2 Schema of data produced by the sensor agents. Field name dvid oid utc_time start_freq sample_rate fft_size n_avg prec gain f_off format lat lon elv data
Type and size string, 36 B int, 4 B double, 8 B double, 8 B double, 8 B int, 4 B int, 4 B int, 4 B double, 8 B double, 8 B int, 4 B double, 8 B double, 8 B double, 8 B blob, NFFT Pb B
Description Device ID Order ID UTC time stamp Start frequency Sampling rate fs Analysis FFT size NFFT Number of averages Navg Data precision Pb Device RX gain Device frequency offset Data storage format Device position latitude Device position longitude Device position elevation Energy scan data
Total
120 B + blob size
Metadata + data size
3.2 Radio spectrum visualization application The web interface can produce pictures representing the radio spectrum to allow users to explore and/or visualize sensor data. An example of this is shown in Fig. 5. The abscissa reports the acquisition time, while the ordinate axis reports the scanned frequency. The image is built up in two steps. In the first step, Apache Flink performs a MapReduce [38] job on raw data. The job phases are shown in Fig. 3. The Map phase receives records from sensors containing basic data (timestamp, starting frequency of the considered band, and the relevant estimated power), and maps each record to one pixel of the image. Multiple records can be associated with the same pixel. The result of this phase is a dataset containing tuples with the coordinates of the pixel and the power reported by the sensor record. The tuples are now grouped by the pixel coordinates and in the Reduce phase we compute the average on all the power values associated with that pixel. The result of the Reduce phase is a set of tuples, where each tuple represents one and only one pixel with its associated power value. In the second step, the web interface receives this matrix and produces the final image by associating a color to each power value.
A Big Data Architecture for Spectrum Monitoring
Input Data
9
Map phase
…...
…
…
Output
Reduce phase
GroupBy
…...
Average
Fig. 3 The MapReduce job.
4 Performance evaluation In this section, we first illustrate the performance evaluation of the service in terms of bandwidth and delay. In particular, we discuss the trade-off between limiting the acquisition delay and maintaining the required bandwidth reasonably low. Then, we present our initial prototype implementation, and evaluate its gain with respect to a classic LAMP (Linux, Apache, MySQL, PHP) service implementation.
4.1 Service characteristics We have considered two different scenarios, one based on the Realtek SDR and the other based on the USRP N210 (see Table 1). For the first case, we considered a sampling frequency of fs = 1 MHz, whereas for the latter fs =25 MHz, and each acquisition is analyzed by an FFT with NFFT =512 points. The idle time Tidle is set equal to 0 and 1 ms, respectively. The number of averages Navg is variable between 10 and 1000. Fig. 4 shows four sub-figures, reporting the performance of the above configurations in log-log scale. For each sub-figure,
10
Giuseppe Baruffa et al.
the abscissa reports Navg , the left ordinate reports the throughput RT in kb/s (black curve), and the right ordinate the acquisition delay D in seconds (red curve). The ordinate quantities are calculated for n=100 sensors, whereas (2) and (3) are for a single sensor. Thus, assuming that sensors are attached to different agents so as to avoid scalability issues at the agent level, we can conclude that reasonable values are Navg ≈ 100 and Tidle = 0 ms for the case of the Realtek SDR, and Navg ≈ 1 000 and Tidle = 1 ms for the USRP N210. In fact, for both cases, we would obtain reasonable input rates, in the order of few Mb/s, and tolerable delays, in the order of few seconds. A further increase of the number of supported sensors (e.g., by two orders of magnitude to monitor different areas) would not imply scalability issues for the platform, being the aggregate input rate in the order of few thousands of Mb/s. In particular, when using the high-end SDR sensor, not only the scalability is improved due to the high sampling frequency of the device, but also the reliability of measurements, since we can use higher values of Navg with even lower acquisition delays.
101
100
Throughput Delay
Throughput RT (Mb/s)
Delay D (s)
100
101
10-1
10-2 103
102
c) fs =25 Msps, T idle =0 s
100 50
10
10
5
5
1
1
0.5
0.5
0.1 101
FFT data averaging, Navg
104
Throughput Delay
50
102
100 101
b) fs =1 Msps, T idle =0.001 s
Delay D (s)
a) fs =1 Msps, T idle =0 s
Throughput RT (Mb/s)
103
0.1 103
102
FFT data averaging, Navg
10-1
102
d) fs =25 Msps, T idle =0.001 s
100
10-2
102
10-3
101
10-1
100
10-2
Delay D (s)
103
Delay D (s) Throughput RT (Mb/s)
Throughput RT (Mb/s)
Throughput Delay
Throughput Delay
101 101
102
FFT data averaging, Navg
10-4 103
10-1 101
102
10-3 103
FFT data averaging, Navg
Fig. 4 Throughput and acquisition delay as a function of FFT data averaging (Navg ), for different values of fs and Tidle , with n=100 sensors.
A Big Data Architecture for Spectrum Monitoring
11
4.2 Prototype implementation Here we present a proof of concept prototype of the proposed architecture. All the software components in Fig. 1 have been deployed on virtual machines (VMs) in a private cloud computing environment implemented by using Openstack (Mitaka release). The computing infrastructure consists of two servers with a total of 72 CPU cores and 180 GBs of RAM. The servers are connected to a private Gigabit Ethernet LAN. Two additional machines provide distributed storage with GlusterFS. Each VM in the cluster is equipped with 4 virtual CPUs and 8 GBs of RAM. The agent serving the spectrum sensors has been implemented by a lowend PC connected to the same LAN. Four RTL2832U devices were connected to the PC by means of a USB 2.0 hub. See Table 3 for further details on the hardware configuration.
Table 3 Prototype hardware and software specification. Device type
Role
Count
Configuration
Operating System
Software
RTL-SDR
Spectrum sensor
4
fs = 1MHz, aggregate range from 40 to 1700 MHz
—
librtlsdr v0.5.3
USB hub
Sensor connector
1
5× USB 2.0 ports
—
—
Ubuntu 14.04
Java 1.8.0
Ubuntu 14.04
LAMP Server VM Kafka VM PrestoDB VM MongoDB VM
Ubuntu 14.04
Flink JobManager VM 4× Flink TaskManager VMs
Ubuntu 14.04
GlusterFS
—
—
PC
Sensor agent
1
Server
Openstack Cloud Controller + Openstack Cloud Compute
1
Server
Openstack Cloud compute
1
Server
Storage
2
Switch
Network
1
Intel(R) Pentium(R) D CPU @ 3.00GHz (2 cores), 4GB RAM @ 667MHz, 60GB disk AMD Opteron(tm) Processor 6128 @ 2GHz (32 cores), 120GB RAM @ 1333MHz, 60GB disk Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (40 cores), 64GB RAM @ 2133MHz, 260GB disk Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (8 cores), 20GB RAM @ 667MHz, 24TB disk 52× Gigabit Ethernet ports
With this configuration we evaluated the performance of a LAMP legacy system versus the two more modern ones, namely FPMP and FMP. For the sake of this prototype, we configured both databases on single VMs, even if they support operation in distributed cluster mode. LAMP server, Kafka, PrestoDB, and MongoDB VMs are hosted on the server acting as Openstack Cloud Controller, while the Flink cluster (composed by one VM act-
12
Giuseppe Baruffa et al.
ing as JobManager and four VMs acting as TaskManagers) is hosted on Openstack Cloud Compute. FMP configuration, in which we bypass the interaction between Flink and PrestoDB, is worth of consideration since the open source JDBC driver for PrestoDB does not fully integrate with Apache Flink and, more in general, with Hadoop compatible frameworks at the time of writing. These frameworks use dedicated classes, implementing the Hadoop InputFormat interface, to read data from various sources. The open source PrestoDB driver does not support prepared statements [39] and, thus, it cannot be used with the aforementioned classes. To use the data retrieved through PrestoDB, it is necessary to load the entire returned information (i.e., the query result) in a collection object and then pass this object to the MapReduce job. The creation of the object takes a considerable amount of time and it is performed on a single Flink node, with the risk of out-of-memory errors if the query produces a large result set. On the other hand, a third-party library [40] is available, providing a MongoDB Hadoop InputFormat to read data in a fully compatible way with Apache Flink. This means that the query is split into multiple sub-queries distributed across Flink nodes that are executed in parallel. Additionally, rows in the result set of each sub-query can be immediately processed by the Map function as long as they are received, without the need of storing all the received data together in a collection object for a later processing. This is the strategy we have implemented in our system. We point out that our designed platform is able to provide two main services. The first one allows downloading raw data (as sent by the agent) by specifying through a web interface the desired time window and frequency spectrum range. This service is provided by querying the PrestoDB module, and data are provided as a csv file. These data can be processed to identify potential sub-bands to be used for IoT transmission, a typical use in cognitive systems. The second service is the possibility to visualize these raw data into a picture, which represents the heat map of the radio spectrum. This picture is provided directly by the web interface of our platform, and the heat map picture can be customized so that each pixel is representative of a time window and a frequency window. This service provides the user with an immediate idea about potential spectrum white spaces, driving further analysis on raw data. Fig. 5 shows a spectrum heatmap example generated by our visualization application for the bandwidth from 400 MHz to 1700 MHz (reported on the vertical axis), and an acquisition time window of approximately one day and half (horizontal axis). On the right side, the color bar indicates the relative power, in dB, associated to each pixel. For instance, it is easy to identify from the figure the typical night and day profile of the LTE band at 800 MHz, as well as the GSM band at 900 MHz.
4.3 Experimental results We measured the time needed to produce a radio spectrum heat map picture (such as that shown in Fig. 5). This amount of time includes querying the sensor data from the database, computing the initial array by means of a MapReduce job, and finally rendering the actual picture. We repeated this benchmark multiple times for different sensing acquisition time windows with durations of 12, 24, 36, and 48 hours and a fixed frequency bandwidth spanning from 40 MHz to 1700 MHz, equally partitioned among the four sensors. Clearly, a larger acquisition time window implies a larger data volume served from the database. Fig. 6 shows the results of this measurement activity. The abscissa reports the data acquisition time window (in hours), whereas the ordinate reports the service time (in seconds).
A Big Data Architecture for Spectrum Monitoring
13
Fig. 5 Heat map of the radio spectrum produced by the PHP user interface: RF acquisition frequencies are shown on the vertical axis, and acquisition time is reported on the horizontal axis. On the right edge, a colormap legend details the measured power, in dB.
We can observe that the performance of the different approaches (in particular, LAMP and FPMP) is comparable for the smallest dataset, while for the larger ones the LAMP system shows poor performance. In particular, up to an acquisition time window of 36 hours, the service time increases linearly, with a significant increase for an acquisition time window of 48 hours, which witnesses the unsuitability of the LAMP architecture to support this type of service. In addition, during the queries, we have recorded a number of data losses in the writing operation, meaning that the database has dropped some data received from the sensors. Instead, the FPMP system shows a more interesting performance, with a service time slowly increasing with the acquisition window duration, without exhibiting any data loss, thanks to the presence of Kafka messaging system. In this test, we have used Flink as a traditional batch job engine. In such way, data are retrieved from the database by means of SQL queries through PrestoDB, then they are processed with MapReduce, and the relevant results are sent to external clients and, in addition, stored in the database for future (re-)use. Finally, the FMP configuration shows the best performance, with an almost constant behavior over time. This is due to the parallel computing capabilities of Flink together with the usage of the MongoDB Hadoop InputFormat, whose main advantages have been explained in the previous sub-section 4.2. The constant behavior may appear optimistic. This happens since Flink can easily manage these dataset sizes, without the bottleneck due to the PrestoDB data preparation. Adding more sensors and integrating their data in the same time window would produce a slightly different behavior, but this does not hinder our results: the legacy solution and the FPMP configuration would be infeasible with larger datasets. However, it is worth to mention other promising ways to process data. The most interesting one consists of configuring Flink to process raw data served by Kafka in real time, and to store the relevant results in the same NoSQL database, so as to decrease the service time of future requests. This should allow dramatically reducing the service time of future requests, especially those requiring to process a very large batch of data. We will investigate the feasibility and benefits of this additional processing mode in future works.
14
Giuseppe Baruffa et al.
120 LAMP FPMP FMP
Processing time (s)
100
80
60
40
20
0 10
15
20
25 30 35 40 Data acquisition time window (h)
45
50
Fig. 6 Service time versus acquisition time for the legacy LAMP and the proposed FPMP and FMP approaches, using the MapReduce programming technique.
Finally, we expect to improve the performance by installing all the components in a distributed fashion on the computing cluster, so as to better exploit parallelism.
5 Conclusion and future work This paper presents the design and prototype implementation of a Sensing as a Service system, specialized for a Spectrum Sensing as a Service use case. We have both presented a detailed system architecture and motivated the peculiar design choices. We also compared our system versus a solution realized by legacy components. The presented results witness that the designed system definitely outperforms the legacy one thanks to the usage of modern Big Data technologies, namely MongoDB database and Flink processing engine, and it candidates as a potential reference implementation for other platforms providing similar services. In addition, the massive usage of open source components and the modular system design allow upgrading the system easily by replacing or adding additional nodes/functions. As a future work we will deploy modern software components in cluster configuration to fully exploit their distributed nature. Finally, we will implement a streaming job with Flink to process sensor data and produce spectrum occupancy data in real time in the web dashboard. In addition, we will apply our radio monitoring system also for managing the radio access in 5G networks.
A Big Data Architecture for Spectrum Monitoring
15
Acknowledgements This work is supported by CLOUD and HYDRA, two research projects funded by the University of Perugia.
References 1. Akyildiz, I.F., Lee, W.Y., Vuran, M.C., Mohanty, S.: A survey on spectrum management in cognitive radio networks. IEEE Communications Magazine 46(4) (April 2008) 40–48 2. Wang, B., Liu, K.J.R.: Advances in cognitive radio networks: A survey. IEEE Journal of Selected Topics in Signal Processing 5(1) (Feb 2011) 5–23 3. Yucek, T., Arslan, H.: A survey of spectrum sensing algorithms for cognitive radio applications. IEEE Communications Surveys Tutorials 11(1) (First 2009) 116–130 4. Flores et al., A.B.: IEEE 802.11af: a standard for TV white space spectrum sharing. IEEE Communications Magazine 51(10) (2013) 92–100 5. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29(7) (September 2013) 1645– 1660 6. Perera et al., C.: Context aware computing for the Internet of Things: A survey. IEEE Communications Surveys & Tutorials 16(1) (2014) 414–454 7. Miorandi et al., D.: Internet of things: Vision, applications and research challenges. Ad Hoc Networks 10(7) (September 2012) 1497–1516 8. Perera, C., Zaslavsky, A., Christen, P., Georgakopoulos, D.: Sensing as a service model for smart cities supported by Internet of Things. Transactions on Emerging Telecommunications Technologies 25(1) (2014) 81–93 9. De Mauro, A., Greco, M., Grimaldi, M.: A formal definition of Big Data based on its essential features. Library Review (2016) 10. Zaslavsky, A., Perera, C., Georgakopoulos, D.: Sensing as a service and big data. arXiv preprint arXiv:1301.0159 (2013) 11. Mell, P., Grance, T.: The NIST definition of cloud computing. (2011) 12. Sheng, X., Tang, J., Xiao, X., Xue, G.: Sensing as a Service: Challenges, Solutions and Future Directions. IEEE Sensors Journal 13(10) (October 2013) 3733–3741 13. Zaslavsky et al., A.: Sensing-as-a-Service and Big Data. Proceedings of the International Conference on Advances in Cloud Computing (ACC), Bangalore, India (2012) 14. Ghasemi, A., Sousa, E.S.: Spectrum sensing in cognitive radio networks: requirements, challenges and design trade-offs. IEEE Communications Magazine 46(4) (2008) 32–39 15. Apache: Flink. https://flink.apache.org/ Accessed: 2017-04-11. 16. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Data Engineering 38(4) (2015) 17. MongoDB: Mongodb. https://www.mongodb.com/ Accessed: 2017-04-11. 18. Gy˝orödi, C., Gy˝orödi, R., Pecherle, G., Olah, A.: A comparative study: MongoDB vs. MySQL. In: 13th International Conference on Engineering of Modern Electric Systems (EMES), 2015, IEEE (2015) 1–6 19. Apache: Kafka. http://kafka.apache.org/ Accessed: 2017-04-11. 20. Ranjan, R.: Streaming big data processing in datacenter clouds. IEEE Cloud Computing 1(1) (2014) 78–83 21. Blefari-Melazzi, N., Sorte, D.D., Femminella, M., Reali, G.: Autonomic control and personalization of a wireless access network. Computer Networks 51(10) (2007) 22. Baruffa, G., Femminella, M., Pergolesi, M., Reali, G.: A cloud computing architecture for spectrum sensing as a service. In: 2016 Cloudification of the Internet of Things (CIoT). (Nov 2016) 1–5 23. Sun, H., Nallanathan, A., Wang, C.X., Chen, Y.: Wideband spectrum sensing for cognitive radio networks: a survey. IEEE Wireless Communications 20(2) (April 2013) 74–81 24. Li, Z., Yu, F.R., Huang, M.: A distributed consensus-based cooperative spectrum-sensing scheme in cognitive radios. IEEE Transactions on Vehicular Technology 59(1) (Jan 2010) 383–393 25. Kotobi et al., K.: Data-throughput enhancement using data mining-informed cognitive radio. Electronics 4(2) (2015) 221 26. Zhang et al., T.: A Wireless Spectrum Analyzer in Your Pocket. In: Proceedings of HotMobile ’15. HotMobile ’15, New York, NY, USA, ACM (2015) 69–74 27. Chakraborty, A., Das, S.R.: Designing a Cloud-Based Infrastructure for Spectrum Sensing: A Case Study for Indoor Spaces, IEEE DCOSS 2016, Washington DC (May 2016) 17–24 28. Ulversoy, T.: Software defined radio: Challenges and opportunities. IEEE Communications Surveys Tutorials 12(4) (Fourth Quarter 2010) 531–550
16 29. 30. 31. 32. 33. 34.
35. 36. 37. 38. 39. 40.
Giuseppe Baruffa et al. Open IoT Consortium: Open IoT. http://openiot.eu Accessed: 2017-04-11. Amazon: Amazon AWS. http://aws.amazon.com/ Accessed: 2017-04-11. Google: Google Cloud. https://cloud.google.com/compute Accessed: 2017-04-11. Openstack: Openstack. https://www.openstack.org/ Accessed: 2017-12-15. Apache: Avro. http://avro.apache.org/ Accessed: 2017-04-11. Maeda, K.: Performance evaluation of object serialization libraries in XML, JSON and binary formats. In: 2012 Second International Conference on Digital Information and Communication Technology and its Applications (DICTAP), IEEE (2012) 177–182 Popa et al., L.: Faircloud: Sharing the network in cloud computing. In: ACM SIGCOMM 2012, ACM (2012) 187–198 Ousterhout et al., K.: Making sense of performance in data analytics frameworks. In: USENIX NSDI’15, Oakland, CA (May 2015) Chakraborty, A., Gupta, U., Das, S.R.: Benchmarking Resource Usage for Spectrum Sensing on Commodity Mobile Devices. ACM HotWireless 2016, New York City Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1) (2008) 107–113 The Java Tutorials. Oracle: Using prepared statements. http://docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html Accessed: 2017-04-12. MongoDB: Mongodb connector for Hadoop. https://github.com/mongodb/mongo-hadoop Accessed: 2017-04-11.