Rakeness-based Compressed Sensing on Ultra-Low ... - IEEE Xplore

Rakeness-based Compressed Sensing on Ultra-Low Power Multi-Core Biomedical Processors Daniele Bortolotti† , Mauro Mangia‡ , Andrea Bartolini†§ , Riccardo Rovatti†‡ , Gianluca Setti∗ and Luca Benini†§ † DEI,

University of Bologna, Italy - Email: [email protected] Systems Laboratory, ETH Zurich, Switzerland - Email: {barandre, lbenini}@iis.ee.ethz.ch ‡ ARCES, University of Bologna, Italy - Email: {mauro.mangia2,riccardo.rovatti}@unibo.it ∗ ENDIF, University of Ferrara, Italy - Email: [email protected]

§ Integrated

Abstract—Technology scaling enables today the design of ultra-low cost wireless body sensor networks for wearable biomedical monitors. The typical behaviour of such systems consists of multi-channel input biosignals acquisition, data compression and final output transmission or storage. To achieve minimal energy operation and extend battery life, several aspects must be considered, ranging from signal processing to architectural optimizations. The recently proposed Rakeness-based Compressed Sensing (CS) paradigm deploys the localization of input signal energy to further increase compression without sensible RSNR degradation. Such output size reduction allows for trading off energy from the compression stage to the transmission or storage stage. In this paper we analyze such tradeoffs considering a multi-core DSP for input biosignal computation and different technologies for either transmission or local storage. The experimental results show the effectiveness of the Rakeness approach (on average ≈ 44% more efficient than the baseline) and assess the energy gains in a technological perspective.

I.

I NTRODUCTION AND R ELATED W ORK

Modern human behavior-related diseases, such as cardiovascular diseases, require precise and long-term medical supervision, which is unsustainable for the traditional healthcare system due to increasing costs and medical management needs [1]. Personal health monitoring systems are able to offer a large-scale and cost-effective solution to this problem. As a matter of fact, emerging and future healthcare policies are fueling a shift toward long term monitoring of biosignals by means of embedded ultra-low power (ULP) devices [2]. Wearable, miniaturized and wireless sensors nodes to continuously measure and remotely report biomedical signals, can indeed provide the ubiquitous, long-term and real-time monitoring required by the patients, and enable faster coordination with medical personnel. Such health monitoring devices, enabled by Wireless Body Sensor Networks (WBSNs), are subject to conflicting strict requirements. On one side they are constrained to a reduced power budget to extend battery life, while on the other side there is an increasing demand for computation capabilities to process locally the sensed data so as to reduce the amount of data to be remotely transmitted or stored in the device. To achieve the ultimate goal of minimal energy operation, several aspects must be considered combining optimizations of the signal processing aspects and of the different technological layers composing the ULP architecture. Compressed Sensing (CS) signal acquisition and compression paradigm [11], [12], has recently proved to be effective

in reducing energy consumption in embedded biomedical monitors. The aim of CS is to represent the information content of the input signal using fewer digital words with respect to Nyquist-rate sampling. For instance, the work presented in [3] shows a ≈ 40% improved lifetime compared to state-of-the-art compression techniques for an embedded ECG monitor. The general CS theory was recently extended with the introduction of the concept of Rakeness [16], [17], [18]. The basic idea behind this approach is to exploit localization of signals, i.e. the assumption that the information of the signal is not equally distributed in its domain. Roughly speaking, localization implies that some realizations of the input process have a higher probability with respect to other ones. The hypothesis of localization is not a limitation since the only class of signals where all possible realizations have the same probability is white noise. In more detail, one of the CS assumptions is that measurements are obtained trough the projection of the input signal on random independent and identically distributed (i.i.d.) sampling functions. In [16] authors showed that the latter is only a sufficient condition and using random sampling functions, which are not independent but show a limited correlation, allows to achieve either a higher quality of the reconstructed signal or a reduced number of measurements necessary to target a given reconstruction quality. With the goal of reducing the computational load a new rakeness-based approach called zeroing is introduced in this work. The idea is to use a sparse sensing matrix or rather a matrix with fewer non null elements compared to full sensing matrices. To apply zeroing we consider sensing matrices generated following the rakeness approach and then randomly chosen elements are set to zero. Such approach outperforms the standard rakeness approach in terms of memory footprint and computation requirements and moreover the sparsity of its sensing matrix makes it suitable for a more efficient algorithmic implementation. The drawbacks of the zeroing technique consist of a distortion of the statistical characterization and an increase of the number of measurements required, nevertheless it opens up a new range of possible tradeoffs. If we consider a typical wearable biomedical monitoring system based on CS, three phases can be identified: (i) input biosignals acquisition (data collection), (ii) sensors data processing (computation) and finally (iii) management of the processed input (transmission or storage). In the first two stages the analog input biosignal is sampled and made available to a digital signal processor (DSP) for processing. Subsequently the compressed output can be transmitted (to

a personal server such as a smartphone) or locally stored for later off-line medical analysis. Motivated by the inherent parallel nature of medical grade biomedical monitoring, where multi-channel signal analysis is often embarrassingly parallel, a multi-core architecture for DSP is considered. Such architectural template proved its efficiency compared to singlecore solutions [4], [5]. In [4] authors introduced a multi-core architecture where individual leads are processed on different cores in parallel. Parallel processing enables more aggressive voltage-frequency scaling than single-core solutions, though at low workload requirements the single-core solution proved to be more efficient due to the dominating leakage power. The initial architecture proposed in [4] was further consolidated by optimizing different aspects to increase its energy efficiency. In [7] the authors propose hardware/software approach to synchronize the execution of biosignal processing applications, while in [6] an hybrid memory is devised to reduce the consumption adapting the active memory portions to different CS workload requirements. The main contributions of this paper are the following: •

•

•

the recently proposed rakeness-based CS is compared to a standard CS implementation in terms of computational efficiency considering a multi-core biomedical DSP. The rakeness approach leads to ≈ 44% improvements in terms of energy efficiency and can substantially reduce (≈ 43%) the number of measurements. a novel technique based on sparsification of rakeness matrices is proposed and the tradeoffs in terms of input data compression, reconstruction quality and computation requirements are evaluated. Combined with an efficient implementation, the zeroing approach proved its efficiency (≈ 64%) w.r.t. the baseline. to complete the analysis, we consider different technologies for output transmission or storage assessing the rakeness benefits in a technological perspective.

The rest of the paper is organized as follows. In Section II the rakeness-based compressed sensing approaches are introduced. Section III presents the overall system and the multicore DSP. Next, in Section IV we describe the experimental setup and the results of the evaluation of the proposed algorithms in terms of reconstruction quality, memory footprint and energy efficiency considering different technological solutions for output transmission or storage. Finally, the conclusions of this work are presented in Section V. II.

R AKENESS - BASED C OMPRESSED S ENSING

In the area of signal processing, CS is a recently introduced paradigm that merges signal acquisition and compression tasks. The aim of CS is to represent the information content of the input signal using fewer digital words with respect to Nyquist-rate sampling. It is possible to overcome the intrinsic limit imposed by Nyquist-Shannon theorem thanks to prior knowledge on the considered class of input signals, which must be sparse. Referring to a fixed time window T , we name x the vector containing the N Nyquist-rate input signal samples. The signal x is sparse if there exists a fixed and suitable basis Ψ = {ψ1 , . . . , ψN } such that x = Ψα, where Ψ is a N × N matrix with column vectors ψj , j = 1, . . . , N and the coefficients vector α has at most K non null elements. The

implicit meaning is that the information content associated to x is less than N thus data compression can be achieved [13]. To increase further the sensed data compression, a different approach was proposed in [15], [17], [16] where the authors couple the sparsity hypothesis with another assumption on the acquired class of signals: x must be sparse and localized, i.e. the information content is non-uniformly distributed in the whole signal domain. In this scenario, a more compressed representation is achieved by introducing a new guideline in the sensing, called rakeness, that consists of the average energy which one is able to collect (“rake”) when the input signal is projected onto them. In the following paragraphs a description of the standard CS approach is given and then an overview of the rakeness approach is presented. A. CS Standard Approach In the encoder stage the information extraction is achieved by projecting the input signal x with a suitable set of N dimensional sensing vectors φj , j = 1, . . . , M arranged row by row in the Sensing Matrix Φ. The signal information extraction (i.e. measurement vector y) can be represented as follows: y = Φx + ν = ΨΦα + ν

(1)

where ν is an additive noise that takes into account all nonidealities like the quantization error or the intrinsic noise of the input signal. A graphical representation of the entire CS encoder is depicted in Figure 1. The decoder stage receives the measurement vector and, thanks to the knowledge of both sensing and sparse matrices shared with the encoder, the reconstruction of x can be achieved by solving the following optimization problem [11], relying on the sparsity assumption: α ˆ = min kαkl1

(2) s.t. kΦΨα − ykl2 < P P 2 where k · kl1 = | · | and k · kl2 = · are the standard l1 and l2 norms, bounds the effects of the noise ν and the reconstructed signal can therefore be written as x ˆ = Ψˆ α. The goodness of the reconstruction is guaranteed by the so-called Restricted Isometry Property (RIP) of the matrix ΦΨ, ensuring that the sensing stage is able to conserve the input signal l2 norm also with M N , i.e. with a data compression equal to N/M . From the standard CS theory [11] it is known that RIP is always satisfied by adopting Φ composed of instances of a collection of i.i.d. random variables, such as a set of random antipodal sequences with equal probability to present

y

Φ

Ψ

α

ν

x Fig. 1: Matrix representation of the information extraction based on Compressed Sensing.

AFE

−1 or +1. In this situation the reconstruction is guaranteed by adopting M ≥ Mmin = 4K log(N/K) [11].

CH1

ADC

B. CS Rakeness Approach

CH2

ADC

where h·, ·iPstands for the standard inner product such that N hφj , xi = i=1 φj i xi . The idea is to increase the collected signal energy by the generic φj with the constraint that the sensing vectors are random enough to preserve the RIP. This translates into solving the following optimization problem: max

ρ(φ, x)

s.t.

hφj , φj i = e ρ(φ, φ) ≤ τ e2

φ

(3)

where e is the energy of each sampling vector1 and the second constraint is an upper bound related to the randomness of the process generating all φj involved in the sensing. Tuning τ on a proper range is not critical since it does not appreciably alter the overall system performance [15]. The output of this optimization problem is the second-order statistical characterization of the φ stochastic process as its correlation matrix. In Section IV-A practical examples will show the benefit introduced by the rakeness with respect to standard CS for the case of ECG biosignals. III.

TARGET A RCHITECTURE

In this section we first introduce the WBSN-based biosignal monitoring system, then we present the target multi-core DSP architecture to perform Compressed Sensing. A. System Overview In the present work we are considering a biomedical system where three phases can be identified: Data Collection, Computation and Transmission/Storage. A block diagram of the system is shown in Figure 2. Data Collection Phase: The input k-channel biosignal is sampled by the Analog Front-End (AFE) in this phase, with a sampling frequency (fs ) according to the dynamics of the signal to analyze and the accuracy needed. During the data collection phase the DSP waits for the number of samples (N ) required to perform compression. Considering typical sampling frequencies for biomedical signals, this phase exceeds in time the phase of computation. For instance, with fs = 256Hz and N = 512, the data collection phase lasts 2 seconds. During 1 note

that for antipodal sampling sequences it is always e = N .

…

Considering two stochastic processes φ and x, generating respectively the sensing vectors φj and the signals instances x, we define the rakeness ρ as h i 2 ρ(φ, x) = Eφ,x |hφj , xi|

…

As already introduced before, rakeness is an innovative extension of the standard CS theory. It relaxes the Restricted Isometry Property (RIP) when the class of signals to acquire is also localized. By exploiting this approach it is possible to reduce Mmin guaranteeing at the same time a correct reconstruction [17], [16]. The goal of the rakeness-based CS is to increase the energy collected in the sensing stage and preserving at the same time the RIP of the ΦΨ operator.

TX mul-core DSP

NVM

ADC

CHK

Data Collecon

Computaon

Transmission/ Storage

Fig. 2: Block diagram of the considered biomedical system. this phase, for most of the time, the whole system is idle thus we assume a deep low power state (almost zero power) for both the DSP and the transmission/storage subsystem to avoid unnecessary power consumption. Once a set of k new samples is ready in the AFE buffer the DMA is triggered to move to the DSP memory the data for later elaboration. Computation Phase: Once all the samples from the different leads are collected, the multi-core DSP performs the Compression algorithm (described in Section IV-B). The considered system performs a burst of computation on the available data for future transmission. During this phase the DSP is in an operating point characterized by high workload requirements and high memory footprint. As will be detailed in Section IV-A, the zeroing introduced by the rakeness-based compressed sensing allows for energy tradeoffs between the computational load and the size of the measurement vectors for subsequent transmission or storage. Transmission/Storage Phase: In the last stage, within a given time window, the compressed data are either transmitted (TX) or stored in a non-volatile memory (NVM) for future offline usage (e.g. medical analysis of a day-long monitoring). The different power figures following the choice between transmission or storage, and the respective protocol or technology, will be presented in Section IV-D. We further assume a power efficient architecture with a deep low power state with negligible consumption for the DSP, once the computation phase is over. B. Multi-Core DSP Motivated by the inherent parallel nature of medical grade biomedical monitoring, where multi-channel signal analysis is often embarrassingly parallel, a multi-core architecture for DSP is considered. We consider a target DSP architecture similar to several current multi-core architectures targeting digital biosignals processing [5], [4]. The considered architecture, presented in Figure 3, features 8 Processing Elements (PEs) with separate instruction and data buses path (i.e. Harvard Architecture). PEs do not have instruction nor data caches therefore avoiding refill costs and coherency protocol overheads. On the instruction side each PE has a private single-cycle Instruction Memory (IM) where the CS code is stored, while they all share a L1 multi-banked tightly coupled data memory (TCDM) acting as a shared data scratchpad memory. The number of memory ports of the TCDM is equal to the number of banks to have concurrent access to different memory locations. Once a read or write requests is brought to the memory interface, the

1

0.8

0.8

0.8

0.6

0.6

0.6

ECG

1

0.4

0.4

0.4

0.2

0.2

0

0

0

-0.2

-0.2

-0.2

0

0.5

1 time (s)

1.5

0.2

0

2

Rake

1.2

1 ECG

ECG

Base

1.2

Input Signal

1.2

0.5

1 time (s)

1.5

2

0

0.5

1 time (s)

1.5

2

Fig. 3: Visual representation of a single input ECG signal using the generator in [19] and the associated VERY GOOD reconstructed signals for both standard CS approach (labeled Base) and Rakeness CS approach (labeled Rake).

TCDM

45 40 S1 S2 S3

...

(samples vectors)

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

IM

IM

IM

IM

IM

IM

IM

IM

SK

AFE

Samples Buﬀer (SB)

S1 S2 S3

SK

Fig. 4: Multi-core DSP architecture for CS.

Average RSNR (dB)

CHK

LIC

Base Rake Rake64

50

CH1

MB MB MB MB MB MB MB MB MB MB MB MB MB MB MB MB 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

VERY GOOD

35

GOOD

30 25 20 15

NO GOOD

10 5

data is available on the negative edge of the same clock cycle, leading to two clock cycles latency for conflict-free TCDM access. If conflicts occur there is no extra latency between pending requests, once a given bank is active, it responds with no wait cycles. The communication is based on a highbandwidth low-latency interconnect (LIC). It consists of a Mesh-of-Trees interconnection network able to support singlecycle communication between PEs and memory banks (MBs), resembling the hardware module presented in [8]. In case of multiple conflicting requests, for fair access to memory banks, a round-robin scheduler arbitrates the accesses. To ease the negative impact of banking conflicts we consider a banking factor of 2 (16 banks). With reference to Figure 4, the digital samples (si ), corresponding to different leads (k), are stored in a buffer inside the AFE during the data collection phase. To reduce AFE buffer size, we consider the DMA (not shown in Figure 4) to be triggered and move the k samples from the AFE buffer to the TCDM. Once the last series of samples is moved to the DSP memory, the multi-core compression algorithm executes, where each core is active and working on its own subset of the sampled data. We assume that the computation phase must be completed before the first sample of the next time window (N + 1) is available to avoid double buffering overhead. IV.

E VALUATION

In this section we first present the analysis of the different CS algorithms in terms of reconstruction quality and computational load, to later focus on their algorithmic implementation. Then we introduce the simulation framework and the evaluation in terms of energy efficiency is presented. A. Rakeness Analysis To confirm the announced benefits introduced by CS based on rakeness with respect to standard CS, we present here

2

2.5

3

3.5

4

4.5

5

5.5

CR

Fig. 5: Average RSNR as a function of the compression ratio (CR) for standard CS (Base), rakeness based CS (Rake) and rakeness with zeroing (Rake64). the results related to a set of simulations on synthetic ECG biosignals using the generator in [19], where an intrinsic noise level of 50 dB is taken into account to deal with ADC and AFE nonlinearities. The considered setup is characterized by time windows of 2 seconds and input signals sampled at 256Hz leading to N = 512. It is known that ECGs are sparse over the Symmlet-6 orthonormal basis [20] therefore this class of signals is a good candidate to prove rakeness advantages. All results are obtained by performing Monte Carlo simulations over 300 trials, where measurement vectors are collected by (1) and the input signal reconstruction is obtained by solving (2) adopting the SPGL1 optimization toolbox2 . As figure of merit we consider the Reconstruction Signal to Noise Ratio (RSNR) expressed by kxkl2 RSNR = kx − x ˆkl2 dB In addition to RSNR we adopt a global classification labeling for signal quality as perceived by medical specialists [21]. We mark as VERY GOOD x ˆ instances when the associated RSNR ≥ 34 dB, we use label GOOD when 21 dB ≤ RSNR < 34 dB and NO GOOD otherwise. A visual representation of a single input signal and the associated VERY GOOD reconstructions are shown in Figure 3, for both standard CS approach (labeled Base) and Rakeness CS approach (labeled Rake). The standard CS approach was tested adopting an i.i.d. antipodal random sensing matrix Φ and the associated average 2 http://www.cs.ubc.ca/∼mpf/spgl1.

Base

Output Size (M)

256

190

Rake64 Rake

145 32780

74240 Computational Load (CL)

131072

Fig. 6: Design space exploration in terms of Computational Load (CL) and measurements vector size (M ). RSNR as a function of compression ratio (CR), defined as CR = N/M , is shown in Figure 5. In this case a maximum of CR = 2.7 must be considered to have a GOOD reconstruction and CR = 2 is largest allowed value to reach the VERY GOOD standard. After solving the optimization problem (3), the correlation matrix of the process generating the Φ matrices is known. Furthermore, Linear probability Feedback Process [22] is considered for the generation of the sensing matrix Φ based on rakeness. Results of this approach is also shown in Figure 5, with an appreciable increase of the compression ratio associated to either GOOD or VERY GOOD reconstruction levels with respect to the Base case. In particular VERY GOOD performances are obtained with CR ≈ 3.5, that correspond to M = 145 measurements, while GOOD reconstruction are always guaranteed for CR ≤ 5.5. It is important to highlight that a reduced M translates into a reduced amount of bits that must be transmitted or stored every time window and also the amount of computation and memory accesses needed by each processing element to compute its own measurement vector y. We cope with this second requirement by defining the Computational Load (CL) as the amount of sums needed to evaluate the entire vector y. With the goal of reducing CL for the CS computation we introduce here a new rakeness-based approach called zeroing. The idea is to use a sparse sensing matrix Φ or rather a matrix with fewer non null elements compared to both Base and Rake which are based on full sensing matrices, i.e. matrices where all elements are nonzero. To apply zeroing we consider sensing matrices generated following the rakeness approach and then randomly chosen elements are set to zero. It is clear that this enforcement will limit the benefit introduced by rakeness by distorting the statistical characterization imposed by (3) and an increase on y dimensionality is expected. Nevertheless this approach can produce a reduction in terms of CL. In our simulation settings we tested this innovative approach by zeroing Φ matrices obtained by rakeness leaving only 64 non-zero elements per column, this configuration is named Rake64. The performance in terms of average RSNR is shown in Figure 5. This plot highlights how zeroing globally reduces the RSNR with respect to Rake, but the sparsification of the sensing matrix makes this approach appealing in terms of CL reduction. Results in terms of CL and M associated to Base, Rake and Rake64 are reported in Figure 6 where a VERY GOOD reconstruction case was taken into account. B. Algorithmic Implementation The different tradeoffs introduced by the rakeness approach, in terms of computational effort and number or measurements, were presented in the previous section. The three

different cases for the CS algorithms introduced (Base, Rake and Rake64) can be divided into two algorithmic implementations due to their sensing matrix structure. In both cases the multi-core DSP is operating in a SIMD fashion where each core is compressing the input data related to the associated input channel. The Base and the Rake cases deploy a full matrix (density = 100 %) where all elements are either +1 or −1. In this case the projection of the input vector with the sensing matrix can be implemented with a standard Full Matrix (FM) multiplication, shown in Listing 1, and the Φ matrix can be represented with signed char datatype.

void cmp_smpl_fm(short *x, short *y) { unsigned short i, j; for (i = 0; i < M; i++) for (j = 0; j < N; j++) y[i] += Phi[i][j]*x[j]; }

Listing 1: Full Matrix (FM) version of CS. If we consider the Rake64 case, the sensing matrix Φ is sparse (+1, 0 or −1 elements) and this motivated a more efficient algorithm implementation based on Look-Up Tables (LUTs), shown in Listing 2. Such version encodes the information related to the position and the sign of the non-zero elements of the sensing matrix in two different LUTs, respectively *lut_idx_ptr and *lut_sign_ptr. In Listing 2 a case with input size N = 512 and Compression Ratio greater than 50% is considered, allowing for respectively unsigned char and signed char data types, while a short representation applies for x and y due to the 12-bit resolution ADC considered. void cmp_smpl_lut(short *x, short *y) { unsigned short i, j; unsigned char *lut_idx_ptr = &lut[0]; signed char *lut_sign_ptr = &lut_s[0]; for (i = 0; i < n; i++) for (j = 0; j < k; j++) y[*(lut_idx_ptr++)] += (*(lut_sign_ptr++))*x[i]; }

Listing 2: Look-Up Table (LUT) version of CS. Due to the varying amount of measurements (M ) required to achieve a VERY GOOD reconstruction quality and to the different algorithmic implementations, the Base, Rake and Rake64 cases lead to different memory footprints in the DSP data memory. For all cases the TCDM memory has to allocate the input samples si , considering N = 512, 8 channels and 12-bit ADC resolution it leads to 8KB using short datatypes. For the measurement vectors y associated to each PE and the sensing matrix Φ (or the LUT structures), the requirements are different and this information is reported in Table I. TABLE I: TCDM memory footprint requirements. CASE

Base Rake Rake64

M 256 145 190

OUTPUT

ALGORITHM

SENSING

4096B 2320B 3040B

FM FM LUT

128KB 72.5KB 32KB

C. Simulation Framework The considered multi-core DSP architecture has been modeled and integrated in a SystemC-based cycle-accurate virtual platform [9], with back-annotated power numbers for the memory subsystem and the rest of the logic (LIC, PEs) extracted from a RTL-equivalent architecture [10]. Table II presents the comparison, in terms of execution cycles of a matrix multiplication benchmark, of the SystemC and the RTL platforms, showing their alignment. Based on the analysis of the CS code (see next section), the 8-cores DSP architecture has been configured with a 192KB 16-bank TCDM and 1KB Instruction Memory per core. Static data allocation is performed by means of cross-compiler attributes and linker script sections. A stack portion of 512B is assigned to each core and it resides in TCDM. TABLE II: SystemC and RTL platforms alignment. PE S 1 2 4

RTL (cycles) 244645 122426 61339

SystemC (cycles) 220532 111041 57069

D ELTA (%) 9.86 9.30 6.96

D. Experimental Results 1) CS Computation Phase: To evaluate the energy cost of the different CS algorithms we instrumented the SystemC simulator to collect activity from each architectural element of the modeled DSP architecture. For the back-annotated power numbers we consider the design corner (RVT, 25C, 0.6V) @ 100MHz in a 28nm FDSOI technology. Moreover, considering that the compression task is memory-bound by nature, the requirements in terms of core-memory bandwidth imply higher supply voltage for the memory (0.8V) in order to sustain the throughput. The results of the evaluation are shown in Figure 7 where on the x-axis are presented the different CS algorithms and on the y-axis the energetic cost for multichannel input compression is reported. Considering the Base and the Rake cases we note the effectiveness of the rakeness approach. The Rake approach requires ≈ 44% less energy than Base and moreover it enables further output compression. If we consider the Rake64 case, thanks to the increased sparsity of the sensing matrix and the efficient LUT-based algorithmic implementation, the energy gain increases further, reaching a maximum of ≈ 64% with respect to the Base case.

DSP Energy for Computation [µJ]

180 160 140

≈ 44%

120

≈ 64%

100 80 60 40 20 0

BASE

RAKE

RAKE64

Fig. 7: Energy for the Compression Stage required by the DSP for Base, Rake and Rake64 cases.

2) Transmission/Storage Phase: As a final experiment we evaluated the energy requirements of the last stage of the considered biomedical monitor. We derived a simple model to cope with some of the most widespread or forthcoming technologies for either transmission or storage. The power required per transmitted or stored bit is calculated as follows: PT X,ST = ET X,ST · fs · RADC · CR−1 where RADC stands for the baseline ADC resolution, fs for the input sampling frequency, CR for the compression ratio (N/M ) and ET X,ST is the required energy for transmitting or storing one bit of the compressed data. Different protocols were considered for the transmission option ranging from the power-hungry Bluetooth LE to the efficient Impulse Radio Ultra Wide Band (IR-UWB). In the storage scenario, i.e. memorising the compressed data in a Non Volatile Memory (NVM), we considered different promising technologies from the resistive RAM (ReRAM) to the Conductive Bridging RAM (CBRAM). We dimensioned the system with fs = 256Hz, RADC = 12 bit/sample, CR as determined in Section IV-A for the VERY GOOD case, while for ET X,ST the different values are reported in Table III. TABLE III: ET X,ST for the different transmission (TX) and storage (NVM) technologies.

TX

NVM

Bluetooth LE IEEE 802.15.6 Body Channel IR-UWB ReRAM STT-MRAM FG Flash CBRAM

ET X,ST [nJ/bit] 5 0.5 0.25 0.012 2 0.1 0.01 0.001

Ref. [23] [24] [25] [26] [27] [28] [29] [30]

The results of the evaluation are presented in Figure 8 where the energy for computation and transmission or storage stages is stacked 3 on the y-axis in a logarithmic scale. On the x-axis the CS algorithms are grouped for the different technologies and on top of the Rake and Rake64 bars the energy savings are reported in percentage. It is clearly shown that the total energy consumption is dominated by the transmission or storage stage in almost all cases. Thanks to the reduced output footprint the Rake approach outperforms the baseline with an average energy saving over the considered technologies Es,Rake = 43.46%. Due to the lower compression achieved by the Rakeness cases with zeroing, the transmission or storage cost pose them as an intermediate case with Es,Rake64 = 32.15%. Such trend is valid for seven out of eight cases considered, while for the CBRAM the Rake64 algorithm proved to be the most efficient (reported in Figure 9). With the simple model considered it is hard to draw conclusions on a specific technology but it is clear that when the Compression and Transmission or Storage stages are comparable in terms of energetic requirements, which is the desired trend in ultralow power WBSN, the zeroing technique is superior for the energy gains. Thanks to the efficient LUT-based algorithmic 3 The zoomed area in the figure is intended to clarify that the E T X,ST contribution is dominant and this is amplified due to the logarithmic scale.

1E+00 ETX,ST [J] 43.36%

ECS [J]

25.81%

Total Energy [J]

1E-01

43.36% 43.37%

25.85%

26.05%

1E-02

43.34%

26.32% 43.38%

1E-03 43.49%

27.09%

34.53%

43.51%

35.82% 43.82%

1E-04

55.76%

Bluetooth LE

IEEE 802.15.6

Body Channel

IR-UWB

ReRAM

FG Flash

STT-MRAM

Rake64

Rake

Base

Rake64

Rake

Base

Rake64

Rake

Base

Rake64

Rake

Base

Rake64

Rake

Base

Rake64

Rake

Base

Rake64

Rake

Base

Rake64

Rake

Base

1E-05

CBRAM

Fig. 8: Total Energy for Compression and Transmission or Storage considering the different technologies presented in Table III.

Total Energy [J]

implementation, the energetic impact of the compression stage is reduced and the lower size of the measurement vector (M ) with respect to the baseline, allows for further improvements. Moreover, in case the DSP can not embed a sufficient amount of memory to store the sensing data required by the Rake algorithm (see Table I), the zeroing approach can be the preferred choice.

with Swiss Confederation financing. The authors would like to thank David Bellasi ([email protected]) ETH Zurich, for the useful information on the transceiver power model. R EFERENCES [1]

2.5E-04

[2]

2.0E-04

[3]

1.5E-04 ECS [J]

1.0E-04

ETX,ST [J]

[4]

5.0E-05

[5] 0.0E+00

BASE

RAKE

RAKE64

[6]

CBRAM

Fig. 9: CBRAM case as a technological perspective. V.

C ONCLUSIONS

To achieve minimal energy consumption in low-cost WBSN-based biosignal monitors both architectural and signal processing aspects must be considered. The Rakeness approach for CS enables trading off the computation workload with the number of measurements for later transmission or storage. In this paper we evaluated such tradeoffs considering a multi-core DSP and different technologies for storage or transmission. Rakeness-based CS proved to be more energy efficient (on average ≈ 44%) than the classic CS approach and such trend will be consolidated with forthcoming low-power transmission or storage technologies.

[7]

[8]

[9]

[10]

[11]

[12] [13]

ACKNOWLEDGMENTS This work was supported by the FP7 project PHIDIAS (g.a. 318013) and the ICYSoC RTD project (no. 20NA21 150939), evaluated by the Swiss NSF and funded by Nano-Tera.ch

[14]

World Health Organization [Online] http://www.who.int/mediacentre/factsheets/fs317. Munir, A. et al., “Multi-core Embedded Wireless Sensor Networks: Architecture and Applications.” (2013): 1-1. Mamaghanian H. et al., “Compressed sensing for real-time energyefficient ECG compression on wireless body sensor nodes”, In: IEEE Transactions Biomedical Engineering, vol. 58, no.9 pp. 2456–2466, 2011. Dogan A.Y. et al., “Multi-core architecture design for ultra-lowpower wearable health monitoring systems”, In: Proceedings of the ACM/IEEE DATE, 2012. Dreslinkski R. G. et al., “An energy efficient parallel architecture using near threshold operation”, In: Proceedings of PACT, 2007. Bortolotti D. et al., “Hybrid memory architecture for voltage scaling in ultra-low power multi-core biomedical processors”, In: Proceedings of the ACM/IEEE DATE, 2014. Braojos R. et al., “Hardware/software approach for code synchronization in low-power multi-core sensor nodes”, In: Proceedings of the ACM/IEEE DATE, 2014. Rahimi A. et al., “A Fully-Synthesizable Single-Cycle Interconnection Network for Shared-L1 Processor Clusters”, In: Proceedings of the ACM/IEEE DATE, 2011. Bortolotti D. et al., “VirtualSoC: a Full-System Simulation Environment for Massively Parallel Heterogeneous System-on-Chip”, In: Proceedings of IPDPWS 2013. Gautschi M. et al., “Customizing an Open Source Processor to Fit in an Ultra-Low Power Cluster with a Shared L1 Memory”, In: Proceedings of GLSVLSI 2014. Candes E.J. et al., “Stable signal recovery from incomplete and inaccurate measurements”, In: Communications on Pure and Applied Mathematics, vol. 59, no. 8, pp. 1207–1223, Aug. 2006. Donoho D. L., “Compressed Sensing”, In: IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006. Haboba J. et al., “A pragmatic Look at Some Compressive Sensing Architectures with Saturation and Quantization”, In: IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 2, n. 3, pp. 443-459, 2012. Candes E. J., “The restricted isometry property and its implications for compressed sensing”, In: Comptes Rendus Mathematique, vol. 346, no. 9, pp. 589–592, 2008.

[15]

Cambareri V. et al., “A rakeness-based design flow for Analog-toInformation conversion by Compressive Sensing”, Circuits and Systems (ISCAS), 2013 IEEE International Symposium on , pp.1360–1363, May 2013. [16] Mangia M. et al., Rakeness in the design of analog-to-information conversion of sparse and localized signals, In: IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 59, no. 5, pp. 1001–1014, May 2012. [17] Mangia M. et al., “Analog-to-Information Conversion of Sparse and non-White Signals: Statistical Design of Sensing Waveforms”, Circuits and Systems (ISCAS), 2011 IEEE International Symposium on , pp. 2129–2132, May 2011. [18] Mangia M. et al., “Rakeness-based approach to compressed sensing of ecgs”, Biomedical Circuits and Systems Conference (BioCAS), 2011 IEEE , vol., no., pp.424–427, Nov 2011. [19] McSharry P. E. et al., “A dynamical model for generating synthetic electrocardiogram signals”, In: Biomedical Engineering, IEEE Transactions on 50.3 (2003): 289-294. [20] Mallat S., “A wavelet tour of signal processing”, Access Online via Elsevier, 1999. [21] Zigel Y. et al., “The weighted diagnostic distortion (wdd) measure for ecg signal compression”, In: Biomedical Engineering, IEEE Transactions on, vol. 47, no. 11, pp. 1422–1430, Nov. 2000. [22] Rovatti R. et al., “Memory-antipodal processes: Spectral analysis and synthesis”, In: Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 56, no. 1, pp. 156–167, Jan. 2009.

[23]

[24]

[25]

[26]

[27]

[28] [29]

[30]

Liu Y. et al., “A 1.9 nJ/b 2.4 GHz multistandard (Bluetooth Low Energy/Zigbee/IEEE802. 15.6) transceiver for personal/body-area networks”, In: Proceedings of ISSCC, 2013. Vidojkovic M. et al., “9.7 A 0.33 nJ/b IEEE802. 15.6/proprietaryMICS/ISM-band transceiver with scalable data-rate from 11kb/s to 4.5 Mb/s for medical applications”, In: Proceedings of ISSCC, 2014. Bae J. et al., “A 0.24-nJ/b wireless body-area-network transceiver with scalable double-FSK Modulation”, In: Solid-State Circuits, IEEE Journal of 47.1 (2012): 310-322. Kulkarni V. V. et al., “A 750 Mb/s, 12 pJ/b, 6-to-10 GHz CMOS IR-UWB transmitter with embedded on-chip antenna”, In: Solid-State Circuits, IEEE Journal of 44.2 (2009): 394-403. Chang M. et al., “A 0.5 V 4Mb logic-process compatible embedded resistive RAM (ReRAM) in 65nm CMOS using low-voltage currentmode sensing scheme with 45ns random read time”, In: Proceedings of ISSCC, 2012. Halupka D. et al., “Negative-resistance read and write schemes for STT-MRAM in 0.13m CMOS”, In: Proceedings of ISSCC, 2010. Shum D. et al., “Highly Reliable Flash Memory with Self-Aligned Split-Gate Cell Embedded into High Performance 65nm CMOS for Automotive & Smartcard Applications”, In: Proceedings of IMW, 2012. Gilbert N. et al., “A 0.6 V 8 pJ/write Non-Volatile CBRAM Macro Embedded in a Body Sensor Node for Ultra Low Energy Applications”, In: Proceedings of VLSIC, 2013.