Massively Parallel Network Coding on GPUs - IEEE Computer Society

0 downloads 0 Views 696KB Size Report
parallelism of the encoding and decoding processes,. i.e., the power of GPUs is fully ... CPU and the GPU to speedup the decoding process. ..... through another level of divide-and-conquer mechanism. ..... ents/ATI_CTM_Guide.pdf. [3] NVIDIA ...
Massively Parallel Network Coding on GPUs Xiaowen Chu

Kaiyong Zhao

Mea Wang

Department of Computer Science Hong Kong Baptist University Hong Kong, P.R.C

Department of Computer Science Hong Kong Baptist University Hong Kong, P.R.C

Department of Computer Science University of Calgary Alberta, Canada

[email protected]

[email protected]

[email protected]

computational complexity of network coding is the cornerstone of practical network coding applications, especially for large data set. In this paper, we seek to take advantages of the GPU computing power to mitigate the computational overhead, so that network coding can be more practical. The contributions of this paper are as follows:

Abstract Network coding has recently been widely applied in various networks for system throughput improvement and/or resilience to network dynamics. However, the computational overhead introduced by the network coding operations is not negligible and has become the cornerstone for real deployment of network coding. In this paper, we exploit the computing power of contemporary Graphic Processing Units (GPUs) to accelerate the network coding operations. We proposed three parallel algorithms that maximize the parallelism of the encoding and decoding processes, i.e., the power of GPUs is fully utilized. This paper also shares our optimization design choices and our workarounds to the challenges encountered in working with GPUs. With our implementation of the algorithms, we are able to achieve up to 12 times of speedup over the highly optimized CPU counterpart, using the NVIDIA GPU and the Computer Unified Device Architecture (CUDA) programming model.

• •



Our experimental results showed that we are able to achieve speedup of 4 to 12 in terms of the encoding and decoding throughputs, in comparison with the current state-of-the-art network coding accelerator [11]. The implementation of network coding operations on GPU is not as simple as it seems. There are quite a number of challenges to address and optimization decisions to make. Just to name a few: First, we wish to utilize each one of the cores available on the GPU. Second, we need to minimize overheads introduced by memory access. Third, we must design the implementation to meet the special GPU architecture. Last but not least, we have to tailor the memory usage to avoid memory access conflicts on the GPU. We will present most of the challenges with our workarounds throughout this paper.

Keywords Network coding, GPU computing, CUDA

1. Introduction Recent advances in Graphics Processing Units (GPUs) opens a new era of GPU computing [18]. Commodity GPUs like NVIDIA’s GTX 280 has 240 processing cores and can achieve 933 GFLOPS of computational horsepower. Traditionally, GPUs are mainly used for graphical applications. The release of the NVIDIA CUDA programming model makes it easier to develop non-graphical applications on GPUs [1] [3]. CUDA treats the GPU as a dedicated coprocessor of the host CPU, and allows the same code to be simultaneously running on different GPU cores as threads.

The rest of the paper is organized as follows. Sec. 2 provides background information on network coding, the GPU architecture, as well as the CUDA programming model. Sec. 3 presents the design of massively parallel network coding. Experimental results are presented in Sec. 4, and we conclude the paper in Sec. 5.

Network coding has been originally proposed to improve throughput in a multicast session. Since the landmark paper on randomized network coding by Ho et al. [21], there has been a gradual shift in research focus, from theoretical studies to more practical studies. The studies in [10] [11] [12] have shown that the

978-1-4244-3367-4/08/$25.00 ©2008 IEEE

A parallel algorithm that maximizes the parallelism of the encoding process on GPU. A combined algorithm that utilizes the multi-core CPU and the GPU to speedup the decoding process. A full implementation of the proposed algorithms.

144

2. Background and Related Work This section presents the basic concept of network coding, the GPU architecture, and the CUDA programming model.

no additional bandwidth is required [6] [8]. The actual computational overhead incurred by network coding operation in GF(2r) has been extensively studied in [10] [11] [12]. In this paper, we will work in GF(28), which implies that each codeword consists of one byte of data, and the coding coefficients are integers between 0 and 255. Nonetheless, our design and implementation can be easily extended to any field size. In general, network coding is computationally expensive for n larger than 100 on conventional CPUs. A network-coding accelerator is proposed in [11] to exploit the SSE2 on x86 and the AltiVec SIMD vector instructions on PowerPC processors. It has been shown that the performance of network coding can be boosted by several times by utilizing the symmetric multiprocessor (SMP) systems. For instance, when n = 128 and m = 1024 bytes, a Quad PowerPC G5 server can achieve around 20 MB/s of throughput. However, with the increase of n, the encoding and decoding throughput drop quickly. In this paper, we aim to further improve the scalability of network coding operations by using GPUs.

2.1 Network Coding Network coding has been originally proposed in information theory in [5] to achieve the optimal throughput in a multicast session. Since then, it has been applied in various communication networks for better throughput and robustness to network dynamics. The essence of network coding is a paradigm shift to allow coding at intermediate nodes between the source and the receivers in one or multiple communication sessions. It has been shown that random linear codes using Galois Fields are sufficient to implement network coding in a practical network setting [7]. Linear network coding regards the messages as vectors of elements in a finite field, and the encoding function is a simple linear combination over the finite field. In our implementation, we define the linear network coding operations as follows. The data to be distributed is divided into n original blocks (b , b , … , b ) , where 1

2

n

each block bi consists of m codewords (b , b , … , b ) . i ,1

i,2

i ,m

An encoded block ei , where i can be any positive integer, is a linear combination of the n original blocks. Each of the m codewords in ei is calculate as:

ei , k =



n j =1

ci , j ⋅ b j , k , k ∈ {1, … , m} .

2.2 GPU Computing GPUs are dedicated hardware for manipulating computer graphics. Due to the huge demand for powerful computing for real-time applications and high-definition 3D graphics, GPUs have been evolved into highly paralleled, multithreaded, multi-core processors. The NVIDIA GeForce GTX260 has 24 Streaming Multiprocessors (SMs), and each SM has 8 Scalar Processors (SPs), resulting in a total of 192 processor cores. The design of the SMs is based on the Single-Instruction Multiple-Data (SIMD) architecture, i.e., at any given clock cycle, all SPs of the same SM must execute the same instruction, but can operate on different data. Each SP can perform the 32-bit singleprecision floating-point arithmetic as well as the 32-bit integer arithmetic.

(1)

where (ci ,1 , ci ,2 , … , ci , n ) are the coding coefficients in the vector format. We can rewrite Eqn. 1 in matrix format as E = C × B , where E = {ei , k } , C = {ci , j } , and B = {b j , k } .

Upon

(e1 , e2 ,…, en ) ,

receiving

n

encoded

blocks

whose coding coefficient vectors

( c1 , c2 , … , cn ) are all linearly independent of each other,

the receiver can recover the original blocks as follows: n bi , k = ci′, j e j , k , k ∈ {1, … , m}, i ∈ {1, … , n} . (2)



j =1

−1 In other words, we have B = C × E , where C = {ci′, j } −1

In a GPU, each SM has four different types of on-chip memory, namely, constant cache, texture cache, registers, and shared memory. The properties of the different types of memories have been summarized in [16]. Constant cache and texture cache are both readonly memories shared by all SPs. On GeForce GTX260, each SM has 16384 32-bit registers and 16KB shared memory that are almost as fast as the registers. In general, the shared memory should be carefully utilized to amortize the global memory latency cost. Shared memory is divided into banks of the equal size for simultaneous access. The banks are organized in a way such that successive 32-bit words belong to consecutive banks. If two memory requests fall into the same bank, it is referred to as a bank

is the inverse of the coding matrix C. The key to the practical implementation of network coding is generating the coding coefficients to be used by each of the intermediate nodes in the session, so that the coded blocks at the receivers are guaranteed to be decodable. Deterministic algorithms have been proposed and shown to be polynomial time computable [4], but they require extensive exchanges of control messages among participating nodes. Ho et al. [7] proposed the concept of randomized network coding, in which the coding coefficients are generated independently and randomly at each node. Network coding operations are performed in Galois Field, which preserves the size of the original data, i.e.,

145

scheduling unit on each SM. When a warp stalls, the SM can schedule another warp for execution. An SM is only fully utilized if all 32 threads in the warp have the same execution path. If the threads in a warp have different execution paths due to conditional branching, the instructions will be serialized, resulting in long processing time. If the number of threads in a block is not a multiple of the warp size, portions of the instruction cycles will be wasted. For these reasons, we must be careful in organizing and managing threads in our design.

conflict; thus, the memory access must be serialized. For optimal memory access performance, one should minimize the chance of bank conflicts, and also utilize on-chip memory as much as possible, since off-chip memories such as local memory and global memory have relatively long access time, usually 400 to 600 clock cycles [3].

2.3 CUDA programming model The exceptional GPU computing power is very attractive to general-purpose system development, which is referred to as general-purpose computing on GPUs (GPGPU). The first generation of GPGPU requires that non-graphics application must be mapped through the graphics application programming interfaces. Interested reader can refer to [14] for an overview of GPGPU. Recently one of the major GPU vendors, NVIDIA, announced their new generalpurpose parallel programming model, namely Compute Unified Device Architecture (CUDA) [1] [3], which extends the C programming language for generalpurpose application development. Meanwhile, another GPU vendor AMD also introduced Close To Metal (CTM) programming model that provides an assembly language for application development [2]. Intel will release Larrabee [20], a new multi-core GPU architecture specially designed for GPU computing.

3. Massively Parallel Network Coding Our objective is to conduct large scale network coding in a parallel fashion on GPUs. This section discusses the design, the challenges encountered, as well as our workarounds. As described in Sec. 2.1, there are two types of network coding operations: encoding and decoding, which will be discussed in Sec. 3.1 and Sec. 3.2, respectively.

3.1 Encoding The inputs of the encoding process are the matrix B (n original blocks consist of m codewords each) and the matrix C (a series of 1 × n coding coefficient vectors). The output of the process is E, a series of encoded blocks, and each encoded block consist of m codewords. Hence, we need to first generate C and then perform the encoding operation, E = C × B , to produce the encoded blocks. The random elements in C can be easily generated by the random number generator available in the CUDA library, which uses the Mersenne Twister method and can generate tens of millions of random numbers per second. As the time required to generate C is negligible, we will focus on the optimization of the actual encoding operation.

Currently, CUDA is the best available programming model, and is the most well accepted model by the research and development community. Since the release of CUDA, it has been used for speeding up a large number of applications [15] [17] [18] [19]. Ryoo et al. gives a comprehensive introduction to CUDA in [16]. For these reasons, we chose to use CUDA in our research. In the CUDA model, the GPU is regarded as a coprocessor capable of executing a great number of threads in parallel. A single source program consists of the host code to be executed on the CPU and the kernel code to be executed on the GPU. The kernel code is usually computational-intensive and data-parallel, and it is executed on the GPU in the Single-Process Multiple-Data (SPMD) fashion. Intuitively, the kernel code is naturally multi-threaded. In CUDA, threads are organized into thread blocks, where each block is associated with one SM. A thread block can have at most 512 threads, and threads belonging to the same thread block can share data through the shared memory and can perform barrier synchronization. When a thread block terminates, a new block can be launched on the vacant SM.

In the following sections, we will first discuss how to achieve block-by-block encoding on GPUs. We then accelerate the process by introducing the batched encoding, which encodes several blocks at a time.

3.1.1 Encoding a single block When computing an encoded block, the n original blocks should be first transferred from the host (CPU) memory to the GPU global memory. After the encoding process, the encoded block will be transferred back to host memory. Nonetheless, the use of the GPU global memory is expensive in terms of access time, hundreds of GPU cycles, which could be the bottleneck of the entire process. In contrast, the shared memory has much faster access time, but with much smaller capacity. It is obvious that a coefficient vector can be easily stored in the shared memory, given its relatively small size (normally hundreds of bytes).

Another important concept in CUDA is warp, which is formed by 32 parallel threads and is the basic

146

However, the data blocks are too large to fit in the shared memory. For this reason, we divide the n x m data matrix B into BLOCKs ( B ′ , B ′ , ...,

9. __syncthreads(); 10. } 11. output sum;

B2′,1 , B2′, 2 , ..., Bs′, t ) where s = n / k and t = m / k , such

Figure 2. Pseudo kernel codes for encoding a single data block

1,1

1, 2

that each BLOCK is a smaller k x k matrix that can be stored into the shared memory. Correspondingly, the coefficient vector c is divided into sub-vectors of size 1 × k , (c1′, c2′ ,..., cs′ ) . With this design, we can utilize

Line 8 in Fig. 2 invokes addition and multiplication in GF(28). Additions in GF(28) is the same as XOR operation; whilst multiplication in GF(28) is more complicated. One common method is by looking up logarithmic and exponential tables, as shown in Fig. 3. This algorithm requires three memory reads and one addition operation. The sizes of logarithmic table and exponential table are both 256 bytes which can be stored in the shared memory. The drawback of this method is that it leaves less shared memory for storing the data, i.e., smaller k value. To alleviate this problem, we propose to store the two tables in the texture memory that has onboard cache which could help to hide part of the memory latency.

the parallelism nature of the GPU. We divide the C × B operation into several smaller tasks, where each computes a partial encoded block e′j = c′j × B′j ,l , where 1 ≤ j ≤ s and 1 ≤ l ≤ t . The actual encode block can be obtained by concatenating



s j =1

e′j , l for all l. The

concept of this algorithm is illustrated in Fig. 1.

Algorithm 2: Multiplication in GF(28) Input: x, y, log[ ], exp[ ] Output: xy 1. int Mul(uchar x, uchar y) 2. { 3. if ( x ==0 || y == 0) 4. return 0 ; 5. temp1 = log[x]; 6. temp2 = log[y]; 7. return exp[temp1 + temp2] ; 8. }

Figure 1. Encoding a single data block Naturally, we can assign each task, i.e., the computing of a BLOCK to one thread block. The conventional implementation of the encoding process produces 1 codeword at a time following Eqn. 1. Since the access time of a 1-byte char is the same as that of a 4-byte int in CUDA, we choose to use int to represent our data for faster table lookup and data fetching. In other words, each thread in the thread block computes 4 codewords that are stored in one integer. Without loss of the generality, we assume that n and m are multiples of k, and k is a multiple of 4. The size of a BLOCK is, therefore, limited by the number of threads supported by a thread block, which is 512 in the current CUDA model. The pseudo code of the kernel function is shown in Fig. 2.

Figure 3. Multiplication in GF(28) based on logarithmic and exponential tables Based on Eqn. 1 and line 8 in Fig. 2, we know that encoding a single codeword requires n multiplications and n-1 additions. Each multiplication requires 3 table lookups, thus, 3n texture memory access if performed for each codeword encoding. The GPU memory throughput must have significant influence on the coding throughput. Overall, it takes O ( mn) time to encode a block with m codewords from n original blocks. Next, we will further explore ways to optimize the coding throughput.

Algorithm 1: Encoding a single block

3.1.2 Batched Encoding

1. sum = 0; ty = threadIdx.y;

So far, to encode a single block, we have m/k thread blocks, where each has k/4 threads. Each thread is assigned to compute 4 codewords in the encoded block. Today’s GPU requires thousands of threads in order to fully exploit its computing power and memory bandwidth. This section demonstrates how to take full advantages of the powerful GPU processor by computing the encoded blocks in batches.

2. for (j=0; j < s; j++) { 3. __shared__ uchar Bs[k][k]; 4. __shared__ uchar Cs[k]; 5. load the corresponding elements to Bs and Cs; 6. __syncthreads(); 7. for (i = 0; i < k; i++) 8. sum = XOR( sum, Mul(Cs[i], Bs[i][ty]) );

147

Let p be the batch size, the number of encode blocks to be produced; thus, the p coefficient vectors form a p × n coefficient matrix. We employ tiled matrix multiplication, an optimization technique [3] [16], to bring the parallelism of the GPU to the next level. The encoded data matrix is divided into a set of square sub′ , E1,2 ′ , ..., E2,1 ′ , E2,2 ′ ,..., Es′, t ) matrices of size k × k , ( E1,1

selected such that the two k × k sub-matrices can be loaded into the shared memory. Another consideration is that k should be a multiple of 16 so that the number of threads per thread block is a multiple of the warp size 32. In a thread block, each thread is responsible for computing 4 matrix elements in a row, given that an integer is used to store 4 codewords. The pseudo kernel code for batched encoding is shown in Fig. 5.

where s = p / k and t = m / k . Without loss of generality, we assume that n, m, and p are all multiples of k. Correspondingly, the coefficient matrix is divided into sub-matrices of size k x n, (C ′, C ′ , ..., C ′ ) . The 1

2

Algorithm 3: Batched encoding 1. sum = 0; 2. tx = threadIdx.x; 3. ty = threadIdx.y; 4. for (j=0; j < r; j++) { 5. __shared__ uchar Cs[k][k]; 6. __shared__ uchar Bs[k][k]; 7. load elements to Cs and Bs; 8. __syncthreads(); 9. for (i = 0; i < k; i++) 10. sum = XOR(sum, Mul(Cs[ty][i], Bs[i][tx])); 11. __syncthreads(); 12. } 13. output sum;

s

original data matrix is almost divided into sub-matrices of size n × k , ( B ′, B ′ , ..., B ′) . The same divide-and1

2

t

conquer algorithm as in Sec. 3.1.1 is applied here, each sub-matrix in E is computed as E ′ = C ′ × B ′ , where i, j

i

j

1 ≤ i ≤ s and1 ≤ j ≤ t . The concept of this algorithm is illustrated in Fig. 4. B: original data matrix

Figure 5. Pseudo kernel codes for batched encoding

n

With batched encoding, we increase the number of 2

thread blocks to pm / k and the number of threads in C: coefficient matrix

E: encoded data matrix

2

each block to k / 4 . Again, each thread computes 4 codewords at a time. We can now take full advantages of the parallelism of the GPU. E'i,j

k

p

3.2 Decoding

k: block size

As mentioned in Section 2, the decoding process is to solve a set of linear equations after receiving n linearly independent encoded blocks (e1 , e2 ,…, en ) . These n coded blocks form an n × m matrix, E, in which each row corresponds to one encoded block. The n coefficient vectors form an n x n coefficient matrix C, in which each row corresponds to the coefficients of one encoded block. The decoding process is to recover the original blocks In other words, we have

m

n

Figure 4. Batched encoding To utilize the parallelism nature of GPU, we let each thread block be responsible for the calculation of one sub-matrix Ei′, j . Since there might not be enough shared memory to hold the two rectangular submatrices, Ci′ and B′j , the computation of Ei′, j is done

B = C × E = (b1 , b2 , ..., bn ) , where C-1 is the inverse of −1

through another level of divide-and-conquer mechanism. Both Ci′ and B′j are further divided into

k×k

sub-matrices:

(Ci′,1 , Ci′, 2 , ..., Ci′, r )

the coefficient matrix C. Let first assume that C-1 is given, then the decoding process is the same as the matrix multiplication, which can be implemented in the same way as the batched encoding progress (Sec. 3.1.2) by replacing p with n. The concept is the same as illustrated in Fig. 4, except that we replace C with C-1, B with E, and E with B.

and

( B ′j ,1 , B ′j , 2 , ..., B ′j , r ) , respectively, where r = n/k. In each

rounds of the calculation, we load the two k x k submatrices into the shared memory, and then compute D = C ′ × B ′ , where 1 ≤ l ≤ r . After r rounds, we can l

i,l

Now, we switch our focus to the computation of the matrix inversion. Although there exist parallel algorithms for matrix inversion, they are not easy to implement on the CUDA platform, mainly because

j ,l

obtain E ′ = ∑ D . r

i, j

l =1

l

The value of k is carefully

148

CUDA does not provide any direct synchronization methods between threads that belong to different thread blocks. There are two approaches to address this issue: (1) a single thread blocks, which means that only one SM is involved in the matrix inversion; and (2) multiple thread blocks, using the CPU to synchronize the thread blocks. The GPU power is not fully utilized in the first approach; while the kernel function calls and memory copies in the second approach might introduce excessive overhead. In our design, we took neither approach since they both do not pose obvious advantage over another.

seek to identify the maximum performance gain on GPUs in comparison with that on CPUs. Furthermore, we studied the performance gain when using batched encoding. At last, we examined the times taken by the GPU and the CPU to invert matrices of various sizes. All the experiments have been run for a number of times and the average results are presented. For comparison purpose, we use the current fastest multithreaded network coding on a Quad-CPU Intel server [11] as the benchmark.

4.1 Encoding Performance

4. Experimental Results We have implemented the network coding in GF(28) using CUDA and tested it with NVIDIA GeForce GTX260. The GTX260 GPU uses the latest GT200 architecture and it has 24 SMs, i.e., 192 1.24GHz processing cores, with 896MB onboard memory. The GPU is installed on a desktop computer equipped with an 2.4GHz Intel Quad-core CPU Q6600.

• • •

160 140 120 100 80 60 40 20 0

n=128 n=256 n=512

1024

2048

4096

8192

16384

32768

block size (bytes)

Figure 6. Encoding throughput (single block case)

The following optimization techniques have been applied in our implementation to speedup the coding throughput: •

First, we studied the impact of the number of blocks, n, and the block size, m, on the encoding throughput. We varied the number of blocks from 128 to 512, and the block size from 1KB to 32KB. As shown in Fig. 6, on one hand, the throughput drops as n increases, since bigger n incurs more memory references, which verifies that memory bandwidth is more critical to the encoding process than computing power is. On the other hand, the throughput grows as each block gets larger. The reason is that the larger block size, the more concurrent threads. We need a huge number of threads to fully utilize computing power of the GUP and also to hide the global memory latency.

throughput (MB/s)

Instead, we utilize the multi-core available in the CPUs. We implemented a multi-threaded algorithm that utilizes all cores to compute the inverse of a matrix. As shown in our experimental results, for moderate size of n, it is acceptable to use the CPU to perform the matrix inversion, and then use the GPU to perform the data decoding process. When m is large, the time for matrix inversion can be amortized. For most of the random network coding applications, n = 256 is sufficient [9] [12] [13]. For large values of n (e.g., n ≥ 1024), it will be very time consuming to perform matrix inversion because the time complexity of a practical matrix inversion algorithm is normally O(n3). It will be worthwhile to exploit the power of GPU to speedup the process of matrix inversion for large n. We leave this to our future research.

Use a large number of thread blocks and a large number of threads in each thread block. Minimize bank conflicts to make effective use of the shared memory. Use texture or constant memory as much as possible, and exploit coalesced memory accesses. Use loop unrolling to reduce loop overhead.

Despite all the detailed attentions paid to memory access and thread management, we chose to focus on the measurement of the encoding and decoding throughputs, i.e., the number of bytes coded per second, as the block size and the number of blocks varies. We

149

We observed from Fig. 6 that the encoding throughput is relatively low for smaller block sizes, due to lack of parallelism on the GPU. The batched encoding mechanism is specially designed to fully exploit the parallel nature of the GPU. We varied the batch size from 4 to 16, and the results are shown in Fig. 7(a)-(c), respectively. It is obvious that the batched encoding mechanism offers significant gain in throughput, especially for block size smaller than 16KB. In general, the smaller block size is, the larger batch size p is needed. In our experiment, p= 16 is good enough to achieve the best encoding performance. Fig. 7(c) also compares our results with the benchmark values. We are able to achieve speedup of 6 to 12, depending on the block size and the number of blocks. In general, the combination of small n and large m offers the most speedup.

(a) p = 4

(b) p = 8

(c) p = 16

Figure 7. Batched encoding throughput shows the time by the single thread matrix inversion on the CPU and by the block decoding on the GPU. For smaller block sizes, the CPU time dominates the GPU time; hence, the matrix inversion is the performance bottleneck. In contrast, the CPU time drops dramatically when using multiple threads for matrix inversion, as shown in Fig. 10. However, when n is large and m is small, the matrix inversion on the CPU is still the performance bottleneck. We leave it as our future work to migrate the matrix inversion onto the GPU.

4.2 Decoding Performance Next, we reused the setting in Sec. 4.1 and measured the decoding throughput. We observed that the number of blocks, n, and the block size, m, have similar impact on the decoding throughput as they do on the encoding throughput, as shown in Fig. 8(a). As expected, due to the complexity in matrix inversion, the decoding throughput degrades significantly when n is large. We then enabled our multi-threaded approach to fully utilize the computing power of the quad-core CPU. This approach improves the decoding throughput (Fig. 8(b)) to the level that is comparable to the encoding throughput (Fig. 6). Fig. 8(b) shows that our achievable decoding throughput is up to 11 times more than the benchmark results.

(a)

Single thread matrix inversion

Figure 9. Decomposition of decoding time when using single-thread matrix inversion

5. Conclusions

(b)

In this paper, we proposed a high-performance random network coding implementation on GPUs. More specific, we design a massively parallel implementation of random linear network coding using the CUDA programming model. We are able to achieve speedup of 4 to 12 in terms of encoding and decoding throughput, depending on the number of blocks, n, and the block size, m. The combination of small n and large m offers the most speedup. For the decoding process, we proposed a multi-threaded model to compute the matrix inversion on the multi-core CPU, in combination with the parallel model for data block

Multi-threaded matrix inversion

Figure 8. Decoding throughput We further examined how the time is spent among the CPU and the GPU, i.e., the time spent on matrix inversion and data decoding, respectively. Fig. 9

150

routing in a randomized setting. In Proceedings of IEEE ISIT, 2003. [8] Li, S.-Y.R., Yueng, R.W., and Cai, N. 2003. Linear network coding. IEEE Transactions on Information Theory, vol. 49, 2003. 371-381. [9] Gkantsidis, C. and Rodriguez, P. 2005. Network coding for large scale content distribution. In Proceedings of IEEE INFOCOM 2005. [10] Wang, M. and Li, B. 2006. How practical is network coding? In Proceedings of the 14th International Workshop on Quality of Service (IWQoS), 2006, 274-278. [11] Shojania, H. and Li, B. 2007. Parallelized progressive network coding with hardware acceleration. In Proceedings of the 15th International Workshop on Quality of Service (IWQoS), 2007. [12] Wang, M. and Li, B. 2007. Lava: a reality check of network coding in peer-to-peer live streaming. In Proceedings of IEEE INFOCOM’07, 2007. [13] Wang, M. and Li, B. 2007. R2: random push with random network coding in live peer-to-peer streaming. In IEEE Journal on Selected Areas in Communications, Dec. 2007, 1655-1666. [14] Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., and Purcell, T. J. 2007. A survey of general-purpose computation on graphics hardware. Computer Graphics forum, 26(1), 2007, 80-113. [15] Manavski, S. A. 2007. Cuda compatible GPU as an efficient hardware accelerator for AES cryptography. In Proceedings of IEEE International Conference on Signal Processing and Communication, Nov. 2007, pp.65-68. [16] Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and Hwu, W. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of ACM PPoPP’08, Feb. 2008. [17] Falcao, G., Sousa, L., and Silva, V. 2008. Massiv parallel LDPC decoding in GPU. In Proceedings of ACM PPoPP’08, Feb. 2008. [18] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., and Phillips, J. C. GPU computing. IEEE Proceedings, May 2008, 879-899. [19] Silberstein, M., Geiger, D., Schuster, A., Patney, A., Owens, J. D. Efficient computation of sumproducts on GPUs through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, Jun. 2008. [20] Seiler, L., et. al., 2008. Larrabee: a many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3), Aug. 2008. [21] Ho, T., Medard, M., Shi, J., Effros, M., and Karger, D., On Randomized Network Coding, in Proc. of the 41st Allerton Conference on Communication, Control, and Computing, October 2003.

decoding on the GPU. With this design, the achievable decoding throughput is comparable to the encoding throughput; hence, the decoding process will no longer be the bottleneck. By exploiting the GPU computing power, we show that the computational complexity of random network coding will not become the performance bottleneck in practical applications, even when coding a very large number of blocks.

Figure 10. Decomposition of decoding time when using multi-thread matrix inversion

6. Acknowledgement This work is partially supported by Hong Kong RGC under grant HKBU 210406, and FRG grant FRG/0708/II-36.

7. REFERENCES [1] NVIDIA CUDA.

http://developer.nvidia.com/object/cuda.html [2] AMD CTM Guide: Technical Reference Manual.

[3]

[4]

[5]

[6]

[7]

2006. http://ati.amd.com/companyinfo/researcher/docum ents/ATI_CTM_Guide.pdf NVIDIA CUDA Compute Unified Device Architecture: Programming Guide, Version 2.0beta2, Jun. 2008. Sanders, P., Egner, S., and Tolhuizen, L. Polynomial Time Algorithm for Network Information Flow. In Proceedings of the 15th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2003), June 2003. Ahlswede, R., Cai, N., Li S. R., and Yeung, R. W. 2000. Network information flow. IEEE Transactions on Information Theory, 46(4), July 2000, 1204-1216. Koetter, R. and Medard, M. 2003. An algebraic approach to network coding. IEEE/ACM Transactions on Networking, 11(5), Oct. 2003. 782-795. Ho, T., Koetter. R., Médard, M., Karger, D.R. and Effros, M. 2003. The benefits of coding over

151

Suggest Documents