Int. J. High Performance Computing and Networking, Vol. X, No. Y, 2016
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis Mageda Sharafeddin*, Hmayag Partamian, Mariette Awad, Mazen A.R. Saghir, Haitham Akkary, Hassan Artail, Hazem Hajj and Mohammed Baydoun Department of Electrical and Computer Engineering, American University of Beirut, Beirut, Lebanon Email:
[email protected] *Corresponding author Email:
[email protected] Email:
[email protected] Email:
[email protected] Email:
[email protected] Email:
[email protected] Email:
[email protected] Email:
[email protected]
Abstract— In this work, we propose a high performance distributed system that consists of several middleware servers each connected to a number of FPGAs with extended solid state storage that we call reconfigurable active solid state device (RASSD) nodes. A full data communication solution between middleware and RASSD nodes is presented. We use seismic data analysis as a case study to quantify how and by how much RASSD nodes can accelerate computational throughput. Speedup of seismic data prediction time when both GLCM and Haralick features are accelerated is examined. The distributed system achieves 102× speedup compared to 4-thread openMP implementation and 265× speedup compared to single thread modern CPU performance. Performance is 5× better than previous work reporting speedup on GLCM and Haralick feature analysis when data is local to the FPGA and 20× better than an identical CUDA implementation using modern GPU. Keywords— Active solid state devices, Data analysis, Distributed systems, FPGA, GLCM, Haralick attributes, Image processing applications, Intelligent systems, Machine learning, Parallel architectures, Reconfigurable computing, Seismic data analysis. Reference to this paper should be made as follows: Sharafeddin, M., Partamian, H., Awad, M., Saghir, M.A.R., Akkary, H., Artail, H., Hajj, H. and Baydoun, M. (xxxx) ‘Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis’, Int. J. High Performance Computing and Networking, Vol. X, No. Y, pp.xxx–xxx. Biographical notes: Mageda Sharafeddin is a Lecturer at AUB, and was a Software Design Automation Engineer at IBM. She finished her BE in Computer and Communication Engineering from AUB in 1993, her MS degree from The Ohio State University in 1995, and her PhD degree from AUB in 2013. Her interests include power efficient micro-architectures, and reconfigurable and embedded computing. Hmayag Partamian is a graduate student in the Electrical and Computer Engineering department at the American University of Beirut. His research interest include Machine Learning and Patter Recognition. Mariette Awad is an Associate Professor of Electrical and Computer Engineering at the American University of Beirut. She has also been a visiting professor at Virginia Commonwealth University and Intel Mobile Group in summers 2008 and 2009 respectively. Prior to her academic position, she worked for IBM - System and Technology group in Vermont as a wireless product engineer where she earned management recognition, several business awards and multiple patents. Her research interests include: Machine Learning, Pattern Recognition, Game Theory and Artificial Intelligence. Mazen A.R. Saghir is a Visiting Professor of Electrical and Computer Engineering at the American University of Beirut. He received his BE in Computer and Communication Engineering from AUB, and his MASc and PhD in Electrical and Computer Engineering from the University of Toronto. His research interests include reconfigurable computing, computer architecture, compilers and EDA tools, and embedded systems design. He is a senior member of the IEEE. Haitham Akkary is an Associate Professor of Electrical and Computer Engineering at the American University of Beirut. Before becoming a Professor, he was a Principal Research Scientist at Intel Labs. While working at Intel for 20 years, he contributed to the design and development of seven different generations of microprocessors. He holds 41 patents and has authored more than 35 technical papers. His research interests include microprocessor architecture, architecture support for parallel programming, and computer security.
Copyright © 2016 Inderscience Enterprises Ltd.
M. Sharafeddin et al.
Hassan Artail is a Professor at AUB doing research in internet and mobile computing, and in mobile and vehicle ad hoc networks. Before joining AUB, he was a System Development Supervisor at Chrysler, where he worked in system development for vehicle testing applications. He obtained his BS degree with high distinction and MS in Electrical Engineering from the University of Detroit in 1985 and 1986, and his PhD degree from Wayne State University in 1999. In the past ten years, he has published over 150 papers in journals and conference proceedings. Hazem Hajj is an Associate Professor at AUB, and was a Principal Engineer at Intel, where he led research and development for manufacturing automation, and received several awards. He received his BE in ECE from AUB in 1987 with distinction, and his PhD degree from the University of Wisconsin-Madison in 1996. His interests include data mining, energy-aware, and high performance computing. Mohammed Baydoun received the B.E. and M.E. degrees in mechanical engineering from the Lebanese University, Beirut, Lebanon, and the American University of Beirut, Beirut, in 2005 and 2007, respectively. He later obtained Ph.D. degree in electrical and computer engineering from the American University of Beirut in 2014, after working in the field of oil and gas in Abu Dhabi. His research interests include digital image and signal processing, machine intelligence, parallel programming and other related domains.
1. INTRODUCTION According to a study by Gantz et al. (2012), it is estimated that 40% of data generated globally will be handled by cloud computing by 2020. Hence, there is a growing interest in introducing innovative cloud computing solutions for data intensive scientific applications. A range of cloud computing platforms for data-intensive scientific applications covering systems that deliver infrastructure as a service were presented by Li et al. (2014). Data intensive applications in areas such as image processing require scalable and efficient solutions to deliver cost-efficient scheduling and management for the respective workflow. Image processing applications rely on sliding windows for analysis. FPGAs allow multiple copies of the same computation by unrolling loops and providing alternative functionalities to complex mathematical functions such as the work by Paul et al. (2009). In this work, we exploit the proven orders of magnitude speedups possible with FPGAs to enhance throughput in image processing applications dealing with big data that are distributed across multiple geographic locations. There are several advantages to having distributed data analysis. First, having multiple centers analyzing data is a more efficient use of expertise in each center. Second, scientific data sets are increasingly exceeding the processing capacity of modern computers, while machines with large amounts of on-board memory, in the order of terabytes, are expensive. Finally, solving problems in a divide and concur fashion, such as the work by Shan (2010), has proven to work efficiently in analyzing distributed tasks, especially when considering the fact that image processing applications rely heavily on sliding windows for analysis. Our proposed system uses the Reconfigurable Active Solid State Device (RASSD) platform introduced by Abbani et al. (2011) which connects a middleware server (MWS) to RASSD nodes consisting of FPGA/SSD pairs. That system had no filesystem, a problem that was later fixed by an enhanced RASSD system introduced by Ali et al. (2012). While this latter work by Ali et al. (2012) builds an operating system for the RASSD node based on the Xilinx Xil Kernel to partially solve the storage problem in the RASSD node by using the onboard SYSACE flash device data storage, it is still limited by the flash capacity of 2GB. In this work, we propose a distributed system based on the same RASSD framework to solve modern problems that deal with big data. Our distributed
system takes advantage of the higher computational throughput of FPGAs and uses the Linux operating system to connect the FPGAs to external SSD storage. We propose a scheduling algorithm that manages communication between the MWS and the RASSD node as well as the data processing in the RASSD nodes. Given the distributed system we are targeting, the MWSs of the various data centers collaborate to complete tasks whose data are distributed across these centers. In our system the MWS sends hardware configuration files as well as the corresponding high level language driving code to the RASSD nodes. These files vary in terms of structure and content depending on the data analysis tasks. We show how our MWS collaborate in a distributed environment. Given the power of FPGAs in building fast custom hardware, we explore the potentials of improving the datacenter throughput in our proposed distributed system and compare it against 1) the original RASSD system with SYSACE flash for storage, 2) a multicore multithread CPU machine, and 3) a modern GPU machine. In this paper, we use seismic data analysis as a case study to demonstrate the promising potentials of a distributed system with clusters of RASSD nodes having reconfigurable accelerators with access to big data locally. The main contributions of our work are: 1. Designing a distributed system of middleware servers connected to several RASSD nodes for data analysis. 2. Implementing hardware accelerators that access local data using extended SSD storage. This as we will shall show is a high performance solution that overcomes the limited storage capacity of modern FPGAs. 3. Developing a middleware to RASSD communication protocol, coupled with analyzing the performance of our system as well as compare performance to that of a multicore multithread CPU machine and to a modern GPU machine. 4. Showing how our solution is applicable to computationally intensive tasks. Using seismic applications as a case study, our system can achieve double digit performance enhancements when compared to state of the art CPUs. This is achieved by first choosing carefully a parallel algorithm for analysis with data stored on SSDs. Second, the algorithm has to limit data communication. The rest of this paper is organized as follows: Section 2 describes related work and Section 3 describes our proposed reconfigurable distributed environment. Section 4 examines
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis
3
seismic analysis as a case study. In Section 5, we show our recovery. Fault tolerance is examined in a theoretical experimental setup and results and we finally conclude our framework by Saifullah et al. (2011). The distributed system builds a depth first search tree of the system such that upon work in Section 6. stabilization of the algorithm each processor knows all other processors that are 3-edge connected to it. A study by Eswari et al. (2015) examines the effectiveness of the firefly algorithm 2. RELATED WORK for static task scheduling in heterogeneous distributed systems. Advancements in IC capacity and IPs in Field Programmable They show that it performs better than the particle swarm Gate Array (FPGA) devices have introduced an attractive optimization based algorithm. environment for accelerating complex processing applications A Java middleware for heterogeneous multidomain nonsuch as image analysis. The new generation of FPGAs at 28nm routable networks was examined by Frattolillo et al. (2011). as specified by Wu (2013) consumes less static power, The significance of their middleware system is the ability to improves system level performance by 50%, and doubles IC connect to private networks connected to the internet through capacity as compared to the 32nm generation of FPGAs. Using publicly addressable IP front-end nodes. A self-extensible Serial Advanced Technology Attachment (SATA) connection middleware system for distributed data source, named to Solid State Device (SSD) storage, the FPGA hardware MOCHA and proposed by Rodriguez-Martinez (2000), ships accelerators can gain access to big data locally. However when Java code to remote sites to be used in manipulating local data large data is dispersed in several datacenters, additional delay of interest, while striving to minimize data movements. Our from accessing big data for processing becomes a potential system which we propose in this work ships hardware bottleneck to improving performance. Active disk drives configuration files and their corresponding C drivers to examined by Keeton (1998) and Gibson (1998) can partially RASSD nodes in order to set them up. Choosing the suitable solve the data access problem by processing streamed data RASSD processing nodes for a particular job is mainly based from the disk using disk controllers with embedded on data distribution. Additionally, our system manages and tracks the whole application flow, which is in contrast to microprocessors. MOCHA and other middleware systems in the literature that As described in the literature, FPGAs were used in the lack application flow management. literature to support specific tasks in a network as proposed by A study by Guo et al. (2004) analyzes the inherent Krasteva et al. (2011). In their work reconfigurable FPGAs are advantage of hardware acceleration over a Von Neumann combined with wireless sensor network nodes in a system that platform both qualitatively and quantitatively. They analyzed uses a local FPGA middleware for reconfiguring the hardware image processing applications while relying on sliding in real-time. The work is however application specific to windows for analysis, using Xilinx Virtex Pro architecture. In wireless sensors and cannot handle processing data in their quantitative analysis of the speedup of FPGAs over dispersed locations. processors they find that streaming of data from memory to the A middleware architecture for distributed real-time and accelerated logic, the overlap in control and data operations embedded systems that attempts to provide fault tolerance, which was proposed by Balasubramanian et al. (2008). Their and the elimination of instructions are the main contributing architecture uses an adaptive and resource-utilization-aware factors. Since our platform takes advantage of the FPGA failover backup management system. The system decentralizes speedup we do not repeat the analysis, but we rather analyze failure detection by adding monitors that collect resource analytically impact of data movement on speedup in our utilization readings and failure event notifications, and reports proposed distributed system. them to a replication manager to provide transparent failure
M. Sharafeddin et al. Chopra’s (2006) study of seismic texture analysis showed the usefulness of the texture attributes in identifying oil and gas bearing sites. A study by He (2004) uses an FPGA as a coprocessor for handling computationally expensive seismic data analysis and achieves an overall speedup of ten times over a CPU. The FPGA communicates with the main processor through the PCI bus, which enables the main processor memory handles data storage and management. FPGAs have been used to handle distributed tasks by Walter et al. (2007), where they compare Hidden Markov Models to sequence databases. They also show computation reductions from hours to minutes using multiple FPGAs. A study by Dondo (2014) uses FPGAs in a distributed architecture to support indoor localization and orientation services. Our system handles both data storage beyond the traditional capacity of a reconfigurable board as well as communication among RASSD nodes within a cluster in a distributed system for the goal of improving performance. Hitachi Accelerated Flash (HAF) by Hitachi (2015) storage is a combination of flash storage with accelerated data access using dedicated FPGAs with twice the IO throughput promising higher density and better storage reclamation. This architecture is proposed as a solution for the gas and oil industry to deal with processing their Big Data as described by Feblowitz (2012). Turning disk storage devices into smart storage that can perform useful queries has been introduced by Chamberlain (2005) and Brodie (2006). Both systems introduce reconfigurable FPGA systems with disk controllers and demonstrate the potential of using smart storage devices. We implement a variation of this concept by introducing SSDs for data storage in our distributed system. Gipp (2012) and Gipp (2012) examine GLCM and Haralick attribute extraction using GPUs, where the target GPU graphics device is equipped with 768MB of memory. Gipp (2012) reports 19× and Gipp (2012) reports 32× speedup, as compared to a single CPU runtime. Their algorithm also uses lookup tables as is the case in our algorithm. The GPU implementation lacks ability to retrieve images of various sizes from extension storage and requires packing of the GLCM matrix to reduce its size. Our approach, on the other hand, differs from this work by proposing a scalable solution to GLCM and Haralick attribute computation within a distributed system framework. We do not set limits on the image size to fit local memory, and provision RASSD nodes to cooperate in order to analyze either neighboring or remote areas of interest. Finally, a study by Asano et al. (2009) examines three algorithms: two dimensional filters, stereo-vision and k-means clustering. The study examines on-chip performance with no distribution of task calculation or result aggregation. The study shows that when data structures are shared FPGAs perform better than GPUs, only for naïve computational algorithms GPUs can match the FPGA performance. 3. DISTRIBUTED RECONFIGURABLE ACCELERATING DATA ANALYSIS
ENVIRONMENT
predetermined class of tasks means designing a system with optimized data access and processing. In this section, we examine the effect of having data local to clients in the system and having most of the processing done locally on the local data to serve the various clients of the system within the distributed environment. Additionally, we check on the efficiency of having online data presented to clients for further processing. 3.1 Overall System Figure 1 shows an overall view of our system. It includes an application and middleware client machine which we refer to as Client Local Middleware (CLM) and is connected through the local area network to other CLMs. Each CLM is connected to one or more RASSD nodes, and has the following main responsibilities: Running user applications such as the seismic application. Keeping an indexed table of all data, referred to as Data Table (DT), in the system. Data can be saved Figure 1: Proposed Distributed System Showing CLM and RASSD nodes
FOR
While data proliferation creates new challenges for data storage and processing, it also presents new opportunities for innovative systems targeting more accurate results. In a distributed environment, speeding up execution time of a
either in the CLM, on the corresponding RASSD nodes of the CLM, or on RASSD nodes of other CLMs in the system. For the seismic application DT keeps track of all user seismic images. Using the TCP/IP protocol to communicate with the RASSD node. Our communication protocol defines the format of exchanged messages, including loading hardware accelerators, processing data, and returning results. Other commands in our application flow management for data manipulation include adding, deleting, or modifying RASSD files. Distributing the user application into tasks and delegating to RASSD nodes based on DT information. It is also responsible for collecting results from each
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis
5
RASSD node to present final results to users. If the RASSD nodes are associated with the same CLM, the master CLM then figures out the number of Maintaining Accelerator Timing Table (ATT) of all threads to be spawned. This depends on which data is hardware accelerators and an estimate of time available in the corresponding local RASSD nodes. required for servicing small, medium, and large data To maximize parallelism, we keep duplicate copies of sizes. This will allow the CLM to delegate tasks to frequently accessed data on several local RASSD other CLM’s when all RASSD nodes in the cluster are nodes. For example, if the data required for the busy for an extended period of time. We employ application resides on three RASSD nodes only, then caching in the RASSD node to save in reconfiguration the number of parallel CLM threads is bound by three. time when the hardware is properly configured. Finally, the master CLM sends the hardware Updating the Training Table (TT) to keep track of configuration file and the drivers for the serviced time spent by RASSD nodes processing a certain application to the RASSD nodes. accelerator working with specific data size. In this If the master CLM owns the data but the RASSD regard, data from RASSD nodes in the cluster as well nodes are busy, we then need to determine if sending as other clusters are used to train the CLM on how to the data to the remote CLM is faster than waiting for distribute a task based on previous runs of a similar the current RASSD nodes to finish, a process that we task/data combination. examine in detail next. Storing a Task Buffer (TB – a FIFO structure) that is If the data is owned by a remote CLM, the local CLM used to keep track of CLM requests. This is a FIFO sends a request to the remote CLM which in turn structure. Every time a new request is issued by the employs the previous two policies listed here. CLM, an entry is allocated in the TB. Each entry in the TB that holds a task and pointers to the data We use the following scheduling algorithm to resolve the needed. This is used for bookkeeping CLM activities scheduling problem presented above when input data is in a local RASSD but the RASSD node is busy. Here we have the and freeing allocated resources allocated. Being responsible for a group of RASSD nodes, based following alternatives to get the input data: A RASSD node is idle and does not have the required on the geographic configuration. Scripts handling input data. It does not matter if the RASSD node is in application flow management in the CLM aggregate the same CLM cluster or in another cluster. the results that are obtained from the different nodes. A copy of the required input data is available in the CLM. A RASSD node, on the other hand has the following main Note that modern FGPA nodes, such as the Virtex 6 board responsibilities: with MicroBlaze soft processor used in this work, while highly Updating the corresponding CLM, in the same cluster, competitive computationally and energy consumption wise; with the filename information as users add/remove they are not optimized for common IO operations, even though files to/from SSD storage local to each RASSD node. they are highly competitive computationally-and energy Processing tasks and sending data ready signals consumption-wise. There are several RASSD node on-board followed by results to CLM. transfer bottlenecks that limit optimal data transfer. Our Using a light weight Linux OS to implement a experiments show that the MicroBlaze performance is poor in modified version of the communication protocol multitasking a data transfer task, such as the one needed to presented in Ali et al. (2012) in order to suit our move data from one RASSD node to another, and while also system. Each node has an Ethernet Media Access transferring results from a running hardware acceleration task Controller (EMAC) handler for TCP/IP connections to the CLM. We therefore eliminate the possibility of copying with the CLM. Also, the RASSD node is responsible data stored in one RASSD node to another idle RASSD node for sending a set of RASSD reply messages to the regardless if the idle node is within the same CLM cluster or in a remote one. Hence, we are left with only one decision to CLM to acknowledge receipts of commands. tackle and that is whether we want to hold copies of the data in Pulling data out from the attached SSD for seismic the CLM. Since requiring that the CLM to hold all data will analysis. defeat the purpose of datacenter storage efficiency in our 3.2 Job Partitioning and Data Management When a client initiates a request, the master CLM interprets the application to be serviced. This interpretation is a one to one mapping between an application and the corresponding script which constitutes required operations. The first operation typically is checking the DT to determine where the corresponding data resides. Second, the master CLM determines which CLM is the data owner. Third, the CLM decides which nodes should process the analysis based on the following configurations:
system, this alternative is only available to tasks with small input data sizes which would be practical to save in the CLM. Small sized data has low transfer time and low transmission error rate and would especially be useful for data of high interest that are often updated for repeated analysis. We use the following algorithm to determine whether the master CLM should wait or transfer data. Let 𝑇𝑡ℎ𝑟𝑒𝑠ℎ denote the time threshold used to determine if the CLM should wait for the RASSD node to finish the task or should send the data to other CLMs. This threshold is defined as:
M. Sharafeddin et al. 𝑇𝑡ℎ𝑟𝑒𝑠ℎ = 𝑇𝑆_𝐴𝑇𝑇 −(𝐷𝑎𝑡𝑎𝑆𝑖𝑧𝑒/𝐵𝑊) RAM
In the above, 𝑇𝑆_𝐴𝑇𝑇 is the service time estimate of the current task in ATT given the data size to be examined, while 𝐵𝑊 is the network bandwidth, which depends on the node location (local/remote).When 𝑇𝑡ℎ𝑟𝑒𝑠ℎ is positive, the CLM will delegate the task to any available RASSD node within the same cluster or to that of another CLM cluster. In the latter case, the corresponding CLM updates its DT and ATT. In all cases the RASSD node computes the required functionality using local data, and send the results to the CLM. Finally, the CLM aggregates the results and delivers the outcome to the serviced client. A summary of the proposed algorithm is shown in Figure 2. 3.3 FPGA Node Architecture Figure 3 shows a block diagram of the RASSD node. The data transfer path for images consists of the SSD drive connected through the SATA driver to the on-chip memory in a Direct Memory Access (DMA) fashion. Alternatively data can reside on System ACE (SYSACE) flash. SYSACE is an on-board flash memory with a 2 GB capacity. In our experiments we used a Xilinx ML605 board with a Micro Blaze soft Processor (MBP). Instead of reserving the Processor Local Bus (PLB) for file transfer from SSD to the FPGA local memory, a dedicated DMA channel is established between SSD and on-board memory, an approach provided by the open source code from Mendon (2012). All our hardware accelerators are also connected to local memory through a dedicated DMA channel similar to a video processing implementation code from OpenCores (2015). On the other hand, the accelerator communicates the results to MBP through the Fast Simplex Link (FSL) shown in Figure 3. Finally, the MBP passes the results to the CLM through EMAC for aggregation and further processing.
Figure 2: Scheduling in our Distributed CLM RASSD System
Multi-ported Memory Controller DMA
DMA
Reconfigurable Fabric:
FSL
Micro Blaze Virtex 6 PLB SYSACE
ICAP
EMAC
SATA
SSD
Figure 3: RASSD Node Microarchitecture
3.4 Analysis of Proposed Distributed Execution Times as Data Size Communicated between RASSD and CLM Increases We quantify in this section the speedup from utilizing RASSD nodes in a distributed system to solve computationally intensive image processing tasks. In a general purpose distributed environment, the time taken to implement a task is denoted by 𝑇𝑡_𝐶𝑃𝑈 . On the other hand, the time taken to implement a task using our proposed distributed system is denoted by 𝑇𝑡_𝑅𝐴𝑆𝑆𝐷 . The transfer time between the CLM node and the local RASSD node is denoted by 𝑇𝑋_𝑅𝐴𝑆𝑆𝐷 and is measured to be 10Mbps. We have three more RASSD timing parameters to define for our distributed execution time. 𝑇𝑆𝐷𝐷 is the data transfer time between the RASSD RAM and the SSD using the SATA protocol and is measured to be 12MBps. 𝑇𝐻 is the execution time of the hardware accelerator [𝐾 + (𝑝 × 𝑁𝑊)] × 𝐶𝑃 where 𝐾 is the number of required pipelining stages (15 and 33 respectively based on the seismic algorithm to be presented in Section 4 and shown in Figure 9), 𝑝 is the window size and 𝑁𝑊 is the number of windows in the given input image that is proportional (𝑘) to the size of input data. Finally, 𝐶𝑃 is the clock period of the hardware accelerator and 𝑇𝐹𝑆𝐿 is the transfer time of the computed results from the hardware configuration logic to the EMAC handler through the MicroBlaze core and is measured to be 5Mbps. Therefore: 𝑇𝑡_𝑅𝐴𝑆𝑆𝐷 = 𝑇𝑋𝑅𝐴𝑆𝑆𝐷 + 𝑇𝐻 + 𝑇𝐹𝑆𝐿 + 𝑇𝑆𝑆𝐷 (1) 𝑇𝑋𝑅𝐴𝑆𝑆𝐷 = 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 /𝐵𝑊𝐸𝑀𝐴𝐶 (2) 𝑇𝐹𝑆𝐿 = 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 /𝐵𝑊𝐹𝑆𝐿 (3) 𝑇𝑆𝑆𝐷 = 𝐷𝑎𝑡𝑎𝑖𝑛 /𝐵𝑊𝑆𝑆𝐷 (4) 𝑇𝐻 = [𝐾 + (𝑝 × 𝑁𝑊)] × CP (5) 𝑁𝑊 = 𝑘 × 𝐷𝑎𝑡𝑎𝑖𝑛 (6) Where 𝐵𝑊𝐸𝑀𝐴𝐶 is the bandwidth of the FPGA EMAC ethernet connection to the CLM, 𝐵𝑊𝐹𝑆𝐿 is the bandwidth of the FSL bus between the Microblaze and the configurable logic, 𝐵𝑊𝑆𝑆𝐷 is the bandwidth between the SSD and the Virtex 6 memory, 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 is the size of output data, and 𝐷𝑎𝑡𝑎𝑖𝑛 is the size of input data. Plugging all constant bandwidth values already specified above and substituting Equations 2 through 6 in Equation 1, we get: 𝑇𝑡_𝑅𝐴𝑆𝑆𝐷 = {(
1
𝐵𝑊𝑆𝑆𝐷
+ 𝑝 × 𝑘)𝐷𝑎𝑡𝑎𝑖𝑛 + 𝐾 + (
1 𝐵𝑊𝐸𝑀𝐴𝐶
+
1 𝐵𝑊𝐹𝑆𝐿
) 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 } × 𝐶𝑃
(7)
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis
7
analysis were proven to enhance the understanding of the Equation 15 shows that the execution time in our distributed reservoirs according to Yenugu (2010). We base our work on a system is linear. Given that we typically do not have control study by Chopra (2010) which concludes that low-frequency over the size of input data nor over the SSD transfer and high-amplitude anomalies, indicative of hydrocarbon bandwidth, RASSD time according to equation 7 is M×N Image dimension ῳ Hyperplane weight proportional to the window size and corresponding number of vector windows required. The two factors depend on the algorithm m’×n’ Subimage size b Hyperplane bias Number of gray tones C The penalty term 𝑁𝑔 chosen and the level of accuracy in results desired. Equation 7 Ɵ Relationship angle m Number of selected also shows that execution time is proportional to 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 attributes making our system ideal for tasks with results small in size. 𝐾 Identity matrix if Ɵ0 horizontal matrix 𝐶(Ɵ,ℎ) 𝐼0 is of negligible value compared to the application runtime. size m+1 whose last Finally, since we do not repeat the CPU theoretical analysis in diagonal value is 0 Symmetrical GLCM elements Y Predicted label 𝑉𝑖,𝑗 this work, a topic discussed by Guo et al. (2004), we show Normalized GLCM elements p Window size 𝑃𝑖,𝑗 experimental speedup in Section 5. 𝑋𝑖
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑇𝑡_𝐶𝑃𝑈 /𝑇𝑡_𝑅𝐴𝑆𝑆𝐷
(8)
E
Matrix containing the window samples on node i Vector of ones
𝑃𝑁𝑍𝑍 𝑃𝑥
Non-zero-zero probability Sum or probabilities (horizontal) Mean
Diagonal matrix containing µ To summarize, if an application is accelerated by 100×, 𝐷𝑖 the labels of the windows on typical in image processing applications, this means that the node i first two terms in equation 7 (𝐵𝑊1 + 𝑝 × 𝑘) 𝐷𝑎𝑡𝑎𝑖𝑛 + 𝐾 Adjusted sample matrix on Number of 𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝐸𝑖 𝑆𝑆𝐷 node i generated samples contribute to the 100×. Results are generated in a streamed fashion and take (𝐵𝑊 1 + 𝐵𝑊1 ) 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 to be received by the accumulation, exhibit high energy, low contrast and low 𝐸𝑀𝐴𝐶 𝐹𝑆𝐿 entropy when compared to non-hydrocarbon sediments. We CLM. 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 size of the accelerated task can either fully follow the nomenclature presented in Table 1. exploit the 100× or can become the bottleneck in our system.
4
CASE STUDY: SEISMIC DATA ANALYSIS
In this section we explain the fundamentals of seismic texture analysis. We focus on showing the computations involved in our adopted seismic analysis namely GLCM and Haralick feature extraction. We do not discuss LS-SVM computations since they were not accelerated and are considered part of the client application to be handled by our CLM. Additionally, we describe windowing and labeling, two techniques commonly used to simplify the latter calculations and to isolate various processing steps for exposing parallelism. We also provide in this section complexity analysis to justify our choice of tasks for hardware acceleration.
4.2 GLCM Computation Haralick (1973) and Costianes (2010) introduced the concept of texture analysis in the field of image processing based on GLCM. Pi , j
Vi , j
(9)
Ng
V
i , j 1
i, j
In Figure 4, we show the steps in a simplified example that creates a GLCM matrix from a 4×4 illustrative image. The 4×4 image is quantized into 4 gray tones (1, 2, 3, and 4). Next matrix C is constructed by calculating the relative frequencies of the different gray tone combinations within the image. We look for 0 angular relationship of matrix entries that are separated by a distance d=1, this is the first neighbor to the right. As shown in Figure 4, if the value 2 is adjacent to the value 4 two times then C (2, 4) is set to 2. The symmetric matrix V is then constructed as shown in Figure 4. Finally, the GLCM matrix is constructed according to Equation 9.
4.1 An Overview of the Adopted Seismic Texture Analysis Seismic attributes are defined as the measured, calculated or implied quantities generated from the seismic data. A plethora of seismic attributes have been studied, classified, and analyzed by White (1991), Taner (1999), Taner (2001) and Barnes (2006). In this paper we examine textural or Haralick attributes defined by Haralick (1973) which can be applied directly to regions of interest. We employ four Haralick texture Table 1: Table of Nomenclature attributes energy, entropy, contrast and variance. These attributes are indicative of gas and hydrocarbon, an approach followed by Chopra (2010) as well. Energy is a textural attribute that is reflected in a given image by bright spots which indicate presence of hydrocarbon. Entropy is another textural attribute that is reflected in an image lacking uniformity, where for example, low entropy can indicate uniform accumulations of sand and oil. The contrast attribute indicates the nature of hydrogen distribution in the image. We computed these features using the gray level co-occurrence matrices (GLCM) on small windows of the images. Haralick Figure 4: Steps for calculating the GLCM matrix of a 4×4 sample features in the form of 2D and 3D texture attributes in seismic image
M. Sharafeddin et al.
4.3 Haralick Features Computation Haralick features can be divided into three major groups: contrast, orderliness and statistical features. Features in the contrast group use weights related to the distance from the GLCM diagonal. Contrast shown in Equation 10 reflects contrast and similarity among individual pixels in an image. Angular second moment, energy, maximum probability and entropy are some of the measures in the orderliness group which indicates the pixel value regularity within a specific window area. Energy and Entropy are textural attributes that indicate uniformity in an image. The Energy attribute shows bright spots and dim spots of hydrocarbon concentration. Similarly, low Entropy can show uniform accumulations of sand and oil. These features are typically computed using GLCM on windows of the original seismic images. Energy and Entropy are represented in Equations 11 and 12. Finally the statistical group includes the mean and variance shown in Equation 13 and the correlation, all of which are derived from the GLCM. N (10) 2 g
Contrast Pi , j (i j ) i , j 1
Ng
(11)
Energy Pi , j 2 i , j 1 Ng
(12)
Entropy Pi , j ( lnPi , j ) i , j 1
Ng
Variance px (i)(i x )2 i 1
Ng
where px (i) Pi , j j 1
Ng
and x i px (i)
(13)
i 1
The rest of our seismic analysis involves incremental least square support vector machines and is adopted from the work by Do (2006). 4.4 Windowing in GLCM and Haralick Features Computations Overlapping windows of size p×p are taken as shown in Figure 5. In our experimental section we show results from 8×8 windows. A total of 49, 8×8 window size with halfway offsets, windows are produced for a 32×32 sample image. Each of these windows is processed separately to compute its GLCM producing 49 samples with four Haralick attributes each keeping track of the locations by proper indexing. All the resulting samples are aggregated into a single data matrix. 4.5 Labeling In any supervised classification problem, samples need to be labeled. Since in our application the samples represent image windows of seismic data, we build a labeled matrix which contains ones and zeros: a label of 1 corresponds to oil and gas and 0 for none. Every window can now be considered as a twotone gray level. The normalized GLCM of these label images will be a 2×2 matrix since we have only two possible values. We find the sum of probabilities of the nonzero-zero occurrences, 𝑃𝑁𝑍𝑍 , and consider the label of a window to be 1 if its greater than a certain threshold (default value of 0.5). In Figure 5, three label matrices are shown with their corresponding labels. Figure 5(a) has zero probability of zerozero occurrence, hence a positive label. Figure 5(b) has high
non-zero-zero occurrence and therefore it has a negative label. Figure 5(c) shows alternating zeros and ones, however, its 𝑃𝑁𝑍𝑍 is 1 and it is labeled positive.
(a)
1 0 1 1
1 1 1 1 1 0 0 1 1 1 1 0
(b)
1 0 0 1
1 0 0 0 0 0 0 1 0 0 0 0
0.16
(c)
1 0 1 0
0 1 0 1 0 1 0 1 0 1 0 1
0.50
0
0.21
0.21 0.58
0.58 0.16 0.08
0
0.50 0
PNZZ 1.00
PNZZ 0.416
PNZZ 1.00
Label =1
Label =0
Label =1
Figure 5: Labeling Examples
4.6 Seismic Prediction Prediction involves a three-step process. First, an index to a training image is used to produce image and label matrices followed by Haralick attributes computation and concluded by using LS-SVM parameters to predict results for each window and to aggregate the prediction results in a single properly indexed vector. In our system a script was developed to handle managing the various tasks involved. Figure 6 shows a flow diagram for a typical image prediction functionality modeled after the popular map/reduce model by Dean (2008). Middleware reduced functionality is summarized in the flow and K RASSD nodes, each responsible for a mapping function, are used in prediction. When data is saved on RASSD nodes, the data administrator is responsible for updating storage and deleting outdated files to make room for new data. Even though this will require communication with the CLM for updates, it will significantly improve storage utilization of individual RASSD nodes. Additionally, this is an offline operation which will not affect prediction or training runtime. Reduce Function: CLM Calculate labels Calculate Intermediate Matrices Collect various matrices from K nodes Compute ῳand ḃ parameters by inversion
Send GLCM Matrix/ Haralick Features Send request Mapper Function: Calculate GLCM or GLCM & Haralick
Node 1
Node K
Figure 6: Prediction in our Proposed Distributed CLM RASSD System
4.7 Complexity Analysis The workflow of the application at hand has an overall computational complexity that is shown in Table 2. Table 2: Order of Complexity
Step1 Step2
(M* N)
(
M N * *[m'* n' p 4 (N g 2 N g ) p 2 N g 2 ( N g (log)]) m' n '
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis Step3
(
M N * *[m'xn' p 2 ( p 2 N g 2 )]) m' n '
The most complex operation is Operation 2 where the 2D GLCM and the Haralick attributes are computed. The GLCM 2 2 complexity is: (p (N g 6p )) while that of the Haralick attributes complexity is: ( N g2 ((ln) 3 p 2 )) . In the case of the 3D GLCM discussed by Chepora et al. (2006), the fetching of the pixels is performed in a 3D volume or a moving box instead of a 2D window. The GLCM 3 2 complexity in the 3D case is (p (N g 6p )) , while the Haralick attribute computation will still have the same complexity as the 2D case. There are 2 main insights we get from this discussion: first, that the GLCM and Haralick calculation are complex and are needed for the prediction phase. Hence, accelerating both these computations is expected to improve runtime. Second, the complexity of the 3D GLCM computation increases linearly when compared to the 2D GLCM computations and hence we limit our experimental analysis to 2D. Since complexity increases linearly for GLCM and Haralick attributes when we move from 2D to 3D image analysis, speedup gained from our distributed system in 2D images is linearly scalable in data size and hence can be extended to handle 3D images. Therefore, in this paper we do not pursue accelerating 3D GLCM and focus instead on understanding the potentials for speedup in 2D images. To better understand the impact of acceleration on seismic analysis, we profiled our seismic code on an Intel quad core i72 2.3 GHz processor with 8 GB RAM using a GNU GCC compiler. Our profiling study of a 32×32 image during the prediction phase shows that 87% of the total time is spent in building GLCM and extracting the Haralick features. Given that these are the most time consuming processing steps in our seismic analysis, we examine accelerating them in our distributed environment. The GLCM windows will have to be sent to the CLM client as they get calculated on the RASSD node which means that the size of 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 is large. We also investigate building an accelerator for Haralick features. In addition to the obvious interest in accelerating the Haralick feature extraction process, we are interested in leveraging distributed processing by reducing communication with the CLM. Since Haralick feature extraction creates the four features discussed per window of an image, we examine the speedup in the distributed task as the communicated data 𝐷𝑎𝑡𝑎𝑜𝑢𝑡 is reduced. This complements our theoretical analysis in Section 3.4 on speedup. 4.8 Hardware Accelerators Figure 7 shows a block diagram of our implemented hardware to accelerate building GLCM. Images are transferred to the local memory from the SSD through the direct memory access (DMA) channel. The accelerator communicates with the local memory through its own DMA channel. The image is processed in terms of partial subimages where in this work, a subimage has 32×32 dimensions and is conceptually divided
9
into a window of 8×8 intensity levels. An 8×8 window in the native seismic image represents a suitable spot size for rich oil locations. In order to identify bright and dim spots with high or low amplitudes in a seismic image relative to other structures in the volume; 8-levels are typically sufficient to cover all intensity levels. Therefore in our DMA to hardware accelerator channel with a width of 64 bits where 8 bits are needed to represent intensity levels of a pixel; a vector of 8 various intensity levels can be transmitted every cycle. Hence, every cycle a vector of 8 pixel values is read. Additionally, with pipelining every cycle a new window is ready for GLCM calculations (by sliding windows which differ in one column only). Reading the full image (49 windows) thus takes 8+48 = 56 cycles, 8 cycles to fill the 8 pipe stages and start getting the first full 8×8 window and one cycle for every new sliding window. The reconfigurable accelerator core for the GLCM calculation on the other hand runs at 100 MHz. The maximum and minimum intensity levels required in quantizing the matrix values to 8 gray levels within each window are updated during data reading. When the window is fully available we calculate the eight quantization levels based on maximum and minimum values or what we call epsilon threshold levels Eps0 through Eps7 shown in Figure 7, to generate the gray level window. The Window Read (WR) vectors in Figure 7 read 8 pixel values every cycle, where initially WR0 through WR7 represent the first window. In every cycle the matrix is shifted one column and after 48 cycles a subimage is fully read. The maximum minimum (MM) blocks keep track of the maximum and minimum intensity values for 8 windows simultaneously. After 8 cycles MM calculates the maximum for maximum values for the first window. Also, in every cycle, a new maximum and minimum are determined by the MM block, and hence, corresponding window gray level values can be calculated. Checking for co-occurring gray level values finally happens. Since we are working with 8×8 windows, the distances 1 through 8 are checked for co-occurrence in both the row and column directions. For all 64 entries in the 8×8 window we check for co-occurrence with respect to the rest of the entries in the window at the distances specified. GLCM is a symmetrical matrix, and therefore we build 36 entries instead of 64. This includes the top right triangular 28 intensity level entries and the 8 diagonal intensity levels. Our accelerator takes one cycle to copy the top right triangular values to the corresponding bottom left triangular 28 symmetrical entries. We use 8 RAM tables to compute the GLCM values for the 36 entries in our matrix. To do this for an 8×8 window, we found that 14 bits representing all possible adjacency relationships is sufficient. Hence we need a total of 36×8 RAM cells with 14 inputs each to compute the upper half of GLCM. Note that by waiting one cycle to assign the 28 symmetrical values to the final GLCM we get area saving of 28×8 RAM cells with 14 inputs each. Finally, note that given the choice of DMA data transfer, 8 pixel values were transmitted per cycle which is sufficient for our design. In Table 9 in the results section, we show that our design consumes 36% of available RAMs. Given that we want to pursue parallelization of the Haralick features, we did not pursue processing larger windows. Nevertheless, minor modifications to our RTL will be needed to process larger windows given that we only use a fraction of the DMA channel
M. Sharafeddin et al. D
Reconfigurable Fabric
WR0
M A Window
WR7 Array of 36 Parallel
MM0
Calculate Thresholds
Grey Level
Cooccurrence
Window
8 RAM’s each
GLCM throug h FSL to MB
Eps0 to Eps7
MM7 WR0 through WR7: Window Read vectors; MM0 through MM7: Min, Max update vector
Figure 7: GLCM Accelerator Block Diagram
GLCM
P0 _0
Haralick2/Contrast
P1 _1 P
Calculate Probability Matrix
for data transfer (window values). The controller we use from OpenCores (2015) can provide a bandwidth of up to 25.6 Gbps to hardware accelerators in streaming mode. This means that we can maintain high speeds in terms of reading intensity levels for larger windows. In terms of computation, if the same number of intensity levels is maintained then no modification to calculating intensity thresholds will be needed. If more intensity levels are desired, the thresholds will have to be calculated in terms of the new choice of intensity levels. Finally, more RAMs will be required to parallelize cooccurrence tests. As shown in Section 3, Haralick feature extraction involves floating point operations, which we implement using 16 bit fixed point representation. An error of less than 1% is measured including replacing the logarithmic function in the Entropy feature with a look-up-table implementation. Our empirical analysis on seismic images confirms similar studies from the literature by Tahir (2005) in terms of results accuracy using fixed point representation. This error rate does not affect overall gas and oil exploration outcomes. Figure 8 shows a block diagram of Haralick feature extraction accelerator. One change to the GLCM accelerator, not shown in Figure 8 to avoid redundancy, is replacing the 8 RAMs for the cooccurrence check with one RAM only. The reason for this design decision is the area limitation of the reconfigurable fabric in our Virtex-6 device. Additionally in the results section we show that the computation time of our hardware accelerators is a small fraction of the overall distributed computing time. We therefore chose to keep our modular design flexible in terms of window size changes over modifying the design to use all available RAMs in order to improve acceleration time. As shown in Figure 8 a probability matrix denoted by P is generated from GLCM. This includes division operations using IP Logic cores from Xilinx (2015) and Xilinx (2015) that are fully pipelined. To create the energy feature 64 multipliers work in parallel to calculate individual square product terms whose output feeds a summation operation. Entropy is built similarly with the only difference being in the logarithmic table introduced to calculate one of the product parameters. We use this optimization to replace the logarithmic function. The computation of the Contrast feature, which is by far the most time consuming feature, is simplified in Figure 8 to show the main operators needed. Pipelining contrast logic requires more stages as compared to the other features and finally, the variance is the least computationally intensive operation. Note the number of operators in series or parallel in Figure 8 is not exact and we also avoid showing control logic to keep the diagram simple.
P
Haralick1/Energy P7 _7
Haralick3/Variance
ln func tion look up
P0_ 0
Haralick4/Entropy
P1 - _1
P P -
Figure 8: Haralick Accelerator Block Diagram
The pseudo code for the main GLCM function is shown in Algorithm 1 while the pseudo code for the alternate GLCM function required to accommodate the Haralick hardware resources is shown in Algorithm 2. Algorithm 1: GLCM Co-occurrence Calculations (8 × 8) in RTL for int j 0; j < num of rows in window; j + + do Build "condition vector" that calculates co-occurence against quantization levels; end for int k 0; k < num of rows in condition vector; k + + do if condition vector[k] == 1 then update GLCM, a new co-occurrence was found; end end return GLCM
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis Algorithm 2: Modified GLCM (8 × 8) in RTL Build "condition vector" that calculates co-occurrence against quantization levels; for int k 0; k < num of rows in condition vector; k + + do if condition vector[k] == 1 then update GLCM row, a new co-occurrence was found; end end return GLCM row
To cover all rows we make eight instantiations of the above pseudo code. To speed up the calculations of each pixels cooccurrence, a RAM is used as described in Figure 7, where the input to the RAM has all possible values of the condition vector which is implemented as a multiplexer. The contents of the RAM hold the corresponding output value for the condition vector. In Algorithm 1, the second for statement represents the RAM lookup table. Thus 8 RAMS are needed to calculate each entry in the GLCM matrix. Additionally, 36 unique values exist for GLCM and the remaining 28 entries are symmetrical and can be copied as previously discussed above. In Haralick feature extraction, energy and entropy lend themselves well to parallelism while contrast and variance do not, as shown in Figure 7. We unroll all loops and pipeline computation steps to maximize the potentials of FPGAs in distributed computing. We initially used a mix of structural and dataflow RTL design to simplify the verification of simulation and synthesis results. Then, we replaced various high level blocks with Xilinxs CoreGen CORE Generator IP objects to take full advantage of on-board optimized and pipelined division and multiplication operators. Our FPGA design uses 2 DMA channels each running at a 200MHz clock as was shown previously in Figure 3. The first DMA channel is used for moving data from the SSD to the local FPGA memory, while the second DMA channel is used to supply data from local memory into the reconfigurable hardware accelerator implementing GLCM and Haralick features. The measured transfer rate from the SSD device to the local memory however is 12MBps. Since our MBP is a rather low performing processor with a clock rate of 150MHz and the PLB bus is used for the acknowledgement protocol, the result is 12MBps transfer rate instead of the potential 200MBps rate. Concerning the second reason, the SATA driver needs to wait for an acknowledgement from the MBP for every block of 4K transferred. Thus the SATA driver falls short of reaching its full capacity. On the other hand, the hardware accelerator to MBP FSL channel runs at a measured rate of 5MBps. The FSL channel is used to propagate hardware accelerated results to the MBP and then to the CLM through the Ethernet connection between the RASSD node and CLM which runs at 10Mbps. The FSL and Ethernet bandwidth rates are the main obstacle to achieving better runtime gains from distributing tasks using our enhanced RASSD system. Figure 9 shows the detailed pipeline stages implementing GLCM that is shown in Algorithm 2 and Haralick feature extraction which is depicted in Figures 7&8.
11
4.9 GPU Based Accelerators Recent GPUs which can support large number of cores running in parallel have gained popularity in high performance computing systems such as the Amazon Web Services (AWS). Figure 10 shows a block diagram of a typical GPU. A CUDA code is often called a kernel and is executed as a grid of thread blocks. On its own, a thread block is composed of a batch of threads. A CUDA GPU has a set of streaming multi-processors (SMs) which in turn are composed of several scalar processors (SPs). Each SM is usually assigned to a block with each SP assigned to a thread. We develop the CUDA code for the GLCM and Haralick features after testing several possible techniques in order to maximize the performance of a CUDA device. There are two main possible ideas to consider for dividing the work among the blocks and threads in the GPU. The first idea allows each block to work on the 32×32 block, where each thread handles calculating the features for an 8×8 matrix. In the second approach, a block preforms the features calculation by handling the 8×8 set, and threads per block cooperate and use shared memory to perform the required computations especially that shared memory usually leads to best speedup. We found that the first approach is 2× more efficient due to the atomic computations involved. Additionally, data transfer mechanism is an important factor in GPU runtime. We therefore experiment with the data transfer set up that can best utilize the memory hierarchy: global, constant and texture memory components. We follow in this work a streaming model which we found to have the best performance. The streaming model uses the texture memory and allows streaming data from memory to processing components without disrupting execution of tasks (kernels in CUDA terms). Besides, we utilize memory coalescence to send the computed data back from the GPU to reduce memory transfer time. 5
EXPERIMENTAL SETUP AND RESULTS
We implemented a CLM and a middleware system using an Intel dual CORE i7-4702MQ machine with 8 GB memory and 2.4 GHz clock which supports 4 threads which we connected to one RASSD node running a lightweight Linux OS. By using mappers representing partial tasks, namely GLCM and Haralick attributes, the middleware reducer is responsible for aggregating results: GLCM data to be used for Haralick attribute calculations in the case when only GLCM calculations are accelerated. This setup will help us understand the potentials of our system without the complexity of having to build the whole network. The RASSD node is a Xilinx ML605 board with 512 MB RAM. We use low power consuming OCZ Vertex Plus R2 Sata II Solid State Drive with 115 GB storage which we connect to the FPGA board through an external daughter card (FMC XM104). Note that data to processor communication is consistently provided by the SATA II protocol in both RASSD and CPU experimental setups. Using this hardware and software setup in the RASSD node; the GLCM as well as the GLCM and Haralick accelerators were implemented as pipelined streaming accelerators running at 100MHz as was shown in Figure 9. The openMP software was implemented on the Intel dual CORE i7-4702MQ machine similar to the one used for the middleware in the CLM-RASSD
M. Sharafeddin et al. Figure 9: GLCM and Haralick Accelerator Pipeline Stages
Figure 10: Block Diagram of Typical Nvidia GPU
The CLM-RASSD system cannot improve performance mainly due to data communication required between the GLCM windows, accelerated in the RASSD node, and the Haralick attribute calculation routine, implemented in the CLM. This was analyzed earlier in Section 3.4. Since data output size is large, the returns from hardware acceleration diminish due to communication. Table 3 shows a detailed breakdown of the time spent by the FPGA to calculate the GLCM matrices and send them to the CLM. Time is reported in seconds for all values. The first column in Table 3 shows image size, the second column shows time needed to transfer the data from the SSD to the FPGA memory, the third column shows the GLCM calculation time in the configurable fabric, the fourth column shows the time needed to transfer GLCM matrices from the configurable fabric to the MBP, the fifth column shows the time taken to send the GLCM matrices from MBP to the CLM, and finally, the last column shows the total time taken for GLCM calculation using the RASSD node. Note that the time is reported in seconds for all values, and the small percentage of time needed to perform the hardware accelerated calculation compared to time for various data transfer. system. For the CUDA implementation, we used NVIDIA Table 3: Distribution of Total Time Needed to Calculate GLCM Matrices for Various Image Sizes by the FPGA and Sending them to Tesla C2050 GPU computing processor with 448 cores running the CLM. at 1.15 GHz each. Memory is 3GB GDDR5 running at 1.5 GHz and memory interface is 348 bits wide. 5.1 Overall Prediction Speedup Results We discuss results with a distributed system and image sizes ranging from 308 KB to 2.25 GB saved locally on the RASSD node. We first examine the improvements in speedup from accelerating GLCM only. In the C code we parallelize the GLCM using OpenMP at the outer for loop handling each of the 49 windows discussed in Section 4.4. This parallelization style takes advantage of the identical analysis required for each window to build GLCM and calculate Haralick attributes, and hence is a natural choice for programmers to use when parallelizing this algorithm. We tried finer grain parallelization which requires synchronizing of common variables accumulated in the inner for loops and got sub optimal performance. Using four threads in the openMP implementation for image sizes ranging between 308 KB and 2.5 GB, performance improves by 2.8× on average. We use Intel VTune™ Amplifier XE 2016 for windows to profile our OpenMP implementation of GLCM and Haralick attributes in order to understand performance. VTune™ shows that a significant portion of CPU time is spent in synchronization and context switching. The CLM-RASSD system achieves 5× speedup compared to the four thread CPU implementation and 14× compared to the single thread CPU runs. In the CLMRASSD system, the CLM requests GLCM calculations, for prediction, from the RASSD node with the accelerated hardware and the SSD residing data.
Image Size 308KB 3.1MB 32MB 80MB 318MB 1GB 2.25GB
SSD Xfer 0.030 0.260 2.670 6.670 26.500 85.330 192.00
GLCM 0.001 0.010 0.090 0.250 0.980 2.940 6.610
MBP Xfer 0.030 0.310 3.200 8.000 31.800 102.40 230.40
CLM Xfer 0.240 2.480 25.600 64.000 254.400 819.20 1843.20
Total 0.290 3.057 31.560 78.910 313.680 1009.87 2272.21
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis
Figure 11: Overall Speedup by CLM-RASSD system when Accelerating GLCM and Haralick Features
As discussed in Section 2.7, GLCM and Haralick attribute calculations are the most time consuming functions in Seismic analysis. Additionally, Haralick attribute calculations return four attributes per image as opposed to full matrices and given that data communication is a bottleneck in our CLM-RASSD system, we accelerate Haralick feature calculations. Finally, to shed more light on the potential of FPGAs in image analysis, we compare execution time of our CLM-RASSD system to CUDA implementation using NVIDIA’s Tesla 2050 GPU computing processor. As in the openMP implementation, we try several alternatives in the CUDA code to increase performance. Results we show here are from distributing window calculations while streaming data read from memory with data calculations. OpenMP implementation with four threads also performs best for this scenario. Figure 11 shows the runtime advantage of our system from distributing the GLCM and Haralick computation to RASSD nodes. While multithreading can achieve up to 3× speedup, our distributed hardware accelerated system now achieves on average 102× speedup compared to the four thread (4T) run and 265× compared to the single CPU run. Furthermore, the CLMSpeedup 130 120 110 100 90 80 70 60 50 40 30 20 10 0 308KB
3.1MB
32MB
CLM-RASSD/4T
80MB
318MB
1GB
2.25GB
CLM-RASSD/GPU
RASSD system improves performance on average by 20× compared to the CUDA GPU implementation. The CLMRASSD to GPU speed up is on average 11× for large image size while it reaches 71× for relatively small files which fit in memory, about 308 KB. This is mainly due to the optimized memory to reconfigurable logic interface in modern FPGAs. Compare this to 8.4 – 9.6× speedup reported in the literature for up to 4MB image sizes by Tahir (2005). A Haralick features computation accelerator using GPUs for biological applications is reported by Gipp (2012) to achieve 32% speedup. Table 5 shows a detailed breakdown of time, reported in seconds, spent on the FPGA to calculate GLCM and Haralick attributes and send them to the CLM. The column labels are similar to those in Table 4 with the exception of the third column which shows GLCM and Haralick feature extraction calculation time in the configurable fabric. To keep results of seismic analysis accurate, we need 40B for our fixed point implementation of Haralick feature extraction. Note reduction in data transfer time due to moving Haralick features to the
13
CLM as opposed to GLCM matrices 64B (8 × 8 with 8 quantization levels). Table 4: Distribution of Total Time Needed to Calculate Haralick Attributes for Various Image Sizes on the FPGA and Send them to the CLM. Image SSD GLCM MBP CLM Total Size Xfer &Harali Xfer Xfer ck 308KB 0.002 0.025 0.017 0.140 0.180 3.1MB 0.020 0.258 0.180 1.340 1.850 32MB 0.190 2.670 1.750 14.010 18.610 80MB 0.490 6.670 4.580 36.610 48.340 318MB 1.940 26.500 18.310 146.45 193.200 1GB 5.820 85.330 54.920 439.36 585.400 2.25GB 13.090 192.000 123.57 988.57 1317.23
5.2 Examining Effect of Output Data Size on Overall System Speedup Given the gap between IO performance and computational performance on FPGAs, in this section we quantify the effect of input data size on overall speedup. We show in Table 5 for each image size (first column) used in our case study the number of windows (second column) required. Note that when only the GLCM is accelerated, more data was sent back to the CLM to finish seismic analysis. When the GLCM and Haralick attributes are accelerated the hardware was sending less data back to the CLM. Column 3 in Table 5 shows the percentage of output to input data when GLCM and Haralick attributes are accelerated and Column 4 shows this percentage when only GLCM is accelerated. As a result, our system should be used with the following two rules established to guarantee an order of magnitude speedup: 1. Data sizes for results that we refer to also as output data from operations computed by the RASSD node should be smaller than the size of the input data, otherwise, returns from the hardware accelerations are diminished. When the output data size sent from the RASSD node to the CLM is less that 60% of the input data examined by the RASSD node, our system achieves an average speed up of 265× as compared to a single thread CPU system and 102× when compared to a 4- Thread CPU system. Table 5 shows exact percentages of output to input data size for various image sizes used in our seismic experiments. On the other hand when the data output was 3× the size of the data input (300% in Table 5), we notice that the returns in speedup diminish to 14× compared to serial CPU execution and 5× compared to 4-thread CPU execution. Hence to address large output results in compute intensive applications such as seismic analysis that is ideal for hardware acceleration, we delegate complete tasks to the RASSD node where reduced output data sizes are expected. 2.
Our CLM-RASSD system is designed to exploit the hardware acceleration performance advantage of configurable nodes. For example the 4-thread CLM machine can run four separate seismic applications. The CLM communicates with 4 local RASSD nodes
M. Sharafeddin et al. that analyze locally stored images for a specific RAMs Bonded IOBs Image Size
308KB 3.1MB 32MB 80MB 318MB 1GB
Proposed Dist_System: CLM RASSD 0.03 0.26 2.67 6.67 26.50 85.33
Base Dist_System (Ali et al. & Abbani et al.) 0.08 0.65 6.71 16.78 66.71 214.82
geographic area of interest. Given that we established in the previous rule that the RASSD node should work mostly in an autonomous fashion, speedup from the RASSD nodes is therefore linear in the proposed distributed system. Table 5: Percentage of Output to Input Data Sizes as Images Grow in Size. Image Number of %of Data to %of Data to CLMSize Windows CLMGLCM Accelerated Haralick + GLCM Accelerated 308KB 15288 58.17% 310.23% 3.1MB 152880 56.44% 301.00% 32MB 1529976 54.72% 291.82% 80MB 3999184 57.21% 305.11% 318MB 15996736 57.57% 307.03% 1GB 47990208 53.63% 286.04% 2.25GB 107977968 53.63% 286.04%
The software implementation of GLCM and Haralick features extraction has a complexity for a single window of Ɵ(𝑝2 (Ng + 6𝑝2 )) and Ɵ (Ng 2 ((log(Ng 2 )) + 3𝑝2 )) respectively as discussed in section 3.8. The hardware implementation on the other hand has a complexity of 𝑝. 5.3 Examining the Effect of RASSD Node Design Parameters on Overall System Speedup In this section we shed some light on the design choices we make in the RASSD node to achieve our 102× speedup. Following the hardware implementation discussed in Section 4.8, Table 6 summarizes the clock and that area requirements in the reconfigurable fabric of the FPGA when we build the GLCM alone as well as area requirements after combining the GLCM and the Haralick accelerators. Table 6 shows that both designs are well under the area budget of the FPGA's reconfigurable fabric. As we discussed in Section 4.8, we replace the 8 RAMs used in the highly parallel design of GLCM with just 1 RAM to keep our GLCM and Haralick feature extraction under the area resource budget. Table 6: Clock and Area Utilization for GLCM Alone +Combined GLCM and Haralick. Clock and Resource Utilization GLCM Haralick & GLCM Clock 100MHz 100MHz Slice Registers 4% 18% Slice LUTs 6% 38% Occupied Slices 13% 52%
36% 19%
51% 19%
Table 7: Time in Seconds for Proposed System versus Base System (Ali et al. and Abbani et al.)
Finally, we compare our proposed distributed system with the base system introduced by Abbani et. al. (2011) and enhanced by Ali et al. (2012). There is a major limitation in data storage capacity of the base system which solely relies on the on-board ACE flash device. This means that the total 2GB of storage provided by the flash device in addition to holding user data will have to store the kernel. Given that we propose this system for compute intensive applications which typically deal with big data, the original system falls short in achieving this goal. In terms of performance, Table 7 shows the runtime, in seconds, for the proposed system as compared to the original system. The proposed system is on average 2.6× faster than the base system. 6
CONCLUSION
In this paper we presented a distributed system for seismic image processing. We presented a proof-of-concept through an implementation of the client local middleware system connected to one RASSD node, and discussed two distributed computing tasks: one in which the GLCM computation for every window in an image is implemented in the RASSD node and sent to the CLM which is waiting for the partial results before resuming the remaining processing steps, and another one in which both GLCM and Haralick feature computations for every window in an image are sent to the CLM. In both cases the RASSD node acts as an active solid state device that pulls relevant data for processing and sends results to the middleware. The CLM will complete the task by performing further calculations to be reported to the seismic application. For the first task, our system achieves 3× speedup, while in the second task it achieves 102× speedup when compared to a state-of-the-art Intel icore7 processor performing 4-Thread multi-threading. We conclude that with the current FPGA technology, we rquire that tasks besides being computationally intensive need to return results of small size in order to be considered candidates for deployment in a distributed architecture. We also conclude that current SATA technology provides fast data storage for modern FPGAs. For future work, we would like to address the two main bottlenecks in our system: first, the DMA synchronization speed bound to the MBP clock speed limiting the data transfer rate between the SSD and the FPGA memory to 12Mbps and second, seeking a faster Ethernet connection for leveraging the communication between the RASSD node and the cloud. It would also be useful to configure our system using the latest generation Xilinx boards, including Zynq FPGAs. This can mainly shed light on the bus capacity of the new system. We will also need to develop a protocol for arbitrating the choice of the CLM master in the event that data is present in more than one CLM cluster as well as recovery from failure mechanism in cases of data transfer failure within a cluster and among various
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis clusters.
References
15
Dondo, J., Villanueva, F., Garcia, D., Vallejo, D., GlezMorcillo, C., Lopez J. C. (2014), 'Distributed FPGA-based architecture to support indoor localization and orientation services', Journal of Network and Computer Applications.
Ali, A., Jomaa, M., Romanos, B., Sharafeddin, M., Saghir, M. Operating System for a Reconfigurable Active SSD Processing Eleyan, A., Demirel, H., 'Co-occurrence based Statistical Node', 19th International Conference on Telecommunications, Approach for Face Recognition', IS-SCI, pp. 611-615. pp. 1-6. Abbani, N., Ali, A., Al Otoom, D., Jomaa, M., Sharafeddine, M., Artail, H., Hajj, H., Akkary, H., Awad, M. (2011), 'A Distributed Reconfigurable Active SSD Platform for Data Intensive Applications', International Conference on High Performance Computing and Communications, pp. 1116-1121.
Eswari, R., Nickolas, S. (2015), ‘Effective Task Scheduling for Heterogeneous Distributed Systems using Firefly Algorithm’, Vol. 11, No. 2, pp. 132-142. Gantz, J., Reinsel, D. (2012), ‘The Digital Universe in 2020: big data, bigger digital shadows, and biggest growth in the far east’, IDC iView, Vol. 1414_v3.
Asano, S., Maruyama, T, Yamaguchi, Y. (2009), `Performance Comparison of FPGA, GPU and CPU in Image Processing`, Ganguly, J., Freed, A., Saxena, S. (2009), 'Density Profiles of IEEE International Conference on Field Programmable Logic Oceanic Slabs and Surrounding Mantle: Integrated and Applications, pp. 126-131. Thermodynamic and Thermal Modeling, and Implications for the Fate of Slabs at the 660- Km Discontinuity', Physics of the Balasubramanian, J., Gokhale,A., Schmidt, D., Wang, N. Earth and Planetary Interiors,Vol. 172, No. 3-4, pp. 256-267. (2008), ‘“Towards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems”, LNCS, Vol. Gibson, G., Naglet, D., Amirit, K., Butler, J., Chang, F. W., 5053, pp. 72-85. Gobioff, H., Zelenka. J. (1998), 'A Cost-Effective, HighBandwidth Storage Architecture', International Conference on Brodie, B. C., Chamberlain, R. D., Shands, B., White, J., Architectural Support for Programming Languages and 'Dynamic recon_gurable computing', Proceedings of the 9th Operating Systems. Military and Aerospace Programmable Logic Devices International Conference (MAPLD). Gipp, M., Marcus, G., Harder, N., Suratanee, A., Rohr, K., Knig, R., Mnner, R. (2012), 'Haralick's Texture Features Chaddad, A., Tanougast, C., Dandache, A., Al Houseini, A., Computation Accelerated by GPUs for Biological Bouridane, A. (2011), 'Improving of Colon Cancer Cells Applications', Proceedings of the Fourth International Detection Based on Haralicks Features on Segmented Conference on High Performance Scientific Computing. Histopathological Images', ICCAIE , pp. 87-90. Gipp, M., Marcus, G., Harder, N., Suratanee, A., Rohr, K., Chamberlain, R., 'Embedding applications within a storage Konig, R., Manner, R. (2012), 'Haralick's Texture Features appliance', Proceedings of the High Performance Embedded Computation Accelerated by GPUs for Biological Computing Workshop (HPEC). Applications', Modeling, Simulation and Optimization of Complex Processes, Springer. Chopra, S., Alexeev, V. (2006), 'Application of texture attribute analysis to 3D seismic data', The Leading Edge, pp. Gao, D. (2003), 'Volume Texture Extraction for 3D Seismic 934-940. Visualization and Interpretation', Geophysics, Vol. 68, No. 4, pp. 1294-1302. Chopra, S., Marfurt, K. (2010), 'Detecting Stratigraphic Features via Cross-Plotting of Seismic Discontinuity Attributes Haralick, R., shanmugam, K., Dinstein, I. (1973), 'Textural and Their Volume Visualization', Search and Discovery. Features for Image Classi_cation', IEEE Transactions on Systems, Man, and Cybernetics, pp. 610-621. Costianes, P., Plock, J. (2010), 'Gray-level co-occurrence matrices as features in edge enhanced images', 39th Applied He, C., Lu, M., Sun, C. (2004), 'Accelerating Seismic Imagery Pattern Recognition Workshop (AIPR), pp. 1-6. Migration Using FPGA-based Coprocessor Platform', Proceedings of the 12th Annual IEEE Symposium on FieldDean, J., Ghemawat, S., 'MapReduce: simplified data Programmable Custom Computing Machines. processing on large clusters', Commuications of the ACM: 50th anniversary, pp. 107-113. Hitachi Data Systems (2015), 'Hitach Accelerated Flash', url = http://www.hds.com/assets/pdf/hitachi-white-paperDo, T. N., Poulet, F. (2006), 'Classifying one billion data with accelerated-ash-storage.pdf a new distributed SVM algorithm', 4th International Conference on Computer Science, Research, Innovation and Feblowitz, J. (2015), 'The Big Deal About Big Data in Vision for the Future, pp. 59-66. Upstream Oil and Gas', url =
M. Sharafeddin et al. http://www.hds.com/assets/pdf/the-big-deal-about-big- data-in- Shahbahrami, A., Pham, T. A., Bertels, K. (2011), 'Parallel upstream-oil-and-gas.pdf Implementation of Gray Level Co-occurrence Matrices and Haralick Texture Features on Cell Architecture', Journal of Frattolillo, F., Landolfi, F. (2011), ‘Parallel and Distributed Supercomputing. Barnes, B. (2006), 'Too many seismic Computing on Multidomain non-routable Networks’, attributes?', CSEG Recorder, pp. 41-45. International Journal on High Performance Computing and Networking, Vol. 7, No. 1, pp. 63-73. Shan, Y., Wang, B., Yan, Y., Wang, Y., Xu, N., Yang, H. (2010), 'FPMR: MapReduce framework on FPGA', Krasteva, Y. E., Portilla, J., de la Torre, E., Riesgo, T. (2001), Proceedings of the 18th annual ACM/SIGDA international ‘Embedded Runtime Reconfigurable Nodes for Wireless symposium on Field programmable gate arrays, pp. 93-102. Sensor Networks Applications,’ Sensors Journal, IEEE , Vol.11, No.9, pp.1800-1810. Sheline, H. E. (1987), 'Arctic Marine Seismic Acquisition Feasibility: A Case Study Northeast of Greenland', Proceedings Keeton, K., Patterson, D., Hellerstein, J. (1998), 'A Case for of the 19th Annual Offshore Technology Conference. Intelligent Disks (IDISKs)', ACM SIGMOD Record, pp. 4252. Smith, K. (2012), 'Deep-Sea Explorers: The World of Seismic Survey Vessels', url = www.maritimeLi, H. L., Tuo, X. G., Liu, M. Z. (2013), 'Distributed Wireless executive.com/article/deep-sea-explorers-the-world-of-seismicAcquisition System for Seismic Signal with Vibration and survey-vessels Notes', Applied Mechanics and Material, No. 340, pp. 75-79. Sun, S. (2011), 'Analysis and Acceleration of Data Mining Li, X., Qiu, J. (2014), ‘Cloud computing for data intensive Algorithms on High Performance Reconfigurable Computing applications’, Springer-Verlag New York. Platforms', Ph.D. dissertation, Iowa State University, IA. Lopez, L. M., Moctezuma, M., Parmiggiani, F. (2005), 'Oil Suykens, J., Vanderwalle, J. (1999), 'Least Squares Support Spill detection using GLCM and MRF', IGARSS, pp. 1781- Vector Machines Classifiers', Neural Processing Letters, pp. 1784. 293-300. Marine Geoscience Data System (2014), 'Academic Seismic Tahir, A., Bouridane, A., Kurugollu, F., Amira, A. (2003), Portal at LDEO: News', url = http://www.marine- 'FPGA Based Coprocessor For Calculating Grey Level Cogeo.org/portals/seismic/news.php. occurence Matrix', 46th Midwest Symposium on Circuits and Systems. OGP (2011), 'An Overview of Marine Seismic Operations', url = www.ogp.org.uk/pubs/448.pdf. Tahir, M. A., Bouridane, F. K. (2005), 'An FPGA Based Coprocessor for GLCM and Haralick Texture Features and Rodriguez-Martinez, M., Roussopoulos, N. (2000), ‘MOCHA: their Application in Prostate Cancer Classification', Analog A Self-Extensible Database Middleware System for Integrated Circuits and Signal Processing, pp. 205-215. Distributed Data Sources,’ Proceedings of the 2000 ACM SIGMOD international conference on Management of data Taner, M. T. (1999), 'Seismic Attributes, Their Classification SIGMOD ’00, Vol. 29, pp. 213-224. And Projected Utilization', 6 th International Congress of the Brazilian Geophysical Society. Mendon, A., Huang, B., Sass, R. (2012), 'A high performance, open source SATA2 core', International Conference on Field- Taner, M. T. (2001), 'Seismic Attributes', Rock Solid Images. Programmable Logic and Applications, pp. 421-428. Heavy Oil Science Centre (2015), 'Trap Types', url =http://www.lloydminsterheavyoil.com/traptypes.htm OpenCore (2015), 'NPI graphics controller', url = http://opencores.org/project,npigrctrl Vapnik, V., 'The Nature of Statistical Learning Theory', Springer. Paul, S., Jayakumar, N., Khatri, S.P. (2009), ‘A Fast Hardware Approach for Approximate, Efficient Logarithm and Walter, J. P., Meng, X., Chaudhary, V., Oliver, T., Yeow, L., Antilogarithm Computations’, IEEE Transactions on Very Nathan, D., Landman, J. (2007), 'MPI-HMMER-Boost. Large Scale Integration (VLSI) Systems, Vol. 17, No. 2, pp. Distributed FPGA Acceleration', Journal of VLSI Signal 269-277. Processing, pp. 223-238. Saifullah, A. M., Tsin, Y. H. (2011), ‘A Self Stabilising Algorithm for 3-edge-connectivity’, International Journal of High Performance Computing and Networking, Vol. 7, No. 1, pp. 40-52.
Weatherill, G., Esposito, S., Lervolino, L., Franchin, P., Cavalieri, P. (2014), 'Framework for Seismic Hazard Analysis of Spacially Distributed Systems', Geological and Earthquake Engineering, No. 31, pp. 57-88.
Towards Distributed Acceleration of Image Processing Applications Using Reconfigurable Active SSD Clusters: a Case Study of Seismic Data Analysis White, R. E. (1991), 'Properties of instantaneous seismic attributes', The Leading Edge, Vol. 10, No. 7, pp. 26-32. Wu, X., Gopalan, P. (2103), 'Xilinx Next Generation 28nm FPGA Technology Overview'. Xilinx White Paper. Xilinx (2015), 'IP Core Divider', =http://www.xilinx.com/products/intellectualproperty/Divider.htm
url
Xilinx (2015), 'Working with CORE Generator IP', url = www.xilinx.com/support/documentation/sw manuals/ xilinx11/ise c using coregen ip.htm
17