Mean-shift Clustering for Heterogeneous Architecture

0 downloads 0 Views 4MB Size Report
Jan 31, 2018 - 4 Experiments and open questions with Chisel. 24 ... PDF probability density function ..... Notice the difference in CPU and GPU architecture. 6 ... based on empirical auto-tuning (an audio processor created by Antares Audio.
Mean-shift Clustering for Heterogeneous Architecture FOUTSE YUEHGOH ([email protected]) African Institute for Mathematical Sciences (AIMS) Senegal Supervised by: Pr. Mustapha Lebbah & Pr. Christophe C´erin Universit´e de Paris 13, France

January 2018 Submitted in Partial Fulfillment of a Masters II at AIMS

Dedication

To my late Father, Foutse Abraham My mother, Dora Njawe Jessie And all my Family.

i

Acknowledgements “Feeling gratitude, and not expressing it, is like wrapping a present and not giving it.” — William Arthur Ward (1921-1994)

I have had the pleasure of meeting people who have extended a helping hand to me through diverse ways during the endeavour of my masters studies and of writing this thesis. I am deeply grateful for their support and I thank the Almighty God for surrounding me with such good role modeles. Firstly, I thank my supervisors, Pr. CHRISTOPHE CERIN and Pr. MUSTAPHA LEBBAH for showing me the joy of research, by guiding me with extensive knowledge, and unending encouragement, support and advises; CHICK MARKLEY from the Berkeley Aspire Lab for his technical assistance and guidance with Chisel programming, for always answering my questions. I express my sincere gratitude to the MasterCard foundation, to African Institute for Mathematical Sciences in general, and in particular to AIMS-SENEGAL for giving me this great opportunity. I especially salute the brilliant idea of Professor Neil Turok. Special thanks also go to Pr. Joseph Ben Geloun, for the moral and technical support he has been giving me, it contributed in making my stay in France a peaceful one. I thank all the staff in LIPN for their warm welcome and care. Great thanks go to my lecturers Dr. NANA CYRILLE and Dr. WILLIAM S.SHU for recommending me for this scholarship that led to this thesis. Great thanks to all the Tutors who have been with us, helping us throughtout our training. I also thank all my friends at AIMS Senegal, especially my course mates for encouraging me and thus contributing to the success in my studies. I am grateful to my parents, Dora and my late dad Abraham, for their love, for their unconditional support, and for believing in me whenever and wherever. To my brothers and sisters.

ii

Abstract The exponential increment in data size poses new challenges for computer scientists, giving rise to a new set of methodologies under the term Big Data. Many efficient algorithms for machine learning have been proposed, facing up time and memory requirements. Nevertheless, with hardware acceleration, multiple software instructions can be integrated and executed into a single hardware die. Current researches aim at eliminating the burden for the user in using multiple processor types. In this master thesis, we propose a new way of implementing machine learning algorithms on heterogeneous hardware. To explore our vision, we use a parallel Mean-shift algorithm, developed at LIPN as our case study to investigate issues in building efficient Machine Learning libraries for heterogeneous systems. The ultimate goal is to provide a core set of building blocks for Machine Learning programming that could serve either to build new applications on heterogeneous architectures or to control the evolution of the underlying platform. We thus examine the difficulties encountered during the implementation of the algorithm with the aim to discover methodologies for building systems based on heterogeneous hardware. We also discover issues and building blocks for solving concrete machine learning (ML) problems on the Chisel software stack we use for this purpose. KeyWords: Machine Learning; Field Programmable Gate Array (FPGA); Parallel computing; Heterogeneous Architecture; Constructing Hardware in Scala Embedded Language (Chisel);

Declaration I, the undersigned, hereby declare that the work contained in this essay is my original work, and that any work done by others or by myself previously has been acknowledged and referenced accordingly.

Foutse Yuehgoh, 31st January 2018 iii

Contents List of Figures

viii

Abbreviations

viii

1 General Introduction

1

1.1

Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Work Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.5

Document plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2 State of the art

6

2.1

Heterogeneous architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3

Machine learning and heterogeneous architecture . . . . . . . . . . . . . . . . . . . . . 14

2.4

Constructing Hardware In Scala Embedded Language (Chisel) . . . . . . . . . . . . . . 16

3 Machine Learning

6

20

3.1

Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2

The Mean-shift algorithm under study . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experiments and open questions with Chisel

24

4.1

Comparison of Scala and Chisel functions for the square distance . . . . . . . . . . . . . 24

4.2

Comparison of Scala and Chisel hash functions

4.3

Additional technical difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4

Some open questions with Chisel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

. . . . . . . . . . . . . . . . . . . . . . 25

5 Conclusion

29

A Implementation example for maximum of a vector in Chisel

30

B Hardware K-NN in chisel

34

B.1 MostFrequentlyOccurring.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

B.2 CombinationalSortIndexAndTake.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B.3 SortIndexAndTake.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 B.4 HardwareNearestNeighbours.scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 References

45

List of Figures 1.1

Huge volume of data requires new techniques . . . . . . . . . . . . . . . . . . . . . . .

3

2.1

[cla] CPU architecture vs. GPU architecture . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

[cla] GPU acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.3

[BDH+ 10] Three types of heterogeneous architectures . . . . . . . . . . . . . . . . . . .

8

2.4

Serial and parallel computing, image by David Taylor . . . . . . . . . . . . . . . . . . .

9

2.5

FPGA memory trade-off [MDS09] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6

DE2-115 FPGA Development Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7

Basic FPGA structure [KTR08] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8

FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9

Simulation and synthesis; Memory IP is target-specific . . . . . . . . . . . . . . . . . . 16

vi

Abbreviations Chisel Constructing hardware in scala embedded language FPGA Field Programmable Gate Array ASIC application-specific integrated circuits ARM Advanced RISC Machine RISC Reduced Instruction Set Computing CPU Central Processing Unit GPU Graphical Processing Units DSP Digital Signal Processing ALU Arithmetic Logic Unit NRE Non Recurring Expense ISA Instruction Set Architecture EPROM Erasable Programmable Read Only Memory EEPROM Electrically Erasable Programmable Read Only Memory JVM Java Virtual Machine DBSCAN Density Based Spatial Clustering of Applications with Noise AI Artificial Intelligence ML Machine Learning NNMS Nearest neighbor mean shift kNN k-Nearest Neighbor LSH Locality Sensitive Hashing HDL Hardware Description Language SISD Single Instruction Single Data SIMD Single Instruction Multiple Data MISD Multiple Instruction Single Data MIMD Multiple Instruction Multiple Data SPMD Single Program Multiple Data MPMD Multiple Program Multiple Data CBEA Cell Broadband Engine Architecture vii

AMD Advanced Micro Devices CAD Computer Aided Design SRAM Static Random Access Memory PRAM Parallel Random Access Memory SLA Service Level Agreement Verilog Verification Logic HLS high level synthesis KDE kernel density estimation VHSIC Very High Speed Integrated Circuits PDF probability density function AST Abstact Syntax Tree RTL Register Transfer Level FIRRTL Flexible Intermediate Representation for RTL IR Intermediate Representation

1. General Introduction 1.1

Preliminaries

The landscape of Hardware and Software is traditionally represented as a stack, starting with the gates and transistors level, then the digital circuits level, then the micro architecture level, then the ISA (Instruction Set Architecture) level, then the level of different intermediary representations of programs, then the compiler level, then the programming language level, and finaly the application level. Our motivation in this study, is to think about programs development with the idea to reduce the number of levels in the above-mentioned stack. Ideally we would like a toolchain reduced to a programming language level, plus a digital circuit level. We observe that many efforts have been done in the past to complexify CPU or GPU architectures, reaching a peak of about 1800 instructions for the ISA for common CPU-GPU[soc17a, soc17c, soc17d, soc17e, LAR+ 15] Chip designers, such as ARM(originally Acorn RISC Machine, later Advanced RISC Machine, is a family of reduced instruction set computing (RISC) architectures for computer processors, configured for various environments), continue to use the huge amount of gates to duplicate cores or to use new algorithms for the branch prediction problem[soc17c], inspired by deep-learning technologies. This is one way to go further. The rapid growth and popularity of heterogeneous architecture such as Field programmable Gate Array (FPGA) in implementation media for digital circuits has drawn research attention. The advancement in process technology has greatly enhanced the logic capacity of FPGAs thus has made them a viable implementation alternative for larger and complex design. This is another way to go further for the industry. But, the big picture, for the way to go further, could take the form, for researchers, of new discussions on pioneering processor paradigms[soc17a]. An FPGA is a chip designed to have its logic circuit hardware programmable, which comprises of various little blocks called slices that can be programmed to do various simple operations such as addition or simple logic conditional statements. All signals are processed simultaneously, thus offers task to be processed at faster speeds for certain problems, making it advantageous over other computing platforms. Our thesis is to use the gates not to increase the chip complexity but to use them for implementing circuits, generated on the fly, and solving building blocks or skeleton of well isolated and defined (sub)problems. However, it is also well known that the programmable nature of the FPGA logic and routing resources has a dramatic effect on the quality of final device’s area, speed and power consumption. The reconfigurability properties makes them very effective for low to medium volume production, easy to program and debug, have less Non Recurring Expense (NRE) cost and less-time-to-market. FPGAs allow us thinking in mathematical expressions and all sorts of operations rather than in a fixed register file and a collection of predefined instructions (as in all CPUs). The implementation of Machine learning in a wide range of applications, has brought great impact on science and society. Machine learning has tremendous potential for value creation, thus should not let any of that potential go to waste. For this reason, the research world has been working on how to advance the field of machine learning and many ways have been put in place so that people can readily access the knowledge and tools they 1

Section 1.2. Work Context

Page 2

will need to realize their full potential. More still, the best is being done over and over to inspire capable people all over the world so that they will want to dedicate their talents to value creation through Artificial Intelligence (AI) or machine learning. AI create fantastic amount of opportunities, and these opportunities should not be reserved but rather be open to all. We can observe that in the near future, Machine learning and artificial intelligence will automate many jobs. If everyone can use AI to solve the problems that they have, then AI becomes a tool that empowers individuals. As such, value creation through AI should be made as broadly available as possible, thus making economic control more distributed and preventing a potentially dangerous centralization of power. A recent survey by Gartner concluded that 75% of companies are either investing now or planning to invest in analytics and big data solutions in two years. Due to this increased interest level, analysts speculate that in a short while big data initiatives will amount up to $242 billion[Tri17]. Machine learning systems have been around since the 1950s, so why are we suddenly seeing breakthroughs in so many diverse areas? Three factors are at play: enormously increased data, significantly improved algorithms, and substantially more-powerful computer hardware. Over the past two decades (depending on the application) data availability has increased as much as 1,000-fold, key algorithms have improved 10-fold to 100-fold, and hardware speed has improved by at least 100-fold. According to MIT’s Tomaso Poggio, these can combine to generate improvements of up to a millionfold in applications such as the pedestrian-detection vision systems used in self-driving cars[BM17]. The most important thing to note about ML is that it represents a fundamentally different approach to creating software: The machine learns from examples, rather than it being explicitly programmed for a particular outcome. This is an important break from previous practice. In the next sections, we will talk about our context, the problematic, our contribution and the work plan.

1.2

Work Context

As data gets bigger and bigger, big problems arises. Studying the impact of this volume increase of data on machine learning algorithms and the eventual solutions related to their implementation on heterogeneous hardware is at the heart of my internship done at the University of Paris 13, more precisely at Laboratoire d’Informatique de Paris Nord (LIPN), computer laboratory member of Institut Galil´ee, from the 1’st of April to the 29th September 2017, Under the supervision of Pr. Mustapha LEBBAH and Pr. Christophe CERIN, entitled: ”Mean Shift Clustering for Heterogeneous Architecture”. This internship involves a preliminary investigation of the problems and solutions related to the implementation of an efficient clustering algorithm (building block of many fundamental algorithms) on heterogeneous hardware. The internship is based on the mean-shift algorithm that was created by Beck Ga¨el during his internship at Computer Science Laboratory (Laboratoire d’Informatique de Paris Nord, LIPN) at the University of Paris 13, with Lebbah Mustapha, Duong Tarn, Azzag Hanene[soc17f]. Its purpose was to provide an efficient distributed implementation to cluster large multivariate multidimensional data sets (Big Data). Nearest neighbor mean shift (NNMS) defines clusters in terms of locally density regions in

Section 1.3. Problem Statement

Page 3

the data density. The main advantages of NNMS are that it can detect automatically the number of clusters in the data set and detect non-ellipsoidal clusters, in contrast to k-means clustering. Exact nearest neighbors calculations in the standard NNMS cannot be used on Big Data so an introduction of an approximate nearest neighbors via Locality Sensitive Hashing (LSH) was made, based on random scalar projections of the data. To further improve the scalability, they implemented NNMS-LSH in the Spark/Scala ecosystem for distributed computing.

1.3

Problem Statement

Considering the fact that big data are now rapidly expanding in all science and engineering domains. While the potential of these massive data is undoubtedly significant, fully making sense of them requires new ways of thinking and novel learning techniques to address the various challenges that results. Moreover, as data size increases, the performance of algorithms becomes more dependent upon the architecture used to store and move data. Parallel data structures, data partitioning and placement, and data reuse become more important with growth in data size[LGEC17]. Figure 1.1 is an illustration of the problem one can face due to huge volumes of data.

Figure 1.1: Huge volume of data requires new techniques Many learning algorithms rely on the assumption that the data being processed can be held entirely in memory or in a single file on a disk[LGEC17]. In such cases, planning out how the data will be stored in the hardware is complicated. Questions such as, Will the input data be loaded in chunks? Or will it be streamed as a sequence of bytes? arises. Most ML algorithms as written in the textbooks assume that one has easy access to the entirety of the input data which is not the case on hardware. So this need to be planned out first.

Section 1.4. Contribution

Page 4

One objective of our work, in the medium term, was to implement a notebook on heterogeneous hardware and to exploit the potential of this hardware to accelerate machine learning algorithms. With memory scaling lagging behind processor speed, proper memory management normally dictates the performance of an application. Parallel computing has only exacerbated the problem by introducing additional congestion on the memory access pathways with an increased number of processing nodes. As shown in previous works on padding and tiling, not all memory usage patterns yield the same performance. Since portions of fast memory are limited, memory management has become a game of accelerating the memory accesses in a section of code to bring about the largest benefit to the entire program execution. Proper memory management during the program development phase is lost on all but the most experienced programmers. It is easy to see that performance will come from servicing a larger portion of memory accesses from the memory closer to the processor, but power, on the other hand, has not benefited from the same amount of attention. To achieve our objective, we need to answer the question of how will the data be stored in the hardware. Once this is addressed, and the algorithm planned, we need to implement it on the hardware, in traditional hardware-description languages(HDLs) such as Verilog or VHDL. Because the semantics of such dominated traditional hardware-description languages(HDLs) are based around simulations, synthesisable designs must be inferred from a subset of the language, complicating tool development and designer education. These languages also lack the powerful abstraction facilities that are common in modern software languages, thus leading to low designer productivity by making it difficult to reuse components. To work around these limitations, Chisel was developed based on the Scala programming language. In our project, we translate our algorithm into Chisel syntax.

1.4

Contribution

Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or unfeasible; it often consists of matrix and tensor operations. Calculations benefit greatly from parallel computing which leads to model-training performed on graphics cards. This brings the need for inventing new, or adapting existing languages and concepts to address the difficulties faced when implementing an algorithm on a heterogeneous hardware, which requires new compiler transformation,optimization rules. Porting applications to heterogeneous architectures thus often requires complete algorithm redesign, as some programming patterns require great care for efficient implementation on a heterogeneous platform. However, our work does not solve the problem of implementing machine learning algorithms on heterogeneous architechture, but rather provides tools and possible directions for further studies, as this problem races many research questions. Our investigation brings out some important points such as: • Advantages of the new programming language Chisel and it’s challenges. • Gives some comparisons of some Scala and Chisel functions • Provide some technical difficulties that resulted from our experiments • Brings out some open questions with chisel The above mentioned points can help both in the development of Chisel and also to implement machine

Section 1.5. Document plan

Page 5

learning algorithms on heterogeneous architectures. The points are more elaborated in chapter four, where we present our experimentation with Chisel. Moreover, one of the main challenges encountered in computations with Big Data comes from the simple principle that scale, or volume, adds computational complexity. As such machine learning algorithms turn to be slow. One part of our work has been to study the various available approaches to solve this issue. This lead to our interest on FPGA. It’s characteristics has been studied as well as ways to exploit efficiently its parallelism. However, operations on floating point numbers (Double in this case) are not supported directly by any HDL. The reason for this is that while addition/subtraction/multiplication of fixed point numbers is well defined there are a lot of design space trade-offs for floating point hardware as it is a much more complex piece of hardware. Thus, a high performance floating point unit is a significant piece of hardware in it’s own right and would be time shared in any realistic design. Fortunately, Chisel does have a native FixedPoint type found in the experimental package which may be of use.

1.5

Document plan

This thesis is organized as follows: In Chapter 1, we present the context and problematic of our work, as well as an overview of our contributions to this study. Chapter 2 introduces important notions necessary for the understanding of our work, and some related works around this area. Chapter 3 emphasizes more on our algorithm of study, the mean-shift algorithm. Chapter 4 presents our experimentation and open questions with chisel. And finaly Chapter 5 concludes our work.

2. State of the art Introduction of the tool chain used in this study 2.1

Heterogeneous architectures

A heterogeneous architecture is a computing system architecture in which processors use more than one instruction set, all of which share a single memory. This requires programs to be written differently for each of the dissimilar instruction sets. The goal is to offer substantially better performance or cost by devoting the appropriate parts of the application to machine designs optimized for specific types of computing. The past decade has seen an explosion in processor parallelism, with the move to multi-cores. The next major architectural trend is the move from symmetric multiprocessors to heterogeneous systems, with multiple different processors and memories. Node level heterogeneous architectures have become attractive during the last decades due to high peak performance, energy and or cost efficient, compared to traditional symmetric CPUs. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, an acute need for a good overview and understanding of these architectures is crucial. Heterogeneous computing refers to systems that use more than one kind of processor. A major challenge in utilizing heterogeneous resources is the diversity of devices on different machines, which provide widely varying performance characteristics. Researchers are proposing even more asymmetric heterogeneous architectures with varied types of cores on a single chip. In heterogeneous machines, there are now multiple processors, with wildly different features and performance characteristics. As a result, it became extremely difficult for a single, comprehensible model to capture behavior across many processors and memory subsystems on which code may run, in order to reason about mapping a program to a heterogeneous machine. Before compiling a program for a heterogeneous machine, questions such as which algorithm to use; on which resources to place computations and data; how much parallelism to exploit and how to use specialized memories needs to be addressed. Graphical Processing Units (GPUs) are very good at parallelizing mathematical operations, which is the basis of both computer graphics and cryptography. They are specially designed to be fast with mathematical operations since drawing things onto your screen is all math (plotting vertice positions, matrix manipulations, mixing RBG values, reading texture space just to name a few), they thus make heavy use of parallelism. CPUs have a component that just does math which is called the Arithmetic Logic Unit (ALU). Notice the difference in CPU and GPU architecture.

6

Section 2.1. Heterogeneous architectures

Page 7

Figure 2.1: [cla] CPU architecture vs. GPU architecture Heterogeneous computing environments allow for making the best use of both:

Figure 2.2: [cla] GPU acceleration Phothilimthana et al [PARKA13] presents a solution to the problem of efficiently programming diverse heterogeneous systems based on empirical auto-tuning (an audio processor created by Antares Audio Technologies which uses a proprietary device to measure and alter pitch in vocal and instrumental music recording and performances[soc18a]). The autotuner is able to switch between entirely different algorithmic techniques encoded in the program, and between multiple strategies for mapping them to heterogeneous computational resources, and to combine the different techniques encoded in the program into a large search space of poly-algorithms. They showed experimentally how different heterogeneous systems require different algorithmic choices for optimal performance. The current tendency to increase performance by parallelism instead of clock frequency comes from multi-chip, multi-core or multi-context parallelism. Flynn´s taxonomy [BDH+ 10] defines four levels of parallelism in hardware which are: • single instruction single data (SISD), • single instruction multiple data (SIMD), • multiple instruction single data (MISD) and • multiple instruction multiple data (MIMD).

Section 2.1. Heterogeneous architectures

Page 8

In addition, two subdivisions of MIMD are single program multiple data (SPMD), and multiple program multiple data (MPMD). These terms are used to describe the architectures.

Figure 2.3: [BDH+ 10] Three types of heterogeneous architectures The single-chip CBEA, illustrated in Figure 2.3 (a), consist of a traditional CPU core and eight SIMD accelerator cores, where each core can run separate programs in MPMD fashion and communicate through a fast on-chip bus, thus very flexible. With main design criteria to maximise performance whilst consuming a minimum of power. Figure 2.3 (b) shows a GPU with 30 highly multi-threaded SIMD accelerator cores in combination with a standard multicore CPU. The GPU has a vastly superior bandwidth and computational performance, and is optimized for running SPMD programs with little or no synchronization. It is designed for high-performance graphics, where throughput of data is key. Finally, Figure 2.3 (c) shows an FPGA consisting of an array of logic blocks in combination with a standard multi-core CPU, which can also incorporate regular CPU cores on-chip, making it a heterogeneous chip by itself. It can be viewed as user-defined application-specific integrated circuits (ASICs) that are reconfigurable. They offer fully deterministic performance and are designed for high throughput, for example, in telecommunication applications. These three heterogeneous architectures differ in level of parallelism, communication possibilities, performance and cost. However, parallelization provides a key pathway for scaling up machine learning to large datasets and complex methods, and the fact that large-scale machine learning has been increasingly popular in both industrial and academic research communities, the computing industry is approaching a formidable obstacle course where anyone wishing to drive advances in computing technology must carefully negotiate several key trade-offs. Figure 2.4 shows an illustration of serial and parallel computing.

Section 2.1. Heterogeneous architectures

Page 9

Figure 2.4: Serial and parallel computing, image by David Taylor Consumers demand for improved battery life, size, and weight for their laptops, tablets, and smartphones. Likewise data center power demands and cooling costs continue to rise. We want to access our devices through more natural interfaces (speech and gesture), and we want devices to manage ever-expanding volumes of data (home movies, pictures, and a world of content available in the cloud). To navigate this complex set of requirements, the computer industry needs a more efficient approach to computer architecture. We need an approach that promises to deliver improvement across all 4 of these vectors: power, performance, programmability and portability. Distributed and parallel processing of very large datasets has been employed for decades in specialized, high budget settings, such as financial and petroleum industry applications. There have been a brought dramatic progress in usability, cost effectiveness and diversity of parallel computing platforms, with their popularity growing for a broad set of data analysis and machine learning tasks. The evolution of hardware architectures and programming frameworks that make it easy to exploit the types of parallelism realizable in many learning algorithms has lead to the rise in the interest in scaling up machine learning applications. The increased attention to large scale machine learning is also due to the spread of very large datasets (often accumulated on distributed storage platforms) across many modern applications. Thus motivating the development of learning algorithms that can be distributed appropriately. FPGA architecture has a dramatic effect on the quality of the final device’s speed performance, area efficiency, and power consumption. They have less architectural decisions as compared to CPUs or GPUs where their vendors (e.g, Intel, AMD, NVIDIA or ARM) already took decisions on the instruction set architectures, capabilities of the I/O subsystem and sizes of caches. FPGAs are very powerful, but they have a limited number of built-in memory resources. If more memory is required than the FPGA can store then the memory must be stored outside the FPGA in slower memory resources. Accessing this slow memory can often be a bottleneck in memory intensive processing operations.

Section 2.2. FPGA Architecture

Page 10

Therefore, optimizing the amount of memory required and the memory access patters is very important in FPGA designs. Fortunately, work has been done in [MDS09] to show the trade-offs of using one memory component over another. By exploiting these trade-offs a designer can find the optimum on-chip memory system for a given application[MDS09]. Table 2.5, taken from [MDS09], shows four different memory designs. Authors found that for a Sobel edge detector, combining registers with block memory works best. All these properties of an FPGA are what motivates us to take it as our platform of interest.

Figure 2.5: FPGA memory trade-off [MDS09]

2.2

FPGA Architecture

Field-Programmable Gate Arrays (FPGAs) have been one of the key digital circuit implementation media over the last decade. One most important part of their creation lies in their architecture, which governs the nature of their programmable logic functionality and their programmable interconnect. They are pre-fabricated silicon devices that can be electrically programmed to become almost any kind of digital circuit or system. They provide a number of compelling advantages over fixed-function Application Specific Integrated Circuit (ASIC) technologies such as standard cells. FPGAs are configured in less than a second (and can often be reconfigured if a mistake is made) and cost anywhere from a few dollars to a few thousand dollars. Nevertheless, the flexible nature of an FPGA comes at a significant cost in area, delay, and power consumption: an FPGA requires approximately 20 to 35 times more area than a standard cell ASIC, has a speed performance roughly 3 to 4 times slower than an ASIC and consumes roughly 10 times as much dynamic power [KTR08]. These disadvantages arise largely from an FPGA’s programmable routing fabric which trades area, speed, and power in return for ”instant” fabrication. Despite these disadvantages, FPGAs present a compelling alternative for digital system implementation based on their fast-turnaround and low volume cost. For small enterprises or small entities within large corporations, FPGAs provide the only economical access to the scalability and performance provided by Moore’s law[soc17h] Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs)

Section 2.2. FPGA Architecture

Page 11

are a cheap and fast option to develop and test such systems. For this study, we used the DE2-115 FPGA Development Board shown in Fig.2.6 donated by Altera and Intel to conduct our experiments. The two essential technologies which distinguish FPGAs are architecture and the computer-aided design (CAD) tools that a user must employ to create FPGA designs. An example of a multicore/FPGA accelerator framework is the GRVI Phalanx (A Massively Parallel RISC-V FPGA Accelerator Framework: A 1680core, 26 MB SRAM Parallel Processor Overlay on Xilinx UltraScale+ VU9P)[soc18b]

Figure 2.6: DE2-115 FPGA Development Board FPGA programming is actually (re)configuring FPGAs using Hardware Description Language (Verilog/VHDL) to connect its logic blocks and interconnects in a way that it can perform a specific functionality (adders, multipliers, processors, filters, dividers, etc). FPGA programming language is commonly called Hardware Description Language since it is actually used to describe or design hardware. Programming an FPGA is necessary in order to make it ready for use. The two most popular synthesis and implementation tools for FPGAs are Xilinx ISE/Vivado Design suite for Xilinx FPGAs and Quartus II for Intel Altera FPGAs. We can also simulate our design with these tools; simulators (Xilinx ISIM, ModelSim-Altera) which are integrated into Xilinx ISE/ Quartus II respectively. The tools enables synthesizing code and generating the bit stream file for FPGA programming. FPGAs consist of an array of programmable logic blocks of potentially different types, including general logic, memory and multiplier blocks, surrounded by a programmable routing fabric that allows blocks to be programmably interconnected as shown in Figure 2.7. The array is surrounded by programmable

Section 2.2. FPGA Architecture

Page 12

input/output blocks(I/O), that connect the chip to the outside world.

Figure 2.7: Basic FPGA structure [KTR08] The ”programmable” term in FPGA indicates an ability to program a function into the chip after silicon fabrication is complete. This customization is made possible by the programming technology, which is a method that can cause a change in the behavior of the pre-fabricated chip after fabrication, in the ”field”, where system users create designs. The first FPGAs came onto the market in 1985, and sales since then have increased dramatically. The market is now dominated by a couple of major players, and is expected to amount to USD 9.8 billion in 2020 (Wikipedia)[soc17g] The first static memory-based FPGA (commonly called an SRAM-based FPGA) was proposed by Wahlstrom in 1967[KTR08]. Was allowed for both logic and interconnection configuration using a stream of configuration bits. The first modern-era FPGA was introduced by Xilinx in 1984 [KTR08]. It contained the now classic array of Configurable Logic Blocks. It contained 64 logic blocks and 58 inputs and outputs[KTR08]. From then, FPGAs have grown enormously in complexity until modern FPGAs now can contain approximately 330,000 equivalent logic blocks and around 1100 inputs and outputs in addition to a large number of more specialized blocks that have greatly expanded the capabilities of FPGAs[KTR08]. These massive increases in capabilities have been accompanied by significant architectural changes.

Section 2.2. FPGA Architecture

Page 13

Figure 2.8: FPGA Architecture Every FPGA relies on an underlying programming technology that is used to control the programmable switches that give them their programmability. There are a number of programming technologies available, and their differences have a significant effect on programmable logic architecture. The approaches that have been used historically include EPROM, EEPROM , flash, static memory, and antifuses[KTR08]. Only the flash, static memory and anti-fuse approaches are widely used in modern FPGAs. 2.2.1 Language for FPGA design. Programming an FPGA can be done in two methods: • Graphical Design: Logic gates and tools in the library of the compiler program (ISE, Quartus, etc) are used • Hardware Description Language (HDL). The two major Hardware Description Languages are Verilog HDL and VHDL. Verilog = ”Verification” + ”Logic”, which was originally created to operate as a logic simulator by P. Goel, P. Moorby, C. -L. Huang, and D. Warmke in 1984. Verilog was acquired by Cadence in 1990 and became IEEE Standard 1364 in 1995. Now, Verilog is commonly used for designing and verification of digital circuits and analog or mixed signal circuits as well. On the other hand, VHDL is VHSIC (Very High Speed Integrated Circuits) Hardware Description Language initially developed by US Defense Department to research and develop very high speed integrated circuits in the 1980s. VHDL became IEEE Standard 1076 in 1987. Now, VHDL is mostly used to model digital and analog mixed signal circuits such as FPGAs and ASICs. The Hardware Description Language (HDL) was invented as simulation language. Synthesis was an afterthought. Many of the basic techniques for synthesis were developed at Berkeley in the 80’s and applied commercially in the 90’s. Despite the recent push toward high level synthesis (HLS), hardware description languages (HDLs) remain king in field programmable gate array (FPGA) development. However, despite the increased

Section 2.3. Machine learning and heterogeneous architecture

Page 14

availability of rich IP libraries and even IP generators that span a wide range of application domains, developing hardware today is limited to experts, takes more time and is more expensive than ever. More still, because these languages are used for simulation, synthesizable design must be inferred from a subset of the language, complicating tool development and designer education. These languages also lack the powerful abstraction facilities that are common in modern software languages, which leads to low designer productivity by making it difficult to reuse components. Constructing efficient hardware designs requires extensive design-space exploration of alternative system micro architectures, but traditional HDLs have limited module generation facilities and are ill-suited to producing and composing the highly parametrized generate facilities and are ill-suited to producing and composing the highly parametrized module generators required to support thorough design-space exploration More importantly, FPGA programming or FPGA design is about designing digital logic circuits to define the behavior of FPGAs while software programming is about the execution of a sequence of sequential instructions to perform a specific behavior in software. Simulation is to verify the functional correctness of the Verilog/ VHDL code. There are many simulators out there, but Modelsim is the most common simulator often used for functional simulation. Also, a test bench which gives all the possible combination of the input values to verify the design should be written . The test bench coding style may not be too restricted, like the synthesizable coding style so software behavior style (For loop, While loop, etc.) can be used to generate the input patterns for functional simulation. After a successfully verification of the design in functional simulation, we can next think of synthesis and Implementation. The two most popular synthesis and implementation tools for FPGAs are Xilinx ISE/ Vivado Design suite for Xilinx FPGAs and Quartus II for Intel Altera FPGAs. We can also simulate our design with these tools; simulators (Xilinx ISIM, ModelSim-Altera) which are integrated into Xilinx ISE/ Quartus II. The tool enables synthesizing code and generating the bit stream file for FPGA programming. FPGAs allow us thinking in mathematical expressions and all sorts of operations rather than in a fixed register file and a collection of predefined instructions (as in all CPUs). For example, if we want to implement a modulo 10 Gray-code counter, we can build this directly in logic and use a small 4-bit register for storing the count state.

2.3

Machine learning and heterogeneous architecture

Machine leaning is a field of research that formally focuses on the theory, performance, and properties of learning systems and algorithms, a kind of highly interdisciplinary field build upon ideas from many different fields such as artificial intelligence, optimization theory, information theory, statistics, cognitive science, optimal control, and many other disciplines of science, engineering, and mathematics. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or unfeasible; it often consists of matrix and tensor operations. Thus, such calculations benefit greatly from parallel computing which leads to model-training performed on graphics cards. Several works have shown that FPGAs are a real opportunity for efficient embedded hardware implementations of image feature detectors [AH16] for instance, but specific problems need to be solved by means of architectural innovations. Modern FPGA devices incorporate embedded resources that facilitate architecture design and optimization. In the following subsection, we will introduce some important works done on the implementation of machine learning algorithm on FPGAs. In [AH16], authors observed that, FPGAs devices where limited in high clock frequencies and high

Section 2.3. Machine learning and heterogeneous architecture

Page 15

bandwidth external memory, this was what image interest point detectors performance depended on to some extent. As such, specialized memory management units had to be developed to optimize the scheduling to the external memory accesses in the FPGA. Also, most of the reviewed research works on image interest point detectors implementations that sacrifice the accuracy by avoiding floating-point arithmetic and/or altering the original detector algorithm so as to ease the hardware implementation of hardware-greedy arithmetic operators, for instance division and square root, at the cost of some precision [AH16] of internal arithmetic operations. The probability density function (PDF) is a theoretical interpretation of data that you have, which usually assumes that the data you have is distributed the same as the population it came from, and thus is a key concept in statistics with many practical applications. Constructing the most adequate PDF from the observed data is still an important and interesting research problem, especially for large datasets. PDFs are often calculated using non parametric data-driven methods of which One of the most popular is the kernel density estimation (KDE). However, this non-parametric method faces a very serious drawback due to the large number of computation required and finding the optimal bandwidth (smoothing) parameter (time complexity O(n2 )). However, Artur et al [GSG16] presented a possibility of utilizing Field-Programmable Gate Arrays (FPGA) to accelerate finding of such optimal bandwidth. To achieve this, they made use of one popular and often used algorithm called the PLUGIN, they speedup the complex numerical algorithm using the hardware-based approach computing on FPGAs chips. Porting an algorithm to a hardware implementation is often very time consuming since traditional hardware description languages such as VHDL and Verilog are quite low level and require a hardware development mindset. To overcome this disparity and increase the productivity of developers, a number of high level languages have been introduced that aim at simplifying porting of algorithms onto FPGAs. Different approaches are taken by such languages; One approach is to extend a common software language with parallel and hardware construct so that they can be used as algorithmic hardware description languages with the advantage that they hide much of the low-level control; an example is the Handel-C. Another approach is to compile ”standard” serial code to a parallel FPGA, this enables the compiler to analyses the serial code to identify potential parallelism and exploit it in mapping the code to hardware, examples include Matlab based Match and C-based SA-C. The characteristics of these languages is that they hide the parallel nature of the hardware completely, allowing existing serial code to be used directly. Another approach was constructing a new hardware construction language that supported advanced hardware design using highly parametrized generators and layered domain-specific hardware language, can generate a corresponding hardware code for implementation on FPGAs , an example is Chisel ( used within programs written in Scala) Language. The limitation of the first two approaches is that the underlying algorithm is serial, since all effort is placed in developing an efficient algorithm, rather than optimizing the underlying architecture for the necessary computation. To achieve a significant speed improvement, a large fraction of the algorithm must be able to be parallelized (Amdahl’s law, usually used to calculate how much a computation can be sped up by running part of it in parallel). As a result, the performance of a parallel implementation is particularly sensitive to the quality of the design [BJ10]. Therefore, particular attention needs to be made when implementing an algorithm onto an FPGA so that we do not only exploit the obvious and relatively straight forward transformations, but also transform the underlying algorithm to exploit other types of parallelism that are compatible with the data flow. The key to an efficient hardware

Section 2.4. Constructing Hardware In Scala Embedded Language (Chisel)

Page 16

implementation is not to port an existing serial algorithm but to transform the algorithm [BJ10]. This makes the last approach (chisel) most appropriate since it generates a hardware version of the code that is to an extend equivalent to a hand written hardware code and can effectively exploit the parallelism available on the FPGAs.

2.4

Constructing Hardware In Scala Embedded Language (Chisel)

This section is about Chisel[BVR+ 12].

Figure 2.9: Simulation and synthesis; Memory IP is target-specific

Chisel is the new open-source hardware construction language developed at UC Berkeley that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. It’s latest version is Chisel3, which uses Firrtl as an intermediate hardware representation language. To work around the limitations of traditional HDLs, chisel was developed based on the Scala programming language. Chisel is a platform that provides modern programming language features for accurately specifying low-level hard-ware blocks which can readily extend to capture many useful highlevel hardware design patterns. By using a flexible platform, each module in a project can employ whichever design pattern best fits that design, and designers can freely combine multiple modules regardless of their programming model. More still, Chisel comprises of a set of Scala libraries that defines new hardware data types and a set of routines to convert a hardware data structure into either a fast C++ simulator or a low-level Verilog for emulation or synthesis. Chisel was created by embedding hardware construction primitives within the Scala programming language. By so doing, the level of hardware abstraction was raised by providing concepts including object

Section 2.4. Constructing Hardware In Scala Embedded Language (Chisel)

Page 17

orientation, functional programming, parametrized types, and type inferences. Abstraction is an important aspect of Chisel as it allows users to conveniently create reusable objects and functions, allows users to define their own data types, and allows users to better capture particular design patterns by writing their own domain-specific languages on top of Chisel. Chisel is a high level highly parameterized embedded DSL for generating hardware design. Scala was the chosen language due to the following reasons: 1. Has interesting features for building circuit generators. A key motivation for embedding Chisel in Scala is to support highly parameterized circuit generators, which is a weakness of traditional HDLs. Cache Generator, Sorting Network and Memory. 2. It is developed as a base for domain specific language. 3. It compiles to the JVM. 4. Has a large set of development tools and IDEs. 5. It has a fairly large and growing user community. At the simplest level Scala is used to create circuit components and to connect these components together. But the real power of Scala is provide a framework for expressing parameterized models of circuit generators. In chisel syntax, some keywords are part of Scala, ports are used as interface to hardware component. A key motivation for embedding Chisel in Scala is to support highly parametrized circuits generators, which is a weakness to traditional HDLs. Chisel supports recursive creation of hardware subsystem, which is not the case for Verilog. Memories are given special treatment in Chisel since hardware implementations of memory have many variations. Chisel defines a memory abstraction that can map to either simple Verilog behavioral description, or to instancies of memory modules that are available from external memory generators provided by foundry or IP vendors. In short, Chisel allows memory to be defined with a single write port and multiple read port as follows: Mem(depth: Int, target: Symbol = ’default, readLatency: Int = 0) Where depth is the number of memory locations, target is the type of memory used, readLatency is the latency of read ports to be defined by the memory. Additional parameters are available to mimic common memory behaviors, to aid with the process of mapping to real-world hardware. Fast simulation is crucial to reduce hardware development time, as such, fast C++ simulator for RTL debugging was developed for chisel. The chisel compiler produces a C++ class for each chisel design, with a C++ interface including clock-low and clock-high methods. Chisel is still in its development phase and it is being developed in Labs of UC Berkeley. It is being supported by some of the biggest conglomerate such as, StarNet, C-Far, LBNL, Intel, Google, LG, Nvidia, Samsung, Oracle, Huawei etc. We can conclude that Chisel makes the power of a modern software programming language available for hardware design, supporting high-level abstractions and parameterized generators without mandating a particular computational model, while also providing high-quality Verilog RTL output and a fast C++ simulator. A Chisel program typically consists of several steps: • A Chisel3 program first constructs an internal representations of an idealized circuit as an abstact syntax tree (AST). At the end of generation, the AST is serialized in to FIRRTL (an intermediate representation) representation.

Section 2.4. Constructing Hardware In Scala Embedded Language (Chisel)

Page 18

• The Firrtl transformation engine process the high level FIRRTL produced with some number of transformation passes. These passes can optimize the code, do width inferences, and finally emit verilog or low Firrtl. Typically during development the circuit is then unit tested. There are two simple ways to do this. • The verilog emitted can be converted into an executable simulation via verilator[soc17i] (the fastest free Verilog HDL simulator) and a C++ compiler. The simulation can be executed with a test harness that validates the circuit. Or, the emitted Firrtl can be simulated using the Firrtlinterpreter a lightweight Scala program, capable of running the same unit tests used with the chisel-testers. These steps can be run together, using chisel-tester can execute all the above steps automatically. Or done individually, each step can produce output files for the user to add custom integration or to target the verilog for FPGA or a chip tape-out. The JVM is simply the execution environment used to run Scala programs and is not necessary to understand or interact with in order to build circuits using Chisel. Also, the netlist that is output by Chisel undergoes a series of very aggressive transformations by the synthesis system before it is laid out on the chip / loaded into the FPGA. The synthesis system requires a number of parameters for tuning, for instance, its energy consumption versus its critical path length. Depending on the setting of these parameters,one will get very different circuits at the end of the transformations. So for optimizing for size and/or energy, one need to first look at those parameters in the synthesis system, before trying to alter the Chisel code. The mapping is not straightforward. Operations on floating point numbers (like Double frequently used in machine learning) are not supported directly by any HDL. This is because, while addition/subtraction/multiplication of fixed point numbers is well defined there are a lot of design space trade-offs for floating point hardware as it is a much more complex piece of hardware.Note that Chisel does have a native FixedPoint type found in the experimental package which could be used instead. Fundamentally, writing Chisel is writing a Scala program to generate a circuit. High-Level Synthesis which is quite different from Chisel. Rather than mapping Scala (or Java) primitives to hardware, Chisel executes Scala code to construct a hardware AST that is then compiled to Verilog. With following annotated example, we will try to see in details each parts of a chisel code: 1]class MyModule(width: Int) extends Module { 2]val io = IO(new Bundle { 3]val in = Input(UInt(width.W)) 4]val out = Output(UInt()) 5]}) 6]println(s"Constructing MyModule with width $width") 7]val counter = RegInit(io.in) 8]val inc = counter + 1.U 9]counter := inc 10]printf("counter = %d\n", counter) 11]} • (1) The body of a Scala class is the default constructor. MyModule’s default constructor has a single Int argument. Superclass Module is a chisel3 Class that begins construction of a hardware module. Implicit clock and reset inputs are added by the Module constructor.

Section 2.4. Constructing Hardware In Scala Embedded Language (Chisel)

Page 19

• (2) io is a required field for subclasses of Module, new Bundle creates an instance of an anonymous subclass of Chisel’s Bundle (just like a struct in C) When executing the function IO(...), Chisel adds ports to the Module based on the Bundle object. • (3) Input port with width defined by parameter. • (4) Output port with width inferred by Chisel. • (6) A Scala println that will print at elaboration time each time this Module is instantiated. This does NOT create a node in the Module AST. • (7) Adds a register declaration node to the Module AST. This counter register resets to the value of input port io.in. The implicit clock and reset inputs feed into this node. • (8) Adds an addition node to the hardware AST with operands counter and 1. + is overloaded, this is actually a Chisel function call. • (9) Connects the output of the addition node to the ”next” value of counter. • (10) Adds a printf node to the Module AST that will print at simulation time. The value of counter feeds into this node The Scala description itself exports the Firrtl (an intermediate representation (IR) for digital circuits designed as a platform for writing circuit-level transformations). Chisel function calls like U Int and := actually mutate an AST being constructed by Chisel inside of a given Module. Chisel walks this AST starting at the top-level Module in order to emit Firrtl. Reflection is used only for naming wires and registers, not for constructing the AST.

3. Machine Learning In this section, we talk more on machine learning and the part of machine learning that was used as case study. Machine learning is a branch of Artificial Intelligence, born in the 1950s when extremely rapid scientific progress made many optimistic about the possibility that computers could achieve human-like intelligence in a matter of decades. Machine Learning is the science of building hardware or software that can achieve tasks such as extracting features from data in order to solve predictive problems, automatically learn to recognize complex patterns and make intelligent decisions based on insight generated by learning from examples, in other words it is the use of algorithms to create knowledge from data. For accuracy, models must be trained, tested and calibrated to detect patterns using previous experience before being deployed. It evolves from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or unfeasible; it often consist of matrix and tensor operations. Thus, such calculation benefits greatly from parallel computing which leads to model-training performed on graphics cards. There are four types of algorithms used in machine learning: • Supervised (Predictive): The vast majority of systems today. These systems are ’trained’ based on past data on an attempt to predict future outcomes. • Unsupervised (Exploratory): These systems try to build models, by themselves, of the process analyzed, also known as clustering. • Semi supervised: A combination of the first two, where a small amount of data is ’labeled’ (i.e. related to known training rules) and the machine uses this as a seed to label the rest of the data • Reinforcement: The algorithm creates its rules through trial and error. Modern data analysis requires a number of tools to undercover hidden structures. To discover these hidden structures, an initial exploration of data by animated scatter diagrams and nonparametric density estimation in many forms and varieties are the techniques of choice. In the following subsection, we will introduce important notions on the techniques that are of interest in our work, and elaborate more on the mean-shift that is of interest to this study.

3.1

Cluster Analysis

Informally, one will say that clustering is finding natural groupings among objects. Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships. The grouping (clustering) is done such that, they are either meaningful, useful, or both. If the grouping is aimed at having meaningful clusters, then the clusters should capture the natural feature of the data. If the grouping is aimed at obtaining usefulness, then the analysis provides an abstraction from individual data objects to the clusters in which those data objects reside. However, in some cases, cluster analysis is only a useful starting point for data summarization for instance. Clustering is key for many exploratory tasks, as such many clustering algorithms have been proposed like: centroid-base clustering (e.g K-means); distribution based clustering (e.g Expectation-Maximization with Gaussian mixture); density-based clustering (e.g Mean-shift, DBSCAN). 20

Section 3.2. The Mean-shift algorithm under study

Page 21

Most of these existing clustering methods need prior notion on the number of clusters to take as an input parameter, whereas in practice, this is unknown. But the Mean-shift and DBSCAN algorithms[RKDZ14] are exceptionally appealing and non parametric clustering techniques that deal without prior notion on the number of clusters. The Mean-shift estimates the number of clusters directly from the data and is able to find clusters with irregular shapes, but it may fail to find the proper cluster structure in the data sometimes when multiple modes exist in a cluster. Unlike Mean-shift, DBSCAN has a drawback in that, it is sensitive to the choice of the neighborhood’s radius (called Eps)[RKDZ14]. With these slight drawbacks(i.e, the sensitivity to parameters’ values and the difficulty of handling overlapping cluster) of DBSCAN and Mean-shift researchers have work to overcome these disadvantages while preserving their non parametric nature, as we can see in the paper by Yazhou et al [RKDZ14].

3.2

The Mean-shift algorithm under study

Mean-shift was first proposed by Fukunaga and Hostetler [FH75], and later adapted by Cheng [Che95] for the purpose of image analysis and recently extended by Comaniciu, Meer and Ramesh [Kon05] to low-level vision problems, including, segmentation, adaptive smoothing and tracking. Mean-shift considers feature space as an empirical probability density function. If the input is a set of points, Mean-shift considers them as sampled from the underlying probability density function. If dense regions (or clusters) are present in the feature space , then they correspond to the mode (or local maxima) of the probability density function. We can also identify clusters associated with the given mode using Mean-shift. 3.2.1 How it works?. Given a data set, for each data point, Mean-shift associates it with the nearby peak of the dataset’s probability density function. For each data point, it defines a window around it and computes the mean of the data point . Then it shifts the center of the window to the mean and repeats the algorithm till it converges. After each iteration, we can consider that the window shifts to a more denser region of the dataset. In summary, it does three things: It fixes a window around each data point; it computes the mean of the data within the window; then shifts the window to the mean and repeat until we have convergence. Mean-shift can be considered to be based on Gradient ascent on the density contour. The generic formula for gradient ascent is given by, x1 = x0 + ηf 0 (x0 )

(3.2.1)

Here, each point in the feature space is treated as a probability density function. The dense regions in feature space corresponds to local maxima or modes. To each data point is performed gradient ascent estimate on its local estimate density until convergence. Stationary points obtained via gradient ascent is a representation of the mode of the density function. Hence, all points associated with the same stationary point belong to the same cluster. 3.2.2 Application of the gradient ascent on the kernel density estimator. The kernel density estimator in d-dimension is given by: P h−d ni=1 K[(x − xi )/h] fˆ(x) = (3.2.2) n

Section 3.2. The Mean-shift algorithm under study

Page 22

and, Pn

i=1 K

∇fˆ(x) =

0 [(x

− xi )/h]

hd+1 n

(3.2.3)

At density maxima, ∇fˆ(x) = 0 Hence we get,

Pn

0 [(x

− xi )/h] → − x = d+1 h n

i=1 K

Pn

i=1 K

0 [(x

− xi )/h] → − xi

hd+1 n

(3.2.4)

Pn K 0 [(x − xi )/h] → − → − xi x = Pni=1 0 K [(x − x )/h] i i=1

(3.2.5)

g(x) = −K 0 (x)

(3.2.6)

Pn i g( x−x − h )→ m(x) = Pni=1 x−x xi − x i g( ) i=1 h

(3.2.7)

Assuming

Thus we have,

The quantity m(x) is the mean shift. Thus the Mean-shift procedure for a given point xi is as follows: 1. Compute the mean shift vector m(xti ). 2. Move the density estimation window by m(xti ), i.e translate density estimation window: xt+1 = xti + m(xti ). i 3. Repeat until convergence, i.e ∇fˆ(xi ) = 0. 3.2.3 Proof of convergence. Using the kernel profile, y

t+1

y t −xi 2 i=1 xi k(|| h || ) Pn y t −xi 2 i=1 k(|| h || )

Pn =

To prove convergence, it suffices to prove that f (y t+1 ) ≥ f (y t ). f (y

t+1

t

) − f (y ) =

n X i=1

n

X y t+1 − xi 2 y t − xi 2 k(|| || ) − k(|| || ) h h i=1

Since the kernel is a convex function,we have k(y t+1 ) − k(y t ) ≥ k 0 (y t )(y t+1 − y t ).

(3.2.8)

Section 3.2. The Mean-shift algorithm under study

Page 23

Thus, using it we get the following: f (y

t+1

t

) − f (y ) ≥

n X

k 0 (||

i=1

y t − xi 2 y t+1 − xi 2 y t − xi 2 || )(|| || − || || ) h h h

=

n  1 X 0 y t − xi 2  (t+1)2 t+1 2 t2 t 2 || ) y − 2y x + x − (y − 2y x + x ) k (|| i i i i h2 h

=

n  1 X 0 y t − xi 2  (t+1)2 t2 t+1 t T || ) y − y − 2(y − y ) x k (|| i h2 h

=

n  1 X 0 y t − xi 2  (t+1)2 t2 t+1 t T t+1 || ) y − y − 2(y − y ) y k (|| h2 h

i=1

i=1

i=1 n X

 y t − xi 2  (t+1)2 2 2 || ) y − y t − 2(y (t+1) − y t y t+1 ) h

=

1 h2

=

n  1 X 0 y t − xi 2  (t+1)2 t2 (t+1)2 t t+1 k (|| || ) y − y − 2y + 2y y h2 h

=

1 h2

k 0 (||

i=1

i=1 n X i=1 n X

k 0 (||

 y t − xi 2  (t+1)2 2 || ) −y − y t + 2y t y t+1 h

k 0 (||

  y t − xi 2 2 2 || )(−1) y (t+1) + y t − 2y t y t+1 h

=

1 h2

=

n  1 X 0 y t − xi 2  (t+1) t 2 || ) ||y − y || −k (|| h2 h

i=1

i=1

≥ 0. Hence, f (y t+1 ) ≥ f (y t ). Thus we have convergence.

from 3.2.8

4. Experiments and open questions with Chisel We conduct our study using the Mean-shift algorithm that was previously developed and tested on Spark Notebook[soc17j] and as depicted in chapter 3. Our main goal was coding the Mean-shift algorithm completely in Chisel, to exploit the advantages it has by implementing it on an FPGA. To achieve this, we first started by coding the software version in Scala[soc17b]. Then now to have it in Chisel, we proceeded by taking each function and try to have the Chisel version of it. So that once we have the various functions we can connect them. For a more practical example, we want to do a Chisel version of k-Nearest Neighbor, since it is the base of many machine learning algorithms. Also, those functions are part of what we used in our mean shift algorithm and demonstrating the k nearest neighbor algorithm in Chisel, will be very helpful to many machine learning algorithms. Finding the k most similar data items to a user’s query item is a common building block of many important applications. In machine learning, fast k-NN techniques boost the classification speed of non-parametric classifiers. We did two parts of the algorithm, the square distance and the sort function which are necessary in the k-NN algorithm. Then joining those two modules we could have the K-NN in Chisel.

4.1

Comparison of Scala and Chisel functions for the square distance

Algorithm 4.1 shows some details of the Chisel implementation of square distance. We have built a test for our module and compared the hardware result to the software results and computed their execution time. To achieve this, we needed functions such as: peekFixedPoint, pokeFixedPoint and expectFixedPoint which has now been added to the Chisel tester library. Expanding vector sizes, we observe that reducing the size of the FixedPoints we use changes the precision errors. We do the simulation to test the functionality of our circuit. We observed that, with FixedPoint(64.W,32.BP), we get a considerably good precision error (of about 2.3713366845655075E-9%). This results obtained allow us to conclude that one can use the FixedPoint data type and still get reasonable results.

24

Section 4.2. Comparison of Scala and Chisel hash functions

Page 25

Algorithm 1 Square Distance object GenFIR { def apply[T (y - x) * (y - x) } tmp.reduceLeft(_ + _) }} class SqDist(val n : Int, val fixedType : FixedPoint) extends Module { val io = IO(new Bundle { val in1 = Input(Vec(n, fixedType)) val in2 = Input(Vec(n, fixedType)) val out = Output(fixedType) }) io.out := GenFIR(io.in1, io.in2) }

4.2

Comparison of Scala and Chisel hash functions

Approximate k-NN (k-nearest neighbor) techniques using binary hash functions are among the most commonly used approaches for overcoming the prohibitive cost of performing exact k-NN queries. The success of these techniques largely depends on their hash functions ability to distinguish k-NN items; that is, the k-NN items retrieved based on data items’ hash codes, should include as many true k-NN items as possible. Thus, the square distances and the local sensitive hashing are two key functions used in the K-NN algorithm used in the mean-shift algorithm of interest which is based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors for both gradient ascent and for cluster labeling. It was shown in [BDAL16] that the time complexity of the proposed fast procedure is considerably lower than that of the traditional Mean-Shift. These improvements with approximate near neighbor calculations was sufficient for moderate sized data sets, though they are not sufficient for Big Data. The hash function helped to optimize the search for the K-NN by reducing the search size. Here we could not use our Chisel hash function in the K-NN written in Chisel since some functions such as division was not yet supported for FixedPoints in Chisel. Algorithm 4.2 shows some details of the Chisel implementation. Thus, we generate the hash values for a given vector x depending on different parameters w, b, tabHash1 as sketched in the next code. From these comparisons, we could make a couple of observations that we can summarize as follows: • += is not a supported hardware construct. • divide is not supported for fixedPoint, so instead we used wInverse so that multiply can be used instead.

Section 4.3. Additional technical difficulties

Page 26

Algorithm 2 Hash Function Scala def hashfunc(x: Array[Double], w: Double, b: Double, tabHash1: Array[Array[Double]]): Double = { val tabHash = new Array[Double](tabHash1.size) for (ind a * b }.reduce(_ + _) + io.b ) * io.wInverse }.reduce(_ + _) }

4.3

Additional technical difficulties

From the experimentation we did, we came out with the additional observations: • With the Reduce function, we observed that the function reduceleft() used in the computation of the maximum of a vector with the function mux, does not provide an optimal circuit regarding the size. Remind that the size of a circuit is the number of gates it contains and its depth is the maximal length of a path from an input gate to the output gate. The width is the maximum number of components in a stage. Indeed, Chisel generates a circuit of size O(N log N), (the product of the width and the depth) for an input of size N, which is not optimal since the sequential

Section 4.4. Some open questions with Chisel

Page 27

reduce requires O(N) operations/circuits. This means that the energy consumption, for instance, is too important since we have too many gates/circuits. • For the K-NN in Chisel, we had to design a sort with Index in Chisel that is not yet optimal. With that, we obtained a precision of 0.7103174603174603. The current implementation of the Nearest Neighbors classifier took 300 cycles per key for a data size of 300. This is not particularly fast, but there are a lot of additional options for improving performance. Depending on how big a circuit can be, and still fits on the FPGA, it is possible to unroll the sort routine where the cycles happen and reduce them (at the cost of more gates) into any number of cycles one would like. It is also possible to use multiple smaller sorters, for example, sort the 300 with multiple simultaneous sorters of say input size 30 taking 5 and then sorting the samples returned with another sorter. By so doing we could reduce the number of cycles. Another interesting optimization could be to add a unique row of registers to each intermediate row sort (or unrolled batch of rows). If one did this, combined with the knowledge that the cycle delays of the circuit are deterministic, and that in actual use we would be running a large number of keys through the engine, the engine would have a fixed cycle delay of say 300 or less, but that every cycle once it was up and running a new key could be added at the input each cycle and new output (from the 300th previous key) would appear. If we had a large number of keys N ¿¿ 300, the total cycle time would be N + 300 cycles to process them all. Almost a key per cycle if the total keys are large enough. All these hypotheses could be further investigated for future works, but from all these we can conclude that, having hardware generators makes it easier to get to a really fast solution.

4.4

Some open questions with Chisel

On the path of the transformation of the Mean-shift algorithm to map it efficiently onto hardware logic, we thus encountered the following research questions. 1. The first question is How will the input data be loaded? Many learning algorithms rely on the assumption that the data being processed can be held entirely in memory or in a single file on a disc[LGEC17]. As such, planning out how the data will be stored in the hardware is more complicated. We proposed using ”Chisel” registers to hold the loaded data. Registers move the values at their inputs to their output on the positive transition of the clock. The use of registers brings in the concept of cycles which is based on the clock. The intermediate values used in computing are kept in registers. The number of cycles it takes to complete depends on the specific input supplied to the circuit. Due to this, the circuit needs to use Read/Valid logic. That is, the IO Bundle needs ports to tell the unit when there are values to compute and when it is done. 2. Second, how can we implement an efficient hardware sorting algorithm? This problem arises as one of the functions used in the mean-shift in Scala. Efficient sorting is important for optimizing the use of other algorithms (such as search and merge algorithms) which require input data to be in sorted lists; it is also often useful for canonicalizing data and for producing human-readable output. Sorting large number of elements needs high computational rate. Consequently, accelerating sorting algorithms using hardware implementation turns out to be interesting to speed up computations. This can be seen in [JAD+ 17] where Yomna et al. developed a hardware accelerated version of TimeSort and MergeSort algorithms from high level descriptions. As experimental

Section 4.4. Some open questions with Chisel

Page 28

results, they compared the performance of two algorithms in terms of execution time and resource utilization and showed that, TimeSort ranges from 1.07X to 1.16X faster than MergeSort when optimized hardware is used. What we implemented for illustration here is a naive hardware sorting.

5. Conclusion This thesis relates our experience in developing a clustering algorithm on an FPGA board. The overall objective of our work is to shorten the software stack which goes from a high level programming language to logic gates. We examined several building block of Chisel, a tool chain based on Scala and we noticed different problems for the generated circuits. The main problem relates to the non-optimality, in term of circuit size, of basic building blocks, for instance the reduce instruction. If the algorithm written in Scala is not optimal and the circuit is not optimal, the drawback is less. By contrast, if the algorithm is optimal and the circuit produced is not optimal in the sense of size then the problem is major. This example brings us back to the question of which criterion or criteria of optimality to choose. The size of the circuit may not be the only metric to consider. We could also consider optimizing energy consumption, possibly coupling it to a metric on the execution time. If the FPGA circuit is rented from a cloud provider, we could also put the user in the loop by specifying a budget or any other SLA, a qualitative or quantitative one. Producing a circuit in this multi-criteria context seems to be a challenging task. To our view, the second challenge is the choice of the skeletons that we have to consider, with the idea that any machine learning problem can be specialized based on these skeletons. Skeleton programming is not new in the landscape but the choice of a model stays, in our context, a challenging question. Common skeletons, used in the framework of skeleton based parallel programming include pipe, farm, loop, map, reduce, divide & conquer, reduce and map over pairs. Basic PRAM (Parallel Random Access Memory) building blocks such as parallel prefix, pointer jumping, Euler tour, sorting and merging may also be useful for the model. The results of this study has been published in December 2017 (Christophe C´erin, Jean-Luc Gaudiot, Mustapha Lebbah and Foutse Yuehgoh, Return of Experience on the Mean-shift Clustering for Heterogeneous Architecture Use Case, 4th Workshop on Advances in Software and Hardware for Big Data Science, IEEE Big Data, Boston, US) and draw interesting perspective and demonstrate the potential of our approach. Futher studies that can be done to continue this work could be to envision based on the identified basic blocks for machine learning in our preliminary work, the implementation of ML on heterogeneous architectures using Scala and at least on an FPGA available with Amazon; Do a performance and a scalability analysis; Generalize the previous point for major clustering algorithms available at this time. Propose new designs in case of performance degradation and/or optimize them for the FPGA eco-system.

29

Appendix A. Implementation example for maximum of a vector in Chisel Chisel3 is a new Firrtl based chisel with it’s compilation which looks like the following: • Chisel3 (Scala) to Firrtl (this is your ”Chisel RTL”). • Firrtl to Verilog (which then be passed into FPGA or ASIC tools). • Verilog to C++ for simulation and testing using Verilator. So to get started, one needs the following softwares: 1. Java 8 2. sbt, the preferred Scala build system and what Chisel uses. 3. Firrtl, which compiles Chisel’s IR down to Verilog. • Prerequisite Verilator (Requires at least v3.886). – Prerequisite g++. This installations is simpler on windows with the help of Cygwing Once we have all the required softwares, we create a project file as follows: • Step 1:

30

Page 31 • Step 2: Create the .sbt file with the following containt. name := "name_of_project" version := "1.0" scalaVersion := "2.11.7" resolvers ++= Seq( Resolver.sonatypeRepo("snapshots"), Resolver.sonatypeRepo("releases") ) // Provide a managed dependency on X if -DXVersion="" is supplied on the command line. val defaultVersions = Map( "chisel3" -> "3.0-SNAPSHOT", "chisel-iotesters" -> "1.1-SNAPSHOT" ) libraryDependencies ++= (Seq("chisel3","chisel-iotesters").map { dep: String => "edu.berkeley.cs" %% dep % sys.props.getOrElse(dep + "Version", defaultVersions(dep)) }) libraryDependencies ++= Seq( "org.scalatest" %% "scalatest" % "2.2.5", "org.scalacheck" %% "scalacheck" % "1.12.4")

Put the build.sbt file in the root directory for your project. • containt of build.properties file. Specifying the sbt version sbt.version = 0.13.11 • containt of plugins.sbt file logLevel := Level.Warn • Run: sbt test to get

Page 32

[Success] means everything worked!!! You will then see a number of generated new files in a new folder testrundir. This procedure is following the chisel template available on the chisel official website. Alternatively, one could add the following code in the MaxN.scala file: object MaxNDriver extends App { chisel3.Driver.execute(args, () => new MaxN) } And put the following in the build.sbt file rather scalaVersion := "2.11.8" resolvers ++= Seq( Resolver.sonatypeRepo("snapshots"), Resolver.sonatypeRepo("releases") ) libraryDependencies += "edu.berkeley.cs" %% "chisel3" % "3.0-SNAPSHOT" This alternative way is if you really want to do the most bare-bones possible thing. Code explanations • MaxN.scala file The command import chisel3.− is to import the chisel library files that allows us to leverage the Scala language as a hardware construction language.

Page 33 Next is a Scala class defined for the chisel component you are implementing. This is similar to a module declaration in Verilog. Next we define a function Max2 that takes in two parameters and returns the greatest thanks to the chisel operator Mux(a,b,c) interpreted if a, then b, else c. • Generated files: – The C++ code is used to build an emulation of your design. The tester defined will drive the emulation, using poke() to set signal values, and peek() or expect() to read them. You should not be compiling the C++ yourself. The –genHarness and –test options passed to Chisel, will make chisel compile the C++ code, build the emulation and run your tester to drive it. The test bench (written in Scala) communicates (currently via shared memory) with the emulation built from C++ code.

Appendix B. Hardware K-NN in chisel B.1



MostFrequentlyOccurring.scala

This figures out of the k samples after the sort, which one is most frequently occurring. This is the hardware equivalent of the following line of the scala software version which you can see here 1

1

topKClasses . groupBy ( identity ) . mapValues ( _ . size ) . maxBy ( _ . _2 ) . _1

1

package chiselknn

 

2 3 4 5

import chisel3 . _ import chisel3 . iotesters . PeekPo keTester import hardwaresort . C o m b i n a t i o n a l S o r t I n d e x A n d T a k e

6 7 8 9 10 11 12 13 14 15 16 17

// ∗ ∗ // ∗ A surprisingly complicated way of determining the most frequently occurring number in a // ∗ Vec of UInt ’s . Ties will generally pick the last value that sits at the highest index of all tied elements . // ∗ @ param elementCount Number of elements to choose from // ∗ @ param uIntSize Width of element containers ( perhaps this should be maxValue ) // ∗ / class M o s t F r e q u e n t l y O c c u r r i n g ( val elementCount : Int , val uIntSize : Int ) extends Module { val io = IO ( new Bundle { val input = Input ( Vec ( elementCount , UInt ( uIntSize . W ) ) ) val m o s t F r e q u e n t l y O c c u r r i n g V a l u e = Output ( UInt ( uIntSize . W ) ) })

18 19 20 21 22 23

// ∗ ∗ // ∗ First we compute sums , a list of how many occurrences of the number at index occur from that index // ∗ to the right end of the inputs . // ∗ We need to create a temp with a known size to make sure we get UInt ’s big enough to contain sum ( why is this ? ) // ∗ /

24 25 26 27 28 29

val sums = io . input . indices . map { index = > io . input . drop ( index ) . map { x = > ( x = = = io . input ( index ) ) . asUInt }. reduce ( _ + & _ ) }

30 31

// printf (" Sums : " + (" % d

"



sums . length ) + "\ n " , sums : _ ∗ )

32 33 34 35 36 37 38 39

// ∗ ∗ next we sort an array of indices using them as keys to the number of occurrences // ∗ for the value at that index in the inputs // ∗ The last value of that sorted list is the the index with the most occurrences . // ∗ So return the value of that inputs at that index // ∗ / val maxSelector = Module ( new C o m b i n a t i o n a l S o r t I n d e x A n d T a k e ( sums . size , sums . size , UInt ( uIntSize . W ) ) ) maxSelector . io . inputs : = sums

40 41 42

io . m o s t F r e q u e n t l y O c c u r r i n g V a l u e : = io . input ( maxSelector . io . outputs . last ) }

43 44 45 46 47

class M o s t F r e q u e n t l y O c c u r r i n g T e s t e r ( c : M o s t F r e q u e n t l y O c c u r r i n g ) extends Pee kPokeTe ster ( c ) { val inputVectors = Seq ( Seq (1 , 1 , 1 , 2 , 2 , 3 , 4 , 1 , 5 , 2 , 4 , 7) . take ( c . elementCount ) , Seq (1 , 1 , 1 , 3 , 3 , 3 , 2 , 2 , 2 , 2 , 2 , 2) . take ( c . elementCount ) , 1

https://github.com/Foutse/Internship work/blob/master/finalprojects/scalakNN/src/main/scala/example/ NNExample.scala

34

  

Section B.2. CombinationalSortIndexAndTake.scala

48 49 50 51

Page 35

Seq (1 , 4 , 1 , 4 , 2 , 4 , 2 , 3 , 4 , 3 , 3 , 4) . take ( c . elementCount ) , Seq (1 , 3 , 1 , 3 , 2 , 3 , 2 , 3 , 4 , 3 , 3 , 4) . take ( c . elementCount ) ) val expecte dValues = Seq (1 , 2 , 4 , 3)

52 53 54

inputVectors . zip ( expecte dValues ) . foreach { case ( inputVector , expectedValue ) = > println ( " inputs " + inputVector . map { x = > f " $x % 4 d " }. mkString ( " " ) )

55 56 57 58 59

inputVector . zipWithIndex . foreach { case ( value , index ) = > poke ( c . io . input ( index ) , value ) } step (1)

60 61 62 63 64 65 66 67

if ( peek ( c . io . m o s t F r e q u e n t l y O c c u r r i n g V a l u e ) ! = expectedValue ) { println ( s " ERROR : MFO value is $ { peek ( c . io . m o s t F r e q u e n t l y O c c u r r i n g V a l u e ) } Should be $exp ectedVal ue \ n \ n \ n " ) } else { println ( s " MFO value is $ { peek ( c . io . m o s t F r e q u e n t l y O c c u r r i n g V a l u e ) }\ n \ n \ n " ) } expect ( c . io . mostFre quentlyO ccurrin gValue , expectedValue )

68 69 70 71

step (1) } }

72 73 74 75 76 77 78 79 80



object M o s t F r e q u e n t l y O c c u r r i n g { def main ( args : Array [ String ]) : Unit = { iotesters . Driver . execute ( Array . empty , () = > new M o s t F r e q u e n t l y O c c u r r i n g ( elementCount = 10 , uIntSize = 8) ) { c = > // iotesters . Driver . execute ( Array (" - - backend - name " , " verilator ") , () = > new M o s t F r e q u e n t l y O c c u r r i n g ( elementCount = 10 , uIntSize = 8) ) { c = > new M o s t F r e q u e n t l y O c c u r r i n g T e s t e r ( c ) } } }

B.2

CombinationalSortIndexAndTake.scala

 1

 package hardwaresort

2 3

import chisel3 . _

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25



// scalastyle : off magic . number // / ∗ ∗ // ∗ Implements a naive and non - optimized combinational hardware sort // ∗ of FixedPoint numbers . // ∗ Sorts the inputs and returns either the outputSize highest or // ∗ lowest depending on reverseSort . // ∗ Similar to SortAndTake but fully unrolled . Very difficult to // ∗ synthesize accept for very small input sizes . // ∗ // ∗ @ param inputSize how many values to sort // ∗ @ param outputSize how many of the top or bottom sorted values to return // ∗ @ param elementType Size of FixedPointer numbers in input // ∗ / class C o m b i n a t i o n a l S o r t I n d e x A n d T a k e ( val inputSize : Int , val outputSize : Int , val elementType : UInt ) extends Module { val io = IO ( new Bundle { val inputs = Input ( Vec ( inputSize , elementType ) ) val newInputs = Input ( Bool () )

Section B.3. SortIndexAndTake.scala

26 27 28

val outputs val sortDone })

Page 36

= Output ( Vec ( outputSize , UInt ( inputSize . W ) ) ) = Output ( Bool () )

29 30 31 32 33 34 35 36 37 38 39 40 41

val sortedInputs = io . inputs . indices . foldLeft ( io . inputs . indices . map ( _ . U ) . toList ) { case ( ll , index ) = > def reorderPairs ( list : List [ UInt ]) = { list . sliding (2 , 2) . toList . map { case a :: b :: Nil = > Mux ( io . inputs ( a ) > io . inputs ( b ) , b , a ) :: Mux ( io . inputs ( a ) > io . inputs ( b ) , a , b ) :: Nil case b :: Nil = > b :: Nil case _ = > Nil }. flatten }

42 43 44 45 46 47 48 49

if ( index % 2 = = 0) { reorderPairs ( ll ) } else { List ( ll . head ) + + reorderPairs ( ll . tail ) } }

50 51 52 53 54 55 56



io . outputs . zip ( sortedInputs ) . foreach { case ( out , reg ) = > out : = reg } io . sortDone : = true . B }

B.3

SortIndexAndTake.scala

 1

 package hardwaresort

2 3 4 5 6 7

import import import import import

chisel3 . _ chisel3 . experimental . FixedPoint chisel3 . internal . firrtl . K n o w n B i n a ry P o i n t chisel3 . iotesters . PeekPo keTester chisel3 . util . log2Ceil

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30



// scalastyle : off magic . number // / ∗ ∗ // ∗ Implements a naive and non - optimized hardware sort of FixedPoint numbers . // Sorts the inputs and returns either // ∗ the selected number ( outputSize ) of indices of the lowest values . // ∗ This has a very primitive flow control . // ∗ Parent sets when inputs are to be read by toggling newInputs // ∗ and then should wait until // ∗ sort is complete . Sort may be complete before circuit realizes it . // ∗ ( this could be fixed ) // ∗ Basic strategy is to copy the indices of the input to a register vector // ∗ then adjacent registers and flip them if the value at the first index is greater // than the value at the second index // ∗ When selecting register pairs on even cycles // compare 0 to 1 , 2 to 3... on odd cycles compare 1 to 2 , 3 to 4 ... // ∗ // ∗ @ param inputSize how many values to sort // ∗ @ param outputSize how many of the top or bottom sorted values to return // ∗ @ param fixedType Type template FixedPoint numbers in input // ∗ / class S o r t I n d e x A n d T a ke ( val inputSize : Int , val outputSize : Int , val fixedType : FixedPoint ) extends Module {

Section B.3. SortIndexAndTake.scala

31 32 33 34 35 36

val val val val val })

io = IO ( new inputs = newInputs = outputs = sortDone =

val val val val

sortReg busy sortCounter isEvenCycle

Page 37

Bundle { Input ( Vec ( inputSize , fixedType ) ) Input ( Bool () ) Output ( Vec ( outputSize , UInt (( log2Ceil ( inputSize ) + 1) . W ) ) ) Output ( Bool () )

37 38 39 40 41

= = = =

Reg ( Vec ( inputSize , UInt (( log2Ceil ( inputSize ) + 1) . W ) ) ) RegInit ( false . B ) RegInit (0. U ( log2Ceil ( inputSize ) . W ) ) RegInit ( false . B )

42 43 44 45

when ( io . newInputs ) { // when parent module loads new inputs to be sorted , we load registers and prepare to sort sortReg . zipWithIndex . foreach { case ( reg , index ) = > reg : = index . U }

46 47 48 49 50 51 52

busy : = true . B sortCounter : = 0. U isEvenCycle : = false . B } . elsewhen ( busy ) { isEvenCycle : = ! isEvenCycle

53 54 55 56 57

sortCounter : = sortCounter + 1. U when ( sortCounter > = inputSize . U ) { busy : = false . B }

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

when ( isEvenCycle ) { sortReg . toList . sliding (2 , 2) . foreach { case regA :: regB :: Nil = > when ( io . inputs ( regA ) > io . inputs ( regB ) ) { // a is bigger than b , so flip this pair regA : = regB regB : = regA } case _ = > // this handles end case when there is nothing to compare register to } } . otherwise { sortReg . tail . toList . sliding (2 , 2) . foreach { case regA :: regB :: Nil = > when ( io . inputs ( regA ) > io . inputs ( regB ) ) { // a is bigger than b , so flip this pair regA : = regB regB : = regA } case _ = > // this handles end case when there is nothing to compare register to } } }

84 85

io . sortDone : = ! busy

86 87 88 89 90

io . outputs . zip ( sortReg ) . foreach { case ( out , reg ) = > out : = reg } }

91 92 93

class So rt In d ex Te st e r ( c : S o r t I n d e x A n d T a k e ) extends PeekPok eTester ( c ) { val valuesToSort = (0 until c . inputSize ) . map { i = > ( c . inputSize - i ) . toDouble / 2.0 }

94 95 96 97 98

def showOutputs () : Unit = { for ( i < - 0 until c . outputSize ) { val index = peek ( c . io . outputs ( i ) ) . toInt print ( f " $index % 3 d $ { valuesToSort ( index ) } % 8.4 f

")

Section B.4. HardwareNearestNeighbours.scala

99 100 101 102 103 104 105 106

Page 38

} println () } for ( i < - 0 until c . inputSize ) { poke FixedPoi nt ( c . io . inputs ( i ) , valuesToSort ( i ) ) } poke ( c . io . newInputs , 1) step (1)

107 108 109

poke ( c . io . newInputs , 0) step (1)

110 111

showOutputs ()

112 113

// wait for sort to finish

114 115 116 117 118

while ( peek ( c . io . sortDone ) = = 0) { showOutputs () step (1) }

119 120

showOutputs ()

121 122

}

123 124 125 126 127 128 129

object SortIndexTest { def main ( args : Array [ String ]) : Unit = { iotesters . Driver . execute ( Array . empty [ String ] , () = > new S o rt I n d e x A n d T a k e (5 , 5 , FixedPoint (16. W , 8. BP ) ) ) { c = > new S or tI n de xT es t er ( c ) }

130

iotesters . Driver . execute ( Array . empty [ String ] , () = > new S o r t I n d e x A n d T a k e (20 , 5 , FixedPoint (16. W , 8. BP ) ) ) { c => new S or tI n de xT es t er ( c ) } }

131 132 133 134 135 136 137



}

B.4



HardwareNearestNeighbours.scala





1 2

package chiselknn

3 4 5 6 7 8 9

import import import import import import

chisel3 . _ chisel3 . experimental . FixedPoint chisel3 . iotesters chisel3 . iotesters . PeekPo keTester chisel3 . util . log2Ceil hardwaresort . S o r t I n d e x A n d T a k e

10 11 12

import scala . io . Source import scalaknn . N e a r e s t N e i g h b o u r s

13 14 15 16 17 18 19 20 21 22

/∗ ∗ Compute the nearest neighbor for a key ∗ @ param fixedType ∗ @ param k ∗ @ param keySize ∗ @ param dataY ∗ @ param dataX ∗/ class H a r d w a r e N e a r e s t N e i g h b o u r s ( val fixedType : FixedPoint , ∗

Section B.4. HardwareNearestNeighbours.scala

23 24 25 26 27 28 29 30 31 32

val val val val val val val val val })

Page 39

k : Int , keySize : Int , dataY : Seq [ Int ] , dataX : Array [ Array [ Double ]]) extends Module { io = IO ( new Bundle { dataLoaded = Input ( Bool () ) key = Input ( Vec ( keySize , fixedType ) ) out = Output ( UInt (16. W ) ) ready = Output ( Bool () )

33 34

val busy = RegInit ( false . B )

35 36 37

val sorter = Module ( new S o r t In d e x A n d T a k e ( dataX . length , k , fixedType ) ) val mfo = Module ( new M o s t F r e q u e n t l y O c c u r r i n g (k , log2Ceil ( dataX . length ) + 1) )

38 39

private val tabHash0 = dataX . map ( _ . map ( _ . F ( fixedType . getWidth .W , fixedType . binaryPoint ) ) )

40 41 42 43 44 45 46

private val distances = tabHash0 . indices . map { ind1 = > val dist : FixedPoint = io . key . zip ( tabHash0 ( ind1 ) ) . foldLeft (0. F ( fixedType . binaryPoint ) ) { case ( accum , (x , t ) ) = > accum + (( x - t ) ∗ ( x - t ) ) } dist }

47 48

val predictions = VecInit ( dataY . map { y = > y . U ( log2Ceil ( dataY . length ) . W ) })

49 50

sorter . io . inputs : = distances

51 52 53 54

sorter . io . outputs . zip ( mfo . io . input ) . foreach { case ( sortOutput , mfoInput ) = > mfoInput : = predictions ( sortOutput ) }

55 56

io . out : = mfo . io . m o s t F r e q u e n t l y O c c u r r i n g V a l u e

57 58

io . ready : = ! busy

59 60 61 62 63 64

when ( io . dataLoaded ) { sorter . io . newInputs : = true . B busy : = true . B }. elsewhen ( busy & & ! io . dataLoaded ) { sorter . io . newInputs : = false . B

65 66 67 68 69 70

when ( sorter . io . sortDone ) { busy : = false . B } } }

71 72 73 74

object H a r d w a r e N e a r e s t N e i g h b o u r s D r i v e r { val TrainingSize = 300

75 76

val euclideanDist = ( v1 : Array [ Double ] , v2 : Array [ Double ]) = > v1 . zip ( v2 ) . map ( x = > math . pow (( x . _1 - x . _2 ) , 2) ) . sum

77 78 79 80 81 82 83 84

// function to transform the data into a couple of list of def line2Data ( line : String ) : ( List [ Double ] , String ) = { val elts = line . split ( " ," ) val y = elts . last val x = elts . dropRight (1) . map ( _ . toDouble ) . toList (x , y ) }

85 86 87 88

// scalastyle : off method . length regex def main ( args : Array [ String ]) : Unit = {

Doubles and a string

Section B.4. HardwareNearestNeighbours.scala

Page 40

89 90 91

// loading the data from file val data = Source . fromFile ( " ionosphere . data . txt " ) . getLines () . map ( x = > line2Data ( x ) ) . toList

92 93 94

val outputs = data . map ( _ . _2 ) // take the last column of strings . val inputs = data . map ( _ . _1 ) . toArray // convert the list of list double in to an array of list double .

95 96

println ( " The full dataset size is $ { outputs . size } " )

97 98 99

val d is ti n ct Ou tp u ts = outputs . distinct . toArray val outputToIndex = d is ti n ct Ou tp u ts . zipWithIndex . map { case (x , index ) = > x - > index }. toMap

100 101

println ( s " di st i nc tO u tp ut s : $ { outputToIndex . map { case (s , n ) = > f " $s : $n " }. mkString ( " ," ) } " )

102 103 104 105 106 107 108 109

val dataX = inputs . map ( _ . toArray ) . take ( TrainingSize ) // we convert the array of list double into an array of array double and take just TrainingSize from the inputs data . val dataY = outputs . take ( TrainingSize ) . map { string = > outputToIndex ( string ) } // we take just TrainingSize of the sequence of strings val keySize = dataX (2) . length // to keySize we assign the length of the second array of dataX println ( s " The key size is $keySize " ) // we check the length val fixedWidth = 64 val binaryPoint = 32 val k = 5

110 111 112 113 114 115 116

def makeModule () = { () = > new H a r d w a r e N e a r e s t N e i g h b o u r s ( FixedPoint ( fixedWidth .W , binaryPoint . BP ) , k , keySize , dataY , dataX ) }

117 118 119 120 121 122

iotesters . Driver . execute ( args , makeModule () ) { c = > new PeekPok eTester ( c ) { val traininputs = inputs . take ( TrainingSize ) // use the first TrainingSize data points of our data sets val trainoutputs = outputs . take ( TrainingSize ) val myNN = new N e a r e s t N e i g h b o u r s ( k = 4 , dataX = traininputs . map ( x = > x . toArray ) , dataY = trainoutputs , euclideanDist )

123 124 125 126 127 128

var correct = 0 ( TrainingSize until outputs . length ) . foreach { exampleId = > val pred = myNN . predict ( inputs ( exampleId ) . toArray ) val target = outputs ( exampleId ) if ( pred = = target ) correct += 1

129 130 131 132 133 134

poke ( c . io . dataLoaded , 1) inputs ( exampleId ) . zipWithIndex . foreach { case ( value , index ) = > poke FixedPoi nt ( c . io . key ( index ) , value ) } step (1)

135 136

poke ( c . io . dataLoaded , 0)

137 138 139 140 141 142 143 144 145 146

var count = 0 while ( peek ( c . io . ready ) = = 0 & & count < 1000) { println ( s " Waiting for KNN , cycle is $count " ) step (1) count = count + 1 } if ( count > = 1000) { println ( s " H a r d w a r e N e a r e s t N e i g h b o r failed to complete " ) }

147 148 149 150

val prediction = peek ( c . io . out ) . toInt val p re di c te dS tr i ng = di st in c tO ut pu t s ( prediction ) println ( s " Ne ar e st Ne i gh bo r : output = > $prediction which corresponds to prediction $ p r e d i c t e d S t r i ng " )

Section B.4. HardwareNearestNeighbours.scala

151

Page 41

}

152 153

println ( " The accuracy is \ n " + correct . toDouble / ( TrainingSize to outputs . length ) . length )

154 155 156 157 158



} } } }



References [AH16]

Ali Ismail Awad and Mahmoud Hassaballah, Image feature detectors and descriptors, Springer, 2016.

[Bai11]

Donald G Bailey, Design for embedded image processing on fpgas, John Wiley & Sons, 2011.

[BDAL16]

Ga¨el Beck, Tarn Duong, Hanene Azzag, and Mustapha Lebbah, Distributed mean shift clustering with approximate nearest neighbours, Neural Networks (IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 3110–3115.

[BDH+ 10] Andre R Brodtkorb, Christopher Dyken, Trond R Hagen, Jon M Hjelmervik, and Olaf O Storaasli, State-of-the-art in heterogeneous computing, Scientific Programming 18 (2010), no. 1, 1–33. [BJ10]

Donald G Bailey and Christopher T Johnston, Algorithm transformation for fpga implementation, Electronic Design, Test and Application, 2010. DELTA’10. Fifth IEEE International Symposium on, IEEE, 2010, pp. 77–81.

[BM17]

ERIK BRYNJOLFSSON and ANDREW MCAFEE, Harvard business review, https://hbr. org/2017/07/whats-driving-the-machine-learning-explosion, 2017.

[BVR+ 12]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Aviˇzienis, John Wawrzynek, and Krste Asanovi´c, Chisel: constructing hardware in a scala embedded language, Proceedings of the 49th Annual Design Automation Conference, ACM, 2012, pp. 1216–1225.

[CAM+ 17] Sonda Chtourou, Mohamed Abid, Zied Marrakchi, Emna Amouri, and Habib Mehrez, On exploiting partitioning-based placement approach for performances improvement of 3d fpga, High Performance Computing & Simulation (HPCS), 2017 International Conference on, IEEE, 2017, pp. 572–579. [Che95]

Yizong Cheng, Mean shift, mode seeking, and clustering, IEEE transactions on pattern analysis and machine intelligence 17 (1995), no. 8, 790–799.

[cla]

Aims senegal lectures notes by mike thompson and dr. nathaniel evans (computer networking and security specialists) argonne national lab.

[CM02]

Dorin Comaniciu and Peter Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on pattern analysis and machine intelligence 24 (2002), no. 5, 603–619.

[CWG+ 13] Stefan Craciun, Gongyu Wang, Alan D George, Herman Lam, and Jose C Principe, A scalable rc architecture for mean-shift clustering, Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on, IEEE, 2013, pp. 370–374. [FH75]

Keinosuke Fukunaga and Larry Hostetler, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Transactions on information theory 21 (1975), no. 1, 32–40. 42

REFERENCES

Page 43

[GDFP09]

Archana Ganapathi, Kaushik Datta, Armando Fox, and David Patterson, A case for machine learning to optimize multicore performance, First USENIX Workshop on Hot Topics in Parallelism (HotPar’09), 2009.

[GLR16]

Georgios Georgis, George Lentaris, and Dionysios Reisis, Acceleration techniques and evaluation on multi-core cpu, gpu and fpga for image processing and super-resolution, Journal of Real-Time Image Processing (2016), 1–28.

[GSG16]

Artur Gramacki, Marek Sawerwain, and Jaroslaw Gramacki, Fpga-based bandwidth selection for kernel density estimation using high level synthesis approach, Bulletin of the Polish Academy of Sciences Technical Sciences 64 (2016), no. 4, 821–829.

[HKM+ 17] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al., Ese: Efficient speech recognition engine with sparse lstm on fpga., FPGA, 2017, pp. 75–84. [JAD+ 17]

Yomna Ben Jmaa, Karim MA Ali, David Duvivier, Maher Ben Jemaa, and Rabie Ben Atitallah, An efficient hardware implementation of timsort and mergesort algorithms using high level synthesis, High Performance Computing & Simulation (HPCS), 2017 International Conference on, IEEE, 2017, pp. 580–587.

[Kon05]

G Derpanis Konstantinos, Mean shift clustering, 2005.

[KTR08]

Ian Kuon, Russell Tessier, and Jonathan Rose, Fpga architecture: Survey and challenges, Foundations and Trends in Electronic Design Automation 2 (2008), no. 2, 135–253.

[LAR+ 15]

Bruno Cardoso Lopes, Rafael Auler, Luiz Ramos, Edson Borin, and Rodolfo Azevedo, Shrink: Reducing the isa complexity via instruction recycling, ACM SIGARCH Computer Architecture News, vol. 43, ACM, 2015, pp. 311–322.

[LGEC17]

Alexandra L’Heureux, Katarina Grolinger, Hany F ElYamany, and Miriam Capretz, Machine learning with big data: Challenges and approaches, IEEE Access (2017).

[LRL+ 12]

Shaoshan Liu, WON W RO, Chen Liu, Alfredo Crist´obal-Salas, Christophe C´erin, Jian-Jun Han, and Jean-Luc Gaudiot, Introducing the extremely heterogeneous architecture, Journal of Interconnection Networks 13 (2012), no. 03n04, 1250010.

[MDS09]

Craig Truett Moore, Harald Devos, and Dirk Stroobandt, Optimizing the fpga memory design for a sobel edge detector, 20th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC 2009), STW Technology Foundation, 2009, pp. 496–499.

[PARKA13] Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe, Portable performance on heterogeneous architectures, ACM SIGARCH Computer Architecture News, vol. 41, ACM, 2013, pp. 431–444. [PM10]

Husain Parvez and Habib Mehrez, Application-specific mesh-based heterogeneous fpga architectures, Springer Science & Business Media, 2010.

[RKDZ14]

Yazhou Ren, Uday Kamath, Carlotta Domeniconi, and Guoji Zhang, Boosted mean shift clustering, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2014, pp. 646–661.

REFERENCES

Page 44

[SIOA17]

David Sidler, Zsolt Istv´an, Muhsen Owaida, and Gustavo Alonso, Accelerating pattern matching queries in hybrid cpu-fpga architectures, Proceedings of the 2017 ACM International Conference on Management of Data, ACM, 2017, pp. 403–415.

[soc17a]

Wp3workshop website, http://wp3workshop.website/, 2017, Online; accessed December 05th, 2017.

[soc17b]

Internship work, https://github.com/Foutse/Internship work/tree/master/finalprojects/ ScalaMeanShift, 2017, Online; accessed December 05th, 2017.

[soc17c]

ARMs new processors are designed to power the machine-learning machines, https://www. theverge.com/2017/5/29/15707606/arm-cortex-a75-a55-mali-g72-specs-announced, 2017, Online; accessed December 05th, 2017.

[soc17d]

Vega Instruction Set Architecturee, https://developer.amd.com/wordpress/media/2017/ 08/Vega Shader ISA 28July2017.pdf, 2017, Online; accessed December 05th, 2017.

[soc17e]

ISA Aging: A X86 case study, https://www.researchgate.net/profile/Rafael Auler/ publication/260112900 ISA Aging An X86 case study/links/0f31752f9ceee27ead000000. pdf, 2017, Online; accessed December 05th, 2017.

[soc17f]

Mean shift LSH (DNNMS LSH), https://github.com/Spark-clustering-notebook/ coliseum/wiki/Mean-shift-LSH-(DNNMS-LSH), 2017, Online; accessed December 05th, 2017.

[soc17g]

Five times the computing power, https://liu.se/en/article/ femfaldigar-berakningskapaciteten, 2017, Online; accessed December 05th, 2017.

[soc17h]

Wikipedia, Moore/’s law, https://en.wikipedia.org/wiki/Moore%27s law, 2017, Online; accessed December 05th, 2017.

[soc17i]

Verilator, https://www.veripool.org/wiki/verilatorn, 2017, Online; accessed December 05th, 2017.

[soc17j]

Lebbah Code, https://sites.google.com/site/lebbah/code, 2017, Online; accessed December 05th, 2017.

[soc18a]

Wikipedia, Auto-Tune, https://en.wikipedia.org/wiki/Auto-Tune, 2018, Online; accessed January 30th, 2018.

[soc18b]

ˆ R FPGA Accelerator Framework, http:// GRVI Phalanx: A Massively Parallel RISC-VA fpga.org/wp-content/uploads/2017/08/HotChips29-GRVI-Phalanx-poster.pdf, 2018, Online; accessed January 31th, 2018.

[Sta17]

The Chemical Statistician, Exploratory data analysis:kernel density estimation-conceptual foundations retrieved, https://chemicalstatistician.wordpress.com/2013/06/09/ exploratory-data-analysis-kernel-density-estimation-in-r-on-ozone-pollution-data-in-new-york-and-ozono 2017.

[Tri17]

Sushil Kumar Tripathi, Kellton tech, https://www.kelltontech.com/kellton-tech-blog/ how-increase-business-efficiency-machine-learning, 2017.

REFERENCES [TYA16]

Page 45

Mustafa U Torun, Onur Yilmaz, and Ali N Akansu, Fpga, gpu, and cpu implementations of jacobi algorithm for eigenanalysis, Journal of Parallel and Distributed Computing 96 (2016), 172–180.