Hardware Transactional Memory on Multi-processor FPGA Platform

4 downloads 3670 Views 536KB Size Report
Multi-processor FPGA Platform ... Faculty of Biosciences and Medical Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia. Email: ∗.
Hardware Transactional Memory on Multi-processor FPGA Platform Jeevan Sirkunan∗ , Chia Yee Ooi† , N. Shaikh-Husin‡ , Yuan Wen Hau§ , M. N. Marsono¶ ∗‡¶

Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia International Institute of Technology, Universiti Teknologi Malaysia Kuala Lumpur, 54100 KL, Malaysia § Faculty of Biosciences and Medical Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia Email: ∗ [email protected], † [email protected], ‡ [email protected], § [email protected], ¶ [email protected],

† Malaysia-Japan

Abstract—Transactional memory (TM) is a promising approach in creating an abstraction layer for multi-threaded programming. However, the performance of TM is applicationspecific. Previous embedded system TM implementations exploit only conflict management to suit the application requirements. In this paper, we propose a hardware transactional memory (HTM) which exploits both version and conflict management. The proposed architecture is targeted for embedded applications and is area efficient compared to current methods that apply cache coherence protocols. The proposed system was tested with random requests at different contention levels. We implemented the HTM with four model processors on Cyclone IV Field Programmable Gate Array. Our results show that it offers up to 14% improvement in terms of clock cycle over the HTM scheme that only exploits conflict management. Keywords—Hardware transactional memory, FPGA, Embedded Hardware, Multi-processor

I.

I NTRODUCTION

Parallel programming model partitions a singular task to be executed into several smaller tasks. Message passing and shared memory are the most common parallel programming models. Message passing needs explicit communication, in which programmers are required to synchronize memory access from the software. On the other hand, shared memory requires blocking synchronization or lock which are usually done implicitly by hardware [1]. Fine-grained locks yields better performance but at the cost of development time. On top of that, it requires expert programmers to develop it. Meanwhile, coarse-grained locks is much easier to implement but performs poorly since it limits parallelism. Transactional Memory (TM) provides non-blocking waitfree synchronization among memory sections. Each transaction is atomic, isolated, and consistent. Its main target is to simplify multi-threaded programming while making full use of multiprocessor hardware capacity. The magnitude of simplification was quantified by Rossbach et al. [2] on a multi-player game programming assignment. In TM, changes made by conflicting transactions are undone and the transactions are either aborted or restarted. On the other hand, changes from successful transaction become permanent. Several HTM architectures have been proposed [3]–[7]. However, most are aimed for high performance system with cache coherence protocols [8]. Nonetheless, there are many embedded applications such as network processing that

978-1-4799-3432-4/14/$31.00 ©2014 IEEE

2744

use multiple light-weight Reduced Instruction Set Computer (RISC) or micro-engines. References [8]–[10] are some of the works that presented HTM for embedded systems. These HTM schemes are usually assembled based on specific applications. The performance of the HTM is dependent both on the configuration and also its application. In general, HTM configuration is divided into version and conflict management, and the application is categorized based on its contention level. Previous implementation on embedded system [8] focused only on conflict management. In this work, we propose a light-weight HTM architecture for embedded system which combines both version management and conflict management to give maximum performance for low contention application. The system is tested with random requests with different contention levels and is compared with [8]. The area trade-off between these architectures are also analysed. Our findings show improvement up to 14% in terms of performance and similar hardware utilization. II.

R ELATED WORKS

Currently, lock-based synchronization schemes are widely used for synchronizing a multi-processor system. The increment in programming complexity has created the need for research on TM. Software implementation of transactional memory (STM) [11] gives poor performance. Most of the current works focus on HTM based on cache coherent protocols. References [3], [4] are some of the earliest work that introduce additional instruction set by adding additional cache line for TM. HTM placed at the cache level are bounded by the maximum cache size, and software programmers need to take this into consideration. Consequently, [5] and [6] were proposed though these works suffer performance loss at certain conditions for a better abstraction at the software level. DynTM [7] and FlexTM [12] were proposed with an interchangeable HTM configuration in order to adapt the changes in application behaviour. The aforementioned works focus on building the HTM for high performance cache coherent system. These architectures were implemented and tested in simulation environment. Several other works targeted the implementation of HTM for FPGA platform. ATLAS [13], Real Time Transactional Memory (RTTM) [14] and NetTM [9] were all hardware implementations that were built based on fixed configurations. Its performance will be ambiguous based on running applications.

Configurable Transactional Memory (CTM) [8] was introduced with a generic approach in building HTM for embedded system. In this design, the system can be configured to lazy or eager conflict management to suit the application demand (probability of conflict). However in their work, version management context has not been exploited since on transaction commit, the transactional memory cache inside CTM still needs to update main memory one word at a time. Another architecture that is targeted for embedded system is EmbeddedTM [10]. Its main concern was about reducing power consumption of the HTM. However, it also uses cache coherent protocols which is usually absent in multi-processor FPGA platform [8]. We propose a similar approach as CTM [8] in building the HTM, but we integrate eager version management to cater for application with low contention. In our work, we represent different types of application with random request at different contention levels and also different transaction sizes. A similar approach was proposed in NetTM [9], which has fast commit and late resolution on conflict. Their work [9] uses an undo-log instead of a fully associative cache like in CTM [8]. Undo-log has a simpler structure since it does not have a cache like organization and it can support larger transactions at lower hardware cost. However, having fully associative cache allows both eager and lazy version management. III.

HTM CONFIGURATION OVERVIEW

A. HTM criteria Various architectures proposed by [5], [8], [9], [15] can be categorized into two main aspects: Version Management and Conflict Management [7]. Fig. 1 is a rough depiction of the HTM architecture, where n is the number of processors. TM buffer and Main memory are both running with similar frequency and thus, the access time of both memories are similar. The proposed architecture is based on [8]. This architecture was chosen since it is lightweight and is able to work in a Multi-Processor System on Chip (MPSoC) system with or without cache coherence support. It can also work with heterogeneous core accelerators making it ideal for embedded system implementation.

Fig. 1.

System Overview of HTM in MPSoC

1) Version Management: Version management is classified based on the location of the modified transaction [7]. For eager version management, old data are kept in the TM buffer and any updates are written directly on the Main memory. For the lazy version management, it is vice versa. Additional clock cycles are needed when the TM buffer updates the Main memory regardless with new or old data. Consequently, eager version management would have lower Tcommit (3) and lazy version management has lower Tabort (3).

2745

2) Conflict Management: Conflict management defines how the conflict is detected and managed [7]. Eager conflict management detects conflicts during read or write phases and then resolves it immediately. On the other hand, lazy conflict management detects conflict on the read, write or commit phases, but it will only be resolved during the commit phase. Both managements resolve conflicts by aborting and restarting the transaction to avoid dead-lock. For eager conflict management, the processor will be given a random delay time before it restarts in order to avoid live-lock. B. Processing Time The processing time needed for HTM can be defined as follow. Tprocess = Taccess + Tcommit abort (1) Taccess = Tread write × n × s (2) From (1), Tprocess represents the total clock cycles needed by HTM to handle a finite amount of transactions, whereas Tcommit abort is the additional processing time needed for HTM to handle commit and abort. It is similar to the hit miss penalty in a cache system. Taccess from (2) is the total time taken for read/write request, regardless whether it is useful or not. n represents the number of transaction whereas s is the size of transaction. Tread write depends on the architecture of the memory. In our design, we use a fully associative buffer to reduce the address search time to one clock cycle. The TM buffer can hold either the old or new value of the transaction and the access flag of each address. In HTM, penalty occurs during commit and abort. During the update period other processors are not allowed to access the memory. n X ((Tcommit )(Pcommit )+ Tcommit abort = (3) i=0 (Tabort )(Pabort )) Pcommit = 1 − (Pabort )

(4)

Tcommit and Tabort depend on the configuration of the HTM. From (3), at higher contention (Pabort is high), Tabort needs to be reduced to get a lower penalty and vice versa for low conflict situations. In reference [8], the structure uses lazy version management but it does not integrate fast abort. The CTM cache and Main memory update their memory one word at a time during commit and abort based on the address FIFOs. In order to have fast abort for lazy version management or fast commit for eager version management, each entry would require its own shared flag. For low contention situation, lazy conflict management is preferred. Overhead caused by the processors checking for conflict status can be avoided. On the other hand, eager conflict management can minimize penalty caused by conflicting transactions accessing the memory, making it suitable for high contention application [8]. Table I shows the relationship of version and conflict management configuration towards the contention level that it is most suited for. In order to get a maximum performance for low contention application, we propose a HTM that employ eager version management and lazy conflict management.

TABLE I.

C ONTENTION

Version Management

Fig. 2.

LEVEL TOWARDS

Lazy Eager

HTM C ONFIGURATION

Conflict Management Lazy Eager High/Low High/High Low/Low Low/High

Top level of proposed HTM architecture

IV.

S YSTEM ARCHITECTURE

Fig. 2 shows the overview of proposed scheme for 4 processors that has access to the shared memory. The TM buffer has a maximum size of 256 entries. Round robin arbitration is used to give each processor equal priority. The TM buffer consists four sections: Valid, Address, Read Write Set and Data. Each processor is given 2 flags to keep track on its transactions. A conflict is detected when the TM buffer is being updated. The Read Write Set for current address is compared with the flags of current access. Write-on-write, read-after-write, and write-after-read from two transactions are the conditions that are considered as conflict [4]. The conflict flag is stored and during the commit phase of the processor, it will be notified and the respective transaction will be aborted. For a conflicted transaction, any consequent request made by the processor will not cause changes in the HTM. Such transaction is called zombie transaction [14]. It reduces the abort penalty and protects other transactions from the corrupted transaction. The valid bit is placed in the cache to allow fast commit. The state diagram of the control unit from Fig. 2 is depicted in Fig. 3. During every read and write phase, the TM buffer is updated. If a miss occurs, the Main memory data will be modified, TM buffer will hold the old data and update the flags, while address FIFO will hold the location of memory access. If it is a hit, the the Main memory data will be modified and the TM buffer will update the flags. During the commit request, corresponding processor checks its conflict buffer. If there is no conflict, the transaction commit is a success. Old data from TM buffer will be discarded. For an entry that is shared and only contain the corresponding processor read flag, only its flag is removed, else the whole entry is removed. During commit, the TM buffer updates itself within one clock cycle. However, when an abort occur, the old data from TM buffer will replace the Main memory data corresponding to the address FIFO. If the location is shared, only the corresponding processor flag is removed, else the whole entry is removed. The lazy conflict management allows conflicting transactions to continue its operation until commit. There will be no overhead for the processor to check at every read and write phase. Random stalls are applied on the processor only if its transactions consequently abort. On top of that, eager version

2746

Fig. 3.

State diagram of the HTM

management allows commit to be done in a single clock cycle. The proposed scheme takes full advantage of version and conflict management in dealing with low contention situations. V.

P ERFORMANCE EVALUATION

The evaluation was done by comparing the proposed architecture with a similar model that does not imply version management. The aim is to observe the performance of the new system towards different levels of contention. In order to model an application at different contention levels, we alter the distribution and isolate some of the memory access. The experiment was carried out with 4 identical hardware modules which were designed to resemble processors pumping in random request to the HTM. All requests are normally distributed, and the contention level is adjusted by changing the standard deviation. A smaller standard deviation would result in a higher contention rate. For cases where lower contention level is needed, some of the memory access for each processor are isolated from each other. We used Quartus II v13.0 to evaluate the system. The low contention HTM was modelled in Verilog. RTL simulation in Quartus was done using ModelSim 6.6 to obtain cycle accurate results.

Fig. 4. Clock cycle versus standard deviation of memory access for With and Without version management

Fig. 4 shows the difference in performance in the number of clock cycles the HTM requires with and without version management for the size of transaction of 8 at different contention levels. Each processor performs 500 successful transactions to the HTM. Fail transactions are retried until all the transactions are committed. With version management requires lesser clock cycles and its performance is more significant and larger standard deviation (less conflicts)

hardware must be able to adapt its configuration based on the application without the need for programmer to specify it. An integrated decision making system is needed to determine the optimized configuration based on the application contention. These are the topics of our future research. ACKNOWLEDGEMENT This work is supported in part by the Ministry of Science, Technology & Innovation of Malaysia (MOSTI) under ScienceFund Grant 06-01-06-SF1098 (UTM Vote No. 4S061) and Universiti Teknologi Malaysia IJN-UTM Flagship grant (Vote No. 00G77). R EFERENCES [1] [2]

[3] Fig. 5. Performance improvement versus percentage of commit at different transaction sizes TABLE II.

Maximum memory location 64 128 256

[4]

A REA COMPARISON IN LUT’ S

With version management 4009 7318 14125

Without Version management 4134 7346 15200

[5]

[6]

Similar experiment was created for different transaction sizes. Fig. 5 shows the percentage of performance improvement at different contention levels for different transaction sizes. The percentage of performance represents the ratio of clock cycle saved towards the overall time taken to complete 2000 transactions (500 for each processor). On the other hand, percentage of commit is ratio of successful transactions to the total transactions attempted. The percentage of improvement is binned at interval of 10 corresponding to the percentage of commit. The HTM with version management recorded up to 14% improvement for transaction sizes between 4 to 16. The reduction in the number of clock cycles is more significant at lower contention level. However, the minimum values of improvement is mainly due to the cpu request spatially locale addresses, making the improvement during commit less significant. The resources usage of the proposed HTM with different sizes is shown in Table II. The addition version management does not incurs more resources usage but improves the performance significantly.

[7]

[8]

[9]

[10]

[11]

[12]

[13]

VI.

C ONCLUSION AND FUTURE WORK

This paper proposed a hardware architecture for low contention HTM on embedded FPGA MPSoC. With version management, we are able to improve the performance of the system up to 14% for transaction between 4 to 16. The similar approach can be done for application with high contention. Since the main objective of applying HTM is to create an abstraction layer for programmers to do multi-threaded programming, having the system configurable is inadequate. The

2747

[14]

[15]

M. L. Navazo, “Hardware approaches for transactional memory,” M.Sc. Thesis, Technical University of Catalonia, 2008. C. J. Rossbach, O. S. Hofmann, and E. Witchel, “Is transactional programming actually easier?” in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Bangalore, India, Jan 2010, pp. 47–56. M. Herlihy and J. E. B. Moss, “Transactional memory: Architectural support for lock-free data structures,” SIGARCH Comput. Archit. News, vol. 21, no. 2, pp. 289–300, May 1993. L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, “Transactional memory coherence and consistency,” SIGARCH Comput. Archit. News, vol. 32, no. 2, Mar 2004. C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and S. Lie, “Unbounded transactional memory,” in Proceedings of the 11th International Symposium on High-Performance Computer Architecture, Washington, DC, USA, Feb 2005, pp. 316–327. L. Yen, “Signatures in transactional memory systems,” Ph.D. Dissertation, University of Wisconsin, 2009. M. Lupon, G. Magklis, and A. Gonz´alez, “A dynamically adaptable hardware transactional memory,” in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2010, pp. 27–38. C. Kachris and C. Kulkarni, “Transactional memories for multiprocessor fpga platforms,” Journal of Systems Architecture, vol. 57, no. 1, pp. 160–168, Jan 2011. M. Labrecque and J. G. Steffan, “The case for hardware transactional memory in software packet processing,” in Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, La Jolla, California, USA, Oct 2010, p. 37. C. Ferri, S. Wood, T. Moreshet, R. I. Bahar, and M. Herlihy, “Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems,” Journal of Parallel and Distributed Computing, vol. 70, no. 10, pp. 1042–1052, Oct 2010. N. Shavit and D. Touitou, “Software transactional memory,” in Proceedings of the 14th annual ACM symposium on Principles of distributed computing, Ottowa, Ontario, Canada, Aug 1995, pp. 204–213. A. Shriraman, S. Dwarkadas, and M. L. Scott, “Flexible decoupled transactional memory support,” in Proceedings of the 35th Annual International Symposium on Computer Architecture, Beijing, China, June 2008, pp. 139–150. N. Njoroge, J. Casper, S. Wee, Y. Teslyar, D. Ge, C. Kozyrakis, and K. Olukotun, “ATLAS: A chip-multiprocessor with transactional memory support,” in Proceedings of the conference on Design, automation and test in Europe, Nice, France, Apr 2007, pp. 3–8. M. Schoeberl and P. Hilber, “Design and implementation of realtime transactional memory,” in Proceedings of the 20th International Conference on Field Programmable Logic and Applications (FPL), Milan,Lombardy,Italy, Aug/Sept 2010, pp. 279–284. M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer III, “Software transactional memory for dynamic-sized data structures,” in Proceedings of the 22nd annual symposium on Principles of distributed computing, Boston, MA, USA, July 2003, pp. 92–101.

Suggest Documents