Conflict-address buffer: In signature-based eager conflict detection, the read and write signatures are generated by hashing the addresses accessed by the current transaction, and conflict detection is triggered by the transactional (read or write) request. Specifically, given a transactional request from another core, if a conflict is detected based on the signatures, a NACK (conflict) is returned. We do not change the signature-based conflict detection. We simply use an additional CAB (Fig. 1) to improve it. The CAB is a fully associative memory with a limited number of entries. Each CAB entry consists of four fields: address, valid (V) bit, read (R) bit and write (W) bit. The CAB also contains the logic for entry management and conflict detection. The CAB captures those addresses that generate a conflict during the execution of a thread on the associated core. All CAB entries are invalidated and cleared at the beginning of a thread. A valid CAB entry is created when, given a transactional request, a conflict is detected
core 0 pipeline
1
2 …
read signature … write signature …
V
address
R W
… …
Introduction: With all major processor vendors shipping shared-memory multicore processors, multithreaded parallel programming has become important. Transactional memory (TM) simplifies multithreaded programming by executing transactions atomically and in isolation. To do this, a TM system performs version management, conflict detection and conflict resolution. This Letter focuses on eager conflict detection in the TM system implemented with hardware support (hardware TM (HTM)). A conflict occurs when two transactions access the same data and at least one access is a write. A conflict detection strategy is eager if it detects offending memory reads or writes immediately. Approaches for precise conflict detection have been proposed [1], but they require changes in the memory system design. To avoid such an architectural coupling, many HTMs use signature-based conflict detection [2]. In this approach, arrays of hardware bits, called signatures, are used to track the addresses read and the addresses written by each transaction. If we use a separate bit to represent each address, we can implement the precise conflict detection. However, it requires too much hardware. To reduce the hardware overhead, we use hash functions to map addresses to signature bits. Such an approximation causes aliasing (mapping multiple different addresses to the same set of signature bits) which leads to false conflicts (detection of non-existing conflicts) and consequently unnecessary transaction aborts. Since false conflicts can significantly affect performance, a variety of signature designs have been proposed to reduce false conflicts [3]. In this Letter, we propose a different approach to reducing false conflicts. Our analysis shows that conflicts do not occur evenly on all addresses and that most conflicts occur when accessing a very small set of addresses. Given the locality, we can potentially reduce false conflicts by performing more precise conflict detection for those addresses that generate conflicts. Therefore, the issues are how to capture such addresses at runtime and how to perform more precise conflict detection. In the proposed approach, in addition to using signature-based conflict detection, we record those addresses that generate a conflict in a buffer called the conflict-address buffer (CAB) and track each transaction’s access to the addresses captured in the CAB. Using this information, we perform more precise conflict detection for the captured addresses. The approach allows us to reduce the false conflicts and the associated unnecessary transaction aborts that occur in a signaturebased eager HTM.
…
The use of a conflict-address buffer (CAB) for reducing false conflicts in signature-based eager hardware transactional memory (HTM) is proposed. On the basis of the observation that most conflicts occur when accessing a very small set of addresses, the CAB captures those addresses that generate a conflict at runtime and performs more precise conflict detection for the captured addresses. Using the CAB can reduce false conflicts and the associated unnecessary transaction aborts and, consequently, improve the performance of the multicore processors that implement the signature-based eager HTM. When running the Stanford transactional applications for multiprocessing (STAMP) benchmark on a 16-core processor that implements the LogTM-SE, the speedup (decrease in execution time) achieved with a 4-entry CAB is 9.4% on average.
…
Jinku Kang, Jaeil Jung and Inhwan Lee
based on the signatures and the requested address does not exist in the CAB. In this case, the requested address is stored in an invalid CAB entry and its V bit is set. (If the CAB is full, the least-recently-used entry is used to store the new address.) At the same time, if the read and write signature bits associated with the requested address are set, we set the R and W bits of the new CAB entry, respectively. Since the aliasing might have already occurred, the value of 1 in the R (W) bit means that the address in the new CAB entry might have been read (written) by the current transaction. The values of the R and W bits of the valid CAB entries are also updated when the transaction running in the associated core performs memory access. If the address read (written) by the current transaction exists in a valid CAB entry, the R (W) bit of the entry is set if it has not already been set. Since the values of the R and W bits indicate the possible access to the address by the current transaction, they are cleared at the time of the transaction commit or abort, similar to the read and write signatures. However, the V bits of the valid CAB entries are not cleared at the time of commit or abort. Instead, they remain to be set until the end of the current thread. This is because we want to perform more precise conflict detection for the addresses captured in the CAB until the end of the thread. Let us discuss how the CAB can reduce the number of false conflicts. On receiving a transactional request from another core, the conflict detection controller passes the request to the CAB as well as to the signature-based conflict detection (Fig. 1). Given the request, if the CAB can be sure that there is no conflict, it asks the conflict detection controller not to send a NACK even if signature-based conflict detection declares a conflict. This is how the CAB reduces false conflicts. Then the question is how the CAB can be sure that there is no conflict. Suppose the requested address for read or write exists in the CAB and the associated R and W bits are both clear. It means that the address has never been accessed by the current transaction. Therefore, the CAB can be sure that there is no conflict. Moreover, on a transactional read request, if the requested address exists in the CAB and the associated W bit is clear, the CAB can be sure that there is no conflict because the current transaction has never written to that address. In both cases, if signature-based conflict detection declares a conflict, that is a false conflict due to aliasing, and the CAB can mask it.
address
Reducing false conflicts in signature-based eager hardware transactional memory
CAB requested address conflict detection controller
network
a
b
Fig. 1 Conflict detection with signatures and CAB a External connections b Data fields in CAB
Unless the CAB can be sure that there is no conflict, it takes no action and relies on signature-based conflict detection. Specifically, the CAB takes no action if at least one of the three conditions is met: (i) the requested address does not exist in the CAB, (ii) the requested address exists in the CAB and its W bit is set and (iii) on a transactional write request, the requested address exists in the CAB and its R bit is set. In the first case, the CAB has no information to make judgements on conflict. In the following two cases, the CAB knows that there is a conflict, but it takes no action because the signature-based conflict detection will detect the conflict and return a NACK. Note that the R (W) bit can have the value of 1 not because the current transaction has read (written to) the address after the CAB entry was created but because it was set based on the signature when the CAB entry was created. Then, because of aliasing, it is possible that the address with its R (W) bit set has not actually been read (written) by the current transaction. Therefore, while the CAB does reduce false conflicts, it cannot remove all false conflicts associated with the captured addresses. Note
ELECTRONICS LETTERS 20th November 2014 Vol. 50 No. 24 pp. 1821–1823
that the CAB entries with the W bit set do not contribute to reducing false conflicts while executing the current transaction. However, these entries can contribute when executing the future transactions of the current thread because, while their R and W bits are cleared at the commit or abort of the current transaction, their V bits remain to be set until the end of the current thread. So far, we have discussed the CAB in the context of a single-threaded core. In a multithreaded core, on a thread switch, we can simply clear all information in the CAB because the CAB is used for optimisation. Signature-based conflict detection provides correctness by properly managing the signatures on a thread switch. Experimental results: We take the LogTM-SE [2] as the baseline signature-based eager HTM for performance evaluation. In the LogTM-SE, conflict detection is performed at the cache block level and works on top of the cache coherence mechanism. The function of the conflict detection controller in Fig. 1a is integrated into the cache coherence controller, and the conflict detection is triggered by the coherence request from another core. The read and write signatures are generated by hashing the cache block addresses accessed by the current transaction and the CAB stores the cache block address in the address field in Fig. 1b. We use SESC [4], an execution-driven multicore architecture simulator, to implement the CAB as well as the baseline 16-core LogTM-SE system. Each core is a 5 GHz 4-issue out-of-order superscalar machine. A core has private 4-way 32 kB L1 instruction and data caches with 64 byte blocks and 2 cycles of access latency. All cores share an 8-way 8 MB L2 cache with 64 byte blocks and 34 cycles of access latency. The MESI directory protocol is employed for cache coherence. The main memory is 4 GB and has 500 cycles of access latency. The inter-core communication latency is 10 cycles. The sizes of the read and write signatures are 2 kbit each, and the leastsignificant 11 bits of the cache block address are decoded to select a signature bit. We use timestamp-based conflict resolution with the oldest-transaction-win policy [5]. We run the Stanford transactional applications for multiprocessing (STAMP) benchmark suite [6]. Table 1 shows the relative execution time on the LogTM-SE with the CAB, compared with the execution time on the LogTM-SE. The results show that a small CAB suffices. With the use of a 4-entry CAB, the speedup (decrease in execution time) is 9.4% on average. The speedup is achieved by reducing false conflicts. Table 2 shows the proportion of false conflicts in all conflicts. One would expect more speedup with the CAB when the contention is high. In fact, the CAB is effective for Bayes. However, the speedup can depend a lot on the dynamic behaviour of each application. For example, we obtain more speedup for Yada and Vacation, which have medium contention, than for Intruder and Labyrinth, which have high contention. The data shows that most of the conflicts in Intruder are true conflicts (Table 2), so the CAB is less effective. Labyrinth has long transactions that are often blocked by a series of true as well as false conflicts. Although the CAB is effective in reducing false conflicts, it turns out that the execution time of Labyrinth is mainly determined by the true conflicts. Genome has low contention, and the speedup for Genome is moderate. We obtain no speedup for K-means and SSCA2 because they have low contention and do not spend much time in transactions. Small fluctuations in the speedup in Table 1 are attributed to the inherent non-determinism of the workload.
Table 1: Relative execution time (%) Application
2 kbit read and 2 kbit write signatures 4-entry 8-entry 16-entry ∞-entry CAB CAB CAB CAB 79.9 76.4 72.1 72.7
4 kbit signatures w/o CAB
Bayes
2-entry CAB 86.8
Intruder Labyrinth Yada Vacation Genome
94.1 99.3 98.6 77.7 90.7
93.7 96.6 91.8 71.9 91.1
94.2 94.6 85.1 69.7 90.6
93.5 95.4 82.7 67.7 91
94 94.9 75.3 64.8 90.6
96.4 95.7 73.9 70 91.6
SSCA2 K-means Average
100 100 93.4
100 100 90.6
100 100 88.8
100 100 87.8
100 100 86.5
100 100 89.6
Table 2: False conflict rate (%) Application Bayes Intruder
2 kbit read 2Without entry CAB CAB 95.7 83.2 11.4 0.9
and 2 kbit write signatures 4816∞-entry entry entry entry CAB CAB CAB CAB 62.5 31.9 12 8 0.7 0.5 0.3 0.3
4 kbit signatures w/o CAB 95.4 5
Labyrinth Yada Vacation Genome SSCA2
95.7 65.9 97.5 86.9 61
62.6 37.4 84.2 58.1 50.5
56.4 23.1 80.1 51.2 48.6
49.5 26.6 69.9 50.2 48.2
28.8 21 68.4 51.2 49.7
4.9 18.9 59.5 50.2 48.2
65.7 54.3 86.1 62.6 40.2
K-means
3.5
3.1
2.8
2.5
2.5
2.5
1.4
A question is: given that we can also reduce false conflicts using larger signatures, do we use the CAB or increase the signature size? The last column of Table 1 shows the speedup when we do not use the CAB, but instead use the signatures twice as large (4 kbit read and 4 kbit write signatures). Note that we can obtain a comparable speedup with a 4-entry CAB (the third column of Table 1), which requires much less hardware. In real designs, we can make a trade-off between the CAB size and the signature size to maximise the speedup while minimising the hardware overhead. Conclusion: This Letter proposes the use of the CAB. The CAB allows us to reduce false conflicts and the associated unnecessary transaction aborts that occur in signature-based eager HTM. Therefore, using the CAB can improve the performance of the multicore processors that implement the signature-based eager HTM. When running the STAMP benchmark on a 16-core processor that implements the LogTM-SE, the speedup achieved with a 4-entry CAB is 9.4% on average. © The Institution of Engineering and Technology 2014 17 September 2014 doi: 10.1049/el.2014.3375 Jinku Kang, Jaeil Jung and Inhwan Lee (School of Electrical and Computer Engineering, Hanyang University, Seoul 133-791, Republic of Korea) E-mail:
[email protected] References 1 Bobba, J., Goyal, N., Hill, M.D., Swift, M.M., and Wood, D.A.: ‘Token TM: efficient execution of large transactions with hardware transactional memory’. Proc. ISCA, Beijing, China, June 2008, pp. 127–138, doi: 10.1109/ISCA.2008.24 2 Yen, L., Bobba, J., Marty, M.R., Moore, K.E., Volos, H., Hill, M.D., Swift, M.M., and Wood, D.A.: ‘LogTM-SE: decoupling hardware transactional memory from caches’. Proc. HPCA, Phoenix, AZ, USA, February 2007, pp. 261–272, doi: 10.1109/HPCA.2007.346204 3 Quislant, R., Gutierrez, E., Plata, O., and Zapata, E.L.: ‘Hardware signature designs to deal with asymmetry in transactional data sets’, IEEE Trans. PDS, 2013, 24, (3), pp. 506–519, doi: 10.1109/TPDS.2012.138 4 Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Sarangi, S., Sack, P., Strauss, K., and Montesinos, P.: ‘SESC simulator’. Available at http://www.sesc.sourceforge.net, January 2005 5 Rajwar, R., and Goodman, J.R.: ‘Transactional lock-free execution of lock-based programs’. Proc. ASPLOS, San Jose, CA, USA, October 2002, pp. 5–17, doi: 10.1145/605397.605399 6 Minh, C.C., Chung, J., Kozyrakis, C., and Olukoten, K.: ‘STAMP: Stanford transactional applications for multi-processing’. Proc. IISWC, Seattle, WA, USA, September 2008, pp. 35–46, doi: 10.1109/ IISWC.2008.4636089
89.3
ELECTRONICS LETTERS 20th November 2014 Vol. 50 No. 24 pp. 1821–1823