Solving Out of Order communication using CAM memory; an implementation Claudiu Zissulescu-Ianculescu
Alexandru Turjan
Bart Kienhuis
Ed Deprettere
Leiden Embedded Research Center Leiden Institute of Advanced Computer Science, Leiden, The Netherlands e-mail: fzissules,aturjan,kienhuis,
[email protected] Abstract— At the Leiden Embedded Research Center, we are working towards a framework called Compaan that automates the transformation of digital signal processing (DSP) applications to Kahn Process Networks (KPNs). These applications are written in Matlab as parameterized nested loop programs This transformation is interesting as KPNs are well suited for mapping onto parallel architectures or FPGAs. One of the key problems in the Compaan framework is solving out-of-order communication. In such case, a FIFO is not sufficient to linearize data and to restore the correct order of the received tokens. As consequence, a control mechanism is required for temporarily holding and reordering the tokens communicated over the Kahn channels. In this paper, we present an implementation in hardware of such reordering mechanism. Keywords—Digital Signal Processing, Out of Order Communication, Mapping, FPGA, CAM
I. I NTRODUCTION At the Leiden Embedded Research Center, we are working towards a framework called Compaan [1] that automates the transformation of digital signal processing (DSP) applications to Kahn Process Networks (KPNs). These applications are written in Matlab as parameterized nested loop programs This transformation is interesting as KPNs are well suited for mapping onto parallel architectures or FPGAs. The KPN consists of concurrent autonomous processes that communicate in a point to point fashion over unbounded FIFO channels using a blocking-read synchronization. In this network, each process executes an internal function following a local schedule. At each execution (also refer to as iteration) this function reads/write data from/to different FIFOs. An input port domain (IPD) of a process is the union of iterators at which the process’ function reads data from the same FIFO. An output port domain (OPD) of a process is the union of the iterators at which the process’ function writes data to the same FIFO. Each FIFO uniquely relates an input port to an output port forming an instance of the classical Producer/Consumer pair. An instance of such a Producer/Consumer pair is given
Producer
for i = 1:1:N, for j = 1:1:M, token = Read();
for i = 1:1:N, for j = 1:1:M, FIFO
Consumer
IPD
FIFO.Put( token ); token = FIFO.get();
end end
OPD
Write( token ); end end
Fig. 1. A Producer-Consumer pair
in Figure 1. The Producer and Consumer process are connected with each other using a FIFO channel. Of the Producer we show one output port domain (OPD) and of the Consumer, we show one input port domain (IPD). Each OPD is uniquely connected to another IPD via a FIFO. There are, however, cases in which a FIFO is not sufficient to linearize data and to restore the correct order of the received tokens. This problem is generated by the different read/write schedule of a Producer/Consumer pair in a KPN. As consequence, a control mechanism is required for temporarily holding and reordering the tokens communicated over the Kahn channels. In [2], we have proposed the Extended Linearization Model (ELM) to solve this reorder problem. So far, the ELM was only implemented in software. In this paper we propose a hardware realization of the ELM. II. T HE ELM In Compaan, one of the steps involved in transforming a nested-loop program to a KPN is Linearization. Linearization is the transformation of a N-dimensional data structure into a single one-dimension data stream. Normally, the linearization model (LM) used in Kahn process network is a FIFO buffer. This allows the Consumer to read the tokens as they were written by the Producer. There are, however, cases in which a FIFO not longer holds as the linearization model. This is, when the order data is produced is different from the order data is consumed. Consequently, a new extension is needed to handle this out-of-order situation. We have introduced such new Linearization model, which we call the Extended Linearization Model. If no reordering takes place, only FIFOs are needed. In [3], such implemented has been studied.
573
Virtex-II Device XC2V40 XC2V80 XC2V250 XC2V500 XC2V1000 XC2V1500 XC2V2000 XC2V3000 XC2V4000 XC2V6000 XC2V8000
The main elements in the Extended Linearization Model are the reordering memory and the controller. In Figure 2, a schematic representation is given of the ELM. It shows the Consumer process (A), the Reorder Memory (B), and the Controller (C). Producer for i = 1:1:N, for j = i:1:N,
fifo.Put(FPP(i,j));
end end
Consumer bounded FIFO
A
for x = 1:1:N, for y = x:1:N,
token = Controller.getFrom(i,j); FCC ( token ); end end
Memory
C
B
Controller
Fig. 2. The Extended Linearization Model
The Process Description (A) in the ELM is different from the process description when a FIFO is enough to linearize data. Instead of getting tokens directly from a FIFO, the function gets its tokens from the Controller. The Memory (B) stores tokens, produced by the Producer. The memory is random accessible, allowing the Controller to reorder tokens into the required order. The Controller (C) converts the sequence in which the tokens are produced into the sequence in which they have to be consumed. The ELM can be realized in software in different ways as presented in [4]. In this paper, however, we focus on the realization of the ELM in hardware. As hardware platform, we target the Xilinx Virtex-II FPGA platform. On this platform special block memories are present that are called SelectRAM+ [5]. This component uses a so-called one-hot bit decoder mechanism, making fast search operations possible. The presents of these memories makes this platform very suitable for implementing KPN networks, generated by Compaan, in hardware. The SelectRAM+ block can be used to implement either FIFOs [6] or small Hash maps [7]. In Table I, we shows the total amount of SelectRAM+ blocks available in the complete VirtexII family. III. T HE HARDWARE SOLUTION FOR ELM To realize the ELM in hardware, we want to use a Content Addressable Memory (CAM). According to the analysis done in [4], the CAM realization leads to the most efficient implementation in terms of complexity of the design of reordering controller and memory uses. A CAM differs from RAM, in that a key is used instead of an address to store and access the content of a memory. The performance of a CAM depends on the speed of finding a key in it. Because we use the special SelectRAM+ component in the Xilinx Virtex-II FPGA, we can perform fast search operation on the content, due to the one-hot bit de-
Number of Blocks 4 8 24 32 40 48 56 96 120 144 168
TABLE I AVAILABLE S ELECT RAM+ BLOCKS IN DIFFERENT V IRTEX -II DEVICES
code mechanism [7]. Our high level concept of the hardware solution for the ELM is shown in Figure 3. As one can see, this structure differs from the structure of the ELM model given in Figure 2. This change of structure results from our goal to minimize the control and memory usage of a network, in which out-of-order communication between nodes takes place. We combined the FIFO channel with the reorder memory and control, moving them from the Consumer side into the Channel. In addition, key units are required at both the Producer and Consumer side to produce unique keys. The hardware implementation works as follows: a Producer produces a token together with a unique key. This key is generated in the Key Unit on the basis of the iteration the token is produced. Next, the token is stored in the Channel memory and referenced by its unique key. When the Consumer requires a particular token, it will send a request for a particular token to the controller. This request is a key that relates to the correct token at the Producer side. The key is generated in the Key Unit at the Consumer side. For the Consumer to be able to find the correct key, we make use of the fact that an affine function exists, called mapping function, which relates the Producer’s and Consumer’s node domain. in this way, an iteration in the Consumer domain can be expressed as an iteration in the Producer domain. This propriety ensures that a unique key can be produced at the Producer side are be regenerated by the Consumer at the correct iteration. The way keys are generated is shown later in this paper. A strong point of the conceptual hardware model of the ELM is that only a single memory is used instead of two
574
Producer
Consumer
Channel
for x = 1:1:N, for y = x:1:N,
Memory
for i = 1:1:N, for j = i:1:N,
token = Controller.get(i,j); FCC ( token );
Controller.Put(i, j, F (i,j)); PP
end end
given by the communication delays and the internal execution of node components.
end end
Store
Token
Token
Request
Controller
IPD
Key
Key Unit
Key Unit
IPD
Key
OPD
Read Unit
IPD
Arguments
Execute Unit
Arguments
Key Unit
Write Unit Key Unit
Control
memories, as is the case in the ELM model. In this way, we use only one memory buffer to implement the functionality of the FIFO and of the reorder memory. Also the number of read and write accesses become less. A token is only written once and read once. In the ELM, a token is first written in the FIFO, read by the Consumer, written in the reorder memory and finally read by the Consumer process. The new Channel acts like a FIFO buffer for the Producer, and like a reorder memory for the Consumer, but it still preserves the Kahn Process Network semantic model, as the ELM does. The size of the memory in the channel must be determined to not introduce artificial deadlock. If the dimension of the reordering memory is bigger than the minimum required, the extra free location simply acts like an additional buffer for the incoming tokens. To make sure enough memory is used in the Channel implementation, we make use of a Process Network Simulator that indicates the minimum required memory per Channel. To further explain the conceptual hardware model for the ELM, we now look at the individual components of the hardware model as shown in Figure 3.
Control
Control
Fig. 4. A Node and its components that implements a process in hardware.
B. The Key Unit A key unit is present in the Read and Write units. Each unit computes a unique key according to the current iteration of a process. The key unit is an important unit in our design. It provides a unique key used to index a token and to access the content of the reorder memory. A unique key can be generated based on the shape of the producer’s node domain. Usually, those shapes are expressed as pseudopolynomial functions, which in general are very difficult to be computed [2] and require a lot of resource. In order to overcome this computational problem, we relax the iteration space to N-dimensional rectangulars. For this shape, we can always determine a unique key using the following formula:
f
=
X N
k
=1
Ck
Xk + X0 ;
(1)
where Ck represents a constant, Xk is a variable.
A. The Node Each process is implemented in hardware as a Node. The structure of a hardware Node consists of four elements: a Read unit, Write unit, an Execute unit, and a Controller. The structure of the node is shown in Figure 4. The Read unit is the hardware unit which implements the IPDs. The Write unit is the hardware implementation of the OPDs. The Execute unit is the implementation of the function of the process. The node controller uses a naive architecture, enabling the Read, Write, and Execute unit one at the time. This makes only one unit to be active at one given time instance. A more advanced control can be envision that pipelines the execution of the units. We define the Read-Execute-Write Time as the total amount of time that is required for processing two consequential tokens by a Node. A set of nodes connected via channels make up a network. The execution time of the network is
Node Controller
OPD
Token + Key
Token + Key
Fig. 3. The high-level conceptual view of the hardware implementation of an ELM
OPD
C. The Channel The Channel implements the reordering function with the help of a CAM memory. This CAM memory is used for storage and for searching tokens on the basis of a key. The Controller is responsible to deliver the right token to the Consumer. D. The Producer The Producer’s OPD produces a unique key for a given iteration. The key is sent together with the produced token to the channel logic. The Producer communicates (via the Write Unit implementing the OPD) with the channel via a blocking-write protocol. A signal from the channel that the memory is full, blocks the write operation until the Consumer frees up a location in the reorder memory.
575
E. The Consumer The Consumer’s reads from the channel (via the Read Unit implementing the IPD) a token that was stored with a key identical with the key requested by the Consumer. An empty channel or an no-matching key makes the current read operation an unsuccessful request and the Consumer will block, implementing a blocking-read semantics. The Consumer remains blocked until the correct token becomes available at the channel.
In the Hash map, a key/address pair is stored as shown in Figure 6. Each address points toward a location in the RAM holding the token associated with the key/address pair. The address can have values between 0 and 31. When a match occurs (i.e. a key is found) the address of the found key is also known. As a condition, the depth of the RAM is the same as the depth of the CAM realization. In our implementation, we use a 9-bit key, and a 36-bit token. The capacity of the Channel is 32 locations, as the Hash map can hold 32 key/address pairs. SelectRAM+
IV. I MPLEMENTATION The main focus of the paper is the design and implementation of the CAM channel and the key units. The key units are implemented using the 19 bits multipliers and adders present in the current FPGAs. We implemented the key unit as a four level tree structure: the leaves are multipliers and the nodes are adders. We collect the result from the root adder’s output. The structure is fixed to handle up to five dimensional indices. That is, it can generate a key for an index with 5 iterators (e.g. a(i,j,k,l,m)). The delay on calculating the key is constant for dimensions 1 to 5. Note that the dimension of 5 as arbitrary. The SelectRAM+ memory built into the Virtex-II devices can be used as a Hash memory [7]. In this way, we can realize CAMs of 32-word of 9 bits using a single SelectRAM+ block. The capacity of the CAM can be made wider as well as deeper, using more than one SelectRAM+ block to construct the Hash memory. The maximum size for a CAM depends on the type of device used (See Table I). To realize the Channel that can perform the reordering, we used the Xilinx technique to build a CAM memory. This CAM architecture uses a Hash memory to store a key and a pointer and additional RAM memory to store tokens as shown in Figure 5. RAM block
The produced token
The consumed token
Token RAM
RAM
Key
Address
Key
Address
Key
Address
Token 00 Token 01 Token 02 Token 30 Token 31
Key
Address
Key
Address
Key[8:0]
Address[4:0]
Token[35:0]
The Lenth of data Store/Search
MUX Search
Store
The Producer Key used to index a token The Consumer Key used to search for a token
Fig. 6. The pointers
A. Channel’s write operation For a write operation, the Producer delivers to the channel a token and its unique key. The controller searches for a free location in the SelectRAM+ block. If it is found, an entry is made from the key/address pair (See Figure 7). This entry is stored via the A port into the SelectRAM+. Because there is a one-to-one relation between Hash map entries and the RAM, a free Hash-entry relates to a free RAM address. Consequently, the token is stored at this free RAM location. The channel’s write operation has more priority than the read operation. So, when those two operations occurs in the same time the Read cycle is delayed, until the Write cycle is finish.
The Channel KEY
Address
The storage key
The producer Key
The search key
Controller
the address (the index to the token RAM) the key (attached by the Producer Controller) The match vector
Hash map
Port B
Port A
The address
SelectRAM+ block
The requested key
Fig. 5. The CAM architecture
Fig. 7. The Hash memory entry
B. Channel’s read operation For a read operation, the Controller access port B of the Hash map with the key required by the Consumer’s
576
IPD. This step of the read operation is called the check cycle. The Hash map returns to the controller a 32 bits wide match vector, which contains either a ’1’ in the position where the key is located or a zero vector, if the key is not found. This match vector is represented in the one-hot bit format. We implemented a decimal-to-binary decoder to provide the read address for a token stored into the RAM based on the match vector. A pointer for free addresses is kept within a vector, which has a length equal with the depth of the CAM. A zero in the vector represents a free CAM address; when all the values from this vector are one, the memory is full. When a token is found and read by the consumer, the channel implementation releases the corresponding location from the reordering memory. We define this second step of the read operation as the erase cycle. As a consequence, the current channel implementation does not support that a token is written once, but read multiple times, a feature known as broadcast. The fact that the CAM architecture releases a token as soon as a token has been read, makes this CAM realization very efficient. The absolute minimum amount of memory is consumed to perform reordering. C. Limitation of the design A write operation takes two cycle, and a successful read operation takes four cycles, two for reading from memory and two for erasing the memory location read. While the check cycle and the write operation are concurrently, the erase cycle inhibits the write operation until it is successfully. Furthermore, the design does not support the broadcast feature and different clocks for read and write operations.
%parameter M %parameter N
for i= 1: 1: N, for j= 1: 1: M, [Sink(i,j)] = Write(x(i,j)); end end
Fig. 8. An in-order Producer-Consumer pair
the data produced by the first process. Because the order in which data is produced is the same as the order in which the data is consumed, a FIFO buffer is enough to linearize array x. If we consider that both Nodes have the same 40ns clock, and the parameters are N=4 and M=4, we found that the producer stops after 7100ns, and, the consumer, after 7820ns. To perform a write operation, the OPDs needs 2 cycles and for a read operation, the IPDs needs 2 cycles. Those numbers are valid for the write operation when the channel is empty, and for the read operation when not empty. The Producer has a setup-time of 60ns, and an average ReadExecute-Write time of 440ns. One additional cycle has to be added for the situation when the FIFO is full. The total amount of time needed for a producer to finish is: producer = Tsetup + TREW
T
Let us analyze a Producer/Consumer pair that uses a FIFO; a simple matlab algorithm exposing this behavior is presented in Figure 8. We remark that the Producer/Consumer pair given in Figure 1 is described by this Matlab code. The first pair of nested-loops represent the Producer process. A sequence of data is produced by the function Read and stored in the two dimensional array x. The second pair of nested-loops represent the Consumer process. The function Write reads data from the array x, consuming
N otokens :
(2)
And, for each token that cannot be written into the channel because the channel is full, the maximum extra time is: extra = TREW
T
A. In-order-communication with a FIFO channel
10; 10;
for i= 1: 1: N, for j= 1: 1: M, [x(i,j)] = Read(); end end
V. E XAMPLES In this section we show the main characteristics of a CAM channel compared with a FIFO channel. Also, we present how the reordering affects the network execution time.
1 1
int
producer C LKconsumer C LK
;
(3)
where Tproducer is the total amount of time needed by a Producer to finish, Tsetup the Setup time needed by a Node to enter in the normal read-execute-write cycle, TRE W the time taken by one Read-Execute-Write cycle, C LKproducer the Producer clock rate and C LKconsumer the Consumer clock rate. For the Consumer the setup time is 140ns, and the ReadExecute-Write cycle is 480ns. We can apply the same equations, like in the case of the Producer, to find out the total amount of time needed for this consumer to finish. B. In-order-communication with a CAM channel As a second experiment, we replaced the FIFO from the first network with a CAM channel. We repeated the entire experiment using the same parameters and clock rate.
577
Because no reordering takes place in the algorithm from Figure 8, the CAM has to behave just like a FIFO. This experiment gives insides in the communication overhead of the CAM versus a FIFO. To read a token from the channel, the Read unit needs 4 cycles (160ns) to accomplish this task. If the channel is full, one extra cycle is required to read the data. The Write unit is using the same communication protocol as in the case of the FIFO channel. From the point of view of the Producer the channel characteristics are the same either it is a FIFO or a CAM. As a result the token production finish after 7100ns. For the Consumer the Read-Execute-Write time is 560ns, with exact 2 clocks more than the FIFO version. So, the total amount of time needed by the Consumer to read all the tokens from the channel is 9420ns. The setup time is 140ns. An extra time of 320ns is required for the first read because the Consumer found the channel busy/empty. We can write the same equations to compute the total amount of time to end the nodes.
use a CAM channel to accomplish this task. In this example the Producer ends after 7100ns. It has the same behavior the FIFO experiment. The out-of-order communication introduces delays, as the Consumer must wait for the right token that might not have been produced yet. As an example, let’s look at token number 5 in the Consumer sequence. The Consumer must wait 4 tokens before token 5 is produced (including it) and the Consumer can consume it. For those 4 tokens which were produced but not consumed, the process waits Twait = N otokens Tproducer:RW E , which means 1760ns. Where N otokens is the number of tokens produced but with a different key than the search key. We can see that until token 13, the Consumer is delayed. At the next time instance it finds the tokens into the channel’s memory. The consumer schedule finish after 13000ns. We can write that the reorder affects the read time for one token with maximum Textra = DimM AXC AM Tproducer:RW E , where DimM AXC AM represents the maximum dimension of the reorder memory.
C. Out-of-order communication
D. Compared results
In the third experiment we interchange the indices i and j in the array x in the Consumer from the Figure 8 to realize an out-of-order situation. This new algorithm is given in Figure 9 and is also known as the transpose algorithm used in application like for example Picture-inPicture (PiP) [8]. Our goal is to analyze this network and to see how the reordering affects the communication performances. Again, we use the same parameters as used in the previous examples.
The results of the experiments are shown in Table II. It shows that for a Producer, it does not matter if the channel is implemented as a LM or an ELM. However, the consumer’s time depends on the type of communication, and if the channel has to reorder or not. For each token that has to be reorder, the consumer must wait until the required data is produced; the total number of tokens it has to wait for, affects in a direct way the overall execution time of the network.
%parameter M %parameter N
1 1
10; 10;
for i= 1: 1: N, for j= 1: 1: M, [x(i,j)] = Read(); end end for i= 1: 1: M, for j= 1: 1: N, [Sink(i,j)] = Write(x(j,i)); end end
Fig. 9. An out-of-order Producer-Consumer pair
The data is written in the array x in the order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, and it is read out in the order: 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15, 4, 8, 12, 16. As we can see, a simple FIFO is not enough to linearize the array x. The ELM solves this problem using a reorder mechanism, such that the Consumer reads the right sequence of tokens. In our hardware implementation, we
Name of the Test
Producer
Consumer
Time/Token
In-order with FIFO In-order with CAM Out-of-order
7100ns 7100ns 7100ns
7820ns 9420ns 13000ns
488.75ns 588.75ns 812.50ns
TABLE II T HE EXPERIMENTS RESULT
The first two examples shown us that, regarding the Consumer, the performance level of the CAM implementation is lower than the performance level of the FIFO implementation. Furthermore, the execution time of a out-oforder network depends on the number of tokens that need to be reordered. In Table III and Table IV, we show the main characteristics of a CAM, a FIFO, and a Key unit. To implement only a 32 location CAM we need one SelectRAM+ block and an extra RAM memory to store
578
Component
Cycles for Write
Cycles for Read
FIFO CAM Key Unit
2 2 NA
2 4 1
TABLE III T HE C OMPONENTS . D ELAYS
Component
Size (gates)
Capacity tokens
XC2V6000 area
FIFO CAM Key Unit
66,805 70,021 17,608
511 32 NA
1.11% 1.16% 0.29%
VI. C ONCLUSIONS This paper shows a hardware implementation of the Extended Linearization Model based on a Content Address Memory (CAM) architecture. We have also realized the CAM architecture onto a Xilinx FPGA. We compared the performance of this solution with the implementation of the Linearization Model, in which a FIFO is sufficient to linearize an array. We have conducted three experiments in order to see the communication overhead introduced by the new type of channel. The number of tokens which have to be reorder in one channel influences the execution time of the network. Our future work will focus on adding a broadcast mechanism and to use the FIFOs and CAMs realizations in real DSP applications.
TABLE IV T HE C OMPONENTS . S IZE CHARACTERISTICS
the tokens. The key component has the size smaller than the other components, and it is fast enough to perform one evaluation in one cycle. We mention that the average clock delay for a KPN network mapped into the VirtexII platform is 40.18ns. This delay is collected after the place&routing step. This network uses a mix FIFO/CAM communication. In terms of silicon efficiently, one FIFO location requires less gates than a CAM location. The Virtex-II 6000 platform allows us to implement 144 FIFO channels of 511 location of 36 bits. The size of the FIFOs are always 511 locations. On the Virtex-II 6000 platform, we can also implement 144 CAM channels of 32 location of 36 bits with a 9 bit key. In our examples, we use a 9 bit key and 9 bit wide tokens. A CAM can be incremented with 32 locations, each requiring an additional SelectRAM+ block. Hence it is possible to implement a single CAM of 144*32 locations. It is interesting to remark that in KPN seen sofar, the size of the FIFO are typically small. Values of 1 to 5 tokens have been observed. For the CAM implementations, we typically see that the CAM implementation requires much more location than 32. This depends on the parameters of the algorithm and the reordering schedule.
R EFERENCES [1] Bart Kienhuis, Edwin Rypkema, and Ed Deprettere, “Compaan: Deriving process networks from matlab for embedded signal processing architectures.,” in Proceedings of the 8th International Workshop on Hardware/Software Codesign (CODES), San Diego, USA, May 2000. [2] Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, “A compile time based approach for solving out-of-order communication in kahn process networks,” in Proceedings of IEEE 13th International Conference on Application-specific Systems, Architectures and Processors, July 17-19 2002. [3] Tim Harriss, Richard Walke, Bart Kienhuis, and Ed. F. Depettere, “Compilation from matlab to process networks realized in fpga,” in Proceedings of the 35th Asilomar conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, November 4 – 7 2001. [4] Alexandru Turjan, Bart Kienhuis, and Ed Deprettere, “Realizations of the extended linearization model in the compaan tool chain,” Samos, Greece, 2002. [5] ”www.xilinx.com”, “Using the Virtex Block SelectRAM+ Features,” December 2000. [6] ”www.xilinx.com”, “FIFOs Using Virtex-II Block RAM,” June 2001. [7] www.xilinx.com, “Using Virtex-II Block RAM for High Performance Read/Write CAMs,” February 2002. [8] Bart A.C.J. Kienhuis, Design Space Exploration of Stream-based Dataflow Architectures: Methods and Tools, Ph.D. thesis, Delft University of Technology, The Netherlands, Jan. 1999.
As can be seen in Table IV, the Key Units are very small. At the same time, the key mechanism employed in the ELM implementation gives a lot of flexibility. If the iteration domains of the IPD and OPD are not dense, the key mechanism still works correctly. A non-dense iteration domain results when for example step-sizes are used larger then 1 in the for-statements, or when pseudo-linear operators are used like mod, div , f loor , ceil, max or min.
579