Charles B. Kuznia, Member, IEEE, Jen-Ming Wu, Member, IEEE, Chih-Hao Chen, Bogdan Hoanca .... OP program at George Mason University.1 This technology.
376
IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 5, NO. 2, MARCH/APRIL 1999
Two-Dimensional Parallel Pipeline Smart Pixel Array Cellular Logic (SPARCL) Processors— Chip Design and System Implementation Charles B. Kuznia, Member, IEEE, Jen-Ming Wu, Member, IEEE, Chih-Hao Chen, Bogdan Hoanca, Lily Cheng, Allan G. Weber, Member, IEEE, and Alexander A. Sawchuk, Fellow, IEEE
Abstract— We describe the chip design and system implementation of an optoelectronic parallel pipeline processing system composed of cascaded stages of smart pixel array cellular logic (SPARCL) processors interconnected with free-space digital optic channels. The SPARCL processing elements are arranged in a two-dimensional array, and each contains an independent optical input/output port and electrical nearest-neighbor local interconnections. The smart pixels are implemented using GaAs–GaAlAs multiple-quantum-well diode arrays flip-chip bonded onto complementary metal–oxide–semiconductor circuitry through the Bell Labs Lucent Technologies/George Mason University optoelectronic VLSI foundry. This system provides efficient execution of single-instruction multiple-data algorithms on large data fields and images. Index Terms—Cellular logic arrays, CMOS integrated circuits, integrated optoelectronics, optical interconnections, smart pixels.
I. SMART PIXELS
FOR
SIMD COMPUTING
S
INGLE-INSTRUCTION multiple-data (SIMD) systems maintain low cost and high performance by replicating many simple processing elements (PE’s) on a verly large scale integrated (VLSI) chip to form a two-dimensional (2-D) processing array, and interconnecting the PE’s using a mesh topology. By distributing the computing power uniformly across one or more of the VLSI planes, high processing performance per chip area can be achieved. The PE can be a low complexity 1-bit processor implemented with a few hundred transistors. All data paths are local, within a single PE or to neighboring PE’s, allowing for high on-chip clock rates. This results in an architecture that can perform very fast data processing [1]–[4]. In fact, SIMD systems were the first
Manuscript received November 9, 1998; revised April 13, 1999. This work was supported by the Air Force Office of Scientific Research through the Joint Services Electronics Program under Contract F49620-97-10238, by the National Center for Integrated Photonic Technology (NCIPT) program funded by DARPA under Contract MDA972-94-1-0001, and by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, and in part by the Annenberg Center for Communication, University of Southern California, Los Angeles, CA, and in part by the California Trade and Commerce Agency. C. B. Kuznia, C.-H. Chen, B. Hoanca, L. Cheng, A. G. Weber, and A. A. Sawchuk are with the Signal and Image Processing Institute, University of Southern California, Los Angeles, CA 90089-2564 USA. J.-M. Wu was with the Signal and Image Processing Institute, University of Southern California, Los Angeles, CA 90089-2564 USA. He is now with Sun Microsystems, Palo Alto, CA 94303 USA. Publisher Item Identifier S 1077-260X(99)05476-3.
machines to achieve billion operations per second capabilities in computing systems [5]. A commonly cited advantage of SIMD systems is their scaling properties. Decreasing VLSI feature sizes allows for higher density PE implementation and thus larger processing array sizes per chip. Ideally, the processing throughput per chip, defined as the number of data elements processed divided by the processing time, should increase linearly with the processing array size. However, since the processing time includes the time required to load the data into the processing array, process the data, and then unload the data, the processing speed is also sensitive to the data input/output (I/O) bandwidth of the chip. The fundamental problem of data I/O in SIMD systems stems from the 2-D nature of the processing array and the one-dimensional (1-D) nature of the data I/O ports and electronic busses. The 2-D data fields typically enter the processing array in a row-parallel format along a border of the array. This I/O bottleneck creates latency and reduces processing throughput [6]. The bandwidth of interchip interconnects on printed circuit boards (PCB’s) and backplanes further limits the processing speed when the processing array is distributed across several chips. This paper describes the hardware realization of a multistage smart pixel array cellular logic (SPARCL) processor, a SIMD machine whose PE’s are implemented with smart pixel technology. Smart pixels integrate high-density optical signal detection and transmission circuitry within high-performance VLSI circuitry. The SPARCL system eliminates the I/O bottleneck by providing each PE with an off-chip optical I/O port operating at on-chip clock rates. The SPARCL transfers image or 2-D data fields directly between PE arrays using 2-D parallel free-space digital optical channels. The SPARCL system exhibits several key features: the processing throughput scales linearly with the number of PE’s; each SPARCL VLSI plane is completely utilized for PE implementation, maximizing the processing power per chip area; interchip communication occurs at on-chip rates; array loading and unloading occurs in a single clock cycle, since the number of data I/O ports equals the number of PE’s; and the SPARCL system scales in three dimensions: with the 2-D processor array size contained in each VLSI plane and by increasing the number of SPARCL stages. Our goal is to demonstrate an optoelectonic smart pixel architecture that alleviates the I/O bottleneck due to data
1077–260X/99$10.00 1999 IEEE
KUZNIA et al.: 2-D PARALLEL PIPELINE SPARCL PROCESSORS—CHIP DESIGN AND SYSTEM IMPLEMENTATION
rate and bus width mismatch between printed circuit board interconnects and 2-D processor arrays. We target applications where data processing algorithms exhibit a natural 2-D format and require large throughput parallel processing, such as video processing, high-resolution image processing, and 3-D graphics rendering. In the following sections, we describe the SPARCL chip design, its implementation and operation in an optical system for realizing a multistage processor, and discuss its scalability. II. SPARCL CHIP DESIGN The SPARCL chip was implemented using the optoelectronic VLSI (OE/VLSI) technology offered by Bell Labs, Lucent Technologies through the DARPA sponsored COOP program at George Mason University.1 This technology flip-chip bonds GaAs–AlGaAs multiple-quantum-well (MQW) diode arrays onto CMOS circuitry fabricated through a standard CMOS foundry [7]. The MQW diodes can operate as optical detectors, modulators or light-emitting diodes to create free-space optical data I/O ports within the VLSI circuitry. Since the I/O ports operate at on-chip clock rates and are densely integrated (28 000 MQW’s/cm ), this technology can achieve very high aggregate bandwidth (>1 Tb/s cm ) between the core VLSI circuits of chips connected with free-space optical links. OE/VLSI technology has spawned the creation of several advanced free-space digital optoelectronic systems [8]–[14]. Perhaps the most sophisticated is a 256 256 ATM switch chip with more than 4000 optical I/O’s operating at 200 Mb/s [15]. Each CO-OP participant was allowed to deign CMOS logic on 2 2 mm area of silicon real estate. Five copies of each such chip were fabricated and had an array of 20 10 MQW diodes flip chip bonded to be CMOS logic. With appropriate circuitry and contacts, the diodes can function as a detector, a light-emitting diode (LED), or an optical modulator at 850 nm. The circuitry was fabricated using Hewlett-Packard’s 0.8- m complementary metal–oxide–semiconductor (CMOS) process, which provides three metal layers for interconnect. In the OE/VLSI process, the third metal layer is used to make flipchip bond pads for contacting the MQW diodes. The diodes have optical windows of 18 square microns and are placed on a 62.5 125 m pitch, covering an area of 1.25 1.25 mm centered within the chip. We used the core 1.25 1.25 mm of the chip to create a 5 10 array of smart pixel processing elements, each of size 125 m . Accordingly, each SPARCL chip has the 250 capability to process 50 data elements in parallel. Fig. 1 is a photo of the SPARCL chip after flip-chip bonding of the MQW diodes. Overlaid on the photo is a 5 10 array of rectangles showing the smart pixel locations. On the perimeter of the 5 10 smart pixel array are memory elements (indicated by the small squares) used as dummy border pixels for image processing and buffers for driving global instruction and clock lines. The remaining outer ring of circuitry is dedicated to the 40 wire-bond pads and occupies 1.75 mm or 43.5% of the chip’s total area. 1 [On-line].
Available HTTP: http://co-op.gmu.edu/
377
When processing arrays of data or images that are larger than 5 10 (data fields that must be processed in sequential blocks), the dummy pixel memory elements contain the data bordering the block being processed. This was designed so that smart pixels lying on the array border can access valid neighboring data elements. Each smart pixel is connected bidirectionally to its four neighboring smart pixels or memory elements in mesh topology. The SPARCL array can be loaded and unloaded either optically or electronically. When electronically loading data, the SPARCL brings in a new column of data from the seven input wire-bond pads on each clock cycle and each column of data shifts to the right (Fig. 1). The SPARCL unloads data in the same manner, by outputting a five-bit column of data on each clock cycle (data in the bordering memory elements is not output). Loading or unloading of the SPARCL array by purely electronic means (through the mesh network) requires ten clock cycles. Optical loading and unloading can be performed simultaneously on the entire array and occurs in a single clock cycle. A. Smart Pixel Circuitry 10 array The SPARCL processing core consists of a 5 of identically replicated smart pixel circuits. Each smart pixel contains 187 transistors implementing a cellular logic 125 m as processor element within an area of 250 shown in Fig. 2. Because of the limited availability of CMOS/MQW foundry runs, we chose a conservative design strategy using standard cell logic from Tanner, Inc. and standard cell optical I/O circuits from Lucent Technologies. Using custom layout methods would result in a much more compact and higher performance smart pixel layout [16]. Fig. 3 shows the functional block diagram of the smart pixel. A 16-bit bus broadcasts the 15-bit instruction and clock signals to all smart pixels in the array. The 15-bit instruction sets the function of the smart pixel on each clock cycle. Three D-flip-flops implement the pixel memory. On each clock cycle, the union/complement section performs the logic OR and logic NOT operations on any combination of the three bits stored in memory. The output of the union/complement section is routed to the four nearest neighbors of the pixel. The dilation section performs a logic OR operation on any combination of the outputs of the union/complement section within its own pixel and four neighbors. The output of the dilation is optically transmitted (to the next SPARCL chip) and also fed back to a 2 : 1 multiplexer. The multiplexer either passes the output of the dilation section or the output of the optical detector. The memory select routes the bit from the 2 : 1 multiplexer into selected memory elements. On each clock cycle, the data from the memory travels through a maximum of four logic gates and is stored back into memory (although this memory might be located on another optically interconnected SPARCL chip). Because the maximum data path length is very short (compared to most data processing architectures), this architecture can exhibit high performance data processing. Essentially, this smart pixel implementation adds a bit serial optical input/output port to each cellular logic PE. This optical port adds minimal overhead circuitry to the PE, namely, the 2 : 1 multiplexer, receiver, and transmitter circuitry.
378
IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 5, NO. 2, MARCH/APRIL 1999
2
Fig. 1. Photo of the 2 2 mm2 SPARCL chip with flip-chip bonded MQW diodes. Overlaid on this photo are rectangles indicating each smart pixel in the 5 10 array and squares representing bordering “dummy-pixel” memory elements. The SPARCL performs data I/O either directly through smart pixel optical I/O ports in a single clock cycle or by shifting data from left to right through the mesh connected smart pixel array in ten clock cycles.
2
The SPARCL design, based on the digital optical cellular image processor (DOCIP) architecture [17], efficiently performs morphological image processing, and is the second implementation of a DOCIP machine. The first DOCIP system was constructed using holographically interconnected optical logic gates implemented with liquid crystal light valve technology. The first DOCIP was truly an optical computer, in that no processing was performed electronically [18]. We program the SPARCL using a set of three morphological image processing instructions based on binary image algebra (BIA) [19]. The three instructions, which operate on binary images, are: complement—invert each pixel’s bit in an image; union—perform a logical OR between two images; and dilation—perform a dilation on an image using another image as the dilation kernel. Because the circuitry
can perform the OR and NOT functions, it is “logically complete” and can perform general logic processing. After verifying that all SPARCL functions were operational, we implemented several algorithms on SPARCL. Binary image processed algorithms can be implemented very efficiently. For example edge detection requires only three instructions issued (and, therefore, three clock cycles). SPARCL also performed parallel arithmetic algorithms. The bits constituting a data element are stored and processed across several smart pixel PE’s in parallel. The most complex algorithm implemented performed a critical processing block within a MPEG video compression algorithm. A further discussion and examples of SPARCL performing image and digital video processing is described by Wu et al. [20].
KUZNIA et al.: 2-D PARALLEL PIPELINE SPARCL PROCESSORS—CHIP DESIGN AND SYSTEM IMPLEMENTATION
379
Fig. 4. The receiver circuit detects a dual-rail optical intensity coded signal using the two MQW diodes (shaded gray) and converts it to a CMOS voltage signal (Vout ). This circuit operates on the difference of the two optical detected optical powers. A higher power on the top MQW diode (Pdet 1 ) results in Vout equal to the logic low value. A higher intensity on the lower MQW diode (Pdet 0 ) results in Vout equal to logic high value.
2
Fig. 2. Each 250 125 m2 SPARCL smart pixel implements a cellular logic processing element with direct optical data I/O. Each pixel contains a three-bit memory, CMOS logic for complement, union and dilation (morphological image operations), and an optical receiver and transmitter. The upper pair of MQW diodes are used for optical detection and the lower pair for optical transmission.
Fig. 3. Function block diagram of SPARCL smart pixel. Each smart pixel contains digital logic for performing operations on combinations of the bits within the pixel and neighboring pixels.
While single-rail operation of a CMOS/MQW device has been successfully accomplished [10], we chose to use a dualrail representation of the optical signals. The transmitter sets the state of two reflecting MQW diode modulators (the lower pair of MQW’s in Fig. 2) so that one is more absorbing than the other. An optical link is established by reflecting a pair of equal intensity beams from a transmitter MQW pair onto a detector MQW pair. The receivers make a decision based on the ratio of the power in the dual-rail optical beams. Using the dual-rail representation eliminates the need for a global threshold optical intensity value, relaxes the tolerances in the optical system design, and reduces the receiver complexity by eliminating the need for threshold adjustment circuitry. B. Receiver Circuit We used receiver and transmitter circuits previously designed and tested at Lucent Technologies [21]. The receiver
circuit, shown in Fig. 4, converts a dual-rail optical signal to a CMOS compatible voltage signal. This circuitry consists of four stages: the MQW diodes, a transimpedance amplifier (TIA), a voltage amplifier (VA), and a decision circuit. The two serially connected MQW diodes produce a photogenerated current input to the TIA. The sign of this current is either positive or negative, depending on the logic state of the optical signal. The VA boosts the voltage output of the TIA and a , a CMOS signal. decision circuit produces a voltage level, and bias the MQW diodes to The supply rails their maximum responsivity, which is roughly 0.5 A/W. The receiver requires 1.5 W of difference in optical power between the two diodes to properly detect between logic states at 20 MHz. Because of the analog nature of the receiver circuit design, each receiver draws over 4 mW of static electrical power. Overall power dissipation for the entire chip operating at 20 MHz was 500 mW, 80% due to the optical receivers. This power dissipation can be greatly reduced by using a clocksense-amplifier-based smart pixel receiver (CSABSPR). This receiver design is applicable to SPARCL since the optical links are synchronous and the clock signal is available at each receiver. Woodward et al. [22], [23] have designed and experimentally tested a CSABSPR that operates at 320 Mb/s while dissipating only 1 mW of power and occupying only 44 22 m of CMOS area. This receiver also has a higher sensitivity and has an optical switching energy of 30 fJ. C. Transmitter Circuit The transmitter circuit, shown in Fig. 5, consists of an inverter driving the central node of a pair of serially connected V, V, and MQW diodes. With V, each MQW diode has a voltage modulation of 5 V around 8.5-V reverse bias. When two equal intensity optical beams reflect from the MQW diodes, one is more attenuated than the other. The attenuation of each beam depends on the state of the transmitter output. The optical system passes these two beams to the receiver circuitry. The
380
IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 5, NO. 2, MARCH/APRIL 1999
Fig. 5. Transmitter circuit for converting CMOS level voltages (Vin ) to dual-rail optical intensity coded signals. The MQW diodes (shaded gray) modulate, through reflection, a pair of off-chip generated optical beams at 850 nm.
Fig. 6. The 2-D parallel pipeline SPARCL system architecture. The first stage performs page composition (serial to parallel data format conversion), the intermediate stages (only one shown here) perform image or data processing, and the final stage performs page decomposition. Data flows between SPARCL chip stages on 50 parallel optical channels operating at on chip data rates.
receiver circuitry detects the difference in power of the two beams and determines whether the input bit is high or low. Due to the limited bandwidth of our instruction/data buffer interface, we had planned to operate our optical links at data rates lower than 50 Mb/s. Other smart pixel systems have established optical links at speeds up to 1 Gb/s [24]. The single stage CMOS inverter, shown in Fig. 5, can drive the MQW diode pair at speeds greater than 1 Gb/s, since the flip-chip bonded diode capacitance is roughly 62 fF/diode [25]. D. Flip-Chip Bonding It is interesting to compare the amount of VLSI area used by the optical I/O and electrical I/O circuits. Each SPARCL optical receiver circuit occupies an area of 32 36 m and 46 m . Each of these I/O each transmitter occupies 12 circuits are connected to flip-chip bonding pads of size 20 20 m . Each electrical I/O circuit, which contains circuitry for driving wire-bond pads or signal receiving and electrostatic discharge (ESD) protection, occupies 130 160 m of chip area and is connected to a 100 100 m wire-bonding pad. III. SPARCL SYSTEM DESIGN Fig. 6 shows the system schematic of a multistage 210 D parallel pipeline SPARCL processor system. A 5 array of optical channels (one bit per smart pixel) links SPARCL stages in a unidirectional manner. A host com-
puter (not shown) sends a 15-bit instruction to each of the SPARCL stages timed with a clock common to all stages. In our demonstration system, all stages are implemented with identical copies of our SPARCL chip. The first SPARCL stage performs page composition by reformatting a 5-bitwide electronic data stream into a 5 10-bit-wide optical data page. Ideally, a specialized page composer, such as a memory or video camera with 2-D parallel optical output would perform this reformatting. OE/VLSI or other smart pixel technology could implement such a page composing device or other appropriate memory [10], [11]. After page composition, the data pass through the pipeline system in a 2-D parallel optical format. The one or more intermediate SPARCL stages perform data processing on the pages and pass them along the pipeline. The final stage is a page decomposer, which converts the data back to the 5-bit-wide electrical form. Similar to the first stage, this final stage is ideally implemented with a dedicated memory or display device having 2-D optical receivers. The pipelining of the system allows processing of data on the intermediate SPARCL chips during the page composing and decomposing, which requires ten clock cycles on a SPARCL chip. The number of intermediate SPARCL stages is variable and can be scaled to meet the required processing power [20]. Our demonstration system has three stages. A. Integration with Optics We packaged the demonstrational SPARCL system on a slotted baseplate designed by Optivision, Inc. This baseplate 14 in and has five slots for magnetically holding is 10 in optical components mounted in 35-mm-diameter rings. This baseplate can accommodate up to five SPARCL stages. The slotted baseplate is an optomechanical system that allows easy and rugged setup of demonstration smart pixel systems. Improvements in packaging have further reduced the size of the optical system and have shown to be compatible with electronic chassis dimensions [15], [26], [27]. Fig. 7 shows the optical module that creates the 50 parallel free-space optical links between SPARCL stages. The optical signals enter and leave each stage’s module in the same optical format, (i.e., the same polarization, and unit magnification) and therefore the module can be cascaded to provide more stages in the SPARCL processor. The optical module performs two imaging functions: 1) it reflectively reads out the states of the MQW modulator based transmitters with a 2D array of uniform intensity beams and 2) it performs unit magnification imaging of transmitter windows onto detector windows between two SPARCL chips. The optical module uses a polarization controlled beam path separation for reading out modulators. The path of the optical signals entering the SPARCL chip is as follows: The polarizing beam splitter (PBS) reflects the optical signals from the previous stage (shown in gray) toward the patterned mirror (PM); the PM reflects the signals back through the PBS; and the triplet images the optical signals into the detectors on the SPARCL chip which is mounted on a printed circuit board. The optical signals leave the SPARCL chip through optical readout of the modulators.
KUZNIA et al.: 2-D PARALLEL PIPELINE SPARCL PROCESSORS—CHIP DESIGN AND SYSTEM IMPLEMENTATION
381
Fig. 7. This optical module provides one-to-one imaging between an array of transmitter outputs and an array of detection inputs on adjacent stages. All components are commercial-off-the-shelf, except for the diffractive optical element and patterned mirror. The size of the module is 1 in 2 in 14 in and can be cascade with a 2-in center to center pitch between stages.
2
2
The laser diode module, an external cavity tunable laser diode, generates a collimated beam. The diffractive optical element (DOE) splits this beam into an array of 5 20 equal intensity beams, to read out the 5 10 dual-rail channels. These beams pass through the PM and PBS, and the triplet images them onto the modulator windows. The PBS reflects the optical signals from the modulators to the next stage. The optical system is constructed entirely with commercialoff-the-shelf optical devices, except for the PM and DOE. The PM was fabricated by a lithographic service by patterning chrome stripes onto a transparent substrate. The optical system images the SPARCL detector rows onto chrome stripes and SPARCL transmitters onto transparent areas of the PM, providing the path separation needed for optical I/O channels. B. Host Computer Interface We constructed PCB’s containing FIFO buffer circuitry to interface the SPARCL stages with a desktop computer supplying the instructions. This demonstrated the compatibility of the SPARCL technology with standard logic circuitry and packaging. Each SPARCL chip is mounted on a board containing a 4-kbyte first in/first out (FIFO) buffer with a maximum clock speed of 30 MHz. We chose a FIFO design because it was commercially available, however its clock rate is much slower than the 150-MHz rate at which SPARCL was simulated. We perform the SPARCL experiments by first loading the FIFO buffers at lower computer port speeds, running the instructions through the FIFO into the SPARCL chips, and then downloading data from the FIFO. Fig. 8 shows the boards for two of the SPARCL stages. The SPARCL chips are located at the top of the boards, covered with a thin glass cover plate. These boards are placed on the edge of the baseplate containing the optical system described above. IV. DOE
AND
MQW DEVICE CHARACTERISTICS
This OE/VLSI foundry was the first attempt by the DOE and MQW device providers to distribute these components on a large production scale to device users. This represents the first step of transferring this technology from an experimental to mass production environment. Slightly worse device performance (as compared to experimental data) is expected due to uniformity and defect issues in fabrication, as well as more mundane issues of handling and packaging. Here, we measure the performance of the DOE and MQW devices and
Fig. 8. Each SPARCL chip is mounted on a PCB containing FIFO circuitry for interfacing slower computer ports with the higher speed SPARCL chips. The SPARCL chips are near the tops of the PCB’s in a package allowing optical access to the chip surface.
consider their impact on the overall SPARCL system optical power budget. A. Characterization of MQW Devices We received five copies of the SPARCL chip from the COOP foundry. Each chip contains 200-MQW diode devices; 100 are utilized as modulators and 100 are utilized as detectors. We characterized the reflectivity of all 500 modulators on the five chips. Setting the input of the transmitter in Fig. 5 to ground fixes the midpoint voltage, , to CMOS logic high voltage level (due to photogenerated currents fed back into the inverter stage this voltage swings a few mV). By varying and , we can measure the the modulator biases reflectivity versus voltage of each diode independently. We used a single optical beam at 851.2 nm from a diode laser by SDL, Inc., as our optical source. The reflected beam is collected by an objective lens and detected by a discrete p-i-n photodiode. A grating-assisted external cavity and TE cooler maintained wavelength stability. The incident optical power on the modulator window is 30 W and the reflected power is measured. The reflectivity of the diode is defined as the ratio of reflected power to the incident power. With an F/2.8 Cooke Triple objective lens, we have an FWHM of 10 m 18 m modulator window. We in diameter on the 18 aligned the spot to the window in two steps. We first perform a
382
IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 5, NO. 2, MARCH/APRIL 1999
Fig. 9. Measured modulator reflectivity versus voltage across a typical SPARCL chip. Each subplot shows the reflectivity of the two MQW diodes within each smart pixel.
coarse adjustment by visual alignment. When the spot is off the modulator window surface, the reflected light tends to scatter due to the roughness of the window’s edge. In contrast, we observe a bright and round spot when the spot hits the surface of the modulator window. In the next step, we fine-tune the alignment with the aid of a current meter. A spot landing on the active area of the diode generates a current that can be measured. We adjust the alignment by maximizing the current, which aligns the spot to the most sensitive region within the modulator window. We measured the reflected optical power at reverse bias voltages ranging from 2 to 12 V at 0.5-V increments for all 500 modulators on our five sample chips. Fig. 9 shows a typical result of the reflectivities measured from transmitter diode pairs across the 5 10 smart pixel array. Each subplot shows the reflectivity curve of a pair of modulators within a smart pixel. The solid and dashed lines show the upper diode, , and lower diode, , reflectivities, respectively. In this dual-rail signaling approach, the contrast ratio between the pair of modulators within a transmitter is of interest. The important contrast ratio CR for sending a logic “1” is between the upper diode’s low reflectivity state and the lower diode’s high reflectivity state. When sending a logic “0,” the reflectivity states for both diodes are reversed. In Fig. 10, we plot the histograms of contrast ratios for the SPARCL chip measurements shown in Fig. 9. The average contrast ratio for both states was found to be 1.9 and worst case (among operational devices) to be 1.2.
Both the individual device characteristics and the uniformity between two devices within a transmitter determines the contrast ratio of the dual-rail signals in the optical link. The contrast ratio affects the external optical power required for modulator readout. Enough power must be supplied to achieve an adequate difference in power between the two MQW detectors at the dual-rail receiver. The relationship between the power difference output from the transmitter, , and the contrast ratio, CR, is CR
(1)
is the optical power incident on each modulator where is the reflectivity of the modulator for readout and in its low reflectance state. B. Characterization of Diffractive Optical Element Spot-Array-Generator for Modulator Readout 20 array of spots with The DOE design generates a 5 highly uniform intensity positioned on the transmitter modulator windows when used in conjunction with a 30-mm focal length objective. We chose the 2-D DOE period area 816 m , with a minimum feature size of 5.1 as 102 5.1 m . This results in 20 160 independent features in the DOE period to solve for, each feature being one of eight possible phase levels. We designed the DOE using a Gerchberg–Saxton phase retrieval algorithm to solve for the optimum phase values for the features [28]. Fig. 11 shows the solution to the DOE period with gray levels representing
KUZNIA et al.: 2-D PARALLEL PIPELINE SPARCL PROCESSORS—CHIP DESIGN AND SYSTEM IMPLEMENTATION
Fig. 10.
383
Histogram of contrast ratios between transmitter pair MQW devices for SPARCL chip.
2
Fig. 11. Design for the DOE that provides the spot array shown in Fig. 12 for SPARCL modulator readout. This period is divided into 20 160 square phase elements (or features of size 5.1 5.1 m2 ), each having a value between and . The phase element values where computed using iterative techniques. This pattern was replicated in a 6 48 array and etched onto a quartz substrate to create a DOE of size 5 5 mm2 .
2
0
2
the phase levels. The DOE period was replicated in a 49 6 array to create a DOE of roughly 5 5 mm inside a chrome masking aperture. The design was submitted to Honeywell for fabrication through the DARPA/GMU CO-OP sponsored diffractive optics foundry run. Illuminating the DOE at 850 nm results in the optical spectrum shown in Fig. 12(a) (captured with a charged-coupled-device (CCD) camera). The CCD scan of the central row in Fig. 12(b) shows the high uniformity of the spots. The CCD camera faintly detects the zero order of the DOE spectrum, an indication of precise fabrication by the Honeywell foundry. We experimentally measured the uniformity of the spot array to be 25% with 13 dB of suppression of unwanted diffraction orders. Here, we define uniformity as the maximum difference in spot intensity across the array divided by the average spot intensity. More important for dual-rail channel operation is the ratio of the power contained in pairs of spots that read out the pairs of modulators within the smart pixel transmitters. In the transmitter of Fig. 5, the power of the optical beams reading and should be very out the two modulators, uniform to equalize the optical powers incident on the two modulator windows. If the power incident on the modulators is not equal, the optical digital signal output is biased toward logic “one” or “zero.” Here, we define the ratio of power of any two pairs of spots that read out a transmitter to be a pair-wise power ratio, given by PWPR
(2)
Measuring the DOE spectrum shown in Fig. 12 gives an average pair-wise power ratio of 1.04 and a worst case pairwise power ratio (WCPR) of 1.11.
2
C. Optical Power Budget Calculations Knowing the WCPR, the optical efficiencies, and the contrast ratio of the MQW modulator, we can calculate the optical power budget. We divide the optical efficiencies into three categories: the optical link efficiency, ; the efficiency in ; and which the DOE forms the 5 20 array of beams, the efficiency (or optical throughput) of the MQW diodes, , proportional to the worst-case modulator reflectivity . Each of these efficiencies is determined by the ratio of useful output optical power divided by the total input optical power. The optical power budget is the amount of optical power (supplied by the laser diode) required to produce a sufficient difference in optical power on the receiver windows, , to create valid CMOS defined as (see Fig. 4) for the entire array of smart logic levels at pixel links. There is one transmitter in the array whose two-readout beams represent the WCPR. Considering all optical links, this optical link requires the highest optical power for proper , so operation. Consider the case where more power is incident on the top modulator than the bottom in Fig. 5. This difference in optical power effectively biases the transmitter output toward logic “one.” This effectively reduces the contrast between the two modulators when they are sending a logic “zero” (which requires more optical power reflected from the bottom modulator than the top modulator). Increasing and does not change the optical power of the ratio between the reflected beams, but it does increase the difference between them. Since the receiver operates on the difference of the two detected beams, increasing the power in the modulator readout beams overcomes the biasing problem due to power nonuniformity in the readout array.
384
IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 5, NO. 2, MARCH/APRIL 1999
V. SCALING
(a)
(b) Fig. 12. (a) CCD image of the optical Fourier transform of the DOE described in Fig. 11. This array of spots performed optical readout of the 5 20 array of SPARCL smart pixel modulators (each smart pixel contained two MQW diodes operating as modulators). (b) CCD scan of the central row of (a) show good uniformity.
2
TABLE I PARAMETERS FOR OPTICAL POWER BUDGET CALCULATION
The amount of optical power supplied to each SPARCL stage by the diode laser supply depends on these technological and design dependent parameters.
The amount of optical power that must be supplied by the diode laser for each stage in the SPARCL system is given by (3) CR WCPR is the number of optical links between SPARCL where stages, and CR is the worst-case contrast ratio of the MQW modulators (described above). Table I shows the measured parameters used to calculate our optical power budget.
THE
SPARCL SYSTEM
For experimental purposes, the SPARCL system contains three stages of SPARCL chips, each containing an array of 10 smart pixels. Scaling the SPARCL system size, (i.e., 5 increasing the number of pixels) produces a more powerful processor, since the processing speed is linearly proportional to the number of pixels. The number of smart pixels in a SPARCL system is the number of pixels per SPARCL chip multiplied by the number of SPARCL stages. Since the SPARCL system is essentially a mesh connected SIMD processor, scaling within a chip is straightforward. The smart pixel circuit, once designed, can be replicated to form an array of arbitrary size, limited only by the chip size and smart pixel density. Similarly, the number of SPARCL stages can be increased by replicating the SPARCL optical modules, limited by the ability to synchronize a clock across several stages. As CMOS technology advances, the circuitry becomes denser and the chip sizes larger. This predicts larger smart pixel arrays within each chip. However, the optical I/O port density also determines the number of pixels per chip. The SPARCL pixel (shown in Fig. 2), implemented in 0.8- m CMOS, could be scaled to less than one quarter of its area by using the 0.35- m CMOS process currently available. The shrinking of feature sizes will allow for denser, faster and more complex circuit realizations of the SPARCL smart pixels. Since each smart pixel requires an optical I/O port, the pixel density also depends on the smart pixel technology’s ability to provide dense optical I/O devices. A recent study on the performance scaling of optoelectronic technologies over the next six generations of CMOS predicts a steady growth in optical I/O capabilities stemming from enhanced CMOS circuitry [29]. This study shows the 0.35- m CMOS process allows for about 6000 optical I/O per 4 cm (a typical chip size) using MQW-based modulators. This represents the limit on the maximum number of SPARCL smart pixels per chip. This number is mainly limited by power dissipation of the optical receiver [29]. Comparing the CMOS circuitry density and optical I/O density for the 0.3- m feature size, the SPARCL circuitry can be realized eight times more densely than the available optical I/O. Therefore, a more optimum SPARCL layout might be found by increasing the complexity of the PE design. The optical system packaging will provide a practical limit on the number of SPARCL stages. In our optical packaging method, the electronics contained on printed circuit boards are placed along the perimeter of a baseplate holding the components comprising the optical system. This system is designed for easy setup and reconfigurability. It was not designed to optimize the number of SPARCL stages nor is it practical for large-scale manufacturing. A more desirable optical packaging technique provides for scaling to a larger number of SPARCL stages using packaging methods compatible with current electronic standards. Due to the complexity of optical system alignment tolerances and optomechanical constraints, packaging is a current area of research. Many promising techniques are being developed. One example is a technique that attaches optical components directly to the chip carrier
KUZNIA et al.: 2-D PARALLEL PIPELINE SPARCL PROCESSORS—CHIP DESIGN AND SYSTEM IMPLEMENTATION
or the printed circuit board [27]. Optical systems utilizing microlens or minilens arrays for imaging provide a solution that provides cost-effective scaling that can be manufactured with methods existing in current IC fabrication [14], [26]. There is also research attempting to alleviate the alignment tolerances by using current mode receiver technology, which enables a larger detector area without sacrificing bandwidth [30], [31]. In summary, with current CMOS technology and optomechanical technology, we could expand our current design so that each SPARCL stage operates on image segments of 64 pixels. Predictions in CMOS and optoelecroughly 64 200 tronic technology show that arrays larger than 200 pixels per chip should be possible when the feature size reaches 0.1 m in CMOS. With a predicted clock speed of 1 GHz the SPARCL system throughput for a single stage would be 40 Tb/s.
[9] [10]
[11]
[12] [13]
VI. CONCLUSION This demonstration SPARCL system was constructed to illustrate experimentally the use of smart pixel technology to greatly enhance cellular computing by overcoming the I/O bottleneck currently found in SIMD systems. The CMOS/MQW optoelectronic smart pixel technology, developed at Lucent Technologies provides a free-space digital optical communication channel to each processing element. These channels allow for the transfer of large data fields among arrays of smart pixels and parallel optical input and output devices in parallel format. The SPARCL system is ideally suited for high performance image processing, in which a few operations must be performed on a vast amount of data. By operating in parallel on the data field and providing 2-D parallel optical data I/O, SPARCL performs high throughput computation.
[14]
[15]
[16] [17] [18]
[19]
ACKNOWLEDGMENT The authors wish to thank M. Derstine and S. Wakelin of Optivision, Inc. (now with Optical Networks, Inc.) for help on optomechanical packaging and laser diode module packaging. They also wish to thank J.-F. Lin of the Industrial Technology Research Institute in Taiwan for his aid in designing the diffractive optical element.
[20]
[21]
REFERENCES [1] K. E. Batcher, “Design of a massively parallel processor,” IEEE Trans. Comput., vol. C-29, pp. 836–844, Sept. 1980. [2] R. M. Hord, The ILLIAC IV—The First Supercomputer. New York: Springer-Verlag, 1982. [3] W. D. Hillis, The Connection Machine. MIT Press, 1985. [4] MasPar Computer Corporation, “The design of the MasPar MP-2—A cost effective massively parallel computer,” Nov. 1992. [5] R. M. Hord, Parallel Supercomputing in SIMD Architectures. Boca Raton, FL: CRC, 1990. [6] A. Broggi and F. Gregoretti, “Performance Evaluation and optimization in low-cost cellular SIMD systems,” Microprocessing and Microprogramming, vol. 41, pp. 659–678, 1996. [7] K. W. Goossen, J. A. Walker, L. A. D’Asaro, S. P. Hui, B. J. Tseng, R. E. Leibenguth, D. Kossives, L. M. F. Chirovsky, A. L. Lentine, and D. A. B. Miller, “GaAs MQW modulators integrated with silicon CMOS,” IEEE Photon. Technol. Lett., vol. 7, pp. 360–362, 1995. [8] D. J. Goodwill, K. E. Devenport, and H. S. Hinton, “An ATM-based intelligent optical backplane using CMOS-SEED smart pixel arrays
[22]
[23] [24]
[25]
385
and free-space optical interconnect modules,“ IEEE J. Select. Topics Quantum Electron., vol. 2, pp. 85–96, Apr. 1996. O. Kibar, P. J. Mar.and, and S. C. Esener, “High-speed CMOS switch designs for free-space optoelectronic MIN’s,” IEEE Trans. VLSI Syst., vol. 6, pp. 372–386, Sept. 1998. A. V. Krishnamoorthy, J. E. Ford, K. W. Goossen, J. A. Walker, A. L. Lentine, S. P. Hui, B. Tseng, L. M. F. Chirovsky, R. E. Leibenguth, D. Kossives, D. Dahringer, L. A. D’Asaro, F. E. Kiamilev, G. F. Apin, R. G. Rozier, and D. A. B. Miller, “Photonic page buffer based on GaAs multiple-quantum-well complementary-metaloxide-semiconductor (CMOS) circuits: Errata,” Appl. Opt., vol. 35, pp. 4637–4640, 1996. A. V. Krishnamoorthy, R. G. Rozier, J. E. Ford, and F. E. Kiamilev, “Demonstration of a CMOS static RAM chip with high-speed optical read and write,” in Optics in Computing (OSA Tech. Dig. Ser.), vol. 8, 1997, Washington, DC: Opt. Soc. Amer., 1997, pp. 1–2, postdeadline paper. D. T. Neilson, S. M. Prince, D. A. Baillie, and F. A. P. Tooley, “Optical design of a 1024-channel free-space sorting demonstrator,” Appl. Opt., vol. 36, pp. 9243–9252, 1997. D. V. Plant, B. Robertson, H. S. Hinton, M. H. Ayliffe, G. C. Boisset, D. J. Goodwill, D. N. Kabal, R. Iyer, Y. S. Liu, D. R. Rolston, W. M. Robertson, and M. R. Taghizedah, “A multi-stage CMOS-SEED optical backplane demonstrator system,” in Int. Topical Meeting Optical Computing. Washington, DC: Opt. Soc. Amer., 1996, vol. 96, pp. 14–15. D. R. Rolston, D. V. Plant, H. S. Hinton, W. S. Hsiao, M. H. Ayliffe, D. Kabal, M. B. Venditti, T. H. Szymanski, A. V. Krishnamoorthy, K. W. Goossen, J. A. Walker, B. Tseng, S. P. Hui, J. C. Cunningham, and W. Y. Jan, “A hybrid-SEED smart pixel array for a four-stage intelligent optical backplane demonstrator,” IEEE J. Select. Topics Quantum Electron., vol. 2, pp. 97–105, Apr. 1996. A. L. Lentine, D. J. Reiley, R. A. Novotny, R. L. Morrison, J. M. Sasian, M. G. Beckman, D. B. Buchholz, S. J. Hinterlong, T. J. Cloonan, G. W. Richards, and F. B. McCormick, “Asynchonous transfer mode distribution network by use of an optoelectronic VLSI switching chip,” Appl. Opt., vol. 36, pp. 1804–1814, 1997. N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, 2nd ed. New York: Addison-Wesley, 1993. K. S. Huang, C. B. Kuznia, B. K. Jenkins, and A. A. Sawchuk, “Parallel architectures for digital optical cellular image processing,” in Proc. IEEE, vol. 82, pp. 1711–1723, 1994. K.-S. Huang, A. A. Sawchuk, B. K. Jenkins, P. Chavel, J.-M. Wang, A. G. Weber, C.-H. Wang, and I. Glaser, “Digital optical cellular image processor (DOCIP): Experimental implementation,” Appl. Opt., vol. 32, pp. 166–173, 1993. K. S. Huang, B. K. Jenkins, and A. A. Sawchuk, “Binary image algebra and optical cellular logic design,” Computer Vision Graphics and Image Processing, vol. 45, pp. 295–345, 1989. J. M. Wu, C. B. Kuznia, B. Hoanca, C.-H. Chen, and A. A. Sawchuk, “Demonstration and architecture analysis of complementary metaloxide semiconductor/multiple-quantum-well smart-pixel array cellular logic (SPARCL) processors for single instruction multiple-data parallelpipeline processing,” Appl. Opt., vol. 38, no. 11, pp. 2270–2281, Apr. 10, 1999. A. L. Lentine, K. W. Goossen, G. A. Walker, L. M. F. Chirovsky, L. A. D’Asaro, S. P. Hui, B. T. Tseng, R. E. Leibenguth, D. P. Kossives, D. W. Dahringer, D. D. Bacon, T. K. Woodward, and D. A. B. Miller, “700 Mb/s operation of optoelectronic switching nodes comprised of flip-chip-bonded GaAs/AlGaAs MQW modulators and detectors on silicon CMOS circuitry,” in IEEE Conf. Lasers and Electro Optics (IEEE/LEOS), 1995, postdeadline paper CPD11. T. K. Woodward, A. V. Krishnamoorthy, K. W. Goossen, J. A. Walker, W. Y. Jan, J. E. Cunningham, L. M. F. Chirovsky, S. P. Hui, B. Tseng, D. Kossives, D. Dahringer, and A. R. E. L. D. Bacon, “Clock-senseamplifier-based smart-pixel optical receivers,” IEEE Photon. Technol. Lett., vol. 8, pp. 1067–1069, 1996. T. K. Woodward, A. V. Krishnamoorthy, A. L. Lentine, and L. M. F. Chirovsky, “Optical receivers for optoelectronic VLSI,” IEEE J. Select. Topics Quantum Electron., vol. 2, pp. 106–116, Apr. 1996. T. K. Woodward, A. V. Krishnamoorthy, A. L. Lentine, K. W. Goossen, J. A. Walker, J. E. Cunningham, W. Y. Jan, L. A. D’Asaro, L. M. F. Chirovsky, S. P. Hui, B. Tseng, D. Kossives, D. Dahringer, and a. R. E. Leibenguth, “1 Gb/s two-beam transimpedence smart-pixel optical receivers made from hybrid GaAs MQW modulators bonded to 0.8 micron silicon CMOS,” IEEE Photon. Technol. Lett., vol. 8, pp. 422–424, Mar. 1996. R. A. Novotny, A. L. Lentine, C. B. Buchholz, and A. V. Krish-
386
[26] [27] [28] [29] [30] [31]
IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 5, NO. 2, MARCH/APRIL 1999
namoorthy, “Analysis of parasitic front-end capacitance and thermal resistance in hybrid flip-chip bonded GaAs SEED/Si CMOS receivers,” Opt. Computing (OSA Tech. Dig. Ser.). Washington, DC: Opt. Soc. Amer., 1995, vol. 10, pp. 207–209. F. B. McCormick, “Smart pixel optics and packaging,” in Proc. 1996 IEEE/LEOS Summer Topical Meeting on Smart Pixels, 1996, pp. 45–46. D. N. Kabal, G. C. Boisset, and a. D. V. P. D. R. Rolston, “Packaging of two-dimensional smart pixel arrays,” in Proc. 1996 IEEE/LEOS Summer Topical Meeting on Smart Pixels, 1996, pp. 53–54. J.-F. Lin, “Optoelectronic cellular array processor with reduced cellular hypercube interconnections,” Ph.D. dissertation, Dept. of Elect. Eng., Univ. of Southern California, Los Angeles, CA, 1996, ch. 6. A. V. Krishnamoorthy and D. A. B. Miller, “Scaling optoelectronicVLSI circuits into the 21st century: A technology roadmap,” IEEE J. Select. Topics Quantum Electron., vol. 2, pp. 55–76, Apr. 1996. T. M. W. Slagle and H. Kelvin, “Optical smart-pixel-based Clos crossbar switch,” Appl. Opt., vol. 36, pp. 8336–8351, 1997. F. A. P. Tooley, A. Z. Shang, and B. Robertson, “Alignment tolerant smart pixels,” in Proc. 1996 IEEE/LEOS Summer Topical Meeting on Smart Pixels, 1996, pp. 55–56.
Charles B. Kuznia (M’95), for photograph and biography, see this issue, p. 329.
Jen-Ming Wu (S’88–M’99), for photograph and biography, see this issue, p. 329.
Chih-Hao Chen, for photograph and biography, see this issue, p. 329.
Bogdan Hoanca, for photograph and biography, see this issue, p. 329.
Lily Cheng, photograph and biography not available at the time of publication.
Allan G. Weber (M’75) received the B.S., M.S., and Ph.D. degrees from the University of Southern California in 1975, 1979, and 1986, respectively. He has been employed by the Naval Ocean System Center and Hughes Aircraft Corporation. Weber is currently employed by the Signal and Image Processing Institute at USC as a Research Scientist, Instructor, and System Administrator. His research interests include image processing and optical interconnects.
Alexander A. Sawchuk (S’65–M’71–SM’79–F’94), for photograph and biography, see this issue, p. 329.