# A 32Gb/s Time-Based PAM-4 Transceiver for High-Speed DRAM Interfaces With In-Situ Channel Loss and Bit-Error-Rate Monitors

Po-Wei Chiu<sup>b</sup> and Chris H. Kim<sup>b</sup>, *Fellow, IEEE* 

*Abstract*—A digital-intensive four-level pulse amplitude (PAM-4) transceiver featuring a 2-tap time-based decision feedback equalization (TB-DFE) circuit was demonstrated in a 65 nm GP CMOS process. A novel inverter-based differential voltageto-time converter (DVTC) increases the linearity and dynamic range compared to a prior time-based DFE approach enabling reliable PAM-4 operation. The four-level signal comparison and DFE operation were performed entirely in the time domain using programmable delays and a phase detector (PD). Using an on-chip bit error rate (BER) monitor, we verified a BER less than  $10^{-12}$  while achieving an energy-efficiency of 0.97pJ/b at a 32Gb/s data rate. The transmitter (TX) and receiver (RX) circuits occupy an area of 0.009 mm<sup>2</sup>.

*Index Terms*—Digital-intensive, differential voltage-to-time converter, time-based decision feedback equalizer, in-situ channel loss monitor, eye-diagram.

## I. INTRODUCTION

**B** IG data and artificial intelligence applications continue to push the data volume between the CPU and DRAM to unprecedented levels necessitating extremely high throughput memory interfaces [1], [2]. Memory I/O bandwidth can be improved either by increasing the number of I/O pins or increasing the bandwidth per pin. Single-ended transceivers are an attractive option as they can achieve high data rates using a single pin per data link, which reduces power consumption and simplifies the hardware. The data rate of next-generation graphics double data rate GDDR6 links are expected to reach 16GB/s or higher in a few years. At the same time, the popularity of multi-drop bus (MDB) memory interface has led to increased channel reflections, which requires a higher number of taps in the equalization filter. Meanwhile, analog circuit performance has not kept up with the exponential performance improvement of digital circuits. This situation has made digital-friendly time-based receivers [3]–[6] an attractive alternative for high speed memory interfaces

Manuscript received August 11, 2021; revised November 2, 2021 and December 19, 2021; accepted January 10, 2022. This article was recommended by Associate Editor S. Gupta. (*Corresponding author: Chris H. Kim.*)

Po-Wei Chiu was with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA. He is now with Apple Inc., Cupertino, CA 95014 USA (e-mail: chiux148@umn.edu).

Chris H. Kim is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: chriskim@umn.edu).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TCSI.2022.3143876.

Digital Object Identifier 10.1109/TCSI.2022.3143876

Fig. 1. Circuit diagram of TB-DFE.



Fig. 2. Waveforms of (a) two level (NRZ) and (b) four level (PAM-4) signal in voltage domain and time domain.

since they can take full advantage of the process scaling benefits. Compared to traditional analog implementations, they are amenable to automation and their performance can be tuned using programmable delay cells. Time-based receivers utilize inverters and programmable delays so they can achieve better energy-proportionality compared to traditional analog receivers; i.e. the supply voltage can be lowered during low data rate periods to save energy [4].

Most serial link transceivers for memory applications are based on non-return-to-zero (NRZ) modulation [7] where a low voltage represents a logic '0' and a high voltage represents a logic '1'. This simple modulation scheme works relatively well at low operating speeds where the voltage comparator or slicer can reliably detect the signal threshold and decode

1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS



Fig. 3. Comparison between conventional VTC and the proposed DVTC; (top row) Schematic, (middle row) timing diagram and (bottom row) post-layout simulation results. The proposed DVTC improves the linearity and dynamic range by modulating the delays of both delay paths in opposite directions.

the data. However, for high data rates, the channel loss and the resulting inter-symbol-interference (ISI) noise limit the performance of NRZ links. Equalization techniques such as continuous time linear equalization (CTLE) and DFE have shown promise in enhancing the signal integrity. However, their effectiveness and energy-efficiency degrade at higher data rates due to the design complexity of analog components [8]. Furthermore, the design complexity, circuit area, and power consumption become intractable for designs targeted for extremely high data rates.

To overcome the bandwidth limitation of NRZ links, new modulation schemes have been proposed [9]–[13]. For example, duobinary modulation, where the output signal is a combination of the current bit and the preceding symbol, can introduce ISI in a controlled manner to achieve a higher signal amplitude [9]–[10]. The higher TX signal amplitude of duobinary links can also reduce the gain boosting requirements in the equalization circuits. However, duobinary requires a full rate clock generator to sample the data. Multi-tone signaling technique has been proposed for MDB applications. This technique divides the signal spectrum into several bands and

utilizes the one with less frequency dependent loss. Due to the reduced loss in each sub-band, the equalization technique can be simplified [11], [12]. In [12], a self-equalization technique was proposed to eliminate power hungry equalization circuits. However, this approach employs complicated RF design techniques including an up/down frequency conversion mixer. Moreover, the frequency band must be carefully chosen to achieve a flat frequency response, making the design even more challenging. Recently, multi-level pulse amplitude modulation such as PAM-3 and PAM-4 [9], [12], [13] are gaining popularity where the signal-to-noise ratio (SNR) is traded off for higher number of signal levels. In particular, PAM-4 signaling utilizes four distinct signal levels to send 2 bits per unit interval, at the expense of complex TX and RX circuits. Compared to NRZ, PAM-4 links consume higher power and requires a larger chip area. For MDB or higher loss channels, a large number of DFE taps is required which significantly increases the hardware complexity [14]. While this approach may be promising for ultra high speed (e.g. >50Gb/s) links [14]–[17], a compact low energy alternative is needed for DFEs in memory interface applications.



Fig. 4. The DVTC circuit is effectively a sampling circuit.



Fig. 5. DVTC SFDR simulation results.

In short, PAM-3 and PAM-4 schemes trade off excess SNR for a higher data rate. Duo-binary can achieve two times the channel bandwidth. Data rate of the multi-tone modulation scheme depends on the number of frequency bands and the type of modulation adopted, so making a direct comparison with other modulation schemes is tricky. Generally speaking, multi-tone signaling has a higher bandwidth efficiency than other schemes provided that several frequency bands and modulation techniques are employed. However, the complicated RF circuit components makes this design style less attractive in scaled technologies.

Time-based DFE links have been recently demonstrated in [3], [4]. The basic operating concept is shown in Fig. 1. The voltage signal is converted to a time delay signal by a voltage-to-time converter (VTC). The clock enters the VTC and generates two delays; namely, the data dependent delay T<sub>Data</sub> and the reference delay T<sub>REF</sub>. The two signals propagate through separate delay lines where each stage delay is programmed using the preceding data and the corresponding DFE filer weight. Finally, a phase detector (PD) detects the delay difference at the end of delay chain and generates the final data bit. Unlike traditional current mode logic, time-based circuits can be realized using simple inverters and programmable loads, making them ideally-suited for energy-efficient memory interfaces with low supply voltages. Previous TB-DFE links were demonstrated on NRZ links using standard VTC circuits. In this work, we demonstrated the first PAM-4 link with a TB-DFE using a novel differential VTC (DVTC) circuit which increases the linearity and dynamic range with minimal hardware overhead. DFE taps control the delay of each inverter stage of the differential delay lines, offering two times the DFE tap efficiency with high immunity against common mode noise compared to a single-ended delay line.



Fig. 6. Post-layout simulation results of DVTC conversion gain at different (top) process, (middle) temperature and (bottom) voltage corners.

The remainder of this paper is organized as follows. Section II describes PAM-4 modulation in the time domain. The proposed DVTC is described in section III. Section IV focuses on the detection of PAM-4 signals in the time domain. Implementation details of a single-ended PAM-4 transceiver with a 2-tap TB-DFE are given in section V. Measurement results are presented in section VI. Finally, conclusions are drawn in section VII. The conference version of this work was published in [18].

### II. TIME-BASED PAM-4 SIGNALING

Two level amplitude modulation also known as PAM-2 or NRZ is commonly used in wireline communication systems due to their simplicity. As shown in Fig. 2 (a), the two voltage levels  $V_1$  and  $V_0$  are used to represent logic '1' and '0', and the detection threshold voltage is denoted as  $V_{TH}$ . A voltage comparator is used to detect the signal level to determine the data value. In time-based NRZ operation, the VTC converts the voltage amplitude to the corresponding time delay. The two voltage levels are mapped to two time delay levels  $T_1$  and  $T_0$ , respectively. A phase detector is used to detect the early or late signal edge to determine the data value. The threshold delay  $T_{TH}$  is set in the middle of the  $T_1$  and  $T_2$  delays. Fig. 2 (b)



Fig. 7. Delay comparison circuit for (left) NRZ and (right) PAM-4 operation. By adding tunable delay buffers in both delay lines, different threshold delays can be implemented for detecting multiple PAM-4 signal levels.

shows the TB-DFE operation for the PAM-4 signaling case. PAM-4 modulation can transmit two symbols in a single unit interval (UI) by encoding the 2 bit data into four voltage levels denoted as  $V_{11}$ ,  $V_{10}$ ,  $V_{01}$  and  $V_{00}$ . Three threshold levels  $V_{TH,H}$ ,  $V_{TH,M}$  and  $V_{TH,L}$  are used to detect the four voltage levels. Similar to the TB-DFE operation for NRZ, the VTC converts the four voltage levels into corresponding time delays  $T_{11}$ ,  $T_{10}$ ,  $T_{01}$  and  $T_{00}$ . These delays are compared with three threshold delays  $T_{TH,H}$ ,  $T_{TH,M}$  and  $T_{TH,L}$  to determine the signal value. Regardless the circuit type, PAM-4 links require extra circuits to compare the sampled voltage or delay against three reference levels.

## III. PROPOSED VOLTAGE TO TIME CONVERTER

The VTC converts the channel voltage to the corresponding delay and thus plays an important role in the overall timebased operation. Compared to NRZ, linearity and dynamic range of the VTC become important considerations in PAM-4 receivers because of the multi-level signaling requirement and smaller SNR. In particular, accurate mapping of the four voltage levels to the corresponding time delays is a critical requirement for a TB-DFE. Two types of VTCs have been proposed in previous works [3], [4]. [3] utilized the clock-to-q delay of a voltage comparator biased in a metastable condition. The offset voltage of the metastable voltage comparator was tuned by adjusting the source degeneration resistance. This technique is not suitable for PAM-4 signaling due to the larger TX signal swing and high RX linearity requirements. Another shortcoming of the metastability based VTC is the increased sensitivity to process-voltage-temperature variation. A current starved inverter stage was used in [4] to modulate the pull-up delay using the input voltage as shown in Fig. 3 (upper left).

The VTC consists of two delay lines with the same input clock signal. Each delay line has two inverter stages. Delay of the first delay line is modulated by the channel voltage while the other delay line serves as the reference delay which is set at the middle of data '0' and data '1' delays. Conceptual waveforms are shown in Fig. 3 (middle left). The final data value is determined by comparing the phase difference between the RX and REF signals. The inverter-based VTC in [4] is simpler compared to the metastability based VTC but the conversion gain is smaller. Furthermore, the transfer curve in Fig. 3 (bottom left) shows a limited input voltage range, which makes the design inadequate for PAM-4 links.

To achieve a high sensitivity and good linearity over a wider voltage range, we propose the DVTC circuit in Fig. 3 (upper right) where the incoming analog voltage Vin is connected not only to the PMOS header of the upper inverter but also to the NMOS footer of the lower inverter. The previous VTC only modulates the signal in one delay line. In the new design, we modulate the delays of both delay lines in opposite directions to improve the voltage-to-time gain and linearity. As illustrated in the timing diagram in Fig. 3 (middle right), the RX<sub>P</sub> delay and RX<sub>N</sub> delay move in opposite directions due to the same Vin voltage controlling the pull-up and pull-down delays of the two paths, respectively. For instance, as the signal level becomes higher, the RXP delay increases while the RX<sub>N</sub> delay decreases. This operation is analogous to that of a differential amplifier, which amplifies the differential input voltage while rejecting the common mode input voltage. Vin controls the delay of the first stage for the RX<sub>N</sub> path but the delay of the second stage for the RX<sub>P</sub> path, leading to an asymmetric circuit structure. To eliminate any systematic delay mismatch between the RX<sub>N</sub> and RX<sub>P</sub> paths, we added a buffer



Fig. 8. Example of how tunable delay used to produce different threshold. (a) Timing diagram before the tunable delay. (b) After the delay, both two edge delay  $T_{TH}$ . The threshold delay is align to middle of  $T_{RXN}$  and  $T_{RXP}$ .





Fig. 9. Block diagram of the proposed PAM-4 transceiver.

stage after the DVTC with carefully sized loading. This helps improve the linearity of the DVTC.

In the ideal case, the proposed DVTC can attain two times the voltage-to-time gain of the previous VTC design as it utilizes the entire input voltage range, from 0V to Vdd. The differential operation can also cancel out the non-linearity in the two delay paths, improving the linearity and increasing the input range. Post-layout simulation results are shown in Fig. 3 (bottom right). The dash lines denoted as RXP and RXN correspond to the transfer curves of the previous VTC approach. The differential configuration of the proposed DVTC expands the delay range from 42ps to 70ps over the entire input voltage range, from VSS to VDD. The input sensitivity of the proposed DVTC is 59 ps/V, which is lower than the 70 ps/V of the previous VTC, but the sensitivity is maintained over the entire rail-to-rail voltage range. SNR of the proposed DVTC is 50 dB using the noise values from [4]. The output delay difference is slightly less than the ideal 2X improvement due to the PMOS header and NMOS footer having different drive currents. This asymmetry can be resolved by carefully adjusting the device sizing ratio to match the sensitivities of the two delay lines. For example, if signal RXN controlled by the NMOS has a shorter delay compared to signal RXP which is controlled by the PMOS, we can either increase the PMOS width or decrease the NMOS width to bring the two delays closer to each other.

Fig. 10. Implementation of 3-tap FFE and voltage mode driver.

To study the DVTC's operation speed, we can simplify the DVTC as shown in Fig. 4. Transistor MP2 can be modeled as a current source whose value is modulated by the input data. MP1 is the sampling transistor gating the MP2 current. The simplified circuit on the right is basically a standard sampling circuit. Readers can follow the analysis in [20] for deriving the aperture time and AC bandwidth of this circuit. To quantify the linearity of the proposed DVTC circuit, we simulated the DVTC's Spurious-Free Dynamic Range (SFDR) as shown in the Fig. 5. Y-axis is the power and X-axis is the frequency harmonic tone. The simulated SFDR of the proposed DVTC is 30dB.

Variation of the DVTC circuit will affect the overall transceiver performance. We simulated the DVTC circuit under extreme process, temperature, and voltage conditions to study the variation issue (Fig. 6). TT, FF and SS corner curves show a similar linearity due to the fact the PMOS and NMOS have the same skew. On the other hand, when PMOS and NMOS have opposite skews, like in the FS and SF corners, the gain curves deviate from the ideal straight line. The inverter chain delay increases at higher temperatures, and as a result, the conversion gain improves at the higher temperature corner.



Fig. 11. Block diagram of the proposed time-based PAM-4 DFE. The timing diagram illustrates how the delay difference is manifested after each delay stage pair for the four PAM-4 signal levels. Besides the low voltage friendly design, our time-based approach obviates the need for DAC circuit which are typically used in PAM-4 systems for generating threshold voltage levels.

We reduced the supply voltage from 1V to 0.8V while limiting the input voltage range accordingly. The conversion gain increases at lower supply voltages due to the longer inverter chain delay. Overall, the proposed DVTC circuit can achieve good linearity and a reasonably consistent gain even under extreme PVT variations.

## IV. PAM-4 TIME-BASED DELAY DETECTION

In voltage-based PAM-4 links, the input signal level is compared with three threshold voltages to determine the two bit data. The threshold voltages are usually generated by separate digital-to-analog converters (DACs) which makes PAM-4 designs more complicated than NRZ designs. In time-based links however, different threshold delays can be implemented using programmable delay stages, which is equivalent to designing a built-in digital-to-time converter (DTC). Fig. 7(a) shows the DVTC circuit implementation for the NRZ scheme. By properly controlling the pull up pull down strengths, we can make the delay difference  $\Delta t = T_{RXP} - T_{RXN}$  to be either positive or negative based on the input data value. PAM-4 on the other hand requires  $\Delta t$  to be compared with three threshold delays  $T_{TH,H}$ ,  $T_{TH,M}$ , and  $T_{TH,L}$  in order to distinguish between the four delay levels. This is achieved by programming the individual delays of each signal pair as shown in Fig. 7 (b). That is, different delay offsets are added to the delay lines to generate three delay differences. The programmable delay stages can be used to calibrate out any static offset. As shown in Fig. 7 (b), the time offset T<sub>offset</sub> induced by static variation sources can be accounted for when configuring the threshold delays. Three parallel PD's followed by a decoder in the DFE block can detect and decode the 4 distinct delay levels.

Fig. 8 shows the timing diagrams for PAM-4 voltage levels  $V_{11}$  and  $V_{10}$ . Without the threshold delay stage,  $T_{RXP}$  is always earlier than  $T_{RXN}$  as show in Fig. 9(a). The PD cannot detect the delay difference between the two delay lines under this condition. After applying the offset delay  $T_{TH}$  as shown in Fig. 8(b), the two rising edges are shifted to the right by the same amount of delay. By properly select the  $T_{TH}$ , the threshold can be aligned to the middle of two rising

CHIU AND KIM: 32Gb/s TIME-BASED PAM-4 TRANSCEIVER FOR HIGH-SPEED DRAM INTERFACES



Fig. 12. Implementation of digital control delay cell [4] and phase detector [19].



Fig. 13. Timing diagram of the odd cycle and middle level of the TB-DFE design in Fig. 11. The DFE loop should satisfy the timing requirement  $\Delta T_{Tap} + \Delta T_{PD} < T$ .



Fig. 14. Measured BER bathtub curves with and without TB-DFE.

edge allowing the PD to detect either a positive or negative delay difference. The threshold delay buffers are also used to measure the time-domain eye-diagram as described in VII.

Jitter in the inverter based delay line is an important consideration for TB-DFE operation. The total jitter increases with a longer delay line, which must be accurately accounted for in the circuit design. A detailed jitter analysis was performed in our previous work [4], showing that the jitter impact is minimal compared to the clock period.

## V. PAM-4 TRANSCEIVER IMPLEMENTATION

The block diagram of the full PAM-4 transceiver system is shown in Fig. 9. On the TX side, we adopted a 3-tap half-rate feed forward equalizer (FFE) and parallel voltage-mode drivers previously published in [4]. An on-chip 2<sup>7</sup>-1 pseudo random



Fig. 15. Measured BER eye-diagram.



Fig. 16. Die photo and feature summary table.

bit sequence (PRBS) generator is included to generate the data stream. On the RX side, it includes the proposed DVTC, a half rate 2-tap TB-DFE, a PAM-4 decoder and a BER monitor. The on-chip channel is terminated by a 60 ohm impedance at the RX side. An on-chip clock generator is implemented to provide the 8GHz clock.

Detail implementation of the 3-tap half-rate FFE and output combiner is shown in Fig. 10. It consists of two parallel FFE paths, one for the PAM-4 upper bit and one for the lower bit. An 8 Gb/s random bit stream is fed to the half-rate FFE for signal pre-emphasis. Two voltage mode drivers with 2X and 1X drive strengths combine the upper and lower bits through a resistive divider circuit to produce a PAM-4 signal. The simulated minimum and maximum voltage levels of the TX output were 320mV and 1V, respectively. The output combiner has 4-bit programmability to support different pre-emphasis levels.

Fig. 11 shows the implementation of the time-based PAM-4 DFE along with the signal waveforms for each delay stage. To support half rate operation, the input data is applied to two DVTC paths clocked by opposite phase clocks CLKP and CLKN. Differential output signals  $RX_N$  and  $RX_P$  from the odd and even DVTCs enter the time-based DFE block. The input voltage is converted to a delay difference between  $RX_N$  and

|                          | ISSCC'15 [7]           | JSSC'14 [10]           | ISSCC'19 [13]          | ISSCC'16 [12]        | JSSC'17 [3]            | This work                    |
|--------------------------|------------------------|------------------------|------------------------|----------------------|------------------------|------------------------------|
| Signaling                | NRZ                    | Duobinary              | PAM-3                  | Muti-Band            | NRZ                    | PAM-4                        |
| Single/Differential      | Single-Ended           | Single-Ended           | Single-Ended           | Differential         | Single-Ended           | Single-Ended                 |
| RX Circuit Type          | Voltage-Based          | Voltage-Based          | Voltage-Based          | Voltage-Based        | Time-Based             | Time-Based                   |
| RX Equalization          | IIR+FIR DFE            | 1-Tap DFE              | 1-Tap DFE              | Self-Equalization    | 2-Tap DFE              | 2-Tap DFE                    |
| Data Rate                | 10 Gb/s                | 7 Gb/s                 | 27 Gb/s                | 10 Gb/s              | 12.5 Gb/s              | 32 Gb/s                      |
| Technology               | 65nm                   | 65nm                   | 28nm                   | 28nm                 | 65nm                   | 65nm                         |
| Voltage                  | 1.0V                   | 1.05V                  | 0.6V                   | 1.2V                 | 0.8V                   | 1.2V                         |
| Channel Loss             | 10.11dB@5GHz           | 0.8dB@3.5GHz           | 20mm                   | 6dB@6GHz             | 14dB@6.25GHz           | *9.2dB@8GHz<br>**11.6dB@8GHz |
| Eye Width @<br>BER=1E-12 | 0.62 UI                | 0.28 UI                | N/A                    | N/A                  | 0.41 UI                | 0.012 UI                     |
| BER                      | <1E-12                 | <1E-12                 | <1E-12                 | <1E-12               | <1E-12                 | <1E-12                       |
| TRX Area                 | 0.0091 mm <sup>2</sup> | 0.0333 mm <sup>2</sup> | 0.0135 mm <sup>2</sup> | 0.01 mm <sup>2</sup> | 0.0094 mm <sup>2</sup> | 0.009 mm <sup>2</sup>        |
| TRX Power<br>Efficiency  | 4.18 pJ/b              | 0.56 pJ/b              | 1.03 pJ/b              | 0.95 pJ/b            | 0.49 pJ/b              | 0.97 pJ/b                    |

\*EM simulation results

\*\*Proposed channel loss monitor measured results

Fig. 17. Comparison with other high speed memory interface papers.

 $RX_P$  by the DVTC circuit. The four delay levels corresponding to voltage levels  $V_{00}$ ,  $V_{01}$ ,  $V_{10}$ , and  $V_{11}$  must be compared with three threshold delays:  $T_{REF,H}$ ,  $T_{REF,M}$ , and  $T_{REF,L}$ . This operation is performed by the three delay chain blocks denoted as H, M, and L. Each block contains two separate delay paths for  $RX_N$  and  $RX_P$  signals, respectively. The first buffer stage performs the delay comparison as discuss in section IV, while the second buffer stage performs the 2-tap DFE operation. The number of stages was reduced by implementing 6-bit DFE weights  $w_1$  and  $w_2$  in the upper and lower paths, respectively.

The implementation of the 6-bit digitally-controlled delay cell is shown in Fig. 12. The same delay cell is used for realizing the DFE taps and threshold delays. It consists of an always on inverter connected in parallel with three tri-state inverters and three MOS capacitors, which are individually controlled. Note that the control logic and switches are different depending on the functionality of the delay cell. To support half-rate operation, a total of 6 delay chain blocks with 12 delay paths and 6 PDs are implemented in our design. One notable advantage of our proposed time-based implementation is the absence of any DAC circuits for generating reference voltages V<sub>TH,H</sub>, V<sub>TH,M</sub>, and V<sub>TH,L</sub>. These analog voltages were required in conventional voltage-based PAM-4 designs to detect the PAM-4 voltage levels. In our time-based implementation, simple programmable delay stages are used in lieu of DACs which significantly reduces the design complexity and circuit area. Moreover, the proposed receiver employs mostly digital circuits which can take full advantage of the technology scaling benefits. The timing waveforms in the bottom of Fig. 12 show how the delay signals are manifested in each delay stage. Signals RX<sub>N</sub> and RX<sub>P</sub> have different relative delays depending on the four signal levels. These delays are compared with different reference delays in the first delay stage. ISI noise is cancelled out in the second delay stage and the delay polarity is sampled by a PD circuit. The results of the PD are decoded by a PAM-4 decoder and the BER is measured using an on-chip error counting circuit. The phase detector is composed of an SR-latch followed by a flip-flop circuit as shown in Fig. 12 [19]. The minimum time difference resolution of the phase detector is 3ps for a 10% delay shift criterion [4].

The timing diagram of the DFE feedback loop is shown in Fig. 13. For simplicity, we only show the signals related to the odd clock and middle level block in Fig. 11. RXP and RXN are the two output signals of the DVTC block. THP and THN refer to the signals after the threshold delay. The signals after the DFE delay stage are denoted as DFEP and DFEN.  $\Delta$ TTap represents the DFE tap delay and  $\Delta$ TPD represents the phase detector delay. The decision data should be ready before the next clock edge arrives at the DFE taps as shown in the timing diagram. This means that the delay of the DFE feedback loop should fit within one unit interval; i.e.  $\Delta$ T<sub>Tap</sub> +  $\Delta$ T<sub>PD</sub> < T.

## VI. MEASUREMENT RESULTS

A PAM-4 test chip featuring the aforementioned techniques was implemented in a 65nm GP process. All high frequency signals were generated on-chip while the error rate was measured using an on-chip BER monitor specifically designed for the time based DFE block. This allows all chip performance parameters to be measured using a simple test setup with no high speed equipment. For the ease of testing, the channel was also implemented inside the chip using a folded M9 metal line with a length of 26mm and a width of  $2\mu$  m. A PXI based data acquisition system was used to serially scan in the control signals and read out the BER data. The error rate of the PAM-4 link was measured using an on-chip BER monitor adopted from our previous work [4]. We first swept the 6 bit weight of the first tap. This requires 64 scan-in trials. Once we found the optimal code for the first tap, we fixed the value and swept the 6 bit weight of the second tap. So in total, it takes 128 scan-in trials to find the optimal weight configuration. The measured

bathtub curves in Fig. 14 show a wider operating window enabled by the proposed TB-DFE. A time-domain (i.e. time versus time) BER eye-diagram is shown in Fig. 15 for BER levels below  $10^{-9}$ . The small discrepancy between the BER data in Figs. 14 and 15 might be due to the long measurement time (hours to days) and the environmental noise difference between the testing runs.

Fig. 16 shows the test chip die photo and summary table. The TX and RX circuits occupy a chip area of  $31 \times 72 \ \mu m^2$ and  $73 \times 89 \ \mu m^2$ , respectively. The PAM-4 transceiver operates at a data rate of 32 Gb/s. The overall energy-efficiency of the transceiver was 0.97 pJ/b at a 1.2V supply voltage while achieving a BER less than  $10^{-12}$ . Fig. 17 compares the proposed design with previous high speed memory interfaces. The proposed design is highly competitive in terms of data rate, area and power efficiency. The eye width is narrow for BER less than 1E-12 possibly due to (1) non-linearity of DVTC and (2) the high data rate we chose for a timebased scheme. It's worth emphasizing the main advantage of time-based designs which is a digital-friendly implementation based on inverters and programmable delays. This ensures good scalability, low voltage compatibility, and a reduced design effort.

## VII. CONCLUSION

In this work, we demonstrated for the first time a timebased PAM-4 transceiver for high speed memory interfaces. The time-based operation allows the entire receiver circuit to be implemented using digital gates ensuring good scalability and energy-proportionality. The BER of the 65nm transceiver circuit was measured down to  $10^{-12}$  using an on-chip timebased BER monitor. Our studies show that the proposed digital intensive design can reduce the design complexity of PAM-4 signal modulation circuits while delivering comparable performance as traditional analog transceivers.

#### REFERENCES

- T. M. Hollis *et al.*, "Recent evolution in the DRAM interface: Milemarkers along memory lane," *IEEE Solid State Circuits Mag.*, vol. 11, no. 2, pp. 14–30, Jun. 2019.
- [2] B. Dehlaghi, N. Wary, and T. C. Carusone, "Ultra-short-reach interconnects for die-to-die links: Global bandwidth demands in microcosm," *IEEE Solid State Circuits Mag.*, vol. 11, no. 2, pp. 42–53, Jun. 2019.
- [3] I.-M. Yi *et al.*, "A time-based receiver with 2-tap DFE for a 12 Gb/s/pin single-ended transceiver of mobile DRAM interface in 0.8 V 65 nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2017, pp. 400–401.
- [4] P.-W. Chiu, S. Kundu, Q. Tang, and C. H. Kim, "A 65-nm 10-Gb/s 10-mm on-chip serial link featuring a digital-intensive timebased decision feedback equalizer," *IEEE J. Solid-State Circuits*, vol. 53, no. 4, pp. 1203–1213, Apr. 2018.
- [5] P.-W. Chiu, M. Liu, Q. Tang, and C. H. Kim, "A 2.1 pJ/bit, 8 Gb/s ultra-low power in-package serial link featuring a time-based front-end and a digital equalizer," in *Proc. IEEE Asian Solid-State Circuits Conf.* (A-SSCC), Nov. 2018, pp. 187–190.
- [6] Y. Chun, A. Ramachandran, and T. Anand, "A PAM-8 wireline transceiver with receiver side PWM (Time-Domain) feed forward equalization operating from 12-to-39.6Gb/s in 65nm CMOS," in *Proc. IEEE 45th Eur. Solid State Circuits Conf. (ESSCIRC)*, Cracow, Poland, Sep. 2019, pp. 269–272.
- [7] J. Song, H.-W. Lee, J. Kim, S. Hwang, and C. Kim, "1 V 10 Gb/s/pin single-ended transceiver with controllable active-inductor-based driver and adaptively calibrated cascade-DFE for post-LPDDR4 interfaces," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2015, pp. 320–321.

- [8] I. A. Young et al., "Optical I/O technology for tera-scale computing," IEEE J. Solid-State Circuits, vol. 45, no. 1, pp. 235–248, Jan. 2010.
- [9] J. Lee, M.-S. Chen, and H.-D. Wang, "Design and comparison of three 20-Gb/s backplane transceivers for duobinary, PAM4, and NRZ data," *IEEE J. Solid-State Circuits*, vol. 43, no. 9, pp. 2120–2133, Sep. 2008.
- [10] S.-M. Lee *et al.*, "An 80 mV-swing single-ended duobinary transceiver with a TIA RX termination for the point-to-point DRAM interface," *IEEE J. Solid-State Circuits*, vol. 49, no. 11, pp. 2618–2630, Nov. 2014.
- [11] K. Gharibdoust, A. Tajalli, and Y. Leblebici, "Hybrid NRZ/multi-tone serial data transceiver for multi-drop memory interfaces," *IEEE J. Solid-State Circuits*, vol. 50, no. 12, pp. 3133–3144, Dec. 2015.
- [12] W. Cho et al., "A 38 mW 40 Gb/s 4-lane tri-band PAM-4 / 16-QAM transceiver in 28 nm CMOS for high-speed memory interface," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2017, pp. 184–185.
- [13] H. Park, J. Song, Y. Lee, J. Sim, J. Choi, and C. Kim, "A 3-bit/2UI 27 Gb/s PAM-3 single-ended transceiver using one-tap DFE for nextgeneration memory interface," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2019, pp. 382–384.
- [14] J. Im et al., "A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
- [15] P.-J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, "6.1 A 56 Gb/s PAM-4/NRZ transceiver in 40 nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2017, pp. 110–111.
- [16] Y. Chang, A. Manian, L. Kong, and B. Razavi, "An 80-Gb/s 44-mW wireline PAM4 transmitter," *IEEE J. Solid-State Circuits*, vol. 53, no. 8, pp. 2214–2226, Aug. 2018.
- [17] A. Roshan-Zamir *et al.*, "A 56-Gb/s PAM4 receiver with low-overhead techniques for threshold and edge-based DFE FIR- and IIR-tap adaptation in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 3, pp. 672–684, Mar. 2019.
- [18] P.-W. Chiu and C. Kim, "A 32 Gb/s digital-intensive single-ended PAM-4 transceiver for high-speed memory interfaces featuring a 2-tap timebased decision feedback equalizer and an *in-situ* channel-loss monitor," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2020, pp. 336–338.
- [19] S. Kundu, B. Kim, and C. H. Kim, "A 0.2-to-1.45 GHz subsampling fractional-N digital MDLL with zero-offset aperture PD-based spur cancellation and *in-situ* static phase offset detection," *IEEE J. Solid-State Circuits*, vol. 52, no. 3, pp. 799–811, Jan. 2017.
- [20] H. O. Johansson and C. Svensson, "Time resolution of NMOS sampling switches used on low-swing signals," in *IEEE J. Solid-State Circuits*, vol. 33, no. 2, pp. 237–245, Feb. 1998.
- [21] J. Lee. Communication Integrated Circuits. [Online]. Available: https://cc.ee.ntu.edu.tw/~jrilee/publications/Comm\_IC.pdf



**Po-Wei Chiu** was born in Tainan, Taiwan, in 1989. He received the B.S. and M.S. degrees in electrical engineering from the National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 2011 and 2013, respectively, and the Ph.D. degree from the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA, in 2019.

He is currently a SerDes Circuit Design Engineer with Apple Inc., Cupertino, CA, USA. His research interests include high speed mixed-signal integrated

circuit design, such as high speed optical I/O for optical link and high speed serial link.



**Chris H. Kim** (Fellow, IEEE) is currently the Louis John Schnell Professor in electrical and computer engineering at the University of Minnesota, Minneapolis, MN, USA. His group has expertise in digital, mixed-signal, and memory IC design, with an emphasis on circuit reliability, hardware security, memory circuits, radiation effects, time-based circuits, machine learning, and quantum-inspired hardware design.

He was a recipient of the SRC Technical Excellence Award for his Silicon Odometer work, the

Taylor Award for Distinguished Research at the University of Minnesota, the NSF CAREER Award, the Mcknight Foundation Land-Grant Professorship, and the IBM Faculty Partnership Award.