22.4 A 32Gb/s Digital-Intensive Single-Ended PAM-4 Transceiver for High-Speed Memory Interfaces Featuring a 2-Tap Time-Based Decision Feedback Equalizer and an In-Situ Channel-Loss Monitor

Po-Wei Chiu, Chris Kim
University of Minnesota, Minneapolis, MN

Single-ended transceivers that can deliver high-data rates at reduced supply voltages are required to meet the ever-growing demands of future memory interfaces. The performance of conventional non-return-to-zero (NRZ) links is usually limited by inter-symbol-interference (ISI) noise caused by high channel losses. Alternative schemes such as duobinary [1], three or four level pulse amplitude modulation (PAM-3, PAM-4) [2], and multi-band signaling [3] were proposed to increase bandwidth efficiency. In particular, PAM-4 signaling utilizes four signal levels to send 2b per unit interval, at the expense of complex TX and RX circuits resulting in higher power consumption and larger chip area. While this approach has been gaining popularity for ultra-high speed (>50Gb/s) links, a more compact implementation is needed for memory interface applications. In this paper, we propose a digital-intensive PAM-4 receiver targeted at memory interfaces; time-based circuits are used for the decision feedback equalization (DFE). Unlike traditional current-mode logic, time-based circuits can be realized using inverters and programmable loads, making them ideally-suited for low-voltage energy-efficient memory interfaces.

Figure 22.4.1(left) shows the block diagram of the full transceiver system. The PAM-4 TX consists of a pseudo random bit sequence (PRBS) generator for testing purposes, a 3-tap half-rate feed forward equalizer (FFE) and parallel voltage-mode drivers. The PAM-4 RX includes a differential voltage-to-time converter (DVTC), a 2-tap time-based decision feedback equalizer (TB-DFE), a PAM-4 decoder and an on-chip bit error rate (BER) monitor. In-situ channel loss monitors were implemented on both the TX and RX sides. The detailed implementation of the 3-tap FFE and output combiner are shown in Fig. 22.4.1(right). The 8Gb/s bit stream is fed to the half-rate FFE for signal pre-emphasis.

Linearity and dynamic range are two important considerations in PAM-4 receiver designs because of the multi-level signal levels. In particular, accurate mapping of the four voltage levels to the corresponding time delays is a critical requirement for TB-DFE. Two main types of VTCs have been used in previous works [4]-[5]. [4] utilizes the clock-to-q delay of a voltage comparator biased in metastable condition. The offset voltage of the metastable voltage comparator is tuned by the input voltage. The VTC design in [4] is more complex than that in [5] but can achieve a higher voltage-to-time sensitivity. Both VTCs suffer from linearity issues. To achieve high sensitivity, good linearity, and robust operation, we propose the DVTC circuit in Fig. 22.4.2(upper, right) where the incoming analog voltage Vin is connected to the PMOS header of the upper inverter as well as the NMOS footer of the lower inverter. As illustrated in the timing diagram, RXp delay and RXn delay have opposite polarities due to the same Vin voltage controlling the pull-up and pull-down delays of the two paths. For instance, when the data is high, the RXp delay increases while the RXn delay decreases. This unique configuration expands the delay range from 42 to 70ps as shown in the simulation results in Fig. 22.4.2(bottom, right). Non-linearity in the two delay paths are cancelled out, enabling good linearity over the entire voltage range, from VSS to VDD.

Figure 22.4.3 shows the implementation of the fully time-based PAM-4 DFE along with the signal waveforms for each delay stage. Differential output signals RX, and RX, from the odd and even DVTCs are fed to the time-based DFE block. The delay difference between RX, and RX, contains the signal information. The four delay levels corresponding to voltage levels Vthp, Vthp, Vthn, and Vthn must be compared with three threshold delays T_REF,T_REF, and T_REF. This operation is performed by the three delay chain blocks denoted H, M, and L. Each block contains two separate delay paths for RX, and RX, signals, respectively. The first buffer stage performs the delay comparison while the second buffer stage performs the 2-tap DFE operation. The length of the delay chain was reduced by implementing the 6-bit DFE weights w1 and w2 in the upper and lower paths, respectively. To support half-rate operation, a total of 6 delay chain blocks with 12 delay paths and 6 phase detectors (PDs) are implemented in our design. A notable advantage of our proposed time-based implementation is the absence of any DAC circuits for generating reference voltages VREF,M and VREF,L. These analog voltages are required in conventional voltage-based PAM-4 designs to detect the different voltage levels. In our time-based implementation, simple programmable delay stages are used in lieu of DACs which significantly reduces the design complexity and circuit area. The timing waveforms in Fig. 22.4.3(bottom) show how the delay signals are manifested in each delay stage. Signals RX, and RXp have different relative delays depending on the four signal levels. These delays are compared with different reference delays in the first delay stage. ISI noise is cancelled out in the second delay stage and the delay polarity is sampled by a PD circuit. The results generated by the PD are decoded by a PAM-4 decoder and the BER is measured using an on-chip monitor circuit.

S-parameter is the de-facto measure of channel loss but its measurement requires an extensive test setup including a high frequency sinusoidal signal source. In this work, we designed an in-situ monitor that can indirectly measure the channel loss by sensing the signal swings of the TX and RX signals for a random bit sequence. The monitor circuit detects the TX and RX signal levels by comparing them with a known reference voltage Vthp. By sweeping the Vthp and measuring the average toggling frequency of the comparator output using a divider circuit, we can extract the signal swing information without an extensive setup. Figure 22.4.4 shows how the signal swings are extracted from the measured frequency versus reference voltage data. We also introduce a channel loss parameter T1, which is basically the ratio between the TX and RX signal swings. The area and power consumption of the proposed channel loss monitor are negligible.

A PAM-4 test chip featuring the aforementioned techniques was implemented in a 65nm GP process. Figure 22.4.5(upper row) shows the data from the in-situ channel loss monitor. The average frequency of the TX and RX comparator outputs reach the same level at 1GHz due to the relatively small loss. As the frequency increases to 7GHz or higher, the RX comparator frequency saturates early due to the severe channel loss while the TX comparator frequency continues to rise. From the frequency versus voltage plot, we calculated the loss parameter defined in Fig. 22.4.4 and compared the results with S-parameter values obtained from electromagnetic simulations. The small discrepancy can be attributed to the non-sinusoidal random bit stream used for the channel characterization. Error rate of the PAM-4 link was measured using an on-chip BER monitor. The bathtub curves in Fig. 22.4.5 (lower left) shows TB-DFE enabling an operating window with a BER less than 10^-10. A time-domain BER eye-diagram is shown in Fig. 22.4.5 (lower right) down to BER rates of <10^-10. Figure 22.4.6 compares the proposed transceiver with relevant previous works. The proposed design has a competitive performance while offering the unique benefits of a time-based design. Figure 22.4.7 shows the die photo and feature summary of the 65nm chip. When operating at a data rate of 32Gb/s, the PAM-4 transceiver achieves an energy-efficiency of 0.97pJ/1b. The circuit area of the TX and RX blocks are 31×72 m² and 89×73 m², respectively.

Acknowledgment:
This research was supported in part by the National Science Foundation under award number CCF-1763761.

References:
Figure 22.4.1: (Left) Block diagram of the proposed PAM-4 transceiver. (Right) Implementation of 3-tap FFE and voltage mode driver.

Figure 22.4.2: Schematic, operating principle and timing diagram of the inverter-based VTC [5] and proposed DVTC. Post-layout simulation result show that the linearity and dynamic range are significantly improved.

Figure 22.4.3: Block diagram of proposed time-based PAM-4 DFE. The timing diagram illustrates how the delay difference is manifested after each delay stage pair for the four PAM-4 signal levels.

Figure 22.4.4: Proposed in-situ channel loss monitor and methodology for extracting channel loss parameter from the monitor output.

Figure 22.4.5: (Upper) Channel loss monitor data and reconstructed channel characteristics. (Lower) Bathtub curves with and without DFE for BER<10^-12, and time-domain BER eye-diagram.

Figure 22.4.6: Performance comparison with previous work.
Figure 22.4.7: Chip microphotograph and feature summary.

<table>
<thead>
<tr>
<th>Feature</th>
<th>Specification</th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>65nm CMOS</td>
</tr>
<tr>
<td>Circuit Area</td>
<td>TX: 31x72μm²</td>
</tr>
<tr>
<td></td>
<td>RX: 89x73μm²</td>
</tr>
<tr>
<td>VDD</td>
<td>1.2V</td>
</tr>
<tr>
<td>Data Rate</td>
<td>32 Gb/s</td>
</tr>
<tr>
<td>Channel Loss</td>
<td>11.6dB@8GHz</td>
</tr>
<tr>
<td>BER</td>
<td>&lt;10⁻¹²</td>
</tr>
<tr>
<td>Power Efficiency</td>
<td>0.97 pJ/b</td>
</tr>
</tbody>
</table>