Overview

As parallel interfaces push to higher speeds, keeping all data bits synchronized becomes increasingly difficult. This post explains why global clock distribution hits a wall — and how source-synchronous signaling solves it.


Signal Propagation on PCB

Before diving into timing, it helps to understand how fast signals actually travel on a PCB:

  • Typical propagation speed on FR4: 15–17 cm/ns
  • A 1 cm trace length difference ≈ 60–70 ps of timing skew

At 1 GHz (1 ns period), 70 ps of skew already consumes 7% of your entire timing budget from a single centimeter of mismatch. As data rates climb, this becomes unmanageable.


The Problem with Global Synchronous Clocking

In a globally synchronous system, one central clock is distributed to all receivers across the board.

Global Synchronous — Clock skew accumulates across traces
■ CLK ■ D[0] (short trace) ■ D[1] (long trace) ■ D[2] (longer trace)

The core problem: As trace lengths diverge, each data bit arrives at a slightly different time. The receiver must sample all bits with a single clock edge — but the valid window shrinks as skew grows.


Source-Synchronous: Send the Clock with the Data

The source-synchronous approach eliminates the global clock problem by having the transmitter send its own sampling strobe alongside the data.

In DDR memory, this is the DQS (Data Strobe) signal — one strobe per byte lane, traveling with its 8 data bits (DQ).

Source-Synchronous — DQS strobe travels with DQ data
■ DQS (strobe) ■ DQ[0] ■ DQ[1] ■ DQ[2] ■ DQ[3] ↕ sample point

Key insight: The DQS strobe and DQ data experience the same board-level delay — they travel the same path. The receiver simply uses DQS edges to sample DQ, regardless of how long the traces are.


Global Synchronous vs. Source-Synchronous

Global SynchronousSource-Synchronous
Clock sourceCentral, sharedTransmitted with data
Skew challengeAll traces vs. clock treeOnly intra-pair (DQ↔DQS)
Trace matchingEvery signal to clockDQ matched to DQS per byte lane
ExamplePCI, ISADDR2/3/4/5, LPDDR
Speed limit~200–400 MHz3200+ MT/s

DDR4 in Practice

DDR4 organizes its signals into byte lanes. Each lane has:

  • 8× DQ bits
  • 1× DQS (differential) — the strobe for that lane
  • 1× DM (data mask)
Byte Lane 0          Byte Lane 1
DQ[7:0] ←→ DQS0     DQ[15:8] ←→ DQS1

The FPGA MIG (Memory Interface Generator) or PHY automatically handles:

  • Write Leveling — aligns DQS to the memory clock on write
  • Read Training — calibrates DQ sampling window on read

These calibration routines compensate for any remaining board-level skew at startup.


PCB Design Guidelines

  • Intra-pair matching (DQ↔DQS within same byte lane): within ±10 ps (≈ 1–2 mm on FR4)
  • Inter-pair matching (between byte lanes): within ±25 ps — relaxed because each lane has its own strobe
  • Use serpentine (meander) routing for length matching
  • Keep all DQ/DQS traces on the same PCB layer with consistent impedance (typically 40–50 Ω differential)
  • Minimize stubs and via transitions

Summary

  • Global synchronous clocking struggles beyond a few hundred MHz due to skew accumulation across the board
  • Source-synchronous moves the reference clock to the data source — the strobe travels with the data, so board delay cancels out
  • DDR memory’s DQ+DQS pairing is the canonical example: each byte lane has its own strobe, limiting skew control to just that group
  • PCB design requires intra-pair trace matching to within picoseconds; write leveling and read training handle the rest at runtime