EECS 4340 · Spring 2026 · Columbia University

Deep Dive

A long-form walkthrough of the design: the seven advanced features layered on the base out-of-order machine, what each one buys, the performance numbers across 33 programs, and where the synthesis timing actually lands.

Headline numbers

Every number on this page is sourced from the same const arrays in lib/results.tsthat drive the charts. The four cards below are the geomean over the full 33-program suite at the “all-on” operating point.

Programs passing

33 / 33

RTL + synthesized netlist

Geomean cycle reduction

-27.46%

OoO base → all-on

Geomean branch accuracy

74.03%

+8.87 pp over bimodal

Worst slack at 1000 ps

-797.58 ps

functionally bit-equivalent

Per-feature ablation

Marginal value at the operating point: disable one feature at a time, leave the other four ablate-able features on, measure the cycle-count regression. The two structural features (2-way superscalar, 2-way set-associative D-cache) cannot be rolled back with a +define and are discussed in their own subsections below.

Disabling all five ablate-able features together raises geomean cycle count by 37.86%. Next-line prefetch alone accounts for 37.14pp of that; the other four contribute about 0.7 pp combined. Read that not as “the smaller features are broken” but as “the prefetcher already absorbs most of the cycles those features could have saved on their own.”

Architecture overview

P6-style out-of-order RV32IM core. 2-way superscalar in fetch, decode, dispatch, ALU, CDB, and commit; the LSQ, multiplier, and both caches stay 1-wide. The RAT lives inside the ROB — ROB entries double as physical registers, so there is no separate rename file.

The fetch unit pulls 8-byte aligned lines from the I-cache. Decode cracks each line into two parallel decoders. Dispatch allocates two ROB slots per cycle and either two RS slots or one RS + one LSQ slot per cycle, in lockstep so the structures never disagree about what is in flight. The CDB has 2 slots with priority MULT > LD > ALU per slot.

The LSQ is a FIFO and only the head can issue to the cache, which keeps memory ordering simple at the cost of some throughput on adjacent loads. Branch resolution happens inside the two ALUs alongside alu_result; the diagram draws a dedicated “Branch resolver” box for pedagogical clarity.

2-way Superscalar

A one-instruction-wide pipeline retires at most one instruction per cycle. Real programs often have two unrelated instructions sitting ready in the RS, both waiting for an issue slot that will only ever serve one of them. Going from one issue per cycle to two roughly doubles the IPC ceiling.

The widening runs from fetch through commit. Two parallel decoders on the 8-byte fetch line, two ROB allocations per cycle, two RS picks (or one RS + one LSQ if one of the pair is a memory op), two ALUs at the end of issue, two CDB slots, and two commit slots in program order. A few units stayed single: there is still one MULT unit (multiply is rare on the suite, the queue of consumers sits ahead of it) and a single-port LSQ (a second LSQ port and a second cache port would have been a much larger surgery than the rest of the widening combined).

This is one of the two structural features. It is wired into port widths, RS issue logic, ROB allocate-and-commit, and the CDB count; no single +define rolls it back. Its contribution shows up indirectly: several ILP-rich programs reach all-on CPI well below the structural ceiling of a strictly one-wide machine (fib_rec 2.44, sort_search 3.30, insertionsort 3.88), so the widening is doing real work.

The single-port LSQ is the most visible IPC limit that remains. Tight memory-stream inner loops still see one of their two issue slots sit idle because two adjacent loads serialize at the cache.

Early Tag Broadcast

A consumer that depends on a multiply normally waits for the multiplier’s value to land on the CDB before it can issue. But the multiplier’s latency is fixed and known the moment the multiply enters the unit, so a dependent could in principle wake up earlier and be ready to issue on the cycle the value actually arrives.

The mechanism is one extra wire. When a multiply enters the multiplier’s first stage, the unit drives the destination tag onto a dedicated early-tag sideband that runs alongside the CDB but is not part of it. RS entries watch that wire the same way they watch the CDB tag; a match flips the source’s registered ready bit one cycle early. The value itself still rides the CDB at the original cycle — ETB is purely an issue-eligibility wakeup, not a value bypass. The MULT pipeline in the actual RTL is 8 stages (MULT_STAGES = 8 in sys_defs.svh); the diagram draws 5 schematic stages for legibility.

Disabling ETB raises geomean cycle count by 0.10%; the worst case is outer_product at +1.07%. That program runs a tight back-to-back multiply loop and is the workload most directly aligned with what ETB optimizes. Elsewhere the multiplier is not on the critical path of the program, so an earlier wakeup does not change much. The single-CDB-broadcast rule from the base design also gates the win: a non-MULT consumer woken by the early tag still has to wait its turn on the bus when the MULT broadcast lands the same cycle.

gshare Predictor

The bimodal baseline indexed the 64-entry BHT with PC bits alone. Two unrelated branches whose PCs landed on the same six-bit slice shared a counter, so a loop-control branch and a data-dependent branch sitting next to each other in the binary trained the same counter and polluted each other’s prediction. Loop-control and data-dependent branches tend to sit near each other in compiled code, so this collision was common rather than rare.

The gshare design indexes the BHT by PC[7:2] XOR GHR, where the 6-bit Global History Register holds the taken/not-taken outcomes of recent committed conditional branches. The GHR width matches the BHT index width so every history bit affects the index. A single hot branch with a TNTNTNT… pattern now lands on six different counters depending on which history led into it, instead of fighting itself on one.

Disabling gshare raises geomean cycle count by 0.19%; the worst case is fib_rec at +9.65% — the workload pattern gshare was built for, where the same call site sees a short, regular taken/not-taken history that the XOR fold can specialize on but the bimodal table cannot. Branch accuracy across the suite lifts by 8.87 pp arithmetic mean over the bimodal baseline (67.25%) on the same 30 programs that execute at least one conditional branch.

Return Address Stack

Function returns are nearly 100% predictable in principle, because the right target is always “go back to the call site that wrote the return address into the link register.” The BTB cannot exploit that — it caches one target per branch, so a recursive function whose return sees a different call site on every frame leaves the BTB constantly relearning the wrong target. Every switch between frames mispredicts.

The RAS is a 16-entry hardware stack alongside the gshare BHT (see the diagram above). When the front end sees a JAL that writes the link register, the next-PC is pushed. When the front end sees a return-shaped JALR (x0, ra, 0), the top of the stack supplies the predicted target and the entry is popped. On returns, the RAS overrides the BTB. When the stack is empty, the override does not fire and the predictor falls back to the BTB, so cold-start behavior matches the pre-RAS design.

Disabling the RAS raises geomean cycle count by 0.10%; the worst case is basic_malloc at +0.52%. The benchmark suite does not have enough deep recursion in its hot paths to make the RAS dominant at the geomean level. Where it does help, it replaces a steady stream of BTB mispredicts with a stack lookup that gets the target right on every call frame.

Store-to-Load Forwarding

A load that asks for an address an older store has just written should not have to wait for the cache. In a write-back design, the store sits in the LSQ until commit, then waits for a cache port, then marks the line dirty. A dependent load behind it pays every one of those cycles for what is logically a register-to-register move.

When a load reaches the head of the LSQ, the queue compares its address against every older un-committed store on the same 8-byte line. If the most recent matching store fully covers the load’s byte range and its data is already known, the load completes in one cycle out of the LSQ and the cache never sees the request. Anything that breaks the chain blocks the forward: an older store with an unresolved address, an older store whose data has not arrived yet, or a partial overlap where the load needs bytes the store did not write. Partial overlap is intentionally not merged; the regression set does not contain programs where it would fire often enough to matter, and the correctness corners are easier to reason about when partial overlap simply waits.

Disabling STLF raises geomean cycle count by 0.20%; the worst case is insertionsort at +1.64%. The canonical pattern is an inner loop that writes an array element and reads it back on the next iteration. A late timing-cleanup pass deferred the forward by one cycle (forwarded loads broadcast via the existing STLF latch), so each forwarded load now lands one cycle later than it would have. That cost only hits the small set of loads that actually forward.

Next-line Prefetch

Programs spend a lot of their time walking through memory in order: instruction fetch through straight-line code, array sweeps in matrix kernels, struct-field reads in image processing. Every cold line in that walk costs the full 100 ns memory latency, and the cache cannot start the fetch until the program asks for the line. A small prefetcher can fetch the next line in the background while the program is still using the current one.

The mechanism is a one-line stream buffer. On a miss fill for some line N, the buffer queues a request for line N+1 and issues it once the bus is free. If the program later misses on N+1 while the buffer is holding it, the line transfers into the cache without a fresh round trip to memory. The I-cache uses the shared stream_buffer.sv module; the D-cache has its own internal next-line prefetch logic in dcache.sv. Both are real, and bus arbitration in pipeline.sv makes sure speculative prefetch requests yield to demand fetches from either cache.

Disabling prefetch raises geomean cycle count by 37.14%; the worst case is alexnet at +94.22%. For comparison, removing all five ablate-able features together raises geomean by 37.86%, so the prefetcher alone accounts for nearly the entire gap. The worst- regressing programs under no_prefetch (alexnet, btest2, sampler) are the ones whose cycle count is most sensitive to instruction-fetch latency, which is why we read this as evidence that the I-cache side is the dominant beneficiary.

2-way Set-Associative D-Cache

A direct-mapped 256-byte D-cache with 8-byte lines has 32 slots, and any two addresses whose index bits collide have to share one slot. If a program walks two arrays whose strides happen to fold onto the same slot, every reference evicts the other one even though the rest of the cache is sitting empty. Sort-style inner loops do this regularly: an array element, a loop counter, and a comparison key all touching memory in the same iteration.

The 2-way layout splits the same 256 bytes into 16 sets of 2 lines each. An address now picks a set, and either of the two ways inside that set can hold the line. Each set carries a single LRU bit that tracks which way was touched more recently — at 2-way, one bit of state is exact LRU rather than an approximation. On a miss into a set whose ways are both valid, the eviction picks the LRU way.

Worth being precise about cache state: each line carries one valid bit and one dirty bit, not byte-granular masks. Byte granularity exists only on the write path: the LSQ supplies an 8-bit proc_wr_bebyte- enable that selects which bytes of a line get rewritten on a hit/store, and which bytes of a fill response are overridden on a miss-and-store. Sub-word stores work; the cache itself just doesn’t track per-byte status.

This is the second structural feature. The 2-way geometry is wired into the array dimensions of dcache.sv and into the address decode; there is no +define that turns it back into a direct-mapped cache without a substantial RTL rebuild. Its contribution is best read as the residual on programs whose inner loops walk two stride-aligned arrays (sort and search kernels, plus bfs and graph): those post the largest reductions from OoO base to all-on, and the prefetch ablation alone does not account for the full size of those cuts.

Per-program speedup

Δ% per program from OoO base to all-on, sorted by largest speedup first. Negative is faster. The two regressions (evens +0.34% and insertion+2.13%) are shown in orchid; both come down to gshare BHT aliasing on the programs’ specific branch pairs.

Geomean speedup over the suite is -27.46% (arithmetic mean -25.54%). The biggest speedups land on ALU-bound code with predictable branches (alexnet, btest2, sampler); the smallest land on programs whose inner loops are dominated by the single-port LSQ or by serial dependences the OoO machine cannot break.

Branch accuracy

Conditional-branch prediction accuracy at the all-on operating point, per program, sorted high to low. The dashed line is the geomean (74.03%) over the 30 programs that execute at least one conditional branch. Three programs (haha, halt, no_hazard) execute zero conditional branches and are excluded.

Arithmetic mean is 76.13%, a lift of 8.87 pp over the bimodal baseline (67.25%). The largest lifts land on programs whose branch behavior the bimodal table could not specialize on (omegalul, btest1, priority_queue, bfs, basic_malloc); programs whose branches are mostly counted loop tests already sit near 85–88% accuracy under bimodal and gain fractions of a percentage point or none at all.

CPI distribution

All-on cycles-per-instruction across the 33-program suite, binned. Geomean CPI is 14.87; arithmetic mean is 20.77, pulled up by a handful of toy programs that pay full memory latency on a few fetches (halt at CPI 106 is the extreme). The kernel-style workloads that are representative of real use sit between 3 and 30 CPI.

Synthesis & timing

All seven module-level testbenches (mult, rob, rs, lsq, dcache, icache, branch_predictor) meet timing at the 1000 ps target. mult and lsq are the tightest at +0.23 ps and +0.05 ps; the other five clear by hundreds of picoseconds.

The full-pipeline netlist (synth/pipeline.vg) does not meet timing at 1000 ps. Worst slack is -797.58 ps on the path lsq_0/head_reg[2] → rob_0/entries_reg[2][take_branch], with a companion endpoint at -797.55 ps on the same cone. Every other endpoint in the design meets at the same clock.

The cone runs LSQ broadcast → RS operand mux → ALU 32-bit adder → {ROB take_branch, LSQ addr}: a single 32-bit ripple-carry adder with high fanout sits in the middle of a long combinational chain. Closing it would mean either registering load_complete_value and load_complete_tag between the LSQ broadcast arbiter and the CDB (one extra cycle on every completing load) or splitting the ALU’s 32-bit adder into two pipeline stages (one extra cycle on every ALU op). Both pay perf on the common case to fix the static-timing residual, so this pass ships as-is.

Functional correctness and static-timing closure are different things. Every .syn.wb writeback trace produced by the gate-level netlist is byte- identical to the corresponding RTL .wb across all 33 programs on the pre-multiplier-operand-register baseline. The synthesized machine commits the same architectural register-write stream as the RTL on every program, modulo the standard one-cycle reset offset.

Verification methodology

Three layers, each catching a different class of bug. At the bottom, every nontrivial module has its own SystemVerilog testbench. A test passes only if the run prints @@@ Passed. The RTL unit suite covers mult, rob, rs, lsq, dcache, icache, and branch_predictor.

Above that, the full pipeline runs all 33 RV32IM programs end to end, checked for a clean halt on @@@ System halted on WFI instruction and a correct register-write trace. On top of both layers, the same tests run a second time against the gate-level netlist that Synopsys DC produces. A small wrapper per module (synth/<m>_svsim.sv) sits between the testbench and the netlist and uses the SystemVerilog stream operator {>>{ }}to repack DC’s flattened ports into the shape the testbench expects.

Two byte-equivalence checks back up the regression. First, every .syn.wb trace is byte- identical to its .wb on the pre-multiplier-operand-register baseline. Second, every .wb on the post-merge build is byte-identical to the trace produced by rebuilding the same commit under +define+SERIALIZE_BRANCHES, which forces the front end to serialize on every conditional branch. If out-of-order issue plus speculation drifted from what a serialized front end would have done on the same program, this check would fail. It does not.

Five compile-time disable knobs (DISABLE_EARLY_TAG, DISABLE_GSHARE, DISABLE_RAS, DISABLE_STLF, DISABLE_PREFETCH) drive the ablation sweep. Each compiles cleanly and produces a working binary that halts every program. The two structural features cannot be rolled back with a +define.

Limitations & future work

Pipeline timing residual. The -797.58 ps slack on the LSQ → RS → ALU adder cone is the one limitation we would close first. Both fixes (registering the LSQ broadcast or splitting the ALU adder) cost a cycle on the common case to recover ~800 ps of static-timing slack, which we judged the wrong trade at this point in the project.

Single-port LSQ on a 2-way machine. Two adjacent loads still serialize at the cache, so the front-end widening is not always matched by back-end memory bandwidth. The natural next step on a wider machine is a dual-ported LSQ paired with a dual- ported or banked D-cache.

Single multiplier. Multiply-heavy code caps out at one issue per cycle through the multiplier. ETB recovers a cycle by waking dependent consumers ahead of the CDB broadcast, but a second multiplier would do better. We did not add one because doubling area for a single functional class was hard to justify against an ablation that suggested the marginal cycles available were small.

Rename style at wider issue. If we were widening past 2-way, we would revisit the embedded-RAT-in-ROB rename approach. A unified physical register pool in the R10K style would make more sense at four-wide, where the simplicity payoff of treating ROB entries as physical registers starts to fall off.

An ablation lesson worth naming. Once the prefetcher is in place, the simpler features stacked on top of an already-tuned base contribute less on geomean than they would in isolation, because the prefetcher has already taken the cycles they would have saved. The bottleneck on this suite at this clock is memory latency. That is not a failure of the smaller features; it is a description of the workload.

References & team

Team

Chenhao Yang · Xuepeng Han · Gavin Zou · Pingchuan Dong · Hins Lyu · Xueer Qian

References

Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th ed., Morgan Kaufmann, 2017.
McFarling, "Combining Branch Predictors", DEC WRL Technical Note TN-36, 1993.
EECS 4340 starter codebase (VeriSimpleV in-order pipeline), Columbia University, Spring 2026.
RISC-V Instruction Set Manual, Volume I: Unprivileged ISA, Document Version 20191213.
Yeh & Patt, "Two-level adaptive training branch prediction", MICRO-24, 1991.

The full IEEE-format paper is available as a PDF: 4340-final-report.pdf. The complete source is on GitHub. Presentation slides and other supporting material live in the project Google Drive folder.