
0x55) and the color/state lane occupies even bit positions (mask 0xAA). No new instruction sets. No ISA extensions required. The machine was already capable — it just needed the right encoding framework to unlock the hidden dimension.Pos: 7 6 5 4 3 2 1 0
C[3] P[3] C[2] P[2] C[1] P[1] C[0] P[0]
Physical lane (mask 0x55): bits 0,2,4,6
Color lane (mask 0xAA): bits 1,3,5,7PDEP/PEXT — single-cycle nibble spread and compress. Maximum performance on modern server and desktop silicon.VPSHUFB — processes 16 legacy bytes per iteration using vectorized physical/color lane separation in a single pass.prefetcht0-prefetched, cache-line-aligned two-lane buffers for simultaneous L1 population of both sub-planes in a single prefetch operation.0x55 / 0xAA bitmasks. Full lane isolation achieved in a single vector pass.llvm/tb3d_decoder.ll defines all TB3-D operations including the spec-aligned vectorized decoder tb3d_decode_v4 and DMA lane-separation functions.# Configure and build
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
cd build && ctest --output-on-failure
# Assemble to bitcode
llvm-as-18 llvm/tb3d_decoder.ll \
-o build/tb3d_decoder.bc
# Compile to native object
llc-18 -filetype=obj \
-mcpu=x86-64 -mattr=+avx2,+bmi2 \
build/tb3d_decoder.bc \
-o build/tb3d_decoder.oBMI2=ON AVX2=ON AVX-512=off. 5 warmup iterations, 20 timed iterations. The numbers validate the theoretical 2× bandwidth claim and demonstrate that encode throughput scales linearly with buffer size from 256 B through 16 MB.xcvp1802-lsvc4072-2mp-e-s). The VP1802 was not chosen arbitrarily: its silicon feature set maps directly onto the TB3-D dual-lane architecture at every level of the datapath.ch0_lpddr4_trip1) with minimal latency, sustaining the throughput demanded by continuous dual-lane encode/decode.PCIe Gen5 ×16 (GTYP/PAM4, 32 GT/s)
│ handled inside CPM5 (CIPS)
▼
┌────────────────────────────────┐
│ CIPS CPM5 QDMA │
│ H2C/C2H AXI4-Stream │
│ AXI4-MM → NoC │
└──────────────┬─────────────────┘
│ AXI4-Stream
│ (~100 MHz pl0_ref_clk)
▼
┌────────────────────────────────┐
│ tb3d_axi_engine │
│ Encode: 1-cycle latency │
│ Decode: 2-cycle latency │
└──────────────┬─────────────────┘
│ AXI4-MM (via NoC)
▼
┌────────────────────────────────┐
│ Versal NoC + LPDDR4 MC │
│ 4 GB (ch0_lpddr4_trip1) │
└────────────────────────────────┘tb3d_encode.sv / tb3d_decode.sv) is purely combinational wiring — implemented as 8 LUT1 (wire) cells with zero routing overhead. This is not a performance optimization; it is a direct hardware proof of the specification's "1:1 parallel resolution via interleave" claim. The interleave is not computed — it is wired.tb3d_axi_engine.sv) adds only the pipeline registers required for AXI handshaking: 1-cycle encode latency, 2-cycle decode latency. Everything else is fabric.vivado -mode batch -source scripts/create_project.tcl — generates the project with all RTL sources, VPK180 XDC constraints, and the Block Design (CIPS + CPM5 QDMA + NoC + LPDDR4) fully assembled.launch_runs synth_1, then impl_1, then write_bitstream sequentially in the Vivado Tcl Console, or set the RUN_SYNTH/IMPL/BITSTREAM=1 flags in the project script for a fully automated run.write_hw_platform -fixed -include_bit -force fpga/output/dyasmart_tb3d_vpk180.xsa — exports the fixed hardware platform with bitstream embedded for Vitis ingestion.vitis -s fpga/vitis/platform.tcl — creates the Vitis platform with both A72 (Linux/bare-metal) and R5F (real-time) domains configured against the exported XSA.vitis -s fpga/vitis/app/create_app.tcl — compiles the bare-metal bring-up application for Cortex-A72. Program the VPK180 via JTAG and connect a serial terminal at 115200 baud to observe bring-up output.docs/architecture.md. RTL constraints including pin assignments, clock definitions, and GT configuration are in constraints/vpk180_top.xdc.