DyaSmart — TB3-D Logic Engine
We didn't teach the machine a new language. We taught it to speak its native language more efficiently — moving binary from flat 2-D normalized state into a full 3-D ternary dimension, all within the same physical bit parameters your hardware already operates on.
Core Concept
Ternary Binary Multi-Dimensional Logic
Traditional binary is a flat plane — every bit is either 0 or 1, and that's the end of the story. TB3-D breaks that ceiling without breaking the machine. Every physical bit is paired with a Color/State satellite bitRED = active, GREEN = passive — creating a third information dimension that lives inside the same 8-bit byte your processor already knows how to handle.
The engine interleaves 4 physical bits and 4 color bits into a single byte using a clean lane architecture. The physical lane sits on odd bit positions (mask 0x55) and the color/state lane occupies even bit positions (mask 0xAA). No new instruction sets. No ISA extensions required. The machine was already capable — it just needed the right encoding framework to unlock the hidden dimension.

This is not a new language for the machine. It is the machine's native binary, restructured to carry twice the semantic density in the same physical footprint.
TB3-D Byte Layout
Bit positions 7 → 0 interleaved:
Pos: 7    6    5    4    3    2    1    0
     C[3] P[3] C[2] P[2] C[1] P[1] C[0] P[0]

Physical lane  (mask 0x55): bits 0,2,4,6
Color lane     (mask 0xAA): bits 1,3,5,7
Key Properties
  • 50% physical footprint reduction
  • 2× effective bandwidth (parallel lanes)
  • 1:1 resolution — no ALU summation
  • 100% lossless binary reconstruction
50% Footprint Reduction
8-bit data stored in 4 physical bits — the color lane carries full state without expanding the byte boundary.
2× Effective Bandwidth
Physical and color lanes read in parallel, doubling throughput with zero additional clock cycles.
1:1 Parallel Resolution
Pure interleave — no ALU summation, no decompression pipeline, no latency penalty.
100% Lossless
Full round-trip fidelity guaranteed. 256×256 byte+color word round-trips all pass in stress testing.
Software Architecture
Three-Tier ISA Implementation
The TB3-D engine is implemented in C11 targeting x86-64, with automatic ISA detection at build time. CMake probes the host toolchain and enables the highest available feature tier — from single-cycle BMI2 nibble spread on Haswell/Zen hardware, through 256-bit AVX2 SIMD vectorization, down to a portable constant-time lookup table that runs correctly on every platform.
1
Tier 1 — BMI2
Haswell+ / Zen+
PDEP/PEXT — single-cycle nibble spread and compress. Maximum performance on modern server and desktop silicon.
2
Tier 2 — AVX2
256-bit SIMD
VPSHUFB — processes 16 legacy bytes per iteration using vectorized physical/color lane separation in a single pass.
3
Tier 3 — Portable
All Platforms
Constant-time 16-entry lookup table. Correct behavior guaranteed across any target, including embedded and cross-compiled environments.
Three Hardware Integration Subsystems
Dragonfly Cache Interface
prefetcht0-prefetched, cache-line-aligned two-lane buffers for simultaneous L1 population of both sub-planes in a single prefetch operation.
SIMD Bitmask DMA Layer
AVX2-vectorized physical/color lane separation using the 0x55 / 0xAA bitmasks. Full lane isolation achieved in a single vector pass.
State Buffer
O(1) per-bit color look-ahead without touching the physical data lane — models the dedicated L2 metadata lane in the hardware specification.
Build & LLVM IR
CMake automatically detects and enables BMI2, AVX2, and the LLVM toolchain. The authoritative LLVM IR module at llvm/tb3d_decoder.ll defines all TB3-D operations including the spec-aligned vectorized decoder tb3d_decode_v4 and DMA lane-separation functions.
# Configure and build
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
cd build && ctest --output-on-failure

# Assemble to bitcode
llvm-as-18 llvm/tb3d_decoder.ll \
 -o build/tb3d_decoder.bc

# Compile to native object
llc-18 -filetype=obj \
 -mcpu=x86-64 -mattr=+avx2,+bmi2 \
 build/tb3d_decoder.bc \
 -o build/tb3d_decoder.o
Benchmark Results
Measured Performance — BMI2 + AVX2 Enabled
All results captured on the software implementation (not bare-metal) with ISA flags BMI2=ON AVX2=ON AVX-512=off. 5 warmup iterations, 20 timed iterations. The numbers validate the theoretical 2× bandwidth claim and demonstrate that encode throughput scales linearly with buffer size from 256 B through 16 MB.
8.6K
MB/s Peak Encode
Buffer encode throughput at 4 KB–16 MB buffer sizes using BMI2 PDEP/PEXT
27.9K
MB/s DMA Fetch
Peak DMA bitmask fetch throughput at 256 B packed buffer — 0.034 ns/byte
59.6K
MB/s Cache Read
Dragonfly Cache Interface read at 4 KB — exploiting L1 prefetch alignment
61K
MB/s Transcode
transcode_from_cache peak at 4 KB — Secretary/Executive transcoder pipeline
Buffer Encode/Decode Throughput (MB/s)
Dragonfly Cache Interface (MB/s)

Correctness suite: 142,451 / 142,451 checks passed — including alignment boundary tests (1–128 bytes), 1 MB random buffer with 6 color mask variants, null-guard audits, all 256 packed byte patterns, color mutation idempotency (256×4), lane isolation (16×16×4), and full 256×256 word round-trips.
FPGA Implementation
VPK180 / Versal Premium VP1802
The TB3-D engine moves from software into silicon on the VPK180 Evaluation Kit — Xilinx Versal Premium VP1802 adaptive SoC (xcvp1802-lsvc4072-2mp-e-s). The VP1802 was not chosen arbitrarily: its silicon feature set maps directly onto the TB3-D dual-lane architecture at every level of the datapath.
Why VP1802
Silicon That Speaks TB3-D Natively
CPM5 PCIe Gen5 ×16 + Integrated QDMA
Eliminates the PCIe read/write bottleneck entirely. H2C/C2H AXI4-Stream channels feed the TB3-D encode/decode engine directly — no separate DMA IP license required, no intermediate bridging logic.
GTYP Transceivers + PAM4 at 32 GT/s
TB3-D's dual-lane structure (physical + color) maps naturally onto PAM4 symbol dimensions. The 64 GTYP transceivers provide ample high-speed I/O headroom for expansion far beyond the base PCIe link.
Versal NoC + LPDDR4 Memory Controller
The high-bandwidth Network-on-Chip fabric connects the TB3-D engine directly to 4 GB LPDDR4 (ch0_lpddr4_trip1) with minimal latency, sustaining the throughput demanded by continuous dual-lane encode/decode.
CIPS — Cortex-A72 ×4 + Cortex-R5F ×2
Full SoC integration: the A72 application processors handle host control and bring-up, the PL implements the TB3-D datapath, and the R5F real-time cores schedule DMA operations with deterministic latency.
RTL Architecture
Hardware Datapath — Zero-Latency Encode Core
RTL Signal Flow
PCIe Gen5 ×16 (GTYP/PAM4, 32 GT/s)
│  handled inside CPM5 (CIPS)
▼
┌────────────────────────────────┐
│ CIPS CPM5 QDMA                │
│ H2C/C2H AXI4-Stream           │
│ AXI4-MM → NoC                 │
└──────────────┬─────────────────┘
               │ AXI4-Stream
               │ (~100 MHz pl0_ref_clk)
               ▼
┌────────────────────────────────┐
│ tb3d_axi_engine                │
│ Encode: 1-cycle latency        │
│ Decode: 2-cycle latency        │
└──────────────┬─────────────────┘
               │ AXI4-MM (via NoC)
               ▼
┌────────────────────────────────┐
│ Versal NoC + LPDDR4 MC        │
│ 4 GB (ch0_lpddr4_trip1)       │
└────────────────────────────────┘
The Combinational Core
The TB3-D encode/decode core (tb3d_encode.sv / tb3d_decode.sv) is purely combinational wiring — implemented as 8 LUT1 (wire) cells with zero routing overhead. This is not a performance optimization; it is a direct hardware proof of the specification's "1:1 parallel resolution via interleave" claim. The interleave is not computed — it is wired.
The registered AXI4-Stream wrapper (tb3d_axi_engine.sv) adds only the pipeline registers required for AXI handshaking: 1-cycle encode latency, 2-cycle decode latency. Everything else is fabric.
RTL Source Structure
tb3d_pkg.sv
SystemVerilog package: types, constants, encode/decode functions shared across all modules
tb3d_axi_engine.sv
Registered AXI4-Stream engine — the integration boundary between QDMA and the combinational core
tb3d_top.sv
Top-level: external ports, TB3-D engine instantiation, Block Design stub connections
Build Flow
Five-Step FPGA Build Flow
Prerequisites: Vivado 2025.2 and Vitis 2025.2 with Versal Premium device support, VPK180 board files v1.2 from the Xilinx Board Store, and CPM5 QDMA (included in CIPS — no separate IP license). Once prerequisites are satisfied, the full flow from RTL to running application is five steps.
01
Create Vivado Project
vivado -mode batch -source scripts/create_project.tcl — generates the project with all RTL sources, VPK180 XDC constraints, and the Block Design (CIPS + CPM5 QDMA + NoC + LPDDR4) fully assembled.
02
Synthesis → Implementation → Bitstream
Run launch_runs synth_1, then impl_1, then write_bitstream sequentially in the Vivado Tcl Console, or set the RUN_SYNTH/IMPL/BITSTREAM=1 flags in the project script for a fully automated run.
03
Export Hardware Platform (XSA)
write_hw_platform -fixed -include_bit -force fpga/output/dyasmart_tb3d_vpk180.xsa — exports the fixed hardware platform with bitstream embedded for Vitis ingestion.
04
Create Vitis Platform
vitis -s fpga/vitis/platform.tcl — creates the Vitis platform with both A72 (Linux/bare-metal) and R5F (real-time) domains configured against the exported XSA.
05
Build, Deploy, and Verify
vitis -s fpga/vitis/app/create_app.tcl — compiles the bare-metal bring-up application for Cortex-A72. Program the VPK180 via JTAG and connect a serial terminal at 115200 baud to observe bring-up output.

Full architecture and design decision documentation is maintained in docs/architecture.md. RTL constraints including pin assignments, clock definitions, and GT configuration are in constraints/vpk180_top.xdc.