Advanced Code Optimisation
In high-performance computing like AI Game Engines, every nanosecond counts. We implement "Zero-Cost Abstractions" to ensure our C++ code runs as close to the metal as possible. Here is a granular breakdown of our optimization techniques.
1. Bitboard Architecture (The Foundation)
The Problem
A standard implementation uses an array int board[20][20].
- Memory: 400 integers * 4 bytes = 1600 bytes.
- Speed: Checking a line requires iterating
board[x][y],board[x+1][y], etc. This involves multiple memory fetches (Cache Misses) and comparison instructions.
The Solution: std::bitset
We map the 2D board to a 1D sequence of bits.
- Memory: 400 bits ≈ 50 bytes. This fits entirely into a singe L1 CPU Cache line.
- Access: Reading a bit is instantaneous.
2. SIMD-like Bitwise Parallelism
This is our critical optimization for the MCTS simulations. We need to check win conditions (5 in a row) millions of times.
Instead of writing:
for (int x=0; x<20; x++) { ... check horizontal ... }
We use bitwise shifts. This effectively simulates SIMD (Single Instruction, Multiple Data) behavior using standard registers.
The Algorithm
To check if Player P has 5 stones in a row horizontally:
- Take the bitboard
B(where 1 = stone present). - Compute
B1 = B & (B >> 1)(This keeps 1s only if there is a stone to the left). - Compute
B2 = B1 & (B1 >> 1)(Keeps 1s only if there were 2 stones to the left, i.e., 3 aligned). - If
B2 & (B2 >> 1)is not zero, we have 5 aligned.
Impact: This checks the entire board for horizontal wins in ~4 CPU clock cycles. A loop would take hundreds.
3. Memory Layout & Buffering
Why new and malloc are bad
Allocating memory (new Tensor, std::vector) asks the Operating System for a heap block. This is extremely slow (context switches, kernel locks).
Static Buffers in Neural Network
In Network.cpp, we declare our working memory once at startup.
Tensor _buffer1(1, 64, 20, 20);
Tensor _buffer2(1, 64, 20, 20);
During the 1500+ MCTS simulations per move:
- We never allocate memory.
- We reuse these buffers locally.
- The Convolution output writes directly into
_buffer1, which becomes the input for the next layer. This is "Zero-Allocation Inference".
4. Branchless Logic
Modern CPUs (like M4 or Intel i9) use "Branch Prediction". They guess which way an if will go to pre-calculate code. If they guess wrong, they flush the pipeline (huge performance penalty ~15 cycles).
We write "Branchless Code" to avoid if.
Bad (Branching):
if (isMyTurn) value += 1;
else value -= 1;
Good (Branchless):
// if isMyTurn is 1 or 0
value += (isMyTurn * 2) - 1;
The compiler translates this into pure arithmetic instructions (IMUL, SUB) which flow linearly through the CPU pipeline without stalls.
5. Binary Serialization
Loading the neural network weights from a text file (JSON/CSV) requires parsing strings to floats (atof), which is slow.
We implemented a custom Memory Dump format.
- Python: Maps the GPU float array to bytes.
- C++:
file.read((char*)ptr, count * 4).
This performs a Direct Memory Copy (DMA equivalent) from disk to RAM. It is the theoretical maximum speed possible for file reading (limited only by SSD speed).
6. Loop Unrolling & Pointer Arithmetic
In our Convolution kernel (conv2d):
- We avoid
std::vector::at()ortensor[x][y]inside tight loops because they often include bounds checking. - We use Raw Pointers:
float* outPtr = output.values.data();
const float* inPtr = input.values.data();
for (int i=0; i<N; i++) *outPtr++ = *inPtr++ * weight; - The compiler can auto-vectorize this easily (using AVX/NEON instructions) because it sees a contiguous block of memory with no side effects.