ViST is a complete Vision Transformer implemented entirely in C++ using only the standard library and stb_image. It classifies images into three categories using a from-scratch patching, attention, and classification pipeline — no frameworks, no external ML libs.

How to Run

GitHub

ViST (Vision Transformer from Scratch) – C++ Implementation Overview

ViST is a fully hand-built Vision Transformer implemented in C++ using only the standard library (plus stb_image for image I/O). It reconstructs the entire ViT image-classification pipeline from first principles: loading and preprocessing images, splitting into patches, adding positional encodings, applying Transformer blocks (self-attention + feedforward layers), and classifying with a linear head.

All matrix and vector operations (dot products, normalization, activations, etc.) are coded manually with std::vector loops, and even a basic backpropagation and optimizer are implemented by hand. This pure-CPU, no-framework design is for me to test my own skills and to understand how a transformer works

Below is a breakdown of the architecture, file responsibilities, and implementation details.

Architecture and Data Flow

Image Input: An input image is loaded from disk using stb_image. It’s resized to a fixed size (like 300×300) using nearest-neighbor and normalized to [0,1].
Patch Embedding: The resized image is split into 16×16 patches, flattened to vectors. Each patch = 1 vector, with shape 16×16×3.
Positional Encoding: Sine/cosine vectors are added per patch to encode position info. Scale = 0.1.
[CLS] Token: A learnable vector is added to the beginning. Its final state is used for classification.
Transformer Blocks: Each block has multi-head attention, residuals, layer norm, and a 2-layer GELU MLP. Output is fed to the next block.
Final Classifier: After all blocks, final norm is applied. The [CLS] token is projected to logits with a linear head.

The entire pipeline is load → patch → encode → transform → classify.

Module Responsibilities and File Roles

`image_loader.hpp`

Handles image loading and resizing using stb_image. Returns a normalized 3D vector (H×W×3). Uses nearest-neighbor interpolation. Includes:

load_image() – loads, normalizes, checks shape
resize_image() – pure CPU-based resizing (nearest)

`image_utils.hpp`

Has augmentation functions like flips, rotations, brightness — but these aren’t used in training yet. Intended for expansion.

`patch_embedding.cpp`

Extracts 16×16 patches from an image. Each patch is flattened into a 1D float vector (RGB interleaved). Stored in a vector<vector<float>>.

create_patch_embedding() loops through image and builds patch embeddings.

`positional_encoding.cpp`

Builds sinusoidal encodings per patch (same method as original Transformer). Adds these encodings element-wise to each patch vector. Scaling factor is 0.1. Functions include:

create_positional_encoding(num_patches, dim)
add_positional_encoding(patches, encoding)

`utils.hpp`

This file is where the math lives. Matrix ops, layer norm, GELU, random init, softmax, loss, etc.

matmul, transpose, add_matrices, scalar_multiply
linear_transform = does XW + b with triple loops
layer_norm = standard zero-mean, unit-variance + gamma/beta
gelu = exact GELU (with tanh)
random_vector, random_matrix – uniform weight init
softmax_vector, calculate_loss, add_bias
Various numerical sanity checks – detect NaNs, clip values

`multi_head_attention.hpp`

Defines the multi-head self-attention operation. It computes Q, K, V projections, splits into heads, performs scaled dot-product, and projects the result back. The catch? No softmax. Scores are scaled and clamped instead.

multi_head_attention() – takes in input, weights, biases, number of heads, d_k, d_v
Q, K, V are computed by linear_transform
Heads are split by slicing Q/K/V
Scores = Q × Kᵗ / √d_k → clamped instead of softmax
Scores are multiplied with V and heads are concatenated
Final projection is done with another linear layer

This mostly does what you’d expect from attention, minus the softmax. Logs Q/K/V sizes to help debug.

`feedforward.hpp`

Implements the 2-layer MLP inside each transformer block.

feedforward_network(input, W1, b1, W2, b2)
Applies: Linear → GELU → Linear

`transformer_block.cpp`

Class wrapper for a single Transformer block. Instantiates attention weights, FFN weights, and layer norm params. The forward() function applies everything in order:

QKV Attention + residual + layer norm
FFN + residual + layer norm

It follows the standard Transformer sub-layer logic, with shape checks and logs.

`vit_model.hpp`

This is where the full Vision Transformer model is defined. It combines together image loading, patching, encoding, transformer stack, and final classification.

vision_transformer(image_path, num_classes, num_blocks) – main entry
Resizes and loads the image using load_image()
Extracts patches via create_patch_embedding()
Applies positional encoding
Adds a random [CLS] token
Runs all Transformer blocks (e.g. 12 stacked)
Applies final layer norm
Extracts [CLS] output and applies linear classifier (random weights)

Note: this head is not trained. The second classifier in VitModelWrapper is trained instead.

`training_utils.hpp`

Utilities for training logic — loss, dropout, gradient update, early stopping.

cross_entropy_loss() – manual softmax + log loss
backpropagate_classifier() – computes gradient for classifier weights
update_classifier_weights() – basic gradient descent with optional clipping
dropout() – not used yet but implemented
EarlyStopping – struct that tracks loss and patience

`gradients.hpp`

Low-level backpropagation logic. This file prepares for full gradient descent through layers, but only the classifier backprop is wired up currently.

backprop_linear_transform() – computes dX, dW, db for a linear layer
gelu_derivative(), gelu_gradient() – for FFN backprop

`optimizers.hpp`

Manual Adam optimizer for weights and biases.

adam_update_matrix(), adam_update_vector()
Includes bias correction and optional L1/L2 reg
These are used directly during training

`checkpoint.hpp`

Handles saving and loading model checkpoints in binary. This only covers the classifier layer and Adam optimizer state — transformer weights aren’t stored.

save_checkpoint() – writes classifier weights, bias, epoch, Adam m/v vectors
load_checkpoint() – restores them with dimension checks

Simple binary format, used after every training epoch.

`train.cpp`

The main training logic. Runs through all images in a folder, applies the model, computes loss, and updates the classifier via Adam. Batches are processed in parallel.

Walks the directory (expects folders like apple/, banana/, etc.)
Maps subfolder names to labels
Processes in mini-batches using std::async
For each image:
- Run forward pass → compute loss
- Call backpropagate() to compute gradient for classifier
- Update weights using Adam
Tracks running average loss
Early stopping kicks in if no improvement
Saves a checkpoint every epoch

Note: Only the final classifier is trained. The actual Vision Transformer blocks are frozen random weights.

`test.cpp`

Simple inference code. Loads a model checkpoint and an image, runs a forward pass, and prints logits + predicted class.

test_model(checkpoint_file, image_path)
Restores model from file
Loads and resizes image
Runs same patching/encoding/CLS path as training
Prints softmax logits and predicted label

`main.cpp`

CLI entry point. If no args: trains from ../train. If args are provided: loads checkpoint and runs test using 12 blocks (not consistent).

No CLI arg → train
./ViT model.bin image.png → test

Training and Testing Flow

Training

Triggered by running ./ViT with no arguments. This launches the full training pipeline in plain C++:

Data Loading:
- Scans ../train directory for subfolders (apple, banana, orange)
- Each folder name is mapped to a class label (0, 1, 2)
- All file paths are loaded into memory for batching
Threaded Batch Execution:
- Batches of images are split across threads via std::async
- Each thread runs process_batch() on its assigned samples
- All model updates (classifier weights) are synchronized using std::mutex
Per-Sample Pipeline:
- Image Load: Uses stb_image to decode into raw RGB data
- Resize: Rescaled to 300×300 via nearest-neighbor interpolation
- Normalization: Pixel values are converted to floats in the range [0,1]
- Patchify: Image is split into 16×16 non-overlapping patches
- Each patch is flattened to a vector of size 768 (16×16×3)
- Positional Encoding: Sinusoidal vectors (one per patch) are generated and scaled by 0.1, then added to the patch vectors
- CLS Token: A new random `[CLS]` token is prepended to the patch sequence
- Transformer Stack:
  - Each TransformerBlock applies:
    - Q/K/V projections for all tokens (via linear_transform)
    - Each projection split across 8 heads
    - Dot-product attention scores are computed for each head: Q · Kᵗ / √d_k
    - No softmax: scores are clamped to [-1e6, 1e6] and used directly
    - Each head produces weighted values (V), all heads are concatenated
    - Concatenated vector passed through a final projection layer
    - Residual connection added to original input
    - Custom layer_norm applied (zero mean/unit variance)
    - Then passed through a 2-layer MLP: linear → GELU → linear
    - Second residual + layer_norm finishes the block
- Final LayerNorm: Applied to all token embeddings
- Classifier:
  - Only the first token (CLS) is extracted
  - CLS values are clipped to ±20 for numerical stability
  - Logits are computed via: logits = W_cls · [CLS] + b_cls
- Loss: Cross-entropy loss is applied against the true label
- Backprop:
  - Only the classifier weights W and bias b receive gradients
  - Gradients are computed manually using explicit loop logic
  - Adam optimizer is applied:
    - 1st and 2nd moments (m, v) tracked per parameter
    - Bias-corrected step: m̂ = m / (1 - β₁ᵗ), v̂ = v / (1 - β₂ᵗ)
    - Update: param -= lr × m̂ / (√v̂ + ε)
Checkpointing:
- After each epoch, save_checkpoint() writes a binary file containing:
- Classifier weights and bias
- Optimizer m/v for weights and bias
- Current epoch and step count

Testing

Triggered by running ./ViT model_checkpoint.bin image.png

Model is reconstructed using load_checkpoint()
Classifier weights, bias, and optimizer state restored
Test image is loaded, resized, and normalized
Patch embedding and positional encoding are identical to training
Random `[CLS]` token is inserted at the front
Transformer stack is applied again (12 blocks during test)
Final `[CLS]` token passed through classifier layer
Logits are printed for each class
Class with highest logit is selected as prediction

Implementation Details

Manual Linear Algebra

All matrix operations are explicit — no Eigen or BLAS. For example, matmul(A, B) is done with triple nested loops. It’s slow but makes the math very clear.

Stability Checks

A few hacks help avoid numeric instability:

Clips logits to ±20 before final classification
Checks for NaNs in layer norm outputs
Clamps dot-product scores in attention to avoid overflows (no softmax)

Code Structure

Some headers include their `.cpp` counterparts inline (like vit_model.hpp including transformer_block.cpp). Not typical C++ structure, but works for a bundled setup.

Limitations

While ViST was built to push me to achive what’s possible with just the C++ standard library, there are still a few practical and architectural limitations (i.e skill issues) worth noting:

Only Final Classifier Trained: The transformer blocks are initialized randomly and frozen. Training only updates the final classification head.
No Softmax in Attention: Self-attention scores are scaled and clamped but not normalized with softmax — attention weights are just raw dot-products.
Hardcoded Hyperparameters: Several key values like patch size (16), image size (300×300), and class count (3) are hardcoded across files. For example:
- train.cpp assumes class folders are apple, banana, orange
- main.cpp uses different transformer block counts for train vs. test
CPU Only: All computation is done using plain std::vector. There’s no GPU acceleration or SIMD usage, so larger models will be very slow.
No Generalization Support: Directory layout, image size, model config, and even training logic are all tightly coupled. There’s no CLI or config system to adjust things dynamically.
Partially Implemented Backpropagation: While gradient computations are coded manually and some GELU/linear layer backprop functions exist, full end-to-end backprop through transformer layers isn’t wired in.
Static Data Augmentation: Augmentation utilities (flips, brightness, etc.) exist in code but aren’t used during training.
Naive Matrix Math: All tensor ops are triple-nested loops with no optimization. For large matrices, performance drops off faster than idk a lot.
Simple Checkpointing: Only the final classifier and optimizer state is saved. If you train transformer weights in the future, those aren't currently preserved.
Memory Leaks: Eventhough I tried to avoid all manual memory management, theres still some memory leaks that shows up after training for a long time.

These aren’t dealbreakers, most are by design for simplicity. But worth keeping in mind if you plan to try this and contribute (would be great)

What’s Next

There’s a lot of stuff to add, improve, rework but theres some that’s more important to make it a somewhat polished code.

CUDA Support: Port core tensor operations (like matmul, linear layers, GELU, etc.) to run on the GPU for massive speedups. The current nested-loops are just too slow on CPU.
Parameterization: Instead of hardcoding everything (number of blocks, patch size, image dims, class count), allow these to be passed in via CLI args or config files.
Replace stb_image: Implement a custom PNG/JPEG decoder to remove the only external dependency and actually stick to “standard library only” if that’s still the goal anymore atp.
Matrix Operations Optimization: Rewrite inner loops to unroll, preallocate, reduce copies, and apply blocking/tile strategies where possible. Even basic cache-aware changes would help. Possibly explore using `std::valarray` or SIMD intrinsics for critical paths.
Trainable Transformer Blocks: Wire up actual gradient computation and parameter updates for attention weights, feedforward layers, and positional encodings. Backprop for the full ViT stack is already scaffolded — just needs to be connected.
Modular Rewrite: Split .cpp/.hpp properly (instead of including .cpps in headers), follow idiomatic modern C++ practices (RAII, smart pointers, proper encapsulation).
Unit Tests: Add sanity checks for core components like `matmul`, `gelu`, and `layer_norm` using test cases and expected values. Could be as simple as a separate debug target.
More Dataset Flexibility: Add support for custom class folders, or read labels from a CSV/json instead of relying on hardcoded names like apple and banana.
Training Logs & Metrics: Record per-epoch accuracy, loss graphs, and export metrics to files. Or atleast proper logging to the terminal
Model Serialization: Extend checkpointing to include all weights — not just the final classifier — so full model state can be saved and resumed even if training is extended to deeper layers.
Making it Fast and Efficient:Gotta optimize but ill do that later(im never doing it)

This version of ViST was already a wild ride, but taking it one step further would turn it into a fully usable transformer or a minimal production-grade ViT prototype in C++.

Conclusion

ViST rebuilds a full Vision Transformer from scratch using just C++ and stb_image. It patches, encodes, transforms, and classifies images through custom layers — no frameworks, no ML libs, no GPU, and no braincells.

Everything is hand-written (i'll change the image loader, then this will be true). There’s a Transformer model with layer norm and attention, positional encodings, and even model checkpointing — all working, all clean(?)

The only thing that isn’t wired in is full end-to-end gradient descent — right now, just the final classifier gets trained. The rest stays frozen.

Still, this was a great way for me to learn how Transformers work even if it is full of holes and tons of tech debt for any future changes.

ViST – Vision Transformer from Scratch