ViST – Vision Transformer from Scratch

ViST is a complete Vision Transformer implemented entirely in C++ using only the standard library and stb_image. It classifies images into three categories using a from-scratch patching, attention, and classification pipeline — no frameworks, no external ML libs.

How to Run

Clone and build with CMake:

git clone https://github.com/allanhanan/ViST.git
cd ViST
mkdir build
cd build
cmake ..
make

To train the model:

./ViT

To test on a saved model and image:

./ViT /path/to/model_checkpoint.bin /path/to/image.png
ViST demo output – classification result

GitHub

GitHub

ViST (Vision Transformer from Scratch) – C++ Implementation Overview

ViST is a fully hand-built Vision Transformer implemented in C++ using only the standard library (plus stb_image for image I/O). It reconstructs the entire ViT image-classification pipeline from first principles: loading and preprocessing images, splitting into patches, adding positional encodings, applying Transformer blocks (self-attention + feedforward layers), and classifying with a linear head.

All matrix and vector operations (dot products, normalization, activations, etc.) are coded manually with std::vector loops, and even a basic backpropagation and optimizer are implemented by hand. This pure-CPU, no-framework design is for me to test my own skills and to understand how a transformer works

Below is a breakdown of the architecture, file responsibilities, and implementation details.

Architecture and Data Flow

The entire pipeline is load → patch → encode → transform → classify.

Module Responsibilities and File Roles

image_loader.hpp

Handles image loading and resizing using stb_image. Returns a normalized 3D vector (H×W×3). Uses nearest-neighbor interpolation. Includes:

image_utils.hpp

Has augmentation functions like flips, rotations, brightness — but these aren’t used in training yet. Intended for expansion.

patch_embedding.cpp

Extracts 16×16 patches from an image. Each patch is flattened into a 1D float vector (RGB interleaved). Stored in a vector<vector<float>>.

positional_encoding.cpp

Builds sinusoidal encodings per patch (same method as original Transformer). Adds these encodings element-wise to each patch vector. Scaling factor is 0.1. Functions include:

utils.hpp

This file is where the math lives. Matrix ops, layer norm, GELU, random init, softmax, loss, etc.

multi_head_attention.hpp

Defines the multi-head self-attention operation. It computes Q, K, V projections, splits into heads, performs scaled dot-product, and projects the result back. The catch? No softmax. Scores are scaled and clamped instead.

This mostly does what you’d expect from attention, minus the softmax. Logs Q/K/V sizes to help debug.

feedforward.hpp

Implements the 2-layer MLP inside each transformer block.

transformer_block.cpp

Class wrapper for a single Transformer block. Instantiates attention weights, FFN weights, and layer norm params. The forward() function applies everything in order:

  1. QKV Attention + residual + layer norm
  2. FFN + residual + layer norm

It follows the standard Transformer sub-layer logic, with shape checks and logs.

vit_model.hpp

This is where the full Vision Transformer model is defined. It combines together image loading, patching, encoding, transformer stack, and final classification.

Note: this head is not trained. The second classifier in VitModelWrapper is trained instead.

training_utils.hpp

Utilities for training logic — loss, dropout, gradient update, early stopping.

gradients.hpp

Low-level backpropagation logic. This file prepares for full gradient descent through layers, but only the classifier backprop is wired up currently.

optimizers.hpp

Manual Adam optimizer for weights and biases.

checkpoint.hpp

Handles saving and loading model checkpoints in binary. This only covers the classifier layer and Adam optimizer state — transformer weights aren’t stored.

Simple binary format, used after every training epoch.

train.cpp

The main training logic. Runs through all images in a folder, applies the model, computes loss, and updates the classifier via Adam. Batches are processed in parallel.

Note: Only the final classifier is trained. The actual Vision Transformer blocks are frozen random weights.

test.cpp

Simple inference code. Loads a model checkpoint and an image, runs a forward pass, and prints logits + predicted class.

main.cpp

CLI entry point. If no args: trains from ../train. If args are provided: loads checkpoint and runs test using 12 blocks (not consistent).

Training and Testing Flow

Training

Triggered by running ./ViT with no arguments. This launches the full training pipeline in plain C++:

Testing

Triggered by running ./ViT model_checkpoint.bin image.png

Implementation Details

Manual Linear Algebra

All matrix operations are explicit — no Eigen or BLAS. For example, matmul(A, B) is done with triple nested loops. It’s slow but makes the math very clear.

Stability Checks

A few hacks help avoid numeric instability:

Code Structure

Some headers include their `.cpp` counterparts inline (like vit_model.hpp including transformer_block.cpp). Not typical C++ structure, but works for a bundled setup.

Limitations

While ViST was built to push me to achive what’s possible with just the C++ standard library, there are still a few practical and architectural limitations (i.e skill issues) worth noting:

These aren’t dealbreakers, most are by design for simplicity. But worth keeping in mind if you plan to try this and contribute (would be great)

What’s Next

There’s a lot of stuff to add, improve, rework but theres some that’s more important to make it a somewhat polished code.

This version of ViST was already a wild ride, but taking it one step further would turn it into a fully usable transformer or a minimal production-grade ViT prototype in C++.

Conclusion

ViST rebuilds a full Vision Transformer from scratch using just C++ and stb_image. It patches, encodes, transforms, and classifies images through custom layers — no frameworks, no ML libs, no GPU, and no braincells.

Everything is hand-written (i'll change the image loader, then this will be true). There’s a Transformer model with layer norm and attention, positional encodings, and even model checkpointing — all working, all clean(?)

The only thing that isn’t wired in is full end-to-end gradient descent — right now, just the final classifier gets trained. The rest stays frozen.

Still, this was a great way for me to learn how Transformers work even if it is full of holes and tons of tech debt for any future changes.

Comments