Computer Vision2024

Vision Transformer (ViT) Reproduction

From-scratch PyTorch reproduction of “An Image is Worth 16×16 Words” with modular OOP design and custom training pipelines.

Source Code

Overview

A faithful reproduction of the Vision Transformer paper (“An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”) implemented from scratch in PyTorch.

The implementation uses a modular, object-oriented design — patch embedding, multi-head self-attention, and transformer encoder blocks as composable components — with custom DataLoader pipelines and training optimized with Adam and learning-rate scheduling.

Technologies

PythonPyTorch

Vision Transformer (ViT) Reproduction

Overview

Technologies

Tags