All projects
Computer Vision2024
Vision Transformer (ViT) Reproduction
From-scratch PyTorch reproduction of “An Image is Worth 16×16 Words” with modular OOP design and custom training pipelines.
Overview
A faithful reproduction of the Vision Transformer paper (“An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”) implemented from scratch in PyTorch.
The implementation uses a modular, object-oriented design — patch embedding, multi-head self-attention, and transformer encoder blocks as composable components — with custom DataLoader pipelines and training optimized with Adam and learning-rate scheduling.
Technologies
PythonPyTorch
Tags
Deep LearningTransformersPaper Reproduction