Skip to content
All projects
Machine Learning2026

TinyGPT Language Model

A GPT-style language model trained on diverse text corpora (Amazon, IMDB, Reddit), covering architecture optimization and tokenization.

Overview

An academic big-data/NLP project: training a GPT-style language model on heterogeneous text datasets including Amazon reviews, IMDB, and Reddit.

The work covers the full stack of small-scale language modeling — tokenization strategy, architecture sizing and optimization, and training across mixed-domain corpora.

Technologies

PythonPyTorchNLP

Tags

NLPLanguage ModelsBig Data