All projects
Machine Learning2026
TinyGPT Language Model
A GPT-style language model trained on diverse text corpora (Amazon, IMDB, Reddit), covering architecture optimization and tokenization.
Overview
An academic big-data/NLP project: training a GPT-style language model on heterogeneous text datasets including Amazon reviews, IMDB, and Reddit.
The work covers the full stack of small-scale language modeling — tokenization strategy, architecture sizing and optimization, and training across mixed-domain corpora.
Technologies
PythonPyTorchNLP
Tags
NLPLanguage ModelsBig Data