NanoVLM

NanoVLM is a Hugging Face project designed to train and optimize small Vision-Language Models (VLMs) with a focus on efficiency and simplicity

It features a modular multimodal architecture with a vision encoder, a lightweight language decoder, and a modality projection layer for vision-language fusion

Plans include multi-GPU training, multi-image support, image-splitting, and integration with VLMEvalKit for enhanced evaluation and functionality

Includes scripts like train.py for training and generate.py for testing, with logging support via WANDB for tracking experiments

Contributions are encouraged, with a focus on maintaining pure PyTorch implementations and avoiding dependencies like DeepSpeed or Accelerate

It supports tasks like multimodal chat, video sequence processing, live streaming analysis, and educational experimentation with small VLMs

NanoVLM integrates with the Hugging Face ecosystem, supporting pretrained model loading, saving, and sharing via the Hugging Face Hub

Training requires ~4.5 GB of VRAM for batch size 1 and up to 65 GB for batch size 256, with a provided script (measure_vram.py) to test VRAM usage

NanoVLM emphasizes transparency, modularity, and forward compatibility, allowing users to experiment with configurations

The 222M parameter NanoVLM model achieves competitive results, such as 35.3% accuracy on the MMStar benchmark