NanoVLM is a Hugging Face project designed to train and optimize small Vision-Language Models (VLMs) with a focus on efficiency and simplicity
It features a modular multimodal architecture with a vision encoder, a lightweight language decoder, and a modality projection layer for vision-language fusion
Plans include multi-GPU training, multi-image support, image-splitting, and integration with VLMEvalKit for enhanced evaluation and functionality
Includes scripts like train.py for training and generate.py for testing, with logging support via WANDB for tracking experiments
Contributions are encouraged, with a focus on maintaining pure PyTorch implementations and avoiding dependencies like DeepSpeed or Accelerate
It supports tasks like multimodal chat, video sequence processing, live streaming analysis, and educational experimentation with small VLMs
NanoVLM integrates with the Hugging Face ecosystem, supporting pretrained model loading, saving, and sharing via the Hugging Face Hub
Training requires ~4.5 GB of VRAM for batch size 1 and up to 65 GB for batch size 256, with a provided script (measure_vram.py) to test VRAM usage
NanoVLM emphasizes transparency, modularity, and forward compatibility, allowing users to experiment with configurations
The 222M parameter NanoVLM model achieves competitive results, such as 35.3% accuracy on the MMStar benchmark