Dataflux: Machine Learning Data Loading Efficiency

Large datasets are ideal for machine learning (ML) models, and quick data loading is essential for ML training that is affordable

Google Cloud advise utilising Dataflux Dataset for training workflows instead of alternative libraries or making direct calls to the Cloud Storage API

Optimise performance to attain up to 3.5 times quicker training times, particularly for smaller files

PyTorch Dataset primitive: Easily integrate with well-known PyTorch ideas

Dataset implementation if you use PyTorch and have data stored in cloud storage

If reading and generating a batch takes longer than GPU processing, the GPU is blocked and underutilised, increasing training times

Google used a Cloud Storage function in Dataflux called Compose items, which allows us to dynamically merge multiple smaller items into a larger one

Dataflux uses sophisticated work-stealing. Initial AI training works on datasets with tens of millions of items