Overview
Neurenix provides comprehensive support for distributed training across multiple GPUs and compute nodes. The framework integrates with industry-standard distributed training backends including MPI, Horovod, and DeepSpeed.MPI Backend
The Message Passing Interface (MPI) backend provides low-level distributed computing primitives for parallel training.MPIManager
TheMPIManager class provides an interface to MPI functionality for distributed training.
neurenix/distributed/mpi.py:15
Collective Operations
Barrier Synchronization
neurenix/distributed/mpi.py:135
Broadcast
neurenix/distributed/mpi.py:148
All-Reduce
neurenix/distributed/mpi.py:168
All-Gather
neurenix/distributed/mpi.py:188
Scatter
neurenix/distributed/mpi.py:207
Context Manager
neurenix/distributed/mpi.py:227
Horovod Backend
Horovod provides a unified API for distributed training with automatic gradient aggregation.HorovodManager
neurenix/distributed/horovod.py:15
Broadcasting Parameters
neurenix/distributed/horovod.py:207
Distributed Optimizer
neurenix/distributed/horovod.py:224
Complete Training Example
DeepSpeed Backend
DeepSpeed provides advanced optimization techniques including ZeRO optimization, pipeline parallelism, and mixed precision training.DeepSpeedManager
neurenix/distributed/deepspeed.py:15
Initialize Model with DeepSpeed
neurenix/distributed/deepspeed.py:152
ZeRO Optimization Stages
ZeRO Stage 1: Optimizer state partitioning- Partitions optimizer states across GPUs
- Reduces memory by ~4x for Adam
- Partitions gradients in addition to optimizer states
- Further reduces memory usage
- Partitions model parameters across GPUs
- Enables training very large models
Training with DeepSpeed
Data Parallel Training
For simple multi-GPU training on a single node:neurenix/distributed/__init__.py:9
Best Practices
Device Placement
Learning Rate Scaling
Gradient Accumulation
Checkpointing
Environment Variables
Common environment variables for distributed training:Launching Distributed Training
MPI Launch
Horovod Launch
DeepSpeed Launch
Performance Tips
- Use NCCL backend for GPU communication (fastest)
- Enable tensor fusion to reduce communication overhead
- Use gradient compression (FP16) to reduce bandwidth
- Batch data loading to maximize GPU utilization
- Profile communication to identify bottlenecks
- Use InfiniBand for multi-node training when available
Related Topics
- Quantization - Reduce model size and memory usage
- Performance Optimization - General optimization techniques
- Mixed Precision Training - FP16/FP32 training