dataset command provides comprehensive dataset management capabilities including listing, downloading, registering, splitting, and converting datasets.
Usage
Actions
| Action | Description |
|---|---|
list | List available registered datasets |
download | Download a dataset from URL or registry |
register | Register a dataset in the local registry |
info | Get detailed information about a dataset |
split | Split a dataset into train/val/test sets |
convert | Convert dataset to different format |
Available Formats
csv- Comma-separated valuesjson- JSON formatnpy- NumPy binary formathdf5- HDF5 formattfrecord- TensorFlow record formatparquet- Apache Parquet format
Examples
List registered datasets
List in JSON format
Download a dataset
Download to specific location
Download from URL
Register a dataset
Register with format
Register with metadata
Get dataset info
Get info for local dataset
Split a dataset
Split with shuffling
Convert dataset format
Convert with explicit input format
Action Details
list
List all registered datasets in the local registry. Options:--format: Output format (text,json)
download
Download a dataset from a URL or registered name. Arguments:source: Dataset URL or registered name
--output: Output directory or file (default:data)--format: Dataset format (auto-detected if not specified)
register
Register a dataset in the local registry for easy access. Arguments:name: Dataset nameurl: Dataset URL or file path
--format: Dataset format--metadata: Metadata (JSON string or file path)
info
Get detailed information about a dataset. Arguments:name: Dataset name or path
--format: Output format (text,json)
split
Split a dataset into train/validation/test sets. Arguments:input: Input dataset file or directory
--output: Output directory (default:data)--ratio: Split ratios (default:0.8,0.2)--shuffle: Shuffle data before splitting--seed: Random seed for reproducibility
convert
Convert a dataset to a different format. Arguments:input: Input dataset file or directoryoutput: Output file or directory
--input-format: Input format (auto-detected if not specified)--output-format: Output format (required)
Error Handling
Dataset not found
Invalid split ratio
File not found
Use Cases
1. Download and prepare dataset
2. Register custom dataset
3. Convert dataset format
4. Create reproducible splits
5. Manage multiple datasets
Best Practices
1. Always shuffle when splitting
2. Use consistent split ratios
Standard splits:- 70/15/15: Balanced three-way split
- 80/20: Simple train/val split
- 80/10/10: More training data
3. Register datasets with metadata
4. Convert to efficient formats
For large datasets, use efficient formats:5. Organize dataset directories
Integration Example
Complete dataset preparation pipeline:See Also
- Preprocess command - Preprocess downloaded datasets
- Init command - Initialize projects with datasets
- Run command - Train with prepared datasets