neurenix dataset - Neurenix

The dataset command provides comprehensive dataset management capabilities including listing, downloading, registering, splitting, and converting datasets.

Usage

neurenix dataset <action> [options]

Actions

Action	Description
`list`	List available registered datasets
`download`	Download a dataset from URL or registry
`register`	Register a dataset in the local registry
`info`	Get detailed information about a dataset
`split`	Split a dataset into train/val/test sets
`convert`	Convert dataset to different format

Available Formats

csv - Comma-separated values
json - JSON format
npy - NumPy binary format
hdf5 - HDF5 format
tfrecord - TensorFlow record format
parquet - Apache Parquet format

Examples

List registered datasets

neurenix dataset list

Available datasets:

cifar10:
  URL: https://datasets.neurenix.ai/cifar10
  Format: auto-detect
  Metadata: {"classes": 10, "size": 60000}

imagenet:
  URL: https://datasets.neurenix.ai/imagenet
  Format: auto-detect
  Metadata: {"classes": 1000, "size": 1281167}

List in JSON format

neurenix dataset list --format json

{
  "cifar10": {
    "url": "https://datasets.neurenix.ai/cifar10",
    "format": "auto-detect",
    "metadata": {"classes": 10, "size": 60000}
  },
  "imagenet": {
    "url": "https://datasets.neurenix.ai/imagenet",
    "format": "auto-detect",
    "metadata": {"classes": 1000, "size": 1281167}
  }
}

Download a dataset

neurenix dataset download cifar10

Downloading dataset from cifar10...
Saving dataset to data/cifar10.csv...
Dataset downloaded and saved to data/cifar10.csv

Download to specific location

neurenix dataset download cifar10 --output datasets/cifar10.csv

Downloading dataset from cifar10...
Saving dataset to datasets/cifar10.csv...
Dataset downloaded and saved to datasets/cifar10.csv

Download from URL

neurenix dataset download https://example.com/data.csv --output data/external.csv

Downloading dataset from https://example.com/data.csv...
Saving dataset to data/external.csv...
Dataset downloaded and saved to data/external.csv

Register a dataset

neurenix dataset register my_dataset https://example.com/dataset.csv

Dataset 'my_dataset' registered successfully.

Register with format

neurenix dataset register my_images /path/to/images --format hdf5

Dataset 'my_images' registered successfully.

Register with metadata

neurenix dataset register custom_data data.csv \
  --metadata '{"classes": 5, "samples": 10000}'

Dataset 'custom_data' registered successfully.

Get dataset info

neurenix dataset info cifar10

Dataset: cifar10
URL: https://datasets.neurenix.ai/cifar10
Format: auto-detect
Metadata: {"classes": 10, "size": 60000}

Get info for local dataset

neurenix dataset info data/my_dataset.csv

Loading dataset from data/my_dataset.csv...
Dataset: my_dataset.csv
Path: /home/user/project/data/my_dataset.csv
Format: csv
Size: 5000
Metadata: {}

Split a dataset

neurenix dataset split data/full_dataset.csv --ratio 0.7,0.15,0.15

Loading dataset from data/full_dataset.csv...
Splitting dataset with ratio 0.7,0.15,0.15...
Saving train split (700 samples) to data/train/train_data.csv...
Saving val split (150 samples) to data/val/val_data.csv...
Saving test split (150 samples) to data/test/test_data.csv...
Dataset split successfully. Results saved to data

Split with shuffling

neurenix dataset split data/dataset.csv \
  --ratio 0.8,0.2 \
  --shuffle \
  --seed 42

Loading dataset from data/dataset.csv...
Splitting dataset with ratio 0.8,0.2...
Saving train split (800 samples) to data/train/train_data.csv...
Saving val split (200 samples) to data/val/val_data.csv...
Dataset split successfully. Results saved to data

Convert dataset format

neurenix dataset convert data.csv data.json --output-format json

Loading dataset from data.csv...
Converting dataset to json format...
Dataset converted and saved to data.json

Convert with explicit input format

neurenix dataset convert input.txt output.csv \
  --input-format csv \
  --output-format csv

Loading dataset from input.txt...
Converting dataset to csv format...
Dataset converted and saved to output.csv

Action Details

list

List all registered datasets in the local registry. Options:

--format: Output format (text, json)

neurenix dataset list [--format text|json]

download

Download a dataset from a URL or registered name. Arguments:

source: Dataset URL or registered name

Options:

--output: Output directory or file (default: data)
--format: Dataset format (auto-detected if not specified)

neurenix dataset download <source> [--output <path>] [--format <format>]

register

name: Dataset name
url: Dataset URL or file path

Options:

--format: Dataset format
--metadata: Metadata (JSON string or file path)

neurenix dataset register <name> <url> [--format <format>] [--metadata <json>]

info

Get detailed information about a dataset. Arguments:

name: Dataset name or path

Options:

--format: Output format (text, json)

neurenix dataset info <name> [--format text|json]

split

Split a dataset into train/validation/test sets. Arguments:

input: Input dataset file or directory

Options:

--output: Output directory (default: data)
--ratio: Split ratios (default: 0.8,0.2)
--shuffle: Shuffle data before splitting
--seed: Random seed for reproducibility

neurenix dataset split <input> [--output <dir>] [--ratio <ratios>] [--shuffle] [--seed <int>]

convert

Convert a dataset to a different format. Arguments:

input: Input dataset file or directory
output: Output file or directory

Options:

--input-format: Input format (auto-detected if not specified)
--output-format: Output format (required)

neurenix dataset convert <input> <output> [--input-format <format>] --output-format <format>

Error Handling

Dataset not found

neurenix dataset download unknown_dataset

Error managing dataset: Dataset 'unknown_dataset' not found in registry

Invalid split ratio

neurenix dataset split data.csv --ratio 0.5,0.3

Error: Invalid split ratio: Split ratios must sum to 1.0

File not found

neurenix dataset split missing.csv

Error: Input dataset 'missing.csv' not found.

Use Cases

1. Download and prepare dataset

# Download
neurenix dataset download cifar10 --output data/cifar10.csv

# Split into train/val/test
neurenix dataset split data/cifar10.csv --ratio 0.7,0.15,0.15 --shuffle

2. Register custom dataset

# Register
neurenix dataset register my_data /path/to/data.csv \
  --metadata '{"description": "Custom dataset", "version": "1.0"}'

# Download later
neurenix dataset download my_data

3. Convert dataset format

# CSV to JSON
neurenix dataset convert data.csv data.json --output-format json

# CSV to NumPy
neurenix dataset convert data.csv data.npy --output-format npy

4. Create reproducible splits

neurenix dataset split full_data.csv \
  --output splits \
  --ratio 0.8,0.1,0.1 \
  --shuffle \
  --seed 42

5. Manage multiple datasets

# Register datasets
neurenix dataset register train_data data/train.csv
neurenix dataset register test_data data/test.csv

# List all
neurenix dataset list

# Get info
neurenix dataset info train_data

Best Practices

1. Always shuffle when splitting

neurenix dataset split data.csv --ratio 0.8,0.2 --shuffle --seed 42

2. Use consistent split ratios

Standard splits:

70/15/15: Balanced three-way split
80/20: Simple train/val split
80/10/10: More training data

neurenix dataset split data.csv --ratio 0.7,0.15,0.15

3. Register datasets with metadata

neurenix dataset register my_dataset data.csv \
  --metadata '{"version": "1.0", "date": "2024-01-15", "samples": 10000}'

4. Convert to efficient formats

For large datasets, use efficient formats:

# Convert to Parquet for efficient storage
neurenix dataset convert large_data.csv large_data.parquet \
  --output-format parquet

# Convert to HDF5 for numerical data
neurenix dataset convert data.csv data.h5 --output-format hdf5

5. Organize dataset directories

mkdir -p datasets/{raw,processed,splits}

# Download to raw
neurenix dataset download cifar10 --output datasets/raw/cifar10.csv

# Split to processed
neurenix dataset split datasets/raw/cifar10.csv \
  --output datasets/splits \
  --ratio 0.8,0.1,0.1

Integration Example

Complete dataset preparation pipeline:

#!/bin/bash

# 1. Register dataset
neurenix dataset register my_data https://example.com/data.csv

# 2. Download
neurenix dataset download my_data --output data/raw/data.csv

# 3. Convert to efficient format
neurenix dataset convert \
  data/raw/data.csv \
  data/processed/data.parquet \
  --output-format parquet

# 4. Split for training
neurenix dataset split \
  data/processed/data.parquet \
  --output data/splits \
  --ratio 0.7,0.15,0.15 \
  --shuffle \
  --seed 42

# 5. Preprocess
neurenix preprocess \
  --input data/splits/train \
  --output data/ready/train \
  --normalize

echo "Dataset prepared and ready for training!"

​Usage

​Actions

​Available Formats

​Examples

​List registered datasets

​List in JSON format

​Download a dataset

​Download to specific location

​Download from URL

​Register a dataset

​Register with format

​Register with metadata

​Get dataset info

​Get info for local dataset

​Split a dataset

​Split with shuffling

​Convert dataset format

​Convert with explicit input format

​Action Details

​list

​download

​register

​info

​split

​convert

​Error Handling

​Dataset not found

​Invalid split ratio

​File not found

​Use Cases

​1. Download and prepare dataset

​2. Register custom dataset

​3. Convert dataset format

​4. Create reproducible splits

​5. Manage multiple datasets

​Best Practices

​1. Always shuffle when splitting

​2. Use consistent split ratios

​3. Register datasets with metadata

​4. Convert to efficient formats

​5. Organize dataset directories

​Integration Example

​See Also

Usage

Actions

Available Formats

Examples

List registered datasets

List in JSON format

Download a dataset

Download to specific location

Download from URL

Register a dataset

Register with format

Register with metadata

Get dataset info

Get info for local dataset

Split a dataset

Split with shuffling

Convert dataset format

Convert with explicit input format

Action Details

list

download

register

info

split

convert

Error Handling

Dataset not found

Invalid split ratio

File not found

Use Cases

1. Download and prepare dataset

2. Register custom dataset

3. Convert dataset format

4. Create reproducible splits

5. Manage multiple datasets

Best Practices

1. Always shuffle when splitting

2. Use consistent split ratios

3. Register datasets with metadata

4. Convert to efficient formats

5. Organize dataset directories

Integration Example

See Also