DatasetHub - Neurenix

Overview

The DatasetHub module provides utilities for loading datasets from URLs or file paths, supporting various formats and preprocessing options.

DatasetFormat Enum

class DatasetFormat(Enum):
    CSV = auto()
    JSON = auto()
    NUMPY = auto()
    PICKLE = auto()
    TEXT = auto()
    IMAGE = auto()
    AUDIO = auto()
    VIDEO = auto()
    SQL = auto()
    CUSTOM = auto()

Methods

from_extension

@classmethod
def from_extension(cls, extension: str) -> DatasetFormat

Determine format from file extension.

extension

str

required

File extension (e.g., ‘csv’, ‘json’, ‘npy’).

return

DatasetFormat

The corresponding dataset format.

Dataset Class

class Dataset:
    def __init__(
        self,
        data: Any,
        format: DatasetFormat,
        name: str = None,
        metadata: Dict = None,
        transform: Callable = None
    )

Parameters

data

Any

required

The dataset content.

format

DatasetFormat

required

Format of the dataset.

name

str

Name of the dataset.

metadata

Dict

Additional information about the dataset.

transform

Callable

Function to transform data samples.

Methods

len

def __len__(self) -> int

Return the number of samples in the dataset.

getitem

def __getitem__(self, idx: Union[int, slice]) -> Any

Get a sample or batch from the dataset.

idx

Union[int, slice]

required

Index or slice to retrieve.

return

Any

The sample or batch at the specified index.

to_tensor

def to_tensor(self, framework: str = 'auto') -> Any

Convert the dataset to a tensor.

framework

str

default:"auto"

The framework to use (‘torch’, ‘tensorflow’, ‘numpy’, or ‘auto’).

return

Any

Tensor representation of the dataset.

DatasetHub Class

class DatasetHub:
    def __init__(self, cache_dir: Optional[str] = None)

Main class for loading and managing datasets.

Parameters

cache_dir

Optional[str]

Directory to cache downloaded datasets. If None, uses default cache directory.

Methods

load_dataset

def load_dataset(
    self,
    source: Union[str, Path],
    format: Optional[DatasetFormat] = None,
    download: bool = True,
    **kwargs
) -> Dataset

Load a dataset from a file path or URL.

source

Union[str, Path]

required

File path or URL to load the dataset from.

format

Optional[DatasetFormat]

Format of the dataset. If None, inferred from file extension.

download

bool

default:"True"

Whether to download the dataset if it’s a URL.

return

Dataset

The loaded dataset.

register_dataset

def register_dataset(
    self,
    name: str,
    source: Union[str, Path],
    format: DatasetFormat,
    metadata: Optional[Dict] = None
) -> None

name

str

required

Name to register the dataset under.

source

Union[str, Path]

required

File path or URL of the dataset.

format

DatasetFormat

required

Format of the dataset.

metadata

Optional[Dict]

Additional metadata about the dataset.

Convenience Functions

load_dataset

def load_dataset(*args, **kwargs) -> Dataset

Convenience function to load a dataset using the default DatasetHub instance.

register_dataset

def register_dataset(*args, **kwargs)

Convenience function to register a dataset using the default DatasetHub instance.

Example Usage

import neurenix as nx
from neurenix.data import DatasetHub, Dataset, DatasetFormat, load_dataset

# Load a CSV dataset
dataset = load_dataset(
    "https://example.com/data.csv",
    format=DatasetFormat.CSV
)

print(f"Dataset size: {len(dataset)}")

# Access individual samples
sample = dataset[0]
print(f"First sample: {sample}")

# Access a batch
batch = dataset[0:10]
print(f"Batch shape: {len(batch)}")

# Load a NumPy dataset
numpy_dataset = load_dataset(
    "data.npy",
    format=DatasetFormat.NUMPY
)

# Convert to tensor
tensor_data = numpy_dataset.to_tensor(framework='neurenix')

# Create DatasetHub with custom cache directory
hub = DatasetHub(cache_dir="./my_cache")

# Register a dataset
hub.register_dataset(
    name="my_dataset",
    source="https://example.com/dataset.json",
    format=DatasetFormat.JSON,
    metadata={"version": "1.0", "author": "Neurenix Team"}
)

# Load registered dataset by name
registered_dataset = hub.load_dataset("my_dataset")

# Load with custom transformations
def preprocess(sample):
    # Custom preprocessing logic
    return sample * 2.0

dataset = Dataset(
    data=my_data,
    format=DatasetFormat.NUMPY,
    transform=preprocess
)

# Transformed data is returned when accessing samples
transformed_sample = dataset[0]

# Load image dataset
image_dataset = load_dataset(
    "images/",
    format=DatasetFormat.IMAGE
)

# Load from Parquet (for large datasets)
parquet_dataset = load_dataset(
    "large_dataset.parquet",
    format=DatasetFormat.CUSTOM
)

# Use with DataLoader for training
from neurenix.data import DataLoader

train_loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4
)

for batch in train_loader:
    # Training logic
    pass

Supported Formats

CSV

Comma-separated values files

JSON

JSON and JSONL formats

NumPy

.npy and .npz arrays

Pickle

Python pickle files

Text

Plain text files

Images

JPG, PNG, BMP, GIF

Audio

WAV, MP3, OGG, FLAC

Video

MP4, AVI, MOV, MKV

SQL

SQLite databases

Best Practices

Caching: DatasetHub automatically caches downloaded datasets. Use a persistent cache directory for better performance.

Transformations: Apply transformations in the Dataset constructor for automatic preprocessing during data loading.

Large datasets: For datasets that don’t fit in memory, use lazy loading or streaming formats like Parquet.

​Overview

​DatasetFormat Enum

​Methods

​from_extension

​Dataset Class

​Parameters

​Methods

​len

​getitem

​to_tensor

​DatasetHub Class

​Parameters

​Methods

​load_dataset

​register_dataset

​Convenience Functions

​load_dataset

​register_dataset

​Example Usage

​Supported Formats

CSV

JSON

NumPy

Pickle

Text

Images

Audio

Video

SQL

​Best Practices

Overview

DatasetFormat Enum

Methods

from_extension

Dataset Class

Parameters

Methods

len

getitem

to_tensor

DatasetHub Class

Parameters

Methods

load_dataset

register_dataset

Convenience Functions

load_dataset

register_dataset

Example Usage

Supported Formats

Best Practices