Overview
The DatasetHub module provides utilities for loading datasets from URLs or file paths, supporting various formats and preprocessing options.
class DatasetFormat(Enum):
CSV = auto()
JSON = auto()
NUMPY = auto()
PICKLE = auto()
TEXT = auto()
IMAGE = auto()
AUDIO = auto()
VIDEO = auto()
SQL = auto()
CUSTOM = auto()
Methods
from_extension
@classmethod
def from_extension(cls, extension: str) -> DatasetFormat
Determine format from file extension.
File extension (e.g., ‘csv’, ‘json’, ‘npy’).
The corresponding dataset format.
Dataset Class
class Dataset:
def __init__(
self,
data: Any,
format: DatasetFormat,
name: str = None,
metadata: Dict = None,
transform: Callable = None
)
Parameters
Additional information about the dataset.
Function to transform data samples.
Methods
len
Return the number of samples in the dataset.
getitem
def __getitem__(self, idx: Union[int, slice]) -> Any
Get a sample or batch from the dataset.
idx
Union[int, slice]
required
Index or slice to retrieve.
The sample or batch at the specified index.
to_tensor
def to_tensor(self, framework: str = 'auto') -> Any
Convert the dataset to a tensor.
The framework to use (‘torch’, ‘tensorflow’, ‘numpy’, or ‘auto’).
Tensor representation of the dataset.
DatasetHub Class
class DatasetHub:
def __init__(self, cache_dir: Optional[str] = None)
Main class for loading and managing datasets.
Parameters
Directory to cache downloaded datasets. If None, uses default cache directory.
Methods
load_dataset
def load_dataset(
self,
source: Union[str, Path],
format: Optional[DatasetFormat] = None,
download: bool = True,
**kwargs
) -> Dataset
Load a dataset from a file path or URL.
File path or URL to load the dataset from.
Format of the dataset. If None, inferred from file extension.
Whether to download the dataset if it’s a URL.
register_dataset
def register_dataset(
self,
name: str,
source: Union[str, Path],
format: DatasetFormat,
metadata: Optional[Dict] = None
) -> None
Register a dataset for easy loading by name.
Name to register the dataset under.
File path or URL of the dataset.
Additional metadata about the dataset.
Convenience Functions
load_dataset
def load_dataset(*args, **kwargs) -> Dataset
Convenience function to load a dataset using the default DatasetHub instance.
register_dataset
def register_dataset(*args, **kwargs)
Convenience function to register a dataset using the default DatasetHub instance.
Example Usage
import neurenix as nx
from neurenix.data import DatasetHub, Dataset, DatasetFormat, load_dataset
# Load a CSV dataset
dataset = load_dataset(
"https://example.com/data.csv",
format=DatasetFormat.CSV
)
print(f"Dataset size: {len(dataset)}")
# Access individual samples
sample = dataset[0]
print(f"First sample: {sample}")
# Access a batch
batch = dataset[0:10]
print(f"Batch shape: {len(batch)}")
# Load a NumPy dataset
numpy_dataset = load_dataset(
"data.npy",
format=DatasetFormat.NUMPY
)
# Convert to tensor
tensor_data = numpy_dataset.to_tensor(framework='neurenix')
# Create DatasetHub with custom cache directory
hub = DatasetHub(cache_dir="./my_cache")
# Register a dataset
hub.register_dataset(
name="my_dataset",
source="https://example.com/dataset.json",
format=DatasetFormat.JSON,
metadata={"version": "1.0", "author": "Neurenix Team"}
)
# Load registered dataset by name
registered_dataset = hub.load_dataset("my_dataset")
# Load with custom transformations
def preprocess(sample):
# Custom preprocessing logic
return sample * 2.0
dataset = Dataset(
data=my_data,
format=DatasetFormat.NUMPY,
transform=preprocess
)
# Transformed data is returned when accessing samples
transformed_sample = dataset[0]
# Load image dataset
image_dataset = load_dataset(
"images/",
format=DatasetFormat.IMAGE
)
# Load from Parquet (for large datasets)
parquet_dataset = load_dataset(
"large_dataset.parquet",
format=DatasetFormat.CUSTOM
)
# Use with DataLoader for training
from neurenix.data import DataLoader
train_loader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4
)
for batch in train_loader:
# Training logic
pass
CSV
Comma-separated values files
JSON
JSON and JSONL formats
NumPy
.npy and .npz arrays
Pickle
Python pickle files
Best Practices
Caching: DatasetHub automatically caches downloaded datasets. Use a persistent cache directory for better performance.
Transformations: Apply transformations in the Dataset constructor for automatic preprocessing during data loading.
Large datasets: For datasets that don’t fit in memory, use lazy loading or streaming formats like Parquet.