Skip to main content
The preprocess command transforms raw input data into a format suitable for model training, with support for normalization, resizing, augmentation, and dataset splitting.

Usage

neurenix preprocess --input <data_path> --output <output_path> [options]

Options

OptionTypeDefaultDescription
--inputstringrequiredInput data file or directory
--outputstringrequiredOutput directory for processed data
--configstringNonePreprocessing configuration file
--normalizeflagfalseNormalize data
--resizestringNoneResize images to WxH (e.g., 224x224)
--augmentflagfalseApply data augmentation
--splitstringNoneSplit data into train/val/test (e.g., 0.7,0.15,0.15)

Examples

Basic preprocessing

neurenix preprocess --input data/raw --output data/processed
Loading data from data/raw...
Preprocessing data...
Saving processed data (1000 samples)...
Preprocessing completed successfully. Results saved to data/processed

Normalize data

neurenix preprocess \
  --input data/raw.csv \
  --output data/normalized \
  --normalize
Loading data from data/raw.csv...
Preprocessing data...
Saving processed data (1000 samples)...
Preprocessing completed successfully. Results saved to data/normalized

Resize images

neurenix preprocess \
  --input images/raw \
  --output images/processed \
  --resize 224x224
Loading data from images/raw...
Preprocessing data...
Saving processed data (5000 samples)...
Preprocessing completed successfully. Results saved to images/processed

Apply data augmentation

neurenix preprocess \
  --input data/train \
  --output data/augmented \
  --augment
Loading data from data/train...
Preprocessing data...
Saving processed data (1500 samples)...
Preprocessing completed successfully. Results saved to data/augmented

Split dataset

neurenix preprocess \
  --input data/full_dataset.csv \
  --output data/splits \
  --split 0.7,0.15,0.15
Loading data from data/full_dataset.csv...
Preprocessing data...
Saving training data (700 samples)...
Saving validation data (150 samples)...
Saving test data (150 samples)...
Preprocessing completed successfully. Results saved to data/splits

Combined preprocessing

neurenix preprocess \
  --input images/raw \
  --output images/ready \
  --resize 224x224 \
  --normalize \
  --augment \
  --split 0.8,0.2
Loading data from images/raw...
Preprocessing data...
Saving training data (4000 samples)...
Saving validation data (1000 samples)...
Preprocessing completed successfully. Results saved to images/ready

Use configuration file

neurenix preprocess \
  --input data/raw \
  --output data/processed \
  --config configs/preprocess.json
Loading data from data/raw...
Preprocessing data...
Saving processed data (1000 samples)...
Preprocessing completed successfully. Results saved to data/processed

Configuration File

Create a JSON configuration file for complex preprocessing pipelines:
{
  "normalize": true,
  "resize": {
    "width": 224,
    "height": 224
  },
  "augment": {
    "rotation": 15,
    "flip_horizontal": true,
    "brightness": 0.2,
    "contrast": 0.2
  },
  "split": [0.7, 0.15, 0.15]
}
Then use it:
neurenix preprocess \
  --input data/raw \
  --output data/processed \
  --config preprocess_config.json

Data Splitting

When using --split, the data is divided into separate directories:

Two-way split (train/val)

neurenix preprocess --input data.csv --output data --split 0.8,0.2
Creates:
data/
├── train/
│   └── train_data.csv
└── val/
    └── val_data.csv

Three-way split (train/val/test)

neurenix preprocess --input data.csv --output data --split 0.7,0.15,0.15
Creates:
data/
├── train/
│   └── train_data.csv
├── val/
│   └── val_data.csv
└── test/
    └── test_data.csv
Split ratios must sum to 1.0. For example: 0.8,0.2 or 0.7,0.15,0.15

Preprocessing Configuration Output

The preprocessing settings are saved to preprocess_config.json in the output directory:
{
  "normalize": true,
  "resize": {
    "width": 224,
    "height": 224
  },
  "augment": true,
  "split": [0.8, 0.2]
}
This allows you to reproduce the preprocessing pipeline later.

Error Handling

Input not found

neurenix preprocess --input missing.csv --output data
Error: Input 'missing.csv' not found.

Invalid resize format

neurenix preprocess --input data --output out --resize invalid
Error: Invalid resize format. Use WxH (e.g., 224x224).

Invalid split ratio

neurenix preprocess --input data --output out --split 0.5,0.3
Error: Invalid split format. Use comma-separated values that sum to 1.0 (e.g., 0.7,0.15,0.15).

Preprocessing error

neurenix preprocess --input corrupted.csv --output out
Loading data from corrupted.csv...
Error preprocessing data: Failed to parse input data

Use Cases

1. Prepare images for training

neurenix preprocess \
  --input raw_images/ \
  --output processed_images/ \
  --resize 224x224 \
  --normalize \
  --split 0.8,0.1,0.1

2. Augment training data

neurenix preprocess \
  --input data/train \
  --output data/train_augmented \
  --augment

3. Create train/val/test splits

neurenix preprocess \
  --input full_dataset.csv \
  --output dataset_splits \
  --split 0.7,0.2,0.1

4. Standardize dataset

neurenix preprocess \
  --input raw_data.csv \
  --output standardized_data \
  --normalize

5. Complex pipeline with config

# Create config file with all settings
cat > preprocess.json << EOF
{
  "normalize": true,
  "resize": {"width": 256, "height": 256},
  "augment": true,
  "split": [0.7, 0.15, 0.15]
}
EOF

# Run preprocessing
neurenix preprocess \
  --input images/ \
  --output processed/ \
  --config preprocess.json

Best Practices

1. Always split your data

Create proper train/val/test splits:
neurenix preprocess \
  --input data.csv \
  --output splits \
  --split 0.7,0.15,0.15

2. Use configuration files for reproducibility

Store preprocessing settings in version control:
neurenix preprocess \
  --input data \
  --output processed \
  --config preprocessing/experiment1.json

3. Normalize numerical data

Always normalize for better training performance:
neurenix preprocess --input data.csv --output norm_data --normalize

4. Resize images consistently

Use standard image sizes for vision models:
# ResNet/VGG
neurenix preprocess --input images --output resized --resize 224x224

# Inception
neurenix preprocess --input images --output resized --resize 299x299

5. Augment only training data

Split first, then augment only the training set:
# Split data
neurenix preprocess --input data --output splits --split 0.8,0.2

# Augment training data only
neurenix preprocess \
  --input splits/train \
  --output splits/train_augmented \
  --augment

Pipeline Example

Complete preprocessing pipeline for image classification:
#!/bin/bash

# 1. Preprocess and split
neurenix preprocess \
  --input raw_images/ \
  --output data/ \
  --resize 224x224 \
  --normalize \
  --split 0.7,0.15,0.15

# 2. Augment training data
neurenix preprocess \
  --input data/train \
  --output data/train_augmented \
  --augment

# 3. Train model
neurenix run train.py --config config.json

See Also