ARM Acceleration - Neurenix

Overview

Neurenix provides comprehensive support for ARM processors, enabling efficient AI inference and training on ARM-based devices. The framework includes:

ARM NEON: SIMD instructions for accelerated vector operations
ARM SVE: Scalable Vector Extension for flexible vectorization
ARM Compute Library: Optimized neural network primitives
ARM Ethos-U: Neural Processing Unit for edge AI
CPU optimization: Multi-threading and cache-aware algorithms

Supported Platforms

ARM Cortex-A series (A53, A55, A57, A72, A76, A78, X1, X2)
ARM Neoverse (N1, N2, V1, V2)
Apple Silicon (M1, M2, M3 series)
Qualcomm Snapdragon
MediaTek Dimensity
NVIDIA Jetson (ARM CPU components)

Requirements

ARM processor with NEON support (ARMv7-A or later)
ARM Compute Library (optional, recommended)
GCC 9.0+ or Clang 10.0+ with ARM extensions

Installation

# Install Neurenix for ARM
pip install neurenix

# Build from source with ARM optimizations
export NEURENIX_WITH_ARM=1
export ARM_COMPUTE_LIB=/path/to/arm_compute
pip install -e .

# Install ARM Compute Library (optional)
git clone https://github.com/ARM-software/ComputeLibrary.git
cd ComputeLibrary
scons Werror=0 -j8 debug=0 neon=1 opencl=0 embed_kernels=0 os=linux arch=arm64-v8a

Device Detection

Check ARM Availability

import neurenix as nx

# Check ARM acceleration
if nx.arm.is_available():
    print("ARM acceleration is available")
    print(f"ARM devices: {nx.arm.device_count()}")
else:
    print("ARM acceleration not available")

// C++ ARM detection
#include "phynexus/hardware/arm.h"

using namespace phynexus::hardware;

if (arm_is_available()) {
    int device_count = arm_get_device_count();
    std::cout << "ARM devices: " << device_count << std::endl;
}

Get Device Properties

props = nx.arm.get_device_properties(0)
print(f"Device: {props.device_name}")
print(f"NEON support: {props.has_neon}")
print(f"SVE support: {props.has_sve}")
if props.has_sve:
    print(f"SVE vector length: {props.sve_vector_length} bits")
print(f"Ethos-U NPU: {props.has_ethos_u}")

// C++ device properties
ARMDeviceProperties props;
if (arm_get_device_properties(0, &props)) {
    std::cout << "Device: " << props.device_name << std::endl;
    std::cout << "NEON: " << (props.has_neon ? "Yes" : "No") << std::endl;
    std::cout << "SVE: " << (props.has_sve ? "Yes" : "No") << std::endl;
    if (props.has_sve) {
        std::cout << "SVE vector length: " << props.sve_vector_length << std::endl;
    }
}

ARM NEON

Overview

NEON provides 128-bit SIMD instructions for parallel processing of data:

import neurenix as nx

# Operations automatically use NEON when available
a = nx.randn(1000, device='cpu')
b = nx.randn(1000, device='cpu')
c = a + b  # Uses NEON acceleration

Explicit NEON Operations

from neurenix.hardware import arm

# NEON-accelerated addition
a = nx.randn(1024)
b = nx.randn(1024)
c = arm.neon_add(a, b)

# NEON-accelerated multiplication
d = arm.neon_multiply(a, b)

// C++ NEON operations
#include "phynexus/hardware/arm.h"

using namespace phynexus::hardware;

float* a = new float[1024];
float* b = new float[1024];
float* c = new float[1024];

// NEON-accelerated add
arm_neon_add(a, b, c, 1024);

// NEON-accelerated multiply
arm_neon_multiply(a, b, c, 1024);

NEON Data Types

NEON supports various data types:

float32x4_t - 4x 32-bit floats
int32x4_t - 4x 32-bit integers
int16x8_t - 8x 16-bit integers
int8x16_t - 16x 8-bit integers

ARM SVE (Scalable Vector Extension)

Overview

SVE provides vector operations with runtime-determined vector lengths:

from neurenix.hardware import arm

# Check SVE support
if arm.has_sve():
    vector_length = arm.get_sve_vector_length()
    print(f"SVE vector length: {vector_length} bits")
    
    # SVE-accelerated operations
    a = nx.randn(2048)
    b = nx.randn(2048)
    c = arm.sve_add(a, b)

// C++ SVE operations
#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>

using namespace phynexus::hardware;

float* a = new float[2048];
float* b = new float[2048];
float* c = new float[2048];

// SVE-accelerated operations
arm_sve_add(a, b, c, 2048);
arm_sve_multiply(a, b, c, 2048);
#endif

SVE Advantages

Vector length agnostic code
Future-proof for longer vectors
Improved performance on Neoverse and future ARM CPUs
Better handling of loop remainders

ARM Compute Library

Overview

ARM Compute Library provides highly optimized functions for computer vision and machine learning:

from neurenix.hardware import arm

# Enable ARM Compute Library
arm.set_compute_library_enabled(True)

# Operations use ACL when beneficial
conv = nx.nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
input = nx.randn(1, 3, 224, 224)
output = conv(input)  # Uses ACL optimized convolution

Convolution with ACL

from neurenix.hardware.arm import acl_conv2d

# Explicit ACL convolution
params = {
    'batch_size': 1,
    'input_channels': 3,
    'input_height': 224,
    'input_width': 224,
    'output_channels': 64,
    'kernel_height': 3,
    'kernel_width': 3,
    'stride_height': 1,
    'stride_width': 1,
    'padding_height': 1,
    'padding_width': 1
}

output = acl_conv2d(input, weights, bias, params)

// C++ ACL convolution
#include "phynexus/hardware/arm.h"

using namespace phynexus::hardware;

Conv2DParams params;
params.batch_size = 1;
params.input_channels = 3;
params.input_height = 224;
params.input_width = 224;
params.output_channels = 64;
params.kernel_height = 3;
params.kernel_width = 3;
params.stride_height = 1;
params.stride_width = 1;
params.padding_height = 1;
params.padding_width = 1;

arm_acl_conv2d(input_data, weights_data, bias_data, output_data, params);

ARM Ethos-U NPU

Overview

Ethos-U is ARM’s neural processing unit for edge AI, providing:

Efficient inference for quantized models
Low power consumption
Integration with Cortex-M processors

from neurenix.hardware import arm

# Check Ethos-U availability
if arm.has_ethos_u():
    print("Ethos-U NPU available")
    
    # Compile model for Ethos-U
    model = MyModel()
    ethos_model = arm.compile_for_ethos_u(
        model,
        input_shape=(1, 3, 224, 224),
        quantization='int8'
    )
    
    # Run inference on Ethos-U
    output = ethos_model(input)

Quantization for Ethos-U

from neurenix.quantization import quantize_model

# Quantize model for Ethos-U
quantized_model = quantize_model(
    model,
    calibration_data=calibration_loader,
    quantization_scheme='int8',
    target='ethos-u'
)

# Deploy to Ethos-U
ethos_model = arm.deploy_to_ethos_u(quantized_model)

Multi-Threading

Configure Thread Pool

import neurenix as nx

# Set number of threads for ARM operations
nx.set_num_threads(4)

# Get current thread count
num_threads = nx.get_num_threads()
print(f"Using {num_threads} threads")

# Use all available cores
import os
nx.set_num_threads(os.cpu_count())

Thread Affinity

# Set thread affinity for better performance
nx.arm.set_thread_affinity([
    [0, 1],      # Threads 0-1 on cores 0-1 (efficiency cores)
    [2, 3, 4, 5] # Threads 2-5 on cores 2-5 (performance cores)
])

Memory Management

Aligned Allocation

# Allocate aligned memory for SIMD
tensor = nx.empty(1024, dtype=nx.float32, aligned=64)  # 64-byte alignment

// C++ aligned memory
void* ptr;
arm_malloc(&ptr, 1024 * sizeof(float));  // 64-byte aligned
// ... use memory ...
arm_free(ptr);

Memory Copy

// Optimized memory copy
float* src = new float[1024];
float* dst = new float[1024];

arm_memcpy_device_to_device(dst, src, 1024 * sizeof(float));

Performance Optimization

Best Practices

Use appropriate data types

# Use float16 on ARM for better performance
model = model.half()  # Convert to FP16
input = input.half()
output = model(input)

Enable kernel fusion

nx.arm.set_fusion_enabled(True)

Optimize tensor layout

# Use NHWC layout for better cache locality
tensor = tensor.to_nhwc()  # Convert from NCHW to NHWC

Use ARM Compute Library

nx.backends.arm_compute_lib.enabled = True

Profiling

# Profile ARM operations
with nx.arm.profiler.profile():
    output = model(input)

# Print profiling results
print(nx.arm.profiler.key_averages())

Mobile and Edge Deployment

Model Optimization

from neurenix.mobile import optimize_for_mobile

# Optimize model for ARM mobile devices
mobile_model = optimize_for_mobile(
    model,
    target='arm',
    quantization='int8',
    optimize_for_latency=True
)

# Export for deployment
mobile_model.save('model_arm.ptl')

Android Deployment

# Export for Android (ARM)
from neurenix.mobile import export_to_android

export_to_android(
    model,
    'model.pt',
    use_neon=True,
    use_arm_compute_lib=True
)

Environment Variables

# Set number of threads
export OMP_NUM_THREADS=4
export NEURENIX_NUM_THREADS=4

# Enable ARM Compute Library
export NEURENIX_ARM_USE_ACL=1

# Set thread affinity
export NEURENIX_ARM_THREAD_AFFINITY=0-3

# Enable NEON
export NEURENIX_ARM_USE_NEON=1

Benchmarking

import neurenix as nx
from neurenix.benchmark import benchmark

# Benchmark ARM operations
results = benchmark(
    model,
    input_shape=(1, 3, 224, 224),
    device='cpu',
    num_iterations=100
)

print(f"Average latency: {results.mean_ms:.2f} ms")
print(f"Throughput: {results.throughput:.2f} fps")

​Overview

​Supported Platforms

​Requirements

​Installation

​Device Detection

​Check ARM Availability

​Get Device Properties

​ARM NEON

​Overview

​Explicit NEON Operations

​NEON Data Types

​ARM SVE (Scalable Vector Extension)

​Overview

​SVE Advantages

​ARM Compute Library

​Overview

​Convolution with ACL

​ARM Ethos-U NPU

​Overview

​Quantization for Ethos-U

​Multi-Threading

​Configure Thread Pool

​Thread Affinity

​Memory Management

​Aligned Allocation

​Memory Copy

​Performance Optimization

​Best Practices

​Profiling

​Mobile and Edge Deployment

​Model Optimization

​Android Deployment

​Environment Variables

​Benchmarking

​See Also

Overview

Supported Platforms

Requirements

Installation

Device Detection

Check ARM Availability

Get Device Properties

ARM NEON

Overview

Explicit NEON Operations

NEON Data Types

ARM SVE (Scalable Vector Extension)

Overview

SVE Advantages

ARM Compute Library

Overview

Convolution with ACL

ARM Ethos-U NPU

Overview

Quantization for Ethos-U

Multi-Threading

Configure Thread Pool

Thread Affinity

Memory Management

Aligned Allocation

Memory Copy

Performance Optimization

Best Practices

Profiling

Mobile and Edge Deployment

Model Optimization

Android Deployment

Environment Variables

Benchmarking

See Also