ARM64 NEON SIMD Acceleration¶

Last Updated: January 2026

Summary¶

Replaced vek library dependency with native ARM64 NEON SIMD implementation in C++ for Apple Silicon and ARM64 servers.

Implementation¶

Files Created¶

pkg/simd/neon_simd_arm64.cpp — C++ NEON SIMD implementation
Uses ARM NEON intrinsics (arm_neon.h)
Optimized for ARMv8-A architecture
Processes 4 float32 elements per iteration
pkg/simd/neon_simd_arm64.h — C interface header for CGO bindings.
pkg/simd/neon_simd.go — Go CGO bindings; build tag arm64 && cgo && !nosimd. ARM64 builds without CGO fall back to simd_arm64.go pure-Go paths.

Functions Implemented¶

neon_dot_product - Dot product: sum(a[i] * b[i])
neon_norm - Euclidean norm: sqrt(sum(v[i]^2))
neon_distance - Euclidean distance: sqrt(sum((a[i] - b[i])^2))
neon_cosine_similarity - Cosine similarity: dot(a,b) / (norm(a) * norm(b))
neon_normalize_inplace - Normalize vector in-place: v[i] = v[i] / norm(v)

Performance Results¶

Benchmark: Dot Product (Apple M3 Max)¶

Vector Size	NEON SIMD	Go Reference	Speedup
128-dim	44 ns	40 ns	0.9x
256-dim	67 ns	77 ns	1.1x
512-dim	111 ns	148 ns	1.3x
1024-dim	208 ns	285 ns	1.4x
1536-dim	305 ns	428 ns	1.4x
3072-dim	598 ns	839 ns	1.4x

Key Findings: - NEON SIMD is 1.4x faster for typical embedding sizes (1024-1536 dim) - Performance improves with larger vectors - Small overhead for very small vectors (128-dim)

Build Configuration¶

Build Tags¶

NEON SIMD (ARM64): arm64 && cgo && !nosimd
Fallback (ARM64, no CGO): arm64 && (!cgo || nosimd) - uses vek Go fallback
x86/amd64: Still uses vek library (AVX2 support)

Compiler Flags¶

CXXFLAGS: -O3 -march=armv8-a+simd -std=c++11
LDFLAGS: -lm

Testing¶

✅ All tests pass:

$ go test ./pkg/simd -v
--- PASS: TestInfo (0.00s)
    SIMD Info: neon (accelerated=true, features=[NEON ARMv8-A])
--- PASS: TestDotProduct
--- PASS: TestCosineSimilarity
--- PASS: TestEuclideanDistance
--- PASS: TestNorm
--- PASS: TestNormalizeInPlace
--- PASS: TestLargeVectors
--- PASS: TestEdgeCases

Dependencies¶

Removed¶

vek library for ARM64: No longer needed on ARM64 platforms

Kept¶

vek library for x86/amd64: Still used for AVX2 SIMD on x86 platforms
CGO required: ARM64 NEON implementation requires CGO

Benefits¶

No External Dependency (ARM64): Self-contained NEON implementation
Better Performance: 1.4x faster for typical embedding sizes
Full Control: Can optimize further without waiting for upstream
Native Code: Direct NEON intrinsics, no library overhead

Future Improvements¶

Unroll Loops: Process 8 or 16 elements per iteration for larger vectors
FMA Instructions: Use fused multiply-add where available
Prefetching: Add memory prefetching for very large vectors
AVX2 Implementation: Replace vek for x86/amd64 as well

Conclusion¶

The ARM64 NEON SIMD implementation is complete, tested, and provides 1.4x performance improvement for typical embedding operations. This eliminates the vek dependency for ARM64 platforms while maintaining compatibility with x86 builds.