Optimization Principles

This page explores the performance characteristics and optimization strategies that make NPipeline efficient. Understanding these principles helps you reason about performance trade-offs and make informed optimization decisions.

For practical optimization guidance, see Performance Hygiene.

Before understanding optimization principles, you should be familiar with:

Core Concepts Overview - Basic NPipeline concepts and terminology
Architecture Overview - Understanding NPipeline's internal architecture
Execution Strategies - How nodes execute data

NPipeline's performance advantages don't come by accident. They're result of deliberate architectural decisions made at framework level. This document explains why behind NPipeline's design and how these choices combine to deliver measurable performance benefits.

The Performance Challenge

Data processing frameworks face inherent tradeoffs:

Flexibility (supporting diverse use cases) vs. Optimization (pre-computing for specific scenarios)
Developer Experience (intuitive APIs, reduced boilerplate) vs. Performance (minimal overhead, zero-cost abstractions)
Safety (preventing errors) vs. Speed (avoiding runtime checks)

Most frameworks compromise by making reasonable defaults but allowing flexibility. NPipeline takes a different approach: optimize for most common, highest-impact scenarios while maintaining flexibility for others.

Core Optimization Principles

1. Plan-Based Execution Model

The Principle: Pre-compute everything that doesn't change per-item.

Traditional Approach:

// Interpreted: evaluate routing/execution strategy for every item
foreach (var item in items)
{
    var strategy = DetermineStrategy(item);
    var result = ExecuteStrategy(strategy, item);
}

NPipeline Approach:

// Compiled: determine execution plan once, execute same plan for all items
var executionPlan = CompileExecutionPlan(pipeline);
foreach (var item in items)
{
    executionPlan.Execute(item);
}

Why This Matters:

Eliminating per-item branching reduces CPU cache misses
Predictable instruction patterns improve branch prediction
The CPU pipeline can optimize hot path more effectively
In high-throughput scenarios: thousands of decisions per second become zero decisions

Impact: Measurable CPU efficiency improvement, especially on modern CPUs with deep pipelines.

2. Zero Reflection During Steady State

The Principle: Pay reflection cost upfront, then never again.

Reflection Overhead:

Runtime type introspection (method resolution, property access)
Dynamic method invocation via delegates
Argument marshalling and unmarshalling
GC pressure from temporary objects created during reflection

Traditional Approach:

// Per-item reflection: look up methods, invoke dynamically
foreach (var item in items)
{
    var method = GetMethod(item.GetType());  // ← Reflection
    method.Invoke(transform, new[] { item }); // ← Dynamic dispatch
}

NPipeline Approach:

// Compile-time: pre-compiled delegates to actual methods
var compiledDelegate = CompileDelegate<T>(method);
foreach (var item in items)
{
    compiledDelegate(item);  // ← Direct, compiled dispatch
}

Why This Matters:

Reflection is expensive: 100-1000x slower than direct method calls
Pre-compiled delegates are statically typed, JIT-optimizable
Reflection GC pressure is eliminated during steady state
The JIT compiler can inline delegate calls

Impact: Particularly noticeable in scenarios with millions of items, where per-item reflection overhead becomes dominant cost.

3. ICountable Optimization for Memory Efficiency

The Principle: Know the size of your data upfront to make smart allocation decisions.

The Problem:

IEnumerable<T> has no size information
Buffers must be over-allocated or reallocated (expensive)
Collections often allocate larger capacity than needed (wasting memory)

NPipeline's Solution - ICountable:

public interface ICountable
{
    long Count { get; }
}

Pipes and collections implementing ICountable expose their size, enabling:

// Allocate exactly the right buffer size, no overshooting
if (input is ICountable countable)
{
    var buffer = new T[countable.Count];
    // Fill buffer with no reallocation
}
else
{
    // Fall back to dynamic resizing for unknown sizes
}

Why This Matters:

Right-sized buffers = reduced memory waste
Fewer reallocations = reduced allocation pressure
Predictable memory usage = easier capacity planning
Smaller GC working set = better cache locality

Impact: Especially important for pipelines with large intermediate collections (batching, aggregation).

4. Streaming-First Design with Lazy Evaluation

The Principle: Process data incrementally, never buffer unnecessarily.

Traditional Batch Processing Approach:

Load entire dataset → Filter → Transform → Load output
↓
Memory usage = entire dataset in memory at once
↓
Large GC pauses when data is collected

NPipeline Streaming Approach:

Item 1 → Filter → Transform → Output
Item 2 → Filter → Transform → Output
Item 3 → Filter → Transform → Output
↓
Memory usage = only current item + accumulated state
↓
Tiny, predictable GC pauses

Implementation:

IAsyncEnumerable<T> for lazy evaluation
Pull-based data flow (demand-driven)
State is only accumulated when explicitly required (aggregation, joins)

Why This Matters:

Memory usage scales with state complexity, not data volume
GC pause times are predictable and minimal
Latency for processing first item is low (no waiting for batch assembly)
Natural backpressure: slow consumers slow down producers

Impact: Enables processing of datasets far larger than available memory, with minimal latency impact.

5. Allocation Reduction and ValueTask Optimization

The Principle: Eliminate unnecessary heap allocations in hot paths.

The Problem - Traditional Async Framework:

In a pipeline processing 1M items/sec with 90% cache hits: 900,000 Task allocations per second

NPipeline Approach - ValueTask:

ValueTask<T> is a struct-based alternative to Task<T> that:

Allocates on stack (not heap) when result is available synchronously
Zero allocations for common case in cache-hit or synchronous scenarios
Seamlessly transitions to true async work when needed

Why This Matters:

ValueTask<T> is a struct (stack-allocated)
For synchronous results: zero heap allocations
For asynchronous results: seamlessly transitions to Task<T>
No performance penalty for async fallback path

Measured Impact:

Up to 90% reduction in GC pressure in high-cache-hit scenarios
Smoother throughput: fewer GC pauses
More predictable latency: less "garbage spikes"

Common Scenarios:

Data validation (usually synchronous)
Filtering (usually synchronous)
Cached enrichment (high synchronous fast path rate)
These represent everyday pipeline tasks, not edge cases

For complete implementation guidance, including critical constraints and real-world examples, see Synchronous Fast Paths and ValueTask Optimization—the dedicated deep-dive guide that covers the complete implementation pattern and dangerous constraints you must understand.

3. Cached Context Access (Per-Item Optimization)

The Principle: Cache frequently-accessed execution context to eliminate per-item dictionary lookups.

The Problem - Context Access Overhead:

During high-throughput node execution (1M+ items/sec), accessing execution context properties multiple times per item adds up:

// Before optimization: dictionary lookups for every item
foreach (var item in items)
{
    var retryOptions = context.Items.TryGetValue("retry::nodeId", ...); // ← Lookup 1
    var tracingEnabled = context.Tracer is not NullTracer;              // ← Lookup 2
    var loggingEnabled = context.LoggerFactory is not NullFactory;      // ← Lookup 3
    // ... process item
}

With millions of items, these repeated lookups become a measurable bottleneck (150-250μs overhead per 1K items).

NPipeline Solution - CachedNodeExecutionContext:

Capture execution context state once at node scope, then reuse for all items:

// After optimization: capture once, reuse for all items
var cached = CachedNodeExecutionContext.Create(context, nodeId);

foreach (var item in items)
{
    // Use cached values - no dictionary lookups
    if (cached.TracingEnabled)
    {
        // ... trace
    }
    var maxRetries = cached.RetryOptions.MaxItemRetries; // ← Already cached
    // ... process item
}

How It Works:

CachedNodeExecutionContext is a readonly struct that captures:

NodeId - Identifier of the executing node
RetryOptions - Pre-resolved retry configuration (with precedence: node-specific → global → context)
TracingEnabled - Whether tracing is active (checked once)
LoggingEnabled - Whether logging is active (checked once)
CancellationToken - Cancellation token for this execution

Why This Matters:

Eliminates per-item dictionary lookups - Direct field access instead of context.Items.TryGetValue()
Zero allocation overhead - Readonly struct is stack-allocated
Transparent to users - Automatically used by execution strategies
Thread-safe - Each worker thread has its own cached instance

Immutability Guarantee:

The cached context assumes that execution state remains immutable during node execution. NPipeline enforces this automatically:

In DEBUG builds: PipelineContextImmutabilityGuard validates that context state hasn't changed
In RELEASE builds: Zero overhead (validation compiled out)

If mutations are detected in DEBUG builds, a descriptive exception is thrown:

Context immutability violation detected for node 'myNode': 
Retry options were modified during node execution.

Performance Impact:

~150-250μs reduction per 1K items in typical pipelines
Improvement scales with:
- Number of items (more items = more savings)
- Item processing cost (overhead is more noticeable with fast transforms)
- Retry configuration usage (more lookups = more savings)

When This Helps Most:

High-throughput scenarios (100K+ items/sec)
Frequent retry option lookups
Parallel execution strategies
Nodes with low per-item processing cost

For best practices on working with cached contexts and avoiding mutations, see Performance Hygiene: Context Immutability.

6. Graph-Based Execution for Dependency Clarity

The Principle: Make execution flow explicit and optimizable.

Why a Graph?

Clarity: Visual representation of data dependencies
Optimization: Can identify parallelizable segments
Composability: Nodes can be chained, reused, tested independently
Debuggability: Clear data provenance (lineage)

Execution Strategy:

The graph is traversed once during the "plan compilation" phase
Execution strategy (sequential, parallel) is determined from graph structure
No per-item graph traversal overhead

7. Memory Layout and Cache Efficiency

The Principle: Respect CPU cache behavior.

Key Decisions:

Value types for small data: Structs with < 16 bytes avoid GC overhead
Array-backed collections: Better cache locality than linked lists
Contiguous buffers: CPU prefetcher can predict access patterns
Minimize indirection: Reduce pointer chasing in hot paths

Example:

// ✓ GOOD: Value types, contiguous memory
public readonly struct Event
{
    public long Timestamp { get; }
    public int Value { get; }
}

// LESS OPTIMAL: Reference types, scattered memory
public class Event
{
    public long Timestamp { get; set; }
    public int Value { get; set; }
}

How These Principles Work Together

The principles don't operate in isolation; they combine synergistically:

Plan-Based Execution
  ↓ Eliminates per-item decisions
  ├→ Enables JIT optimization
  └→ Improves CPU cache behavior

Zero Reflection at Runtime
  ↓ Direct method dispatch
  ├→ Inlinable and optimizable by JIT
  └→ Reduces memory allocations

ValueTask Optimization
  ↓ Eliminates allocations in fast paths
  ├→ Reduces GC pressure
  └→ Smaller GC working set = better cache locality

Streaming + Lazy Evaluation
  ↓ Process incrementally
  ├→ Predictable memory usage
  └→ Minimal GC pauses

ICountable for Right-Sizing
  ↓ Allocate exactly what's needed
  ├→ Fewer reallocations
  └→ Better memory cache behavior

6. Execution Plan Caching

The Principle: Pre-compile expression trees once, cache them indefinitely.

NPipeline compiles execution plans (expression trees) when a pipeline first runs. These plans are expensive to build (~300-500μs per pipeline) but can be cached and reused:

What Gets Cached:

Pre-compiled delegates for transform execution
Expression trees for type conversions
Execution routing for joins and aggregates

Result:

First run: Pay compilation cost (~1.9ms for small pipeline)
Subsequent runs: Use cached plans (~0.4-0.5ms for small pipeline)
Improvement: 75% reduction in execution time after warm-up

When Caching Applies:

Same pipeline definition runs multiple times ✓ (typical case)
Pipeline definition changes between runs ✗ (automatic cache miss)
Preconfigured nodes with state ✗ (caching disabled for safety)

Impact: 250-450μs saved per run for pipelines with caching enabled.

See Execution Plan Caching for detailed architecture.

Combined Impact of All Optimizations

These six principles work together to create exceptional performance:

Plan-Based Execution
  ↓ Compile once, execute many times
  ├→ Zero per-item decision overhead
  └→ Plan Caching amplifies this (75% reduction on warm-up)

Zero Reflection at Runtime
  ↓ Direct method dispatch
  ├→ Inlinable and optimizable by JIT
  └→ Reduces memory allocations

ValueTask Optimization
  ↓ Eliminates allocations in fast paths
  ├→ Reduces GC pressure
  └→ Smaller GC working set = better cache locality

Streaming + Lazy Evaluation
  ↓ Process incrementally
  ├→ Predictable memory usage
  └→ Minimal GC pauses

ICountable for Right-Sizing
  ↓ Allocate exactly what's needed
  ├→ Fewer reallocations
  └→ Better memory cache behavior

Measurable Results

The combination of these principles produces observable performance characteristics:

Metric	Typical Benefit
GC Pause Duration	50-80% reduction vs. naive async approach
Memory Allocations	Up to 90% fewer in cache-hit scenarios
Throughput (items/sec)	2-5x improvement vs. interpreted frameworks
Latency (p99)	More predictable, fewer spikes
CPU Efficiency	Better branch prediction, cache locality

Trade-offs and When These Optimizations Matter

When Optimization Matters Most

These optimizations provide most benefit in:

High-throughput scenarios: Millions of items per second
Multi-tenant systems: GC pauses directly impact other tenants
Real-time processing: Latency spikes from GC pauses are unacceptable
Long-running processes: Accumulated allocation pressure matters
Latency-sensitive workloads: Predictable performance critical

When Optimization Matters Less

These optimizations have minimal impact if:

Throughput is low: (< 1000 items/sec) - bottleneck is elsewhere
Items are large: (> 100KB) - allocation cost is tiny vs. processing cost
Processing is CPU-bound: GC pressure is secondary concern
Latency spikes are acceptable: SLAs allow for GC pauses

Next Steps

Execution Plan Caching: Deep dive into how compiled plans are cached
Performance Hygiene: Best practices for writing performant NPipeline code
Synchronous Fast Paths: Master ValueTask patterns in your transform nodes
Component Architecture: Understand how these principles are implemented in codebase
Performance Characteristics: Understanding performance implications of different approaches

The Performance Challenge​

Core Optimization Principles​

1. Plan-Based Execution Model​

2. Zero Reflection During Steady State​

3. ICountable Optimization for Memory Efficiency​

4. Streaming-First Design with Lazy Evaluation​

5. Allocation Reduction and ValueTask Optimization​

3. Cached Context Access (Per-Item Optimization)​

6. Graph-Based Execution for Dependency Clarity​

7. Memory Layout and Cache Efficiency​

How These Principles Work Together​

6. Execution Plan Caching​

Combined Impact of All Optimizations​

Measurable Results​

Trade-offs and When These Optimizations Matter​

When Optimization Matters Most​

When Optimization Matters Less​

See Also​

Next Steps​

The Performance Challenge

Core Optimization Principles

1. Plan-Based Execution Model

2. Zero Reflection During Steady State

3. ICountable Optimization for Memory Efficiency

4. Streaming-First Design with Lazy Evaluation

5. Allocation Reduction and ValueTask Optimization

3. Cached Context Access (Per-Item Optimization)

6. Graph-Based Execution for Dependency Clarity

7. Memory Layout and Cache Efficiency

How These Principles Work Together

6. Execution Plan Caching

Combined Impact of All Optimizations

Measurable Results

Trade-offs and When These Optimizations Matter

When Optimization Matters Most

When Optimization Matters Less

See Also

Next Steps