Skip to main content

Resilience Overview

Resilience in NPipeline refers to the ability of your data pipelines to detect, handle, and recover from failures without complete system breakdown. This section provides a comprehensive guide to building robust, fault-tolerant pipelines.


Where to Start

If you want to enable node restarts and common retry patterns, start here: Getting Started with Resilience

This guide provides a quick-start checklist and step-by-step instructions for configuring the three mandatory prerequisites for node restarts. This is the canonical starting point for most users.


Why Resilience Matters

In production environments, pipelines inevitably encounter failures from various sources:

  • Transient infrastructure issues: Network timeouts, database connection failures
  • Data quality problems: Invalid formats, missing values, unexpected data types
  • Resource constraints: Memory pressure, CPU saturation, I/O bottlenecks
  • External service dependencies: API rate limits, service outages, authentication failures

Without proper resilience mechanisms, these failures can cascade through your pipeline, causing data loss, system instability, and costly manual intervention.

Resilience Strategy Comparison

StrategyBest ForMemory RequirementsComplexityKey Benefits
Simple RetryTransient failures (network timeouts, temporary service issues)LowLowQuick recovery from temporary issues
Node RestartPersistent node failures, resource exhaustionMedium (requires materialization)MediumComplete recovery from node-level failures
Circuit BreakerProtecting against cascading failures, external service dependenciesLowMediumPrevents system overload during outages
Dead-Letter QueuesHandling problematic items that can't be processedLowHighPreserves problematic data for manual review
Combined ApproachProduction systems with multiple failure typesHighHighComprehensive protection against all failure types

Choosing the Right Strategy

  • For simple pipelines with basic needs: Start with Simple Retry
  • For streaming data processing: Use Node Restart with materialization
  • For external service dependencies: Add Circuit Breaker to prevent cascade failures
  • For critical data pipelines: Implement Dead-Letter Queues to preserve failed items
  • For production systems: Combine multiple strategies for comprehensive protection

Core Resilience Components

NPipeline's resilience framework is built around several interconnected components:

ComponentRoleBest For
Getting Started with ResilienceQuick-start checklist for node restarts and retry delaysNew users; configuring resilience for the first time
Error HandlingHow to respond to failures at node and pipeline levelsUnderstanding error recovery strategies
Retry OptionsConfigure retry limits, delays, and materializationFine-tuning resilience behavior
Materialization & BufferingHow buffering enables replay during restartsUnderstanding the replay mechanism
Circuit BreakersPrevent cascading failures to external servicesProtecting against external service outages
Dead-Letter QueuesHandle problematic items separatelyPreserving failed items for manual review
TroubleshootingDiagnose and resolve common resilience issuesDebugging failed configurations

Choosing Your Resilience Approach

Simple Retry Logic: For transient failures, use error decision handlers with retry limits. See Getting Started for quick examples.

Node Restart: For recovering from node-level failures, follow the 3-step checklist in Getting Started with Resilience.

Circuit Breakers: For protecting against cascading failures to external services, see Circuit Breakers.

Dead-Letter Queues: For preserving problematic items, see Dead-Letter Queues.

  1. Getting Started with Resilience ← Start here for a quick checklist
  2. Error Handling ← Understand error recovery strategies
  3. Retry Options ← Fine-tune retry behavior
  4. Materialization & Buffering ← Understand how replay works
  5. Troubleshooting ← Debug issues

Advanced Topics