Preview

Preview Feature — This feature is currently in preview and under active development. APIs and functionality may change. We recommend testing thoroughly before using in production.

Agent Resilience

Build reliable AI agents by leveraging Ductape's existing resilience infrastructure. Agents can use product-level healthchecks, quotas, and fallbacks to create production-ready agents that gracefully handle failures, distribute load across providers, and maintain high availability.

Overview

Agent resilience in Ductape works by referencing existing resilience configurations defined at the product level:

Pattern	Purpose	Use Case
Healthchecks	Monitor provider availability	Detect outages, track latency
Quotas	Weighted load distribution	Balance traffic across providers
Fallbacks	Sequential failover	Automatic failover on errors

Important

Agents do not define their own healthchecks, quotas, or fallbacks. Instead, they reference configurations that already exist at the product level. This ensures consistency across your entire product and avoids duplication.

Setting Up Resilience

Step 1: Define Resilience at the Product Level

First, create your resilience configurations using the Ductape resilience API:

import Ductape from '@ductape/sdk';

const ductape = new Ductape({...});

// Create a healthcheck
await ductape.resilience.healthcheck.create('my-product', {
  tag: 'anthropic-health',
  name: 'Anthropic API Health',
  probe: {
    type: 'app',
    app: 'anthropic-app',
    event: 'health-check',
  },
  interval: 30000, // Check every 30 seconds
  retries: 2,
  envs: [
    { slug: 'production' },
    { slug: 'staging' },
  ],
  onFailure: {
    notifications: [{
      notification: 'alerts',
      message: 'llm-provider-down',
      channels: {
        email: { recipients: ['oncall@company.com'] },
      },
    }],
  },
});

// Create a quota for load balancing
await ductape.resilience.quota.create('my-product', {
  tag: 'llm-quota',
  name: 'LLM Provider Quota',
  input: {
    prompt: { type: 'string', required: true },
  },
  options: [
    {
      provider: 'anthropic-primary',
      app: 'anthropic-app',
      event: 'generate',
      quota: 70, // 70% of traffic
      healthcheck: 'anthropic-health',
      retries: 2,
    },
    {
      provider: 'openai-secondary',
      app: 'openai-app',
      event: 'generate',
      quota: 30, // 30% of traffic
      healthcheck: 'openai-health',
      retries: 2,
    },
  ],
});

// Create a fallback chain
await ductape.resilience.fallback.create('my-product', {
  tag: 'llm-fallback',
  name: 'LLM Fallback Chain',
  input: {
    prompt: { type: 'string', required: true },
  },
  options: [
    {
      provider: 'anthropic-primary',
      app: 'anthropic-app',
      event: 'generate',
      healthcheck: 'anthropic-health',
      retries: 3,
    },
    {
      provider: 'openai-fallback',
      app: 'openai-app',
      event: 'generate',
      healthcheck: 'openai-health',
      retries: 2,
    },
  ],
});

Step 2: Reference Resilience in Your Agent

Now reference these configurations in your agent definition:

const agent = await ductape.agents.define({
  product: 'my-product',
  tag: 'resilient-agent',
  name: 'Resilient Agent',
  model: 'claude-model',
  systemPrompt: 'You are a helpful assistant.',
  tools: [...],

  // Reference existing resilience configurations
  resilience: {
    defaults: {
      healthcheck: 'anthropic-health', // Check health before operations
      fallback: 'llm-fallback',        // Use fallback chain by default
    },
  },
});

Using Resilience in Tools

Checking Health Status

{
  tag: 'smart-query',
  description: 'Query with health-aware provider selection',
  parameters: {...},
  handler: async (ctx, params) => {
    // Check provider health before making a decision
    const healthStatus = await ctx.resilience.healthcheck.status();

    if (healthStatus['anthropic-health']?.status === 'healthy') {
      ctx.log.info('Using Anthropic provider');
      // Use primary provider
    } else if (healthStatus['openai-health']?.status === 'healthy') {
      ctx.log.info('Falling back to OpenAI');
      // Use fallback provider
    } else {
      throw new Error('All LLM providers are unavailable');
    }

    // ... perform the query
  },
}

Running Through Quotas

{
  tag: 'analyze-text',
  description: 'Analyze text using load-balanced LLM providers',
  parameters: {
    text: { type: 'string', description: 'Text to analyze', required: true },
  },
  handler: async (ctx, params) => {
    // Run through quota - automatically selects provider based on weight
    const result = await ctx.resilience.quota.run({
      tag: 'llm-quota',
      input: {
        prompt: `Analyze the following text: ${params.text}`,
      },
    });

    return result;
  },
}

Running Through Fallbacks

{
  tag: 'generate-response',
  description: 'Generate response with automatic failover',
  parameters: {
    prompt: { type: 'string', description: 'User prompt', required: true },
  },
  handler: async (ctx, params) => {
    // Run through fallback chain - automatically fails over on errors
    const result = await ctx.resilience.fallback.run({
      tag: 'llm-fallback',
      input: {
        prompt: params.prompt,
      },
    });

    return result;
  },
}

Tool-Level Resilience

You can configure resilience at the individual tool level for fine-grained control:

const agent = await ductape.agents.define({
  product: 'my-product',
  tag: 'granular-agent',
  name: 'Granular Resilience Agent',
  model: 'default-model',
  systemPrompt: 'You are a helpful assistant.',

  tools: [
    {
      tag: 'critical-operation',
      description: 'A critical operation that needs fallback protection',
      parameters: {...},
      handler: async (ctx, params) => {...},
      // Tool-specific resilience (overrides defaults)
      resilience: {
        fallback: 'critical-fallback',
        healthcheck: 'critical-service-health',
      },
    },
    {
      tag: 'load-balanced-operation',
      description: 'An operation that should be load balanced',
      parameters: {...},
      handler: async (ctx, params) => {...},
      resilience: {
        quota: 'standard-quota',
      },
    },
  ],

  resilience: {
    // Default resilience for tools without explicit config
    defaults: {
      healthcheck: 'default-health',
      quota: 'default-quota',
    },

    // Override resilience for specific tools
    toolOverrides: {
      'special-tool': {
        fallback: 'special-fallback',
        healthcheck: 'special-health',
      },
    },
  },
});

Resilience Context API

The ctx.resilience object in tool handlers provides these namespaces:

`ctx.resilience.quota`

`quota.run<T>(options)`

Run an operation through an existing quota for weighted load distribution.

const result = await ctx.resilience.quota.run<ResponseType>({
  tag: 'quota-tag',               // Tag of existing quota
  input: { /* operation input */ },
  session: { tag: 'session-tag', token: 'token' }, // Optional
});

`quota.status(options)`

Get current status and usage of an existing quota.

const status = await ctx.resilience.quota.status({
  tag: 'quota-tag',
});

// Returns:
// {
//   tag: 'quota-tag',
//   totalQuota: 1000,
//   usedQuota: 450,
//   remainingQuota: 550,
//   providers: [
//     { name: 'primary', weight: 70, uses: 315, status: 'available' },
//     { name: 'secondary', weight: 30, uses: 135, status: 'available' },
//   ]
// }

`ctx.resilience.fallback`

`fallback.run<T>(options)`

Run an operation through an existing fallback chain for sequential failover.

const result = await ctx.resilience.fallback.run<ResponseType>({
  tag: 'fallback-tag',            // Tag of existing fallback
  input: { /* operation input */ },
  session: { tag: 'session-tag', token: 'token' }, // Optional
});

`ctx.resilience.healthcheck`

`healthcheck.check(options)`

Check the health status of a specific healthcheck.

const health = await ctx.resilience.healthcheck.check({
  tag: 'healthcheck-tag',         // Tag of existing healthcheck
  env: 'production',              // Optional, defaults to current env
});

// Returns:
// {
//   tag: 'healthcheck-tag',
//   status: 'healthy' | 'unhealthy' | 'degraded' | 'unknown',
//   lastChecked: 1699900000000,
//   lastAvailable: 1699900000000,
//   lastLatency: 150,
//   averageLatency: 145,
//   error?: 'Connection timeout'
// }

`healthcheck.status()`

Get health status for all configured healthchecks referenced by the agent.

const allHealth = await ctx.resilience.healthcheck.status();

// Returns: Record<string, IAgentHealthcheckResult>
// {
//   'anthropic-health': { tag: '...', status: 'healthy', ... },
//   'openai-health': { tag: '...', status: 'degraded', ... },
// }

Complete Example

Here's a production-ready agent using existing resilience configurations:

import Ductape from '@ductape/sdk';

const ductape = new Ductape({...});

// Assume these resilience configs already exist at the product level:
// - healthchecks: 'anthropic-health', 'openai-health'
// - quotas: 'cost-optimized-quota'
// - fallbacks: 'llm-fallback'

const agent = await ductape.agents.define({
  product: 'production-app',
  tag: 'resilient-support-agent',
  name: 'Resilient Support Agent',
  model: 'claude-primary',
  systemPrompt: `You are a customer support agent with access to multiple
AI providers for reliability. Always provide helpful responses.`,

  tools: [
    {
      tag: 'answer-question',
      description: 'Answer customer questions using resilient LLM calls',
      parameters: {
        question: { type: 'string', description: 'Customer question', required: true },
        context: { type: 'string', description: 'Additional context' },
      },
      handler: async (ctx, params) => {
        // Check overall health first
        const health = await ctx.resilience.healthcheck.status();
        const healthyProviders = Object.values(health)
          .filter(h => h.status === 'healthy').length;

        ctx.log.info(`${healthyProviders} healthy providers available`);

        // Use fallback for critical customer responses
        const response = await ctx.resilience.fallback.run({
          tag: 'llm-fallback',
          input: {
            prompt: `Context: ${params.context || 'None'}

Question: ${params.question}

Please provide a helpful, accurate response.`,
          },
        });

        return response;
      },
    },
    {
      tag: 'summarize-ticket',
      description: 'Summarize support tickets using load-balanced providers',
      parameters: {
        ticketId: { type: 'string', description: 'Ticket ID', required: true },
      },
      // Use quota for non-critical operations to optimize costs
      resilience: {
        quota: 'cost-optimized-quota',
      },
      handler: async (ctx, params) => {
        const ticket = await ctx.database.query({
          database: 'tickets-db',
          event: 'get-ticket',
          params: { id: params.ticketId },
        });

        return ctx.resilience.quota.run({
          tag: 'cost-optimized-quota',
          input: {
            prompt: `Summarize this support ticket: ${JSON.stringify(ticket)}`,
          },
        });
      },
    },
  ],

  resilience: {
    defaults: {
      healthcheck: 'anthropic-health',
      fallback: 'llm-fallback',
    },
  },

  termination: {
    maxIterations: 10,
    timeout: '5m',
  },
});

// Run the agent
const result = await ductape.agents.run({
  product: 'production-app',
  env: 'production',
  tag: 'resilient-support-agent',
  input: {
    question: 'How do I reset my password?',
  },
});

Best Practices

1. Define Resilience at the Product Level

// Good: Define resilience configs once at product level
await ductape.resilience.healthcheck.create('my-product', {
  tag: 'api-health',
  // ... config
});

// Then reference in multiple agents
const agent1 = await ductape.agents.define({
  resilience: { defaults: { healthcheck: 'api-health' } },
  // ...
});

const agent2 = await ductape.agents.define({
  resilience: { defaults: { healthcheck: 'api-health' } },
  // ...
});

2. Use Fallbacks for Critical Operations

// Critical customer-facing operations should use fallbacks
{
  tag: 'process-payment',
  resilience: { fallback: 'payment-fallback' },
  handler: async (ctx, params) => {
    return ctx.resilience.fallback.run({
      tag: 'payment-fallback',
      input: params,
    });
  },
}

3. Use Quotas for Cost Optimization

// Non-critical operations can use quotas to optimize costs
{
  tag: 'generate-suggestions',
  resilience: { quota: 'cost-optimized' },
  handler: async (ctx, params) => {
    return ctx.resilience.quota.run({
      tag: 'cost-optimized',
      input: params,
    });
  },
}

4. Check Health Before Critical Operations

handler: async (ctx, params) => {
  // Always check health for critical operations
  const health = await ctx.resilience.healthcheck.check({
    tag: 'primary-service',
  });

  if (health.status !== 'healthy') {
    ctx.log.warn('Primary service degraded, using fallback');
  }

  // Continue with operation...
}

Next Steps

Resilience Overview - Learn how to define healthchecks, quotas, and fallbacks
Agent Tools - Learn more about building agent tools
Human in the Loop - Add approval gates
Multi-Agent Systems - Orchestrate multiple agents

Overview​

Setting Up Resilience​

Step 1: Define Resilience at the Product Level​

Step 2: Reference Resilience in Your Agent​

Using Resilience in Tools​

Checking Health Status​

Running Through Quotas​

Running Through Fallbacks​

Tool-Level Resilience​

Resilience Context API​

ctx.resilience.quota​

quota.run<T>(options)​

quota.status(options)​

ctx.resilience.fallback​

fallback.run<T>(options)​

ctx.resilience.healthcheck​

healthcheck.check(options)​

healthcheck.status()​

Complete Example​

Best Practices​

1. Define Resilience at the Product Level​

2. Use Fallbacks for Critical Operations​

3. Use Quotas for Cost Optimization​

4. Check Health Before Critical Operations​

Next Steps​