Reliability

Rubicon's oracle is designed for high availability. This page documents our reliability architecture and monitoring.

Design Goals

Goal
Target

Uptime

99.9%

Max latency

<1 second

Max staleness

60 seconds

Recovery time

<5 minutes

Reliability Features

1. Multi-Source Fallback

Three-tier price source hierarchy:

Polygon → Yahoo → Cache

Each source failure is handled automatically without human intervention.

2. State Persistence

Oracle state survives restarts:

// ./data/oracle-state.json
{
  "prices": {
    "SOXX": "225.00",
    "timestamp": "2024-01-15T15:30:00Z"
  },
  "emaState": {
    "SOXX": { "value": 224.95, "samples": 1200 }
  },
  "errorCount": 0,
  "lastSuccess": "2024-01-15T15:30:00Z"
}

Recovery flow:

  1. Process starts

  2. Loads state from disk

  3. Validates state freshness

  4. Resumes with minimal disruption

3. EMA Smoothing

Mark prices use Exponential Moving Average:

  • Reduces impact of momentary spikes

  • Provides stability during volatility

  • Prevents manipulation via thin liquidity

4. Degraded Mode

When reliability is compromised:

Condition
Response

3 consecutive errors

Enter degraded mode

Cache >60s stale

Halt oracle updates

Extended outage

Block new positions

Monitoring

Health Metrics

Metric
Description
Alert Threshold

oracle.success_rate

% successful submissions

<99%

oracle.latency_p99

99th percentile latency

>500ms

oracle.source_fallback

Secondary source usage

>1%

oracle.cache_usage

Cache fallback usage

>0.1%

oracle.error_count

Consecutive errors

>2

Alerting

Alerts notify on-call team for immediate response.

Dashboards

Operations team monitors:

  • Real-time price feed status

  • Source response times

  • Error rates by source

  • Historical uptime

Incident Response

Automated Response

Event
Automatic Action

Source timeout

Switch to fallback

Rate limiting

Backoff and retry

Invalid price

Skip and log

All sources fail

Use cache, alert

Manual Response

For extended outages:

  1. On-call receives alert

  2. Assess source status

  3. Contact provider if needed

  4. Communicate status to users

  5. Monitor recovery

Historical Reliability

We track and publish:

  • Monthly uptime percentage

  • Incident reports (if any)

  • Source availability stats

Architecture Decisions

Why Multiple Sources?

Single source = single point of failure. Our multi-source design:

  • Polygon outage → Yahoo takes over (~0.1% of time)

  • Both down → Cache covers brief outages

  • Result: Near-continuous availability

Why 3-Second Updates?

Balances:

  • Freshness — Prices reflect recent trades

  • Rate limits — Stay within API quotas

  • Hyperliquid — Matches their update cadence

Why 60-Second Cache Limit?

Longer cache risks:

  • Trading on outdated prices

  • Significant price movement missed

  • User trust compromised

60 seconds covers brief connectivity issues while limiting staleness.

Failure Modes

Mode 1: Graceful Degradation

Users may not notice anything.

Mode 2: Stale Price Warning

Users informed, can still exit positions.

Mode 3: Halt

Protective measure during severe issues.

Disaster Recovery

Data Loss Recovery

State file corrupted:

  1. Start with empty state

  2. Fetch fresh prices

  3. EMA rebuilds over time

Complete Restart

  1. Load persisted state

  2. Verify state freshness

  3. Resume submissions

  4. EMA continues from saved state

Provider Outage

Long-term source unavailable:

  1. Add alternative sources

  2. Deploy updated oracle

  3. Resume operations

User-Facing Reliability

What traders experience:

Oracle Status
Trading
New Positions
Indication

Healthy

Normal

Allowed

Green indicator

Fallback

Normal

Allowed

Yellow indicator

Stale

Normal

Cautioned

Orange indicator

Degraded

Limited

Blocked

Red indicator

Continuous Improvement

We actively improve reliability:

  • Monitor for new failure modes

  • Test fallback paths regularly

  • Evaluate additional sources

  • Update based on incidents

Last updated