Reliability
Rubicon's oracle is designed for high availability. This page documents our reliability architecture and monitoring.
Design Goals
Uptime
99.9%
Max latency
<1 second
Max staleness
60 seconds
Recovery time
<5 minutes
Reliability Features
1. Multi-Source Fallback
Three-tier price source hierarchy:
Polygon → Yahoo → CacheEach source failure is handled automatically without human intervention.
2. State Persistence
Oracle state survives restarts:
// ./data/oracle-state.json
{
"prices": {
"SOXX": "225.00",
"timestamp": "2024-01-15T15:30:00Z"
},
"emaState": {
"SOXX": { "value": 224.95, "samples": 1200 }
},
"errorCount": 0,
"lastSuccess": "2024-01-15T15:30:00Z"
}Recovery flow:
Process starts
Loads state from disk
Validates state freshness
Resumes with minimal disruption
3. EMA Smoothing
Mark prices use Exponential Moving Average:
Reduces impact of momentary spikes
Provides stability during volatility
Prevents manipulation via thin liquidity
4. Degraded Mode
When reliability is compromised:
3 consecutive errors
Enter degraded mode
Cache >60s stale
Halt oracle updates
Extended outage
Block new positions
Monitoring
Health Metrics
oracle.success_rate
% successful submissions
<99%
oracle.latency_p99
99th percentile latency
>500ms
oracle.source_fallback
Secondary source usage
>1%
oracle.cache_usage
Cache fallback usage
>0.1%
oracle.error_count
Consecutive errors
>2
Alerting
Alerts notify on-call team for immediate response.
Dashboards
Operations team monitors:
Real-time price feed status
Source response times
Error rates by source
Historical uptime
Incident Response
Automated Response
Source timeout
Switch to fallback
Rate limiting
Backoff and retry
Invalid price
Skip and log
All sources fail
Use cache, alert
Manual Response
For extended outages:
On-call receives alert
Assess source status
Contact provider if needed
Communicate status to users
Monitor recovery
Historical Reliability
We track and publish:
Monthly uptime percentage
Incident reports (if any)
Source availability stats
Architecture Decisions
Why Multiple Sources?
Single source = single point of failure. Our multi-source design:
Polygon outage → Yahoo takes over (~0.1% of time)
Both down → Cache covers brief outages
Result: Near-continuous availability
Why 3-Second Updates?
Balances:
Freshness — Prices reflect recent trades
Rate limits — Stay within API quotas
Hyperliquid — Matches their update cadence
Why 60-Second Cache Limit?
Longer cache risks:
Trading on outdated prices
Significant price movement missed
User trust compromised
60 seconds covers brief connectivity issues while limiting staleness.
Failure Modes
Mode 1: Graceful Degradation
Users may not notice anything.
Mode 2: Stale Price Warning
Users informed, can still exit positions.
Mode 3: Halt
Protective measure during severe issues.
Disaster Recovery
Data Loss Recovery
State file corrupted:
Start with empty state
Fetch fresh prices
EMA rebuilds over time
Complete Restart
Load persisted state
Verify state freshness
Resume submissions
EMA continues from saved state
Provider Outage
Long-term source unavailable:
Add alternative sources
Deploy updated oracle
Resume operations
User-Facing Reliability
What traders experience:
Healthy
Normal
Allowed
Green indicator
Fallback
Normal
Allowed
Yellow indicator
Stale
Normal
Cautioned
Orange indicator
Degraded
Limited
Blocked
Red indicator
Continuous Improvement
We actively improve reliability:
Monitor for new failure modes
Test fallback paths regularly
Evaluate additional sources
Update based on incidents
Last updated