DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

Sliding-Window Spend Guard: the $47K Loop Per-Call Caps Miss

Comments
11 min read
AI Agent 生产环境每月崩几次?——LLM API 可靠性数据真相

AI Agent 生产环境每月崩几次?——LLM API 可靠性数据真相

Comments
1 min read
Graceful Degradation: Circuit Breakers for External API Dependencies

Graceful Degradation: Circuit Breakers for External API Dependencies

Comments
5 min read
Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Building a Chaos Testing Harness for Multi-Region Video API Endpoints

Comments
10 min read
Error budgets when downtime costs money: reliability engineering for payment-critical systems

Error budgets when downtime costs money: reliability engineering for payment-critical systems

Comments
10 min read
Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Distributed Tracing 101: The Mental Model, the Standards, and Your First Pipeline

Comments
5 min read
Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Safe Operating Throughput (SOT) as a First-Class SRE Metric: Derivation and Operationalization

Comments
17 min read
AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

AI SRE: What an Autonomous Agent Doing On-Call Actually Looks Like

Comments
6 min read
Monitoring and Logging: How They Work Together and When You Need Both

Monitoring and Logging: How They Work Together and When You Need Both

Comments
8 min read
MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

Comments
6 min read
Deploying Production Systems on Raspberry Pi: Lessons from the Field

Deploying Production Systems on Raspberry Pi: Lessons from the Field

Comments
7 min read
maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

maskedcauses: Maximum Likelihood Estimation for Masked Series System Failures

Comments
5 min read
Model Selection for Weibull Series Systems: When Simpler Models Suffice

Model Selection for Weibull Series Systems: When Simpler Models Suffice

Comments
3 min read
The Economics of Reliability: When to Invest, When to Accept Risk

The Economics of Reliability: When to Invest, When to Accept Risk

Comments
2 min read
Your Scraper Died at Row 12,000. The Rerun Pattern.

Your Scraper Died at Row 12,000. The Rerun Pattern.

Comments
13 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.