Back to blog
Technical
5 Jan 2025
10 min read

Building Scalable Web Applications: Lessons from CheckID SA

Technical insights from building CheckID SA which serves thousands of South African users monthly with 99.9% uptime.

Building Scalable Web Applications: Lessons from CheckID SA

CheckID SA is an identity verification platform serving South African users. Built to handle thousands of monthly validations with 99.9% uptime, the project taught me crucial lessons about building applications that scale reliably.

This article covers the technical decisions, architecture choices, and operational practices that keep CheckID SA running smoothly.

The Challenge

CheckID SA validates South African ID numbers in real-time. The requirements were clear:

  • Instant validation (under 500ms response time)
  • Handle traffic spikes (validation requests can surge)
  • Maintain uptime during peak usage periods
  • Scale cost-effectively

Architecture Decisions

Serverless-First Approach

I chose Next.js with serverless functions for the API layer. This provides:

  • Automatic scaling (no server management)
  • Pay-per-use pricing
  • Built-in redundancy
  • Global edge distribution

Database Strategy

Rather than a traditional relational database, I used:

  • Edge-compatible database (Vercel Postgres at edge locations)
  • Connection pooling for efficient queries
  • Read replicas for geographic distribution
  • Caching layer for frequently accessed data

Caching Layer

A multi-tier caching strategy:

  1. Edge caching: Static validation rules cached at CDN edge
  2. Application caching: Frequently validated IDs cached in memory
  3. Database caching: Query results cached with TTL

This reduces database load by 80% during peak traffic.

Performance Optimization

Response Time Targets

  • API response: < 200ms
  • Database queries: < 50ms
  • Cache hits: < 10ms

Every component was optimized to meet these targets.

Database Query Optimization

The ID validation algorithm runs entirely in the database using stored procedures. This:

  • Eliminates application-level processing time
  • Reduces network round trips
  • Leverages database query optimization

Edge Deployment

All static assets and API routes are deployed to edge locations close to users. This reduces latency for South African users significantly.

Reliability and Uptime

Monitoring and Alerting

I implemented comprehensive monitoring:

  • Uptime monitoring (checks every 60 seconds)
  • Error rate tracking
  • Response time monitoring
  • Database connection pool monitoring

Alerts trigger if:

  • Response time exceeds 1 second
  • Error rate exceeds 1%
  • Database connections approach limits

Graceful Degradation

If the primary database is unavailable, the system:

  1. Falls back to cached validation rules
  2. Returns partial results (basic validation only)
  3. Logs errors for post-processing

Users always get a response, even if full validation isn't available.

Database Redundancy

The database setup includes:

  • Primary database with automatic failover
  • Read replicas in multiple regions
  • Automated backups every 6 hours
  • Point-in-time recovery capability

Load Testing

Before launch, I ran load tests simulating:

  • 100 concurrent requests
  • 1,000 requests per minute
  • Sustained traffic over 1 hour

This identified bottlenecks before they affected users.

Scaling Patterns

Horizontal Scaling

The serverless architecture scales horizontally automatically. As traffic increases:

  • More function instances spin up
  • Load distributes across instances
  • No manual intervention required

Database Scaling

Database scaling follows a tiered approach:

  1. Connection pooling (maximize existing resources)
  2. Read replicas (distribute read load)
  3. Vertical scaling (upgrade instance size)
  4. Horizontal scaling (sharding if needed)

Currently at stage 2, with capacity for 10x current load.

Cost Management

Serverless pricing means costs scale with usage:

  • Low traffic periods: minimal costs
  • Peak traffic: higher costs, but still predictable
  • No idle server costs

I monitor costs weekly and set alerts for unusual spikes.

Security Considerations

Input Validation

All ID numbers are validated before processing:

  • Format validation (correct length and structure)
  • Sanitization (remove whitespace, normalize)
  • Rate limiting (prevent abuse)

Data Privacy

  • ID numbers are never stored (only validation results)
  • Results are cached with short TTLs
  • No personal information collected
  • GDPR-compliant data handling

API Security

  • API key authentication
  • Rate limiting per key
  • Request logging for audit trails
  • DDoS protection via Cloudflare

Operational Practices

Deployment Strategy

  • Zero-downtime deployments
  • Feature flags for gradual rollouts
  • Automated rollback on errors
  • Database migrations run separately from code deployments

Error Handling

Every error is:

  • Logged with context
  • Categorized by severity
  • Tracked in error monitoring
  • Reviewed weekly for patterns

Performance Monitoring

I track:

  • Average response times
  • P95 and P99 response times
  • Error rates by endpoint
  • Database query performance

Weekly reviews identify optimization opportunities.

Lessons Learned

1. Start with Edge Computing

Deploying to edge locations from day one improved performance significantly. Users in South Africa experience sub-200ms response times despite servers being geographically distant.

2. Cache Aggressively

The caching layer handles most requests without hitting the database. This reduces costs and improves reliability.

3. Monitor Everything

Comprehensive monitoring caught issues before they affected users. Response time alerts prevented degraded performance.

4. Plan for Failure

Graceful degradation ensures users always get value, even when components fail. This builds trust and reliability.

5. Load Test Early

Load testing before launch identified bottlenecks that would have caused issues under real traffic. Fixing these proactively saved headaches later.

Current Performance

CheckID SA now handles:

  • Thousands of validations monthly
  • 99.9% uptime (monitored)
  • Average response time: 150ms
  • P95 response time: 300ms
  • Zero unplanned downtime

The architecture continues to scale as usage grows, with clear paths for further optimization when needed.

Future Considerations

As traffic grows, potential optimizations include:

  • Additional read replicas in more regions
  • More aggressive caching strategies
  • Database query optimization
  • CDN integration for static assets

The foundation is solid, and scaling is straightforward.