Building Scalable Web Applications: Lessons from CheckID SA
Technical insights from building CheckID SA which serves thousands of South African users monthly with 99.9% uptime.
Building Scalable Web Applications: Lessons from CheckID SA
CheckID SA is an identity verification platform serving South African users. Built to handle thousands of monthly validations with 99.9% uptime, the project taught me crucial lessons about building applications that scale reliably.
This article covers the technical decisions, architecture choices, and operational practices that keep CheckID SA running smoothly.
The Challenge
CheckID SA validates South African ID numbers in real-time. The requirements were clear:
- Instant validation (under 500ms response time)
- Handle traffic spikes (validation requests can surge)
- Maintain uptime during peak usage periods
- Scale cost-effectively
Architecture Decisions
Serverless-First Approach
I chose Next.js with serverless functions for the API layer. This provides:
- Automatic scaling (no server management)
- Pay-per-use pricing
- Built-in redundancy
- Global edge distribution
Database Strategy
Rather than a traditional relational database, I used:
- Edge-compatible database (Vercel Postgres at edge locations)
- Connection pooling for efficient queries
- Read replicas for geographic distribution
- Caching layer for frequently accessed data
Caching Layer
A multi-tier caching strategy:
- Edge caching: Static validation rules cached at CDN edge
- Application caching: Frequently validated IDs cached in memory
- Database caching: Query results cached with TTL
This reduces database load by 80% during peak traffic.
Performance Optimization
Response Time Targets
- API response: < 200ms
- Database queries: < 50ms
- Cache hits: < 10ms
Every component was optimized to meet these targets.
Database Query Optimization
The ID validation algorithm runs entirely in the database using stored procedures. This:
- Eliminates application-level processing time
- Reduces network round trips
- Leverages database query optimization
Edge Deployment
All static assets and API routes are deployed to edge locations close to users. This reduces latency for South African users significantly.
Reliability and Uptime
Monitoring and Alerting
I implemented comprehensive monitoring:
- Uptime monitoring (checks every 60 seconds)
- Error rate tracking
- Response time monitoring
- Database connection pool monitoring
Alerts trigger if:
- Response time exceeds 1 second
- Error rate exceeds 1%
- Database connections approach limits
Graceful Degradation
If the primary database is unavailable, the system:
- Falls back to cached validation rules
- Returns partial results (basic validation only)
- Logs errors for post-processing
Users always get a response, even if full validation isn't available.
Database Redundancy
The database setup includes:
- Primary database with automatic failover
- Read replicas in multiple regions
- Automated backups every 6 hours
- Point-in-time recovery capability
Load Testing
Before launch, I ran load tests simulating:
- 100 concurrent requests
- 1,000 requests per minute
- Sustained traffic over 1 hour
This identified bottlenecks before they affected users.
Scaling Patterns
Horizontal Scaling
The serverless architecture scales horizontally automatically. As traffic increases:
- More function instances spin up
- Load distributes across instances
- No manual intervention required
Database Scaling
Database scaling follows a tiered approach:
- Connection pooling (maximize existing resources)
- Read replicas (distribute read load)
- Vertical scaling (upgrade instance size)
- Horizontal scaling (sharding if needed)
Currently at stage 2, with capacity for 10x current load.
Cost Management
Serverless pricing means costs scale with usage:
- Low traffic periods: minimal costs
- Peak traffic: higher costs, but still predictable
- No idle server costs
I monitor costs weekly and set alerts for unusual spikes.
Security Considerations
Input Validation
All ID numbers are validated before processing:
- Format validation (correct length and structure)
- Sanitization (remove whitespace, normalize)
- Rate limiting (prevent abuse)
Data Privacy
- ID numbers are never stored (only validation results)
- Results are cached with short TTLs
- No personal information collected
- GDPR-compliant data handling
API Security
- API key authentication
- Rate limiting per key
- Request logging for audit trails
- DDoS protection via Cloudflare
Operational Practices
Deployment Strategy
- Zero-downtime deployments
- Feature flags for gradual rollouts
- Automated rollback on errors
- Database migrations run separately from code deployments
Error Handling
Every error is:
- Logged with context
- Categorized by severity
- Tracked in error monitoring
- Reviewed weekly for patterns
Performance Monitoring
I track:
- Average response times
- P95 and P99 response times
- Error rates by endpoint
- Database query performance
Weekly reviews identify optimization opportunities.
Lessons Learned
1. Start with Edge Computing
Deploying to edge locations from day one improved performance significantly. Users in South Africa experience sub-200ms response times despite servers being geographically distant.
2. Cache Aggressively
The caching layer handles most requests without hitting the database. This reduces costs and improves reliability.
3. Monitor Everything
Comprehensive monitoring caught issues before they affected users. Response time alerts prevented degraded performance.
4. Plan for Failure
Graceful degradation ensures users always get value, even when components fail. This builds trust and reliability.
5. Load Test Early
Load testing before launch identified bottlenecks that would have caused issues under real traffic. Fixing these proactively saved headaches later.
Current Performance
CheckID SA now handles:
- Thousands of validations monthly
- 99.9% uptime (monitored)
- Average response time: 150ms
- P95 response time: 300ms
- Zero unplanned downtime
The architecture continues to scale as usage grows, with clear paths for further optimization when needed.
Future Considerations
As traffic grows, potential optimizations include:
- Additional read replicas in more regions
- More aggressive caching strategies
- Database query optimization
- CDN integration for static assets
The foundation is solid, and scaling is straightforward.