-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Feature Request: Kubernetes Health and Metrics Endpoints
🚀 Feature Description
Is your feature request related to a problem? Please describe.
Currently, QQQ applications lack standardized health check and metrics endpoints that are compatible with modern cloud-native infrastructure like Kubernetes, Prometheus, and Grafana. This makes it difficult to:
- Implement Kubernetes liveness and readiness probes
- Monitor application health in containerized environments
- Collect and aggregate metrics across distributed QQQ applications
- Integrate with observability platforms (Prometheus, Grafana, Datadog, etc.)
- Follow cloud-native best practices for application monitoring
Describe the solution you'd like:
Add optional middleware modules (qqq-middleware-health and qqq-middleware-metrics) that provide:
-
Health Endpoint (
/health):- Kubernetes-compatible health checks (liveness and readiness probes)
- Configurable health indicators (database connectivity, external service availability, etc.)
- JSON response format compatible with standard health check specifications
- Extensible health check registration system
-
Metrics Endpoint (
/metrics):- Prometheus-compatible metrics exposition format
- Standard JVM metrics (memory, threads, GC)
- Application-level metrics (request counts, response times, error rates)
- Custom metric registration API for application-specific measurements
- Optional OpenMetrics format support
-
Configurability:
- Customizable endpoint paths (defaults:
/healthand/metrics) - Optional authentication/authorization for endpoints
- Selective metric collection (enable/disable specific metric groups)
- Health check component registration API
- Configurable metric labels and tags
- Customizable endpoint paths (defaults:
Describe alternatives you've considered:
-
Manual Implementation: Users could implement their own health/metrics endpoints in each application, but this:
- Lacks standardization across QQQ applications
- Requires boilerplate code in every project
- Doesn't leverage QQQ's meta-data architecture
- Misses opportunity for framework-level optimizations
-
Third-Party Libraries: Direct integration of libraries like Micrometer or Spring Boot Actuator, but:
- Introduces heavy dependencies
- May not align with QQQ's architectural patterns
- Doesn't integrate with QQQ's existing middleware abstractions
- Could create version conflicts with existing dependencies
-
Middleware-Specific Solutions: Implementing only for Javalin or other specific middleware, but:
- Creates inconsistency across middleware implementations
- Doesn't benefit Lambda, PicoCLI, or future middleware options
- Misses opportunity for shared health/metrics abstractions
💡 Use Case
What is the use case for this feature?
Who would benefit from this feature?
- DevOps engineers deploying QQQ applications to Kubernetes clusters
- SRE teams monitoring QQQ application health and performance
- Enterprise users requiring compliance with observability standards
- Development teams debugging production issues with metrics data
- Platform engineers building multi-tenant QQQ hosting infrastructure
What scenarios would this feature be useful in?
- Kubernetes Deployments: Configure liveness/readiness probes for automatic pod lifecycle management
- Load Balancer Health Checks: Integrate with AWS ELB, GCP Load Balancer, or Azure Application Gateway
- Monitoring & Alerting: Scrape metrics into Prometheus and create Grafana dashboards
- Auto-Scaling: Use metrics to drive Kubernetes HPA (Horizontal Pod Autoscaler) decisions
- Incident Response: Quickly identify unhealthy components during production issues
- Capacity Planning: Analyze historical metrics to forecast resource requirements
- Performance Optimization: Identify bottlenecks through request latency metrics
How would this improve the QQQ framework?
- Cloud-Native Readiness: Positions QQQ as a first-class citizen in containerized environments
- Production Maturity: Demonstrates enterprise-grade operational capabilities
- Ecosystem Integration: Enables QQQ to integrate with standard observability tooling
- Developer Experience: Reduces boilerplate for production-ready deployments
- Operational Excellence: Provides visibility into application health without custom code
- Competitive Positioning: Aligns QQQ with expectations for modern application frameworks
🔧 Implementation Ideas
Do you have any ideas about how this could be implemented?
Technical Approach
Module Structure:
qqq-middleware-health/
├── src/main/java/com/kingsrook/qqq/middleware/health/
│ ├── HealthCheckRegistry.java # Central registry for health checks
│ ├── HealthCheckResult.java # Health check result model
│ ├── HealthIndicator.java # Interface for health indicators
│ ├── model/
│ │ ├── HealthCheckMetaData.java # MetaData for health endpoint config
│ │ ├── HealthStatus.java # Enum: UP, DOWN, DEGRADED, UNKNOWN
│ │ └── HealthResponse.java # JSON response structure
│ ├── indicators/
│ │ ├── DatabaseHealthIndicator.java # Check database connectivity
│ │ ├── MemoryHealthIndicator.java # Check JVM memory thresholds
│ │ ├── DiskSpaceHealthIndicator.java # Check available disk space
│ │ └── CustomHealthIndicator.java # Base for user-defined checks
│ └── middleware/
│ ├── JavalinHealthRouteProvider.java
│ ├── LambdaHealthHandler.java
│ └── PicoCLIHealthCommand.java
qqq-middleware-metrics/
├── src/main/java/com/kingsrook/qqq/middleware/metrics/
│ ├── MetricsRegistry.java # Central metrics registry
│ ├── MetricCollector.java # Interface for metric collectors
│ ├── model/
│ │ ├── MetricsMetaData.java # MetaData for metrics endpoint config
│ │ ├── Metric.java # Individual metric representation
│ │ └── MetricType.java # Enum: COUNTER, GAUGE, HISTOGRAM, SUMMARY
│ ├── collectors/
│ │ ├── JvmMetricsCollector.java # JVM memory, threads, GC
│ │ ├── RequestMetricsCollector.java # HTTP request stats
│ │ ├── ProcessMetricsCollector.java # QQQ process execution metrics
│ │ └── DatabaseMetricsCollector.java # Connection pool, query times
│ ├── formatters/
│ │ ├── PrometheusFormatter.java # Prometheus exposition format
│ │ ├── OpenMetricsFormatter.java # OpenMetrics format
│ │ └── JsonFormatter.java # JSON format for other tools
│ └── middleware/
│ ├── JavalinMetricsRouteProvider.java
│ ├── LambdaMetricsHandler.java
│ └── PicoCLIMetricsCommand.java
Configuration via QInstance MetaData:
QInstance qInstance = QInstance.create()
.withHealthCheck(new HealthCheckMetaData()
.withEnabled(true)
.withEndpointPath("/health") // Configurable, defaults to /health
.withIndicators(List.of(
new DatabaseHealthIndicator(),
new MemoryHealthIndicator().withThreshold(90), // 90% memory threshold
new CustomHealthIndicator("external-api", this::checkExternalApi)
))
.withAuthenticationRequired(false))
.withMetrics(new MetricsMetaData()
.withEnabled(true)
.withEndpointPath("/metrics") // Configurable, defaults to /metrics
.withFormat(MetricsFormat.PROMETHEUS)
.withCollectors(List.of(
new JvmMetricsCollector(),
new RequestMetricsCollector(),
new ProcessMetricsCollector()
))
.withAuthenticationRequired(true) // Optional authentication
.withLabels(Map.of(
"application", "my-qqq-app",
"environment", "production"
)));Health Check API Integration:
// In middleware (e.g., Javalin)
public class JavalinHealthRouteProvider implements RouteProviderInterface
{
@Override
public void defineRoutes(Javalin app, QInstance qInstance)
{
HealthCheckMetaData healthConfig = qInstance.getHealthCheck();
if (healthConfig != null && healthConfig.getEnabled())
{
String path = healthConfig.getEndpointPath();
app.get(path, ctx -> {
HealthResponse response = HealthCheckRegistry.check(qInstance);
ctx.json(response);
ctx.status(response.getStatus() == HealthStatus.UP ? 200 : 503);
});
}
}
}Prometheus Metrics Format:
# HELP qqq_http_requests_total Total HTTP requests processed
# TYPE qqq_http_requests_total counter
qqq_http_requests_total{method="GET",path="/api/order",status="200"} 1547.0
qqq_http_requests_total{method="POST",path="/api/order",status="201"} 328.0
# HELP qqq_http_request_duration_seconds HTTP request duration
# TYPE qqq_http_request_duration_seconds histogram
qqq_http_request_duration_seconds_bucket{le="0.1"} 1234.0
qqq_http_request_duration_seconds_bucket{le="0.5"} 1850.0
qqq_http_request_duration_seconds_sum 456.78
qqq_http_request_duration_seconds_count 1875.0
# HELP jvm_memory_used_bytes Memory used by JVM
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap"} 536870912.0
jvm_memory_used_bytes{area="nonheap"} 134217728.0
Health Check Response Format:
{
"status": "UP",
"timestamp": "2025-11-26T10:30:00Z",
"checks": {
"database": {
"status": "UP",
"duration": 23,
"details": {
"type": "PostgreSQL",
"version": "15.3",
"connectionPool": {
"active": 5,
"idle": 10,
"max": 20
}
}
},
"memory": {
"status": "UP",
"duration": 2,
"details": {
"used": 512,
"max": 2048,
"percentage": 25
}
},
"externalApi": {
"status": "DEGRADED",
"duration": 1523,
"details": {
"message": "Response time exceeds threshold",
"threshold": 1000,
"actual": 1523
}
}
}
}Similar Features in Other Frameworks
- Spring Boot Actuator: Provides
/actuator/healthand/actuator/metricswith similar functionality - Micronaut Health: Built-in health endpoint with extensible indicators
- Quarkus SmallRye Health: MicroProfile Health implementation for Kubernetes
- Dropwizard Metrics: Comprehensive metrics library with Prometheus integration
- Micrometer: Vendor-neutral metrics facade supporting multiple monitoring systems
Constraints and Considerations
- Minimal Dependencies: Avoid heavy frameworks; implement core functionality in pure Java
- Performance Overhead: Metrics collection should have negligible performance impact (<1% CPU)
- Security: Health/metrics endpoints should support optional authentication (QQQ security keys)
- Backward Compatibility: Fully optional modules that don't affect existing applications
- Middleware Agnostic: Core abstractions should work across Javalin, Lambda, PicoCLI
- Thread Safety: Metrics collection must be thread-safe for concurrent requests
- Memory Footprint: Keep metric storage efficient (circular buffers, sampling strategies)
📊 Impact Assessment
What is the impact of this feature?
Scope:
- Size: Medium to large - two new modules with comprehensive functionality
- Complexity: Medium - requires understanding of Prometheus format, Kubernetes probes, and middleware integration
- Integration: Touches multiple middleware modules but remains fully optional
- Documentation: Requires wiki pages, code examples, and deployment guides
Users:
- Immediate Benefit: Organizations deploying QQQ to Kubernetes/cloud environments
- Medium-Term Benefit: All users seeking production-grade observability
- Long-Term Benefit: Entire QQQ ecosystem as monitoring becomes standard practice
- Estimated Adoption: 40-60% of enterprise QQQ deployments within 12 months
Complexity:
- Implementation Complexity: Medium
- Core abstractions: 3-5 days
- Middleware integrations: 2-3 days per middleware
- Standard collectors/indicators: 3-5 days
- Testing and documentation: 5-7 days
- Total: 3-4 weeks for complete implementation
- API Design: Must follow QQQ's MetaData patterns and fluent-style conventions
- Testing: Requires integration tests with actual Prometheus/K8s environments
Maintenance:
- Ongoing Maintenance: Low to medium
- Prometheus format is stable (changes rare)
- Kubernetes health check spec is mature
- Primary maintenance: adding new built-in collectors/indicators
- Security updates for authentication mechanisms
- Community Contributions: Likely source of new collector implementations
- Version Compatibility: Must maintain compatibility with multiple K8s versions
🔗 Related Resources
Before submitting, please check:
- Wiki Documentation for existing functionality
- Existing Issues for similar requests
- Architecture Guide for design context
- Feature Development Guide for implementation details
Related External Standards:
- Kubernetes Liveness/Readiness Probes
- Prometheus Exposition Format
- OpenMetrics Specification
- MicroProfile Health Specification
- Spring Boot Actuator Reference
📚 Getting Help
Need more information?
- 📖 Complete Documentation Wiki - Start here for comprehensive guides
- 🏗️ Architecture Overview - Understand QQQ's design
- 🔧 Feature Development - Learn how to extend QQQ
- 💬 GitHub Discussions - Discuss ideas with the community
🎯 Next Steps
If this feature is accepted:
-
Create detailed design document covering:
- API surface area and MetaData structures
- Health check indicator interface and built-in implementations
- Metrics collector interface and standard collectors
- Middleware integration points (Javalin, Lambda, PicoCLI)
- Authentication/authorization integration
- Configuration examples for common scenarios
-
Implement feature following Feature Development Guide:
- Create
qqq-middleware-healthmodule structure - Create
qqq-middleware-metricsmodule structure - Implement core abstractions (HealthIndicator, MetricCollector)
- Implement standard indicators (Database, Memory, DiskSpace)
- Implement standard collectors (JVM, Request, Process)
- Implement Prometheus formatter
- Integrate with Javalin middleware
- Integrate with Lambda middleware
- Integrate with PicoCLI middleware
- Add optional authentication support
- Create
-
Add tests following Testing Guide:
- Unit tests for all collectors and indicators (>70% instruction coverage)
- Integration tests with mock Prometheus scraper
- Kubernetes probe simulation tests
- Performance benchmarks for metrics collection overhead
- Thread safety tests for concurrent metric updates
-
Update documentation in the Wiki:
- Health endpoint configuration guide
- Metrics endpoint configuration guide
- Custom health indicator tutorial
- Custom metrics collector tutorial
- Kubernetes deployment examples with probe configuration
- Prometheus/Grafana integration guide
- Troubleshooting common monitoring issues
-
Create sample implementations:
- Example Kubernetes deployment YAML with health probes
- Example Prometheus scrape configuration
- Example Grafana dashboard JSON
- Sample QInstance configuration with health and metrics
Thank you for helping improve QQQ! 🚀
Metadata
Metadata
Assignees
Labels
Type
Projects
Status