Skip to content

[FEATURE] Create Kubernetes Health and Metrics Endpoints #268

@KofTwentyTwo

Description

@KofTwentyTwo

Feature Request: Kubernetes Health and Metrics Endpoints

🚀 Feature Description

Is your feature request related to a problem? Please describe.

Currently, QQQ applications lack standardized health check and metrics endpoints that are compatible with modern cloud-native infrastructure like Kubernetes, Prometheus, and Grafana. This makes it difficult to:

  • Implement Kubernetes liveness and readiness probes
  • Monitor application health in containerized environments
  • Collect and aggregate metrics across distributed QQQ applications
  • Integrate with observability platforms (Prometheus, Grafana, Datadog, etc.)
  • Follow cloud-native best practices for application monitoring

Describe the solution you'd like:

Add optional middleware modules (qqq-middleware-health and qqq-middleware-metrics) that provide:

  1. Health Endpoint (/health):

    • Kubernetes-compatible health checks (liveness and readiness probes)
    • Configurable health indicators (database connectivity, external service availability, etc.)
    • JSON response format compatible with standard health check specifications
    • Extensible health check registration system
  2. Metrics Endpoint (/metrics):

    • Prometheus-compatible metrics exposition format
    • Standard JVM metrics (memory, threads, GC)
    • Application-level metrics (request counts, response times, error rates)
    • Custom metric registration API for application-specific measurements
    • Optional OpenMetrics format support
  3. Configurability:

    • Customizable endpoint paths (defaults: /health and /metrics)
    • Optional authentication/authorization for endpoints
    • Selective metric collection (enable/disable specific metric groups)
    • Health check component registration API
    • Configurable metric labels and tags

Describe alternatives you've considered:

  1. Manual Implementation: Users could implement their own health/metrics endpoints in each application, but this:

    • Lacks standardization across QQQ applications
    • Requires boilerplate code in every project
    • Doesn't leverage QQQ's meta-data architecture
    • Misses opportunity for framework-level optimizations
  2. Third-Party Libraries: Direct integration of libraries like Micrometer or Spring Boot Actuator, but:

    • Introduces heavy dependencies
    • May not align with QQQ's architectural patterns
    • Doesn't integrate with QQQ's existing middleware abstractions
    • Could create version conflicts with existing dependencies
  3. Middleware-Specific Solutions: Implementing only for Javalin or other specific middleware, but:

    • Creates inconsistency across middleware implementations
    • Doesn't benefit Lambda, PicoCLI, or future middleware options
    • Misses opportunity for shared health/metrics abstractions

💡 Use Case

What is the use case for this feature?

Who would benefit from this feature?

  • DevOps engineers deploying QQQ applications to Kubernetes clusters
  • SRE teams monitoring QQQ application health and performance
  • Enterprise users requiring compliance with observability standards
  • Development teams debugging production issues with metrics data
  • Platform engineers building multi-tenant QQQ hosting infrastructure

What scenarios would this feature be useful in?

  • Kubernetes Deployments: Configure liveness/readiness probes for automatic pod lifecycle management
  • Load Balancer Health Checks: Integrate with AWS ELB, GCP Load Balancer, or Azure Application Gateway
  • Monitoring & Alerting: Scrape metrics into Prometheus and create Grafana dashboards
  • Auto-Scaling: Use metrics to drive Kubernetes HPA (Horizontal Pod Autoscaler) decisions
  • Incident Response: Quickly identify unhealthy components during production issues
  • Capacity Planning: Analyze historical metrics to forecast resource requirements
  • Performance Optimization: Identify bottlenecks through request latency metrics

How would this improve the QQQ framework?

  • Cloud-Native Readiness: Positions QQQ as a first-class citizen in containerized environments
  • Production Maturity: Demonstrates enterprise-grade operational capabilities
  • Ecosystem Integration: Enables QQQ to integrate with standard observability tooling
  • Developer Experience: Reduces boilerplate for production-ready deployments
  • Operational Excellence: Provides visibility into application health without custom code
  • Competitive Positioning: Aligns QQQ with expectations for modern application frameworks

🔧 Implementation Ideas

Do you have any ideas about how this could be implemented?

Technical Approach

Module Structure:

qqq-middleware-health/
├── src/main/java/com/kingsrook/qqq/middleware/health/
│   ├── HealthCheckRegistry.java           # Central registry for health checks
│   ├── HealthCheckResult.java             # Health check result model
│   ├── HealthIndicator.java               # Interface for health indicators
│   ├── model/
│   │   ├── HealthCheckMetaData.java       # MetaData for health endpoint config
│   │   ├── HealthStatus.java              # Enum: UP, DOWN, DEGRADED, UNKNOWN
│   │   └── HealthResponse.java            # JSON response structure
│   ├── indicators/
│   │   ├── DatabaseHealthIndicator.java   # Check database connectivity
│   │   ├── MemoryHealthIndicator.java     # Check JVM memory thresholds
│   │   ├── DiskSpaceHealthIndicator.java  # Check available disk space
│   │   └── CustomHealthIndicator.java     # Base for user-defined checks
│   └── middleware/
│       ├── JavalinHealthRouteProvider.java
│       ├── LambdaHealthHandler.java
│       └── PicoCLIHealthCommand.java

qqq-middleware-metrics/
├── src/main/java/com/kingsrook/qqq/middleware/metrics/
│   ├── MetricsRegistry.java               # Central metrics registry
│   ├── MetricCollector.java               # Interface for metric collectors
│   ├── model/
│   │   ├── MetricsMetaData.java           # MetaData for metrics endpoint config
│   │   ├── Metric.java                    # Individual metric representation
│   │   └── MetricType.java                # Enum: COUNTER, GAUGE, HISTOGRAM, SUMMARY
│   ├── collectors/
│   │   ├── JvmMetricsCollector.java       # JVM memory, threads, GC
│   │   ├── RequestMetricsCollector.java   # HTTP request stats
│   │   ├── ProcessMetricsCollector.java   # QQQ process execution metrics
│   │   └── DatabaseMetricsCollector.java  # Connection pool, query times
│   ├── formatters/
│   │   ├── PrometheusFormatter.java       # Prometheus exposition format
│   │   ├── OpenMetricsFormatter.java      # OpenMetrics format
│   │   └── JsonFormatter.java             # JSON format for other tools
│   └── middleware/
│       ├── JavalinMetricsRouteProvider.java
│       ├── LambdaMetricsHandler.java
│       └── PicoCLIMetricsCommand.java

Configuration via QInstance MetaData:

QInstance qInstance = QInstance.create()
   .withHealthCheck(new HealthCheckMetaData()
      .withEnabled(true)
      .withEndpointPath("/health")  // Configurable, defaults to /health
      .withIndicators(List.of(
         new DatabaseHealthIndicator(),
         new MemoryHealthIndicator().withThreshold(90), // 90% memory threshold
         new CustomHealthIndicator("external-api", this::checkExternalApi)
      ))
      .withAuthenticationRequired(false))
   .withMetrics(new MetricsMetaData()
      .withEnabled(true)
      .withEndpointPath("/metrics")  // Configurable, defaults to /metrics
      .withFormat(MetricsFormat.PROMETHEUS)
      .withCollectors(List.of(
         new JvmMetricsCollector(),
         new RequestMetricsCollector(),
         new ProcessMetricsCollector()
      ))
      .withAuthenticationRequired(true)  // Optional authentication
      .withLabels(Map.of(
         "application", "my-qqq-app",
         "environment", "production"
      )));

Health Check API Integration:

// In middleware (e.g., Javalin)
public class JavalinHealthRouteProvider implements RouteProviderInterface
{
   @Override
   public void defineRoutes(Javalin app, QInstance qInstance)
   {
      HealthCheckMetaData healthConfig = qInstance.getHealthCheck();
      if (healthConfig != null && healthConfig.getEnabled())
      {
         String path = healthConfig.getEndpointPath();
         app.get(path, ctx -> {
            HealthResponse response = HealthCheckRegistry.check(qInstance);
            ctx.json(response);
            ctx.status(response.getStatus() == HealthStatus.UP ? 200 : 503);
         });
      }
   }
}

Prometheus Metrics Format:

# HELP qqq_http_requests_total Total HTTP requests processed
# TYPE qqq_http_requests_total counter
qqq_http_requests_total{method="GET",path="/api/order",status="200"} 1547.0
qqq_http_requests_total{method="POST",path="/api/order",status="201"} 328.0

# HELP qqq_http_request_duration_seconds HTTP request duration
# TYPE qqq_http_request_duration_seconds histogram
qqq_http_request_duration_seconds_bucket{le="0.1"} 1234.0
qqq_http_request_duration_seconds_bucket{le="0.5"} 1850.0
qqq_http_request_duration_seconds_sum 456.78
qqq_http_request_duration_seconds_count 1875.0

# HELP jvm_memory_used_bytes Memory used by JVM
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap"} 536870912.0
jvm_memory_used_bytes{area="nonheap"} 134217728.0

Health Check Response Format:

{
  "status": "UP",
  "timestamp": "2025-11-26T10:30:00Z",
  "checks": {
    "database": {
      "status": "UP",
      "duration": 23,
      "details": {
        "type": "PostgreSQL",
        "version": "15.3",
        "connectionPool": {
          "active": 5,
          "idle": 10,
          "max": 20
        }
      }
    },
    "memory": {
      "status": "UP",
      "duration": 2,
      "details": {
        "used": 512,
        "max": 2048,
        "percentage": 25
      }
    },
    "externalApi": {
      "status": "DEGRADED",
      "duration": 1523,
      "details": {
        "message": "Response time exceeds threshold",
        "threshold": 1000,
        "actual": 1523
      }
    }
  }
}

Similar Features in Other Frameworks

  • Spring Boot Actuator: Provides /actuator/health and /actuator/metrics with similar functionality
  • Micronaut Health: Built-in health endpoint with extensible indicators
  • Quarkus SmallRye Health: MicroProfile Health implementation for Kubernetes
  • Dropwizard Metrics: Comprehensive metrics library with Prometheus integration
  • Micrometer: Vendor-neutral metrics facade supporting multiple monitoring systems

Constraints and Considerations

  • Minimal Dependencies: Avoid heavy frameworks; implement core functionality in pure Java
  • Performance Overhead: Metrics collection should have negligible performance impact (<1% CPU)
  • Security: Health/metrics endpoints should support optional authentication (QQQ security keys)
  • Backward Compatibility: Fully optional modules that don't affect existing applications
  • Middleware Agnostic: Core abstractions should work across Javalin, Lambda, PicoCLI
  • Thread Safety: Metrics collection must be thread-safe for concurrent requests
  • Memory Footprint: Keep metric storage efficient (circular buffers, sampling strategies)

📊 Impact Assessment

What is the impact of this feature?

Scope:

  • Size: Medium to large - two new modules with comprehensive functionality
  • Complexity: Medium - requires understanding of Prometheus format, Kubernetes probes, and middleware integration
  • Integration: Touches multiple middleware modules but remains fully optional
  • Documentation: Requires wiki pages, code examples, and deployment guides

Users:

  • Immediate Benefit: Organizations deploying QQQ to Kubernetes/cloud environments
  • Medium-Term Benefit: All users seeking production-grade observability
  • Long-Term Benefit: Entire QQQ ecosystem as monitoring becomes standard practice
  • Estimated Adoption: 40-60% of enterprise QQQ deployments within 12 months

Complexity:

  • Implementation Complexity: Medium
    • Core abstractions: 3-5 days
    • Middleware integrations: 2-3 days per middleware
    • Standard collectors/indicators: 3-5 days
    • Testing and documentation: 5-7 days
    • Total: 3-4 weeks for complete implementation
  • API Design: Must follow QQQ's MetaData patterns and fluent-style conventions
  • Testing: Requires integration tests with actual Prometheus/K8s environments

Maintenance:

  • Ongoing Maintenance: Low to medium
    • Prometheus format is stable (changes rare)
    • Kubernetes health check spec is mature
    • Primary maintenance: adding new built-in collectors/indicators
    • Security updates for authentication mechanisms
  • Community Contributions: Likely source of new collector implementations
  • Version Compatibility: Must maintain compatibility with multiple K8s versions

🔗 Related Resources

Before submitting, please check:

Related External Standards:

📚 Getting Help

Need more information?

🎯 Next Steps

If this feature is accepted:

  • Create detailed design document covering:

    • API surface area and MetaData structures
    • Health check indicator interface and built-in implementations
    • Metrics collector interface and standard collectors
    • Middleware integration points (Javalin, Lambda, PicoCLI)
    • Authentication/authorization integration
    • Configuration examples for common scenarios
  • Implement feature following Feature Development Guide:

    • Create qqq-middleware-health module structure
    • Create qqq-middleware-metrics module structure
    • Implement core abstractions (HealthIndicator, MetricCollector)
    • Implement standard indicators (Database, Memory, DiskSpace)
    • Implement standard collectors (JVM, Request, Process)
    • Implement Prometheus formatter
    • Integrate with Javalin middleware
    • Integrate with Lambda middleware
    • Integrate with PicoCLI middleware
    • Add optional authentication support
  • Add tests following Testing Guide:

    • Unit tests for all collectors and indicators (>70% instruction coverage)
    • Integration tests with mock Prometheus scraper
    • Kubernetes probe simulation tests
    • Performance benchmarks for metrics collection overhead
    • Thread safety tests for concurrent metric updates
  • Update documentation in the Wiki:

    • Health endpoint configuration guide
    • Metrics endpoint configuration guide
    • Custom health indicator tutorial
    • Custom metrics collector tutorial
    • Kubernetes deployment examples with probe configuration
    • Prometheus/Grafana integration guide
    • Troubleshooting common monitoring issues
  • Create sample implementations:

    • Example Kubernetes deployment YAML with health probes
    • Example Prometheus scrape configuration
    • Example Grafana dashboard JSON
    • Sample QInstance configuration with health and metrics

Thank you for helping improve QQQ! 🚀

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions