Skip to main content

Observability

Three streams

StreamSourceUse for
PostHog audit logAuditLogMiddleware + AuditLoggerServiceRequest-level traffic, errors, role-based access patterns
CloudWatch / stdout logsNestJS loggerStack traces, application-level details
Stripe eventsstripe_events tableBilling lifecycle

PostHog queries

Event name: api_request. Properties: requestId, userId, userRole, orgId, method, path, statusCode, result, errorCode, duration_ms.

Useful queries:

QuestionQuery
Top error codes last 24hevent = api_request AND result = 'failure' BREAKDOWN BY errorCode
Slowest endpointsevent = api_request, avg(duration_ms) BREAKDOWN BY path
Per-user activityevent = api_request BREAKDOWN BY userId
Admin actionsevent = api_request, userRole = 'platform_admin' BREAKDOWN BY path
Quota exceededevent = api_request, errorCode = 'QUOTA_EXCEEDED' BREAKDOWN BY path

Build dashboards once, reuse forever.

CloudWatch / stdout

NestJS's default logger prints structured lines:

[Nest] 12345 - 2026-05-14T... [HTTP] GET /agents 200 +87ms (req-uuid alice@example.com)

requestId is included so you can grep CloudWatch with the same UUID you saw in PostHog.

Stripe event log

SELECT type, processed_at, error_message
FROM stripe_events
ORDER BY received_at DESC
LIMIT 50;

Events with error_message IS NOT NULL are stuck — see Troubleshooting → Stripe.

Tracing

Not currently wired. The requestId UUID is a poor-man's trace. If you need real distributed tracing:

  • Add @opentelemetry/api + auto-instrumentation.
  • Configure an exporter (Datadog, Honeycomb, AWS X-Ray).
  • Propagate traceparent headers between marketplace and agent backends.

Not yet a priority — single-service plus PostHog covers ~90% of debugging.

Alerts

Recommended PostHog alerts:

  • 5xx rate > 1% over 5 min.
  • 4xx rate > 5% over 5 min (catches deploy regressions).
  • QUOTA_EXCEEDED rate spike (signals a popular agent's quotas are too tight).
  • Stripe webhook failures (correlate to stripe_events table).

Performance

duration_ms is reported. Outliers:

  • 95p > 1000ms is a problem.
  • Anything > 5000ms is almost certainly a leaked DB connection or unbounded query.

Drizzle Studio + EXPLAIN ANALYZE are your friends.