Observability
Three streams
| Stream | Source | Use for |
|---|---|---|
| PostHog audit log | AuditLogMiddleware + AuditLoggerService | Request-level traffic, errors, role-based access patterns |
| CloudWatch / stdout logs | NestJS logger | Stack traces, application-level details |
| Stripe events | stripe_events table | Billing lifecycle |
PostHog queries
Event name: api_request. Properties: requestId, userId, userRole, orgId, method, path, statusCode, result, errorCode, duration_ms.
Useful queries:
| Question | Query |
|---|---|
| Top error codes last 24h | event = api_request AND result = 'failure' BREAKDOWN BY errorCode |
| Slowest endpoints | event = api_request, avg(duration_ms) BREAKDOWN BY path |
| Per-user activity | event = api_request BREAKDOWN BY userId |
| Admin actions | event = api_request, userRole = 'platform_admin' BREAKDOWN BY path |
| Quota exceeded | event = api_request, errorCode = 'QUOTA_EXCEEDED' BREAKDOWN BY path |
Build dashboards once, reuse forever.
CloudWatch / stdout
NestJS's default logger prints structured lines:
[Nest] 12345 - 2026-05-14T... [HTTP] GET /agents 200 +87ms (req-uuid alice@example.com)
requestId is included so you can grep CloudWatch with the same UUID you saw in PostHog.
Stripe event log
SELECT type, processed_at, error_message
FROM stripe_events
ORDER BY received_at DESC
LIMIT 50;
Events with error_message IS NOT NULL are stuck — see Troubleshooting → Stripe.
Tracing
Not currently wired. The requestId UUID is a poor-man's trace. If you need real distributed tracing:
- Add
@opentelemetry/api+ auto-instrumentation. - Configure an exporter (Datadog, Honeycomb, AWS X-Ray).
- Propagate
traceparentheaders between marketplace and agent backends.
Not yet a priority — single-service plus PostHog covers ~90% of debugging.
Alerts
Recommended PostHog alerts:
- 5xx rate > 1% over 5 min.
- 4xx rate > 5% over 5 min (catches deploy regressions).
QUOTA_EXCEEDEDrate spike (signals a popular agent's quotas are too tight).- Stripe webhook failures (correlate to
stripe_eventstable).
Performance
duration_ms is reported. Outliers:
- 95p > 1000ms is a problem.
- Anything > 5000ms is almost certainly a leaked DB connection or unbounded query.
Drizzle Studio + EXPLAIN ANALYZE are your friends.