feat(otel): instrument runtime with GenAI semantic conventions#2620
Open
tdabasinskas wants to merge 12 commits intodocker:mainfrom
Open
feat(otel): instrument runtime with GenAI semantic conventions#2620tdabasinskas wants to merge 12 commits intodocker:mainfrom
tdabasinskas wants to merge 12 commits intodocker:mainfrom
Conversation
Open
fa4a01d to
2a69313
Compare
Member
|
@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase? |
- `pkg/telemetry/genai/` provides the GenAI semantic-conventions surface: span helpers (`ChatSpan`, `EmbeddingSpan`, `FallbackSpan`, `SandboxSpan`, runtime helpers), attribute / operation-name / provider-name constants per the OTel GenAI semconv, conversation-id baggage round-trippers, error classification, content-capture gating (`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`), stability gating (`OTEL_SEMCONV_STABILITY_OPT_IN`), `gen_ai.client.token.usage` and operation-duration histograms, the `gen_ai.evaluation.result` log emitter, and process-boundary helpers (`InjectSandboxEnv`, `InjectTraceContextEnv`) - `pkg/telemetry/mcp/` provides MCP-specific telemetry: `ConversationIDFromBaggage`, span starters for client / server, `params._meta` propagation carrier, attribute constants, and metrics - Test files cover content gating, stability defaults, conversation propagation, and span lifecycle invariants
- `cmd/root/otel.go`: stand up `TracerProvider` / `MeterProvider` / `LoggerProvider` from a single `initOTelSDK` entry, configure OTLP/HTTP exporters with explicit-scheme endpoint normalization, set the global W3C trace-context + baggage propagator unconditionally, flush providers in dependency order, attach `service.*` / `host.*` / `process.*` / `os.type` / `host.arch` resource attributes, and use `AlwaysSample` so local agent sessions are not dropped by an upstream sampling decision - `pkg/httpclient/client.go`: add a `WrapWithOTel` round-tripper gated on a single `atomic.Bool` flipped by `initOTelSDK` (avoids the prior mismatch between `--otel` and the otelhttp wrap), plus `TracedDefaultClient` / `TracedClient` helpers for one-off HTTP calls - `cmd/root/sandbox.go`: open a host-side `sandbox.exec` span and inject the active W3C trace context as `-e KEY=VALUE` flags so processes inside the container chain onto the host trace - `cmd/root/new.go`, `cmd/root/otel_test.go`: wire tracer scope and cover the endpoint normalization / localhost detection cases - `go.mod` / `go.sum`: pull in `go.opentelemetry.io/otel` SDK + OTLP/HTTP exporters
…s and metrics
- `pkg/model/provider/instrument.go`: decorator that wraps any `Provider` with a `chat {model}` CLIENT span (per OTel GenAI semconv), opt-in capture of `gen_ai.input.messages` / `gen_ai.output.messages` / `gen_ai.tool.definitions`, request/response attributes including the Anthropic spec-sum input-token computation (input + cache_read + cache_creation), `gen_ai.client.token.usage` histogram, and `gen_ai.client.operation.duration` histogram. Six wrapper variants preserve the EmbeddingProvider / RerankingProvider capability surfaces so RAG fallbacks round-trip correctly
- `pkg/model/provider/factory.go`, `factory_test.go`: route construction through the decorator
- `pkg/model/provider/anthropic/client.go`, `files.go`: add `anthropic.tokens.count` and `anthropic.files.get_or_upload` spans for the overflow-retry token-counting path and the file-upload cache-or-create path; drop the unnecessary `string(model)` cast
…n, skills, and background agents
- `pkg/runtime/loop.go`: open `runtime.session` and `runtime.stream` INTERNAL spans seeded with `gen_ai.conversation.id` baggage at session start; mark the session span with `error.type=loop_detected` + `codes.Error` when the loop detector terminates
- `pkg/runtime/fallback.go`, `pkg/runtime/cache.go`: wrap the fallback chain with a `runtime.fallback` span carrying primary/final model, attempts, outcome, cooldown state; record provider-cache hit/backing on the cache span
- `pkg/runtime/agent_delegation.go`: emit `runtime.task_transfer` and `runtime.handoff` spans with `gen_ai.operation.name=invoke_agent` and `gen_ai.agent.name`
- `pkg/runtime/skill_runner.go`: emit `invoke_workflow {skill}` per spec
- `pkg/runtime/toolexec/dispatcher.go`: open `runtime.tool.call` and `runtime.tool.handler` spans with the GenAI execute_tool semconv, capture `gen_ai.tool.call.{arguments,result}` under the content-capture opt-in, and stamp `cagent.approval.{decision,source}` from `notifyApproval` so denied / canceled / read-only-allowed calls are distinguishable in trace dashboards
- `pkg/runtime/compactor/compactor.go`: wrap compaction with a span that carries summary tokens and cost
- `pkg/tools/builtin/agent/agent.go`: open a `background_agent.run` root span with a link back to the spawning context, and stamp `gen_ai.conversation.id` from baggage so the span participates in conversation-scoped queries
- `pkg/tools/startable.go`, `pkg/toolinstall/registry.go`: wrap toolset Start with a `toolset.start` span so capability discovery latency is attributable
…race context
- `pkg/hooks/executor.go`: open a single `hook.{event}` INTERNAL span per Dispatch covering every matched hook, then `annotateHookSpan` stamps the aggregated `Result` so denied / asked / allowed / modified-input / summary-provided cases are distinguishable. Verdict booleans and the structured decision/reason are unconditional; free-text `message` / `additional_context` / `system_message` / `summary` are gated on `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`
- `pkg/hooks/handler.go`: append `genai.InjectTraceContextEnv(ctx)` to the hook subprocess env so script-driven hooks that emit OTel spans (or call instrumented CLIs / LLM endpoints) chain onto the parent `hook.{event}` span instead of producing orphaned roots
- `pkg/mcp/server.go`: route the MCP HTTP transport through `otelhttp.NewHandler` and `otelmcp.StartServer` so inbound requests carry `traceparent` / `baggage` and emit a SERVER span per call - `pkg/tools/mcp/session_client.go`: wrap MCP client calls (`tools/list`, `tools/call`, `prompts/list`) with CLIENT spans using the params._meta propagation carrier. Iterator wrappers open the span inside the iterator closure (not at call time) so unused iterators do not leak spans, and end on every exit path including early `yield` returns - `pkg/tools/mcp/oauth.go`, `oauth_helpers.go`, `oauth_login.go`, `oauth_server.go`: wrap interactive OAuth flow and token refresh with `oauth.flow` / `oauth.token.refresh` CLIENT spans, route metadata HTTP calls through `httpclient.TracedClient` / `TracedDefaultClient`, and emit `oauth.step` span events at each network sub-step boundary (`fetch_protected_resource_metadata`, `fetch_authorization_server_metadata`, `dynamic_client_registration`, `request_authorization_code`, `token_exchange`) so a failure can be attributed to a specific stage without descending into HTTP children
…nt semconv - `pkg/a2a/server.go`: wrap the agent-card and JSON-RPC endpoints with `otelhttp.NewHandler` so inbound A2A requests extract `traceparent` / `tracestate` / `baggage` and emit a SERVER span. The outer `agent-a2a` server wrap covers any auxiliary routes - `pkg/a2a/adapter.go`: in `runDockerAgent`, decorate the active SERVER span with `gen_ai.operation.name=invoke_agent`, `gen_ai.agent.name`, and `cagent.agent.name`. Wires the runtime tracer scope so per-invocation `runtime.session` / `runtime.stream` / `runtime.tool.call` chain onto the inbound A2A span instead of starting fresh trace ids per request
…ints, and add cold-start spans - `pkg/server/server.go`: wrap the agent-api Echo handler with `otelhttp.NewHandler` so inbound API requests extract `traceparent` / `tracestate` / `baggage` and the runtime spans started downstream chain onto the calling client trace - `pkg/server/session_manager.go`: wire the runtime tracer scope into per-session runtime construction; open a `session.runtime_init` INTERNAL span on the cold path (team load + runtime construction) so per-request first-use latency is attributable. Cached hits skip the span — they are a pointer load - `pkg/chatserver/server.go`, `pkg/chatserver/runtime_pool.go`: wrap the chat completions HTTP server with `otelhttp.NewHandler` and propagate the runtime tracer through the per-session pool - `pkg/teamloader/teamloader.go`: open a `teamloader.load` INTERNAL span around `LoadWithConfig` so the cold-start path (config parse, model alias resolution, OCI agent pulls, toolset starts) becomes attributable - `pkg/acp/agent.go`: wire the runtime tracer into the ACP entry point so its sub-spans share scope with CLI / API runs
- `pkg/memory/database/sqlite/sqlite.go`: open `memory.{op}` spans on `AddMemory`, `SearchMemories`, etc., with named-return error capture so failures attach to the span via `RecordError`. The search path additionally emits a `retrieval` semconv span for cross-tool dashboards
- `pkg/rag/manager.go`: open `retrieval` (semconv) spans on `Query`, plus `rag.init` / `rag.reindex` / `rag.file_watcher` for lifecycle visibility
- `pkg/sessiontitle/generator.go`: wrap title generation with a `sessiontitle.generate` span; named-return errors fold onto the span on failure
- `pkg/evaluation/judge.go`: emit `gen_ai.evaluation.result` log events from the LLM-as-judge evaluator with score / explanation / error.type, linked to the active span via context for cross-signal join
- `pkg/tools/builtin/shell.go`, `script_shell.go`: stamp `cagent.tool.{shell,script_shell}.{cmd,cwd,timeout_seconds}` on the active `runtime.tool.handler` span. Cmd ships unconditionally because it is the main signal of what the agent did; redact at the OTel collector if commands carry secrets
- `pkg/tools/builtin/filesystem.go`: stamp `cagent.tool.filesystem.{op,path,paths,path_count}` covering all file operations. Paths ship unconditionally for the same incident-response reason
- `pkg/tools/builtin/fetch.go`: stamp `cagent.tool.fetch.{urls,url_count,format}`; each fetched URL still emits its own HTTP CLIENT child span via `httpclient.WrapWithOTel`
- `pkg/tools/builtin/lsp.go`: wrap every tool from `lspTool` so each LSP RPC stamps `cagent.tool.lsp.{tool,read_only}` on the parent span
- `pkg/tools/builtin/lsp_lifecycle.go`: inject `genai.InjectTraceContextEnv(ctx)` into the LSP server spawn env so OTel-aware language servers chain onto the agent trace
- `pkg/tools/builtin/openapi.go`, `pkg/tools/builtin/api.go`: route the user-facing HTTP clients through `httpclient.WrapWithOTel(remote.NewTransport(ctx))` so each API call emits a CLIENT span and propagates `traceparent`
- `pkg/tools/codemode/exec.go`: stamp `cagent.tool.codemode.{script,script_length,tool_call_count}` so a code-mode turn is visible as "ran N lines of JS that called M tools"
… attribute - Change `tool_call_response` parts to use `result` field instead of `content` to align with OTel GenAI semconv example schema - Cap `cagent.tool.filesystem.paths` attribute to 32 entries to prevent backends from dropping oversized attributes on multi-hundred-path calls - Always record `path_count` to preserve total fidelity when paths are truncated - Fix typo in `ApprovalSourcePermissionRequestHook` constant name (add missing `Allow` suffix) - Remove `t.Parallel()` from MCP tests that mutate global OTel state
…ttrs - `pkg/tools/codemode/exec.go`: emit `cagent.tool.codemode.script_hash` (SHA-256) + `script_length` unconditionally so dashboards can correlate identical scripts and spot oversize submissions, but gate the full `cagent.tool.codemode.script` body behind `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`. Codemode scripts are kilobyte-scale arbitrary JS that routinely embed auth tokens / pasted user data / inline secrets, so the bundle decision (Option B, ship body unconditionally) was the wrong call for this attribute specifically - `pkg/tools/builtin/fetch.go`: strip query strings, fragments, and userinfo from `cagent.tool.fetch.urls` so the attribute can ship by default without leaking signed-URL tokens, OAuth codes, or inline credentials. Path stays intact so dashboards still answer "which sites/endpoints did the agent hit?". Unparseable URLs are emitted as `<unparseable>` rather than passed through verbatim Both span attributes were flagged on the upstream PR review for the same root cause — emitting unbounded user-controlled content as a default-on telemetry attribute creates a PII/secret-exfiltration surface. The other Option B attributes (`shell.cmd`, `filesystem.path`, `script_shell.cmd`) stay unconditional: they are short, do not carry the same query-token / arbitrary-content risk, and remain decision-relevant for incident response
2a69313 to
9b08feb
Compare
Contributor
Author
Done! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds end-to-end OpenTelemetry instrumentation following the GenAI semantic conventions:
chat/embeddings/rerankCLIENT spans withgen_ai.*attributes and thegen_ai.client.token.usage/operation.durationhistograms.runtime.session,runtime.stream,runtime.fallback,runtime.tool.call,runtime.run_skill,runtime.task_transfer,runtime.handoff,background_agent.run).params._metapropagation, plus OAuth flow spans.otelhttpand marked asinvoke_agent.docker exec.service.*,host.*,process.*,os.type)This PR wires two opt-in env vars beyond the default OTel SDK ones:
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT— capture prompts, responses, tool arguments and tool results as span attributes. Off by default (PII surface).OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental— emit only the spec-definedgen_ai.*keys. Default is dual-emit (bothgen_ai.*and the legacytool.name/agent/session.idkeys), so existing dashboards keep working alongside spec-aware tooling.The diff is large — ~50 files, ~5k lines. It's split into 10 topical commits (telemetry primitives → SDK init → providers → runtime → hooks → MCP → A2A → servers/cold-start → memory/RAG → tool internals) so each commit is independently reviewable. Most of the volume is in the new
pkg/telemetry/genai/andpkg/telemetry/mcp/packages, which are pure helpers; the surface-area changes elsewhere are 1-3 lines per call site.