Add diff, multi-account, spot, K8s GPU metrics, Prometheus#26
Merged
Add diff, multi-account, spot, K8s GPU metrics, Prometheus#26
Conversation
Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.
Discovers dcgm-exporter pods via label selectors and scrapes their Prometheus metrics endpoint via kubectl proxy to populate GPU and GPU memory utilization on K8s node instances. Skips nodes that already have utilization data and gracefully handles scrape errors.
Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.
DCGM enrichment matched pods to instances by InstanceID, but pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal) while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to GPUInstance and use it for DCGM matching. Also stop retrying CW queries after the first error — all nodes will get the same AccessDenied when credentials aren't available.
DCGM: stop spamming per-node warnings when scrapes fail consistently (likely RBAC). Log one warning, bail after 3 consecutive failures. Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label matches K8s hostnames, not EC2 instance IDs.
Compares two scan results by instance ID. Detects added, removed, and changed instances across 6 fields (instance type, pricing model, cost, state, GPU allocation, waste severity). Computes cost deltas.
gpuaudit diff old.json new.json [--format table|json] Closes #5
The recommendation said "No GPU pods scheduled for X days" but X was the node's total uptime, not the idle duration. We don't know when the node became idle — only that it currently has zero GPU pods. Changed wording to "Node up X days with 0 GPU pods scheduled."
Covers CLI flags (--targets, --role, --org), architecture for parallel cross-account scanning via STS AssumeRole, output changes with per-target sub-summaries, and IAM role setup docs (Terraform + CloudFormation).
Add ResolveTargets function that resolves scan targets based on --targets, --org, --role, and --skip-self options. Self account uses original credentials (no AssumeRole), failed assumptions are collected as TargetError rather than being fatal. Add STSClient and OrgClient interfaces, Target and TargetError types, multi-target fields to ScanOptions, and organizations SDK dependency. Includes 6 tests covering: self-only, explicit targets, skip-self, partial failure, org discovery with suspended account filtering, and self-in-targets deduplication.
ResolveTargets now returns selfAccount separately so Scan() always gets the correct caller identity regardless of --skip-self. Extracted models.MaxSeverity to replace three copies of severity classification. Removed dead error return from scanTarget. Added missing copyright headers.
Share --prom-url across both EC2 and K8s scan paths. EC2 instances are matched to DCGM Prometheus metrics via private DNS hostname, with fallback to IP extracted from the instance label. - Add PrivateDnsName field to GPUInstance, populated from EC2 API - Extract shared Prometheus HTTP query/parse into internal/prometheus - Refactor K8s metrics.go to use shared prometheus package - Add EnrichEC2PrometheusGPUMetrics with hostname + IP matching - Show GPU utilization in signal column for flagged instances - Wire --prom-url into AWS ScanOptions
Replace fleet-specific numbers with generic examples that don't expose real infrastructure details.
9f53f5c to
0c33990
Compare
maksimov
added a commit
that referenced
this pull request
Apr 21, 2026
Add diff, multi-account, spot, K8s GPU metrics, Prometheus
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gpuaudit diff old.json new.json): compare two scan snapshots, report added/removed/changed instances with cost deltas--targets,--org,--role), parallel execution, per-target summaries--prom-urlacross K8s and EC2 scanners, match by hostname and IP fallbackTest plan
go test ./...passes (all existing + new tests)go vet ./...cleango buildsucceedsgpuaudit diffwith two JSON snapshots