Skip to content

Add diff, multi-account, spot, K8s GPU metrics, Prometheus#26

Merged
maksimov merged 39 commits intomasterfrom
feature/k8s-gpu-metrics
Apr 21, 2026
Merged

Add diff, multi-account, spot, K8s GPU metrics, Prometheus#26
maksimov merged 39 commits intomasterfrom
feature/k8s-gpu-metrics

Conversation

@maksimov
Copy link
Copy Markdown
Collaborator

Summary

  • Diff command (gpuaudit diff old.json new.json): compare two scan snapshots, report added/removed/changed instances with cost deltas
  • Multi-account scanning: scan multiple AWS accounts via STS AssumeRole (--targets, --org, --role), parallel execution, per-target summaries
  • Spot recommendations: enrich EC2 instances with live spot pricing, flag on-demand instances eligible for spot (rule 8)
  • K8s GPU metrics: three-source fallback chain — CloudWatch Container Insights, DCGM exporter scrape, Prometheus query — with low GPU utilization detection (rule 9)
  • Prometheus EC2 enrichment: share --prom-url across K8s and EC2 scanners, match by hostname and IP fallback
  • README sanitized: generic example output, no identifiable fleet details

Test plan

  • go test ./... passes (all existing + new tests)
  • go vet ./... clean
  • go build succeeds
  • Manual: single-account scan
  • Manual: gpuaudit diff with two JSON snapshots

maksimov added 30 commits April 19, 2026 22:42
Three-source fallback chain: CloudWatch Container Insights,
DCGM exporter scrape, and Prometheus query. Per-node fallback
with new ruleK8sLowGPUUtil analysis rule.
Discovers dcgm-exporter pods via label selectors and scrapes their
Prometheus metrics endpoint via kubectl proxy to populate GPU and
GPU memory utilization on K8s node instances. Skips nodes that
already have utilization data and gracefully handles scrape errors.
Add --prom-url and --prom-endpoint flags (mutually exclusive) for
Prometheus GPU metrics. Orchestrate the 3-source fallback chain
(CloudWatch Container Insights → DCGM scrape → Prometheus) between
K8s discovery and analysis.
DCGM enrichment matched pods to instances by InstanceID, but
pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal)
while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to
GPUInstance and use it for DCGM matching.

Also stop retrying CW queries after the first error — all nodes will
get the same AccessDenied when credentials aren't available.
DCGM: stop spamming per-node warnings when scrapes fail consistently
(likely RBAC). Log one warning, bail after 3 consecutive failures.

Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL
node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label
matches K8s hostnames, not EC2 instance IDs.
Compares two scan results by instance ID. Detects added, removed,
and changed instances across 6 fields (instance type, pricing model,
cost, state, GPU allocation, waste severity). Computes cost deltas.
gpuaudit diff old.json new.json [--format table|json]

Closes #5
The recommendation said "No GPU pods scheduled for X days" but X was
the node's total uptime, not the idle duration. We don't know when
the node became idle — only that it currently has zero GPU pods.
Changed wording to "Node up X days with 0 GPU pods scheduled."
Covers CLI flags (--targets, --role, --org), architecture for
parallel cross-account scanning via STS AssumeRole, output changes
with per-target sub-summaries, and IAM role setup docs (Terraform
+ CloudFormation).
Add ResolveTargets function that resolves scan targets based on
--targets, --org, --role, and --skip-self options. Self account uses
original credentials (no AssumeRole), failed assumptions are collected
as TargetError rather than being fatal.

Add STSClient and OrgClient interfaces, Target and TargetError types,
multi-target fields to ScanOptions, and organizations SDK dependency.

Includes 6 tests covering: self-only, explicit targets, skip-self,
partial failure, org discovery with suspended account filtering, and
self-in-targets deduplication.
ResolveTargets now returns selfAccount separately so Scan() always
gets the correct caller identity regardless of --skip-self. Extracted
models.MaxSeverity to replace three copies of severity classification.
Removed dead error return from scanTarget. Added missing copyright
headers.
Share --prom-url across both EC2 and K8s scan paths. EC2 instances are
matched to DCGM Prometheus metrics via private DNS hostname, with
fallback to IP extracted from the instance label.

- Add PrivateDnsName field to GPUInstance, populated from EC2 API
- Extract shared Prometheus HTTP query/parse into internal/prometheus
- Refactor K8s metrics.go to use shared prometheus package
- Add EnrichEC2PrometheusGPUMetrics with hostname + IP matching
- Show GPU utilization in signal column for flagged instances
- Wire --prom-url into AWS ScanOptions
Replace fleet-specific numbers with generic examples that
don't expose real infrastructure details.
@maksimov maksimov force-pushed the feature/k8s-gpu-metrics branch from 9f53f5c to 0c33990 Compare April 21, 2026 12:03
@maksimov maksimov merged commit a18e50a into master Apr 21, 2026
2 checks passed
@maksimov maksimov deleted the feature/k8s-gpu-metrics branch April 21, 2026 12:06
maksimov added a commit that referenced this pull request Apr 21, 2026
Add diff, multi-account, spot, K8s GPU metrics, Prometheus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant