Add diff, multi-account, spot, K8s GPU metrics, Prometheus by maksimov · Pull Request #26 · gpuaudit/cli

maksimov · 2026-04-21T11:59:00Z

Summary

Diff command (gpuaudit diff old.json new.json): compare two scan snapshots, report added/removed/changed instances with cost deltas
Multi-account scanning: scan multiple AWS accounts via STS AssumeRole (--targets, --org, --role), parallel execution, per-target summaries
Spot recommendations: enrich EC2 instances with live spot pricing, flag on-demand instances eligible for spot (rule 8)
K8s GPU metrics: three-source fallback chain — CloudWatch Container Insights, DCGM exporter scrape, Prometheus query — with low GPU utilization detection (rule 9)
Prometheus EC2 enrichment: share --prom-url across K8s and EC2 scanners, match by hostname and IP fallback
README sanitized: generic example output, no identifiable fleet details

Test plan

go test ./... passes (all existing + new tests)
go vet ./... clean
go build succeeds
Manual: single-account scan
Manual: gpuaudit diff with two JSON snapshots

Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.

Discovers dcgm-exporter pods via label selectors and scrapes their Prometheus metrics endpoint via kubectl proxy to populate GPU and GPU memory utilization on K8s node instances. Skips nodes that already have utilization data and gracefully handles scrape errors.

Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.

DCGM enrichment matched pods to instances by InstanceID, but pod.Spec.NodeName is the K8s hostname (e.g. ip-10-22-1-100.ec2.internal) while InstanceID is the EC2 ID (i-0671...). Add K8sNodeName field to GPUInstance and use it for DCGM matching. Also stop retrying CW queries after the first error — all nodes will get the same AccessDenied when credentials aren't available.

DCGM: stop spamming per-node warnings when scrapes fail consistently (likely RBAC). Log one warning, bail after 3 consecutive failures. Prometheus: use K8sNodeName (the actual K8s hostname) in the PromQL node=~ regex instead of InstanceID (EC2 ID). The Prometheus node label matches K8s hostnames, not EC2 instance IDs.

Compares two scan results by instance ID. Detects added, removed, and changed instances across 6 fields (instance type, pricing model, cost, state, GPU allocation, waste severity). Computes cost deltas.

gpuaudit diff old.json new.json [--format table|json] Closes #5

The recommendation said "No GPU pods scheduled for X days" but X was the node's total uptime, not the idle duration. We don't know when the node became idle — only that it currently has zero GPU pods. Changed wording to "Node up X days with 0 GPU pods scheduled."

Covers CLI flags (--targets, --role, --org), architecture for parallel cross-account scanning via STS AssumeRole, output changes with per-target sub-summaries, and IAM role setup docs (Terraform + CloudFormation).

…anning

Add ResolveTargets function that resolves scan targets based on --targets, --org, --role, and --skip-self options. Self account uses original credentials (no AssumeRole), failed assumptions are collected as TargetError rather than being fatal. Add STSClient and OrgClient interfaces, Target and TargetError types, multi-target fields to ScanOptions, and organizations SDK dependency. Includes 6 tests covering: self-only, explicit targets, skip-self, partial failure, org discovery with suspended account filtering, and self-in-targets deduplication.

…n command

ResolveTargets now returns selfAccount separately so Scan() always gets the correct caller identity regardless of --skip-self. Extracted models.MaxSeverity to replace three copies of severity classification. Removed dead error return from scanTarget. Added missing copyright headers.

…d div-by-zero

Share --prom-url across both EC2 and K8s scan paths. EC2 instances are matched to DCGM Prometheus metrics via private DNS hostname, with fallback to IP extracted from the instance label. - Add PrivateDnsName field to GPUInstance, populated from EC2 API - Extract shared Prometheus HTTP query/parse into internal/prometheus - Refactor K8s metrics.go to use shared prometheus package - Add EnrichEC2PrometheusGPUMetrics with hostname + IP matching - Show GPU utilization in signal column for flagged instances - Wire --prom-url into AWS ScanOptions

Replace fleet-specific numbers with generic examples that don't expose real infrastructure details.

Add diff, multi-account, spot, K8s GPU metrics, Prometheus

maksimov added 30 commits April 19, 2026 22:42

Add K8s GPU metrics collection design spec

7a2802e

Three-source fallback chain: CloudWatch Container Insights, DCGM exporter scrape, and Prometheus query. Per-node fallback with new ruleK8sLowGPUUtil analysis rule.

Add K8s GPU metrics collection implementation plan

d271797

Add EnrichK8sGPUMetrics for CloudWatch Container Insights GPU metrics

1c2d3d8

Add ProxyGet to K8sClient interface for pod API proxy

96155c1

Add Prometheus query enrichment for K8s GPU metrics

0f460c4

Add ruleK8sLowGPUUtil for utilization-based K8s GPU waste detection

d605cb4

Wire K8s GPU metrics fallback chain into CLI scan flow

54dc0ce

Add --prom-url and --prom-endpoint flags (mutually exclusive) for Prometheus GPU metrics. Orchestrate the 3-source fallback chain (CloudWatch Container Insights → DCGM scrape → Prometheus) between K8s discovery and analysis.

Include time window in low GPU utilization recommendation text

b08c025

Skip CW enrichment when AWS creds unavailable, reduce DCGM noise

e846cc8

Add diff package with Compare function and tests

ff5b7f5

Compares two scan results by instance ID. Detects added, removed, and changed instances across 6 fields (instance type, pricing model, cost, state, GPU allocation, waste severity). Computes cost deltas.

Add diff table and JSON output formatters

1eed3b8

Add diff subcommand to compare two scan results

5672229

gpuaudit diff old.json new.json [--format table|json] Closes #5

Fix box alignment in diff table output

2e95784

Update README with K8s scanning, diff command, and current output format

c9ee92d

Add multi-target scanning design spec

cd27862

Covers CLI flags (--targets, --role, --org), architecture for parallel cross-account scanning via STS AssumeRole, output changes with per-target sub-summaries, and IAM role setup docs (Terraform + CloudFormation).

Add multi-target scanning implementation plan

b8c6a35

Add TargetSummary and TargetErrorInfo model types for multi-target sc…

408399e

…anning

Extract BuildSummary to summary.go and add BuildTargetSummaries

e856adb

Refactor Scan() for parallel multi-target scanning

cd49f91

Add --targets, --role, --org, --external-id, --skip-self flags to sca…

698b0f6

…n command

Add per-target summary table and target column to table formatter

19fbbb7

Add per-target summaries to markdown and Slack formatters

8737a44

Add cross-account and Organizations permissions to iam-policy output

68fbeaa

Add multi-account scanning docs to README

1f58d92

maksimov added 9 commits April 21, 2026 13:02

Add SpotHourlyCost field to GPUInstance model

41b0867

Implement EnrichSpotPrices with DescribeSpotPriceHistory

b081710

Wire EnrichSpotPrices into scanRegion after EC2 discovery

82ae997

Correct spot instance cost using live spot prices

8f4973e

Add ruleSpotEligible analysis rule for spot recommendations

6b7bfab

Add ec2:DescribeSpotPriceHistory to IAM policy output

4b8c1c6

Address review: update signal type comment, add pagination note, guar…

1abda17

…d div-by-zero

Sanitize example output in README

0c33990

Replace fleet-specific numbers with generic examples that don't expose real infrastructure details.

maksimov force-pushed the feature/k8s-gpu-metrics branch from 9f53f5c to 0c33990 Compare April 21, 2026 12:03

maksimov merged commit a18e50a into master Apr 21, 2026
2 checks passed

maksimov deleted the feature/k8s-gpu-metrics branch April 21, 2026 12:06

maksimov added a commit that referenced this pull request Apr 21, 2026

Merge pull request #26 from gpuaudit/feature/k8s-gpu-metrics

1b4a207

Add diff, multi-account, spot, K8s GPU metrics, Prometheus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add diff, multi-account, spot, K8s GPU metrics, Prometheus#26

Add diff, multi-account, spot, K8s GPU metrics, Prometheus#26
maksimov merged 39 commits intomasterfrom
feature/k8s-gpu-metrics

maksimov commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maksimov commented Apr 21, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant