Web Source by kashifkhan0771 · Pull Request #4848 · trufflesecurity/trufflehog

kashifkhan0771 · 2026-03-30T11:40:44Z

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Introduces new network-facing crawling logic and multiple new dependencies (Colly/HTML parsing), which can impact performance and operational behavior (timeouts, robots enforcement, domain scoping) despite being largely additive.

Overview
Adds a new web scan source and CLI command to crawl one or more seed URLs and scan fetched page contents for secrets, with configurable crawling (--crawl, --depth), rate limiting (--delay), overall timeout, custom User-Agent, and optional robots.txt bypass.

Wires the new source through the engine (ScanWeb), introduces sources.WebConfig, expands protobufs (sourcespb.SourceType_SOURCE_TYPE_WEB plus sourcespb.Web config and source_metadatapb.Web metadata), and updates docs/manpage; the implementation uses Colly, emits per-page chunks (including linked JS when crawling) with URL/title/content-type/depth/timestamp metadata and adds a Prometheus metric plus a comprehensive test suite.

^{Reviewed by Cursor Bugbot for commit a1cd399. Bugbot is set up for automated code reviews on this repo. Configure here.}

MuneebUllahKhan222

Just need address couple of small changes.

MuneebUllahKhan222 · 2026-04-06T11:30:05Z

+			ctx.Logger().Error(err, "Visit failed")
+		}
+		collector.Wait() // blocks until all requests finish
+		close(done)


it should be defer close(done) outside the go routine.

Outside go routine?? Any reason why?

It is generally a practice that the owner of the channel should be the one to close it.

…vability

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 56e35d6. Configure here.}

MuneebUllahKhan222

This PR has matured and looks solid now.
Only small thing that can be added is the validation for negative values of delay and crawl(they do not affect the source but it is better to add them).

kashifkhan0771 requested a review from a team March 30, 2026 11:40

kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40

cursor Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread pkg/engine/web.go

Comment thread pkg/sources/web/web.go

cursor Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread pkg/sources/web/web.go Outdated

Comment thread pkg/sources/web/web.go Outdated

cursor Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread pkg/sources/web/web.go Outdated

Comment thread main.go

cursor Bot reviewed Mar 30, 2026

View reviewed changes