Conversation
MuneebUllahKhan222
left a comment
There was a problem hiding this comment.
Just need address couple of small changes.
| ctx.Logger().Error(err, "Visit failed") | ||
| } | ||
| collector.Wait() // blocks until all requests finish | ||
| close(done) |
There was a problem hiding this comment.
it should be defer close(done) outside the go routine.
There was a problem hiding this comment.
Outside go routine?? Any reason why?
There was a problem hiding this comment.
It is generally a practice that the owner of the channel should be the one to close it.
77a5cc2 to
aade78b
Compare
51abd9d to
8c36f72
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 56e35d6. Configure here.
MuneebUllahKhan222
left a comment
There was a problem hiding this comment.
This PR has matured and looks solid now.
Only small thing that can be added is the validation for negative values of delay and crawl(they do not affect the source but it is better to add them).

Description:
Adds a new
websource that crawls and scans websites for exposed secrets. The source uses theCollyframework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via--crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.Checklist:
make test-community)?make lintthis requires golangci-lint)?Note
Medium Risk
Introduces new network-facing crawling logic and multiple new dependencies (Colly/HTML parsing), which can impact performance and operational behavior (timeouts, robots enforcement, domain scoping) despite being largely additive.
Overview
Adds a new
webscan source and CLI command to crawl one or more seed URLs and scan fetched page contents for secrets, with configurable crawling (--crawl,--depth), rate limiting (--delay), overall timeout, custom User-Agent, and optional robots.txt bypass.Wires the new source through the engine (
ScanWeb), introducessources.WebConfig, expands protobufs (sourcespb.SourceType_SOURCE_TYPE_WEBplussourcespb.Webconfig andsource_metadatapb.Webmetadata), and updates docs/manpage; the implementation uses Colly, emits per-page chunks (including linked JS when crawling) with URL/title/content-type/depth/timestamp metadata and adds a Prometheus metric plus a comprehensive test suite.Reviewed by Cursor Bugbot for commit a1cd399. Bugbot is set up for automated code reviews on this repo. Configure here.