Skip to content

Web Source#4848

Open
kashifkhan0771 wants to merge 22 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source
Open

Web Source#4848
kashifkhan0771 wants to merge 22 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source

Conversation

@kashifkhan0771
Copy link
Copy Markdown
Contributor

@kashifkhan0771 kashifkhan0771 commented Mar 30, 2026

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Introduces new network-facing crawling logic and multiple new dependencies (Colly/HTML parsing), which can impact performance and operational behavior (timeouts, robots enforcement, domain scoping) despite being largely additive.

Overview
Adds a new web scan source and CLI command to crawl one or more seed URLs and scan fetched page contents for secrets, with configurable crawling (--crawl, --depth), rate limiting (--delay), overall timeout, custom User-Agent, and optional robots.txt bypass.

Wires the new source through the engine (ScanWeb), introduces sources.WebConfig, expands protobufs (sourcespb.SourceType_SOURCE_TYPE_WEB plus sourcespb.Web config and source_metadatapb.Web metadata), and updates docs/manpage; the implementation uses Colly, emits per-page chunks (including linked JS when crawling) with URL/title/content-type/depth/timestamp metadata and adds a Prometheus metric plus a comprehensive test suite.

Reviewed by Cursor Bugbot for commit a1cd399. Bugbot is set up for automated code reviews on this repo. Configure here.

@kashifkhan0771 kashifkhan0771 requested a review from a team March 30, 2026 11:40
@kashifkhan0771 kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40
Comment thread pkg/engine/web.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go Outdated
Comment thread pkg/sources/web/web.go Outdated
Comment thread pkg/sources/web/web.go Outdated
Comment thread main.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Copy link
Copy Markdown
Contributor

@MuneebUllahKhan222 MuneebUllahKhan222 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need address couple of small changes.

Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go Outdated
ctx.Logger().Error(err, "Visit failed")
}
collector.Wait() // blocks until all requests finish
close(done)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be defer close(done) outside the go routine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside go routine?? Any reason why?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally a practice that the owner of the channel should be the one to close it.

Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Comment thread pkg/sources/web/web.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 56e35d6. Configure here.

Comment thread pkg/sources/web/web_test.go
Copy link
Copy Markdown
Contributor

@MuneebUllahKhan222 MuneebUllahKhan222 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has matured and looks solid now.
Only small thing that can be added is the validation for negative values of delay and crawl(they do not affect the source but it is better to add them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants