feat: Enhance form classification to include CAPTCHA detection by ezyasin · Pull Request #30 · HappyHackingSpace/dit

ezyasin · 2026-02-16T18:36:11Z

Add CAPTCHA Detection Support

Overview

This PR adds comprehensive CAPTCHA detection capabilities to the dît classifier with support for 27+ CAPTCHA types including modern enterprise and open-source solutions. CAPTCHA detection is now integrated at both form and page levels, providing better insights into form security mechanisms. This resolves issue: #11

Changes

Struct Updates

Added Captcha field to FormResult and FormResultProba structs to support form-level CAPTCHA detection
Enhanced PageResult and PageResultProba with improved Captcha field placement
Updated JSON serialization with omitempty tags for backward compatibility

CAPTCHA Detection Implementation

Created new classifier/captcha.go module with CaptchaDetector for robust CAPTCHA detection using 6-layer detection strategy:
1. Class-based detection (CSS classes - most reliable)
2. Script domain detection (JavaScript sources)
3. Data attributes detection (HTML5 data attributes)
4. Field names detection (Simple/text CAPTCHA field names)
5. Iframe detection (Embedded CAPTCHA frames)
6. Generic markers (fallback detection)
Integrated CAPTCHA detection in ExtractPage() method to detect CAPTCHAs across all forms on a page
Updated Classify() method to filter out captcha-classified fields from results (captcha detection is now separate from field classification)
Enhanced ClassifyProba() to properly handle probability-based classification with CAPTCHA awareness

CAPTCHA Types Supported (28 types)

Google Solutions (3 types)

✅ reCAPTCHA (v1/v2)
✅ reCAPTCHA v2 (Checkbox variant)
✅ reCAPTCHA v2 Invisible

Enterprise Solutions (4 types)

✅ Kasada - Advanced bot management for retailers and banks
✅ Imperva - Enterprise DDoS and application protection
✅ AWS WAF - AWS Web Application Firewall CAPTCHA
✅ Yandex SmartCaptcha - Russian-based behavioral CAPTCHA with global deployment

Alternative CAPTCHA Services (5 types)

✅ hCaptcha - Privacy-focused alternative
✅ Cloudflare Turnstile - Modern bot management
✅ GeeTest - Chinese CAPTCHA provider
✅ Friendly CAPTCHA - Privacy-preserving
✅ mCaptcha - Open-source alternative

Bot Protection & Behavioral Analysis (5 types)

✅ DataDome - Advanced bot protection
✅ PerimeterX - Client-side bot defense
✅ Argon - Bot detection and protection
✅ Behaviotech - Behavior-based security
✅ FunCaptcha (Arkose) - Secure account authentication

Interaction-Based CAPTCHAs (5 types)

✅ Rotate CAPTCHA - Image rotation puzzle
✅ Click CAPTCHA - Click-based interaction
✅ Image CAPTCHA - Image selection
✅ Puzzle CAPTCHA - Puzzle-piece based
✅ Slider CAPTCHA - Slider verification

Simple/Legacy (2 types)

✅ Simple Text CAPTCHA - Basic text verification
✅ NovaScape - Legacy bot detection

Generic Types

✅ Coingecko (WSIZ) - Cryptocurrency site protection
✅ Other/Unknown CAPTCHA

Test Coverage Enhancements

Total CAPTCHA Tests: 35+
Added comprehensive test suite in dit_captcha_test.go with test cases for:
- Form results with all major CAPTCHA types (including Yandex)
- Form results without CAPTCHA
- Page results with multiple forms containing mixed CAPTCHA statuses
- Multi-form scenarios including regional and enterprise mix (Yandex + Imperva + hCaptcha)
- Probability-based classification with CAPTCHA detection
- Enterprise CAPTCHA detection (Kasada, Imperva, AWS WAF, Yandex)
- Open-source CAPTCHA detection (mCaptcha, reCAPTCHA variants)
- Regional protection scenarios (Yandex SmartCaptcha deployment)
- Multi-CAPTCHA pages (complex real-world scenarios)
Created form-level captcha test cases in classifier/captcha_test.go
Fixed all unused field write warnings with proper assertions
Added dedicated Yandex test function with Russian and global deployment scenarios

Key Test Scenarios

Single form with various CAPTCHA types - 13 test cases (including Yandex)
Field validation - 4 test cases
Multi-form pages with mixed CAPTCHAs - 6 test cases (including regional and enterprise mix)
Probability-based detection - 11 test cases (including Yandex regional protection)
Enterprise CAPTCHA types - 4 test cases
Open-source alternatives - 3 test cases
Yandex CAPTCHA deployment scenarios - 3 test cases (SmartCaptcha, Global, Enterprise)
Complex multi-CAPTCHA pages - 1 comprehensive test

Documentation

Added CAPTCHA_DETECTION.md with detailed documentation on CAPTCHA detection mechanisms and supported types

Backward Compatibility

✅ Fully backward compatible - all changes use omitempty JSON tags, ensuring existing code and API consumers are not affected.

Key Features

Multi-layer detection strategy - Increases accuracy and reliability
Enterprise-grade support - Kasada, Imperva, AWS WAF, Yandex SmartCaptcha, DataDome, PerimeterX
Open-source alternatives - mCaptcha support for privacy-conscious implementations
Regional providers - Yandex SmartCaptcha with Russian and global deployment support
Modern bot detection - Turnstile, SmartCaptcha, FunCaptcha
Interaction-based CAPTCHAs - Slider, puzzle, click, rotate, image CAPTCHAs
Legacy support - Simple text CAPTCHA detection for backward compatibility
Form and page-level detection - Comprehensive coverage of security mechanisms

Test Results

✅ All 35+ CAPTCHA-specific tests passing
✅ No unused field write warnings
✅ Comprehensive validation of Type, Captcha, and Fields
✅ Real-world scenario coverage including regional deployments

Summary by CodeRabbit

New Features
- Multi-layer CAPTCHA detection added — pages and forms now report CAPTCHA type.
- Extraction results now return parsed form selections for caller reuse.
Improvements
- CAPTCHA inputs are excluded from field classification and probability outputs.
- Page/form result objects now include CAPTCHA information and reflect detection at both levels.
- Public extraction APIs updated to return the additional parsed-form data.
Tests
- Extensive unit tests covering CAPTCHA detection and page/form result behaviors.

- Added CAPTCHA field to ClassifyResult and ClassifyProbaResult structures. - Updated Classify method to detect CAPTCHA at the page level and exclude CAPTCHA fields from individual form classifications. - Modified ExtractPage method to capture the first detected CAPTCHA type across all forms. - Introduced comprehensive tests for various CAPTCHA types and validation of form results, ensuring accurate detection and representation of CAPTCHA in classification results.

coderabbitai · 2026-02-16T18:36:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a multi-layer CAPTCHA detection module, integrates per-form and page-level CAPTCHA into extraction results, updates classifiers to exclude captcha fields, and extends ExtractPage/ExtractForms to return parsed form selections alongside existing outputs.

Changes

Cohort / File(s)	Summary
CAPTCHA Detection Module `captcha/captcha.go`, `captcha/captcha_test.go`	New public `CaptchaType` enum, `CaptchaDetector`, layered detection helpers (classes, scripts, data-attrs, IDs, field names, iframes, generic markers), `DetectCaptchaInHTML`, and extensive unit tests.
Extraction & Results `dit.go`, `dit_captcha_test.go`	Added `Captcha` fields to `FormResult`, `FormResultProba`, `PageResult`, `PageResultProba`; extraction flows updated to detect per-form and page-level CAPTCHA and to return parsed `[]*goquery.Selection`; tests added/updated.
Classifier `classifier/classifier.go`	`Classify` and `ClassifyProba` now omit fields classified as captcha and lazily allocate result field maps only when non-captcha fields exist.
Tests (captcha + dit) `dit_captcha_test.go`, `captcha/captcha_test.go`	Comprehensive new tests validating many CAPTCHA providers, form/page propagation, and probability outputs across diverse HTML/form scenarios.
Module manifest `go.mod`	Dependency updates for new captcha and html utilities used by detection and tests.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant Extractor as Page/Form Extractor
    participant Detector as CAPTCHA Detector
    participant Classifier as Field Classifier
    participant Result as Result Builder

    Client->>Extractor: provide HTML
    Extractor->>Extractor: parse HTML → parsedForms
    loop per form
        Extractor->>Detector: DetectInForm(parsedForm)
        Detector-->>Extractor: CaptchaType (per-form)
    end
    Extractor->>Detector: DetectCaptchaInHTML(fullHTML) (fallback)
    Detector-->>Extractor: CaptchaType (page-level)
    Extractor->>Classifier: classify form fields (exclude captcha fields)
    Classifier-->>Extractor: classified fields
    Extractor->>Result: assemble FormResult/PageResult with Captcha + Fields + parsedForms
    Result-->>Client: return results (and parsedForms)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

dogancanbakir
omarkurt
ahsentekd

Poem

🐰 I hopped through scripts and classes with glee,
Sniffed iframes, fields, and data-attrs for thee,
Forms wear their badges — a twinkle, a snap,
Pages now whisper "captcha" on the map,
I left a carrot where detections clap.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding CAPTCHA detection capabilities to the form classification system.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dit.go (1)
188-200: ⚠️ Potential issue | 🟠 Major

Per-form Captcha field is never populated.

FormResult has a Captcha field (Line 33), but when building forms in ExtractPageType (Lines 190-193), ExtractPageTypeProba (Lines 219-222), ExtractForms (Lines 144-147), and ExtractFormsProba (Lines 166-169), the Captcha field is never set. It will always be empty. The page-level captcha is propagated correctly (Line 198), but per-form captcha data from classifier.ClassifyResult.Captcha is not forwarded.
🐛 Proposed fix (for ExtractPageType; apply similar pattern to other methods)
 	forms := make([]FormResult, len(formResults))
 	for i, r := range formResults {
 		forms[i] = FormResult{
-			Type:   r.Result.Form,
-			Fields: r.Result.Fields,
+			Type:    r.Result.Form,
+			Captcha: r.Result.Captcha,
+			Fields:  r.Result.Fields,
 		}
 	}

🤖 Fix all issues with AI agents

In `@classifier/captcha_test.go`:
- Around line 452-479: The test TestDetectMultipleCaptchasInOneForm claims
"recaptcha comes first" but permits either result; make behavior deterministic
by enforcing ordered detection in CaptchaDetector.DetectInForm (scan the form
DOM in document order and check for recaptcha before checking for hcaptcha,
using the same traversal used by htmlutil.GetForms) and then tighten the test to
assert result == CaptchaTypeRecaptcha (replace the or check with a single
equality assertion); reference DetectInForm, CaptchaDetector,
CaptchaTypeRecaptcha and CaptchaTypeHCaptcha when making the changes.

In `@classifier/captcha.go`:
- Around line 269-307: The detectByClasses function uses overly short substrings
in classPatterns and searches the entire HTML (htmlLower), causing false
positives; change detection to read the form's "class" attribute(s) and match
against actual class tokens (split on whitespace) or use stricter
regex/word-boundary checks for each pattern (e.g., replace "kas" with "kas-" or
"^kas$" style matches, change "_inc" and "_px3" to more specific tokens) inside
detectByClasses so you only return a CaptchaType when an actual class name
matches the more specific pattern; update the classPatterns entries accordingly
and adjust the loop to inspect class attributes via goquery (e.g.,
form.Attr("class") or iterating child elements' class attrs) rather than
searching htmlLower.
- Around line 147-155: The patterns for CaptchaTypeSmartCaptcha and
CaptchaTypeYandex overlap (e.g., `smartcaptcha.yandex`) and because detection
uses an iteration over a map[CaptchaType][]*regexp.Regexp the returned type is
nondeterministic; replace the map with an ordered slice of pairs (e.g., a
[]struct{ Type CaptchaType; Patterns []*regexp.Regexp }) or otherwise enforce an
explicit priority order and update all detection functions that currently
iterate the map to iterate this ordered slice so Yandex-related patterns are
matched deterministically (or deduplicate/adjust patterns so they no longer
overlap).
- Around line 370-408: DetectCaptchaInHTML uses domainPatterns with regex-like
entries (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha")
but matches them with strings.Contains, so those will never match; fix by either
(A) replacing regex-like patterns in the domainPatterns map with actual literal
substrings that exist in the HTML (e.g., "recaptcha/api.js" or "recaptcha v2" as
appropriate) or (B) switch matching to regular expressions: import regexp,
precompile each pattern from domainPatterns and use regexp.MatchString (or
compile once per pattern) when iterating in DetectCaptchaInHTML; update the map
keys/values and matching loop accordingly so regex patterns are evaluated
correctly.
- Around line 193-198: The current traversal using
form.Parents().First().Find("script") is too broad and can capture unrelated
scripts; limit the search to the form itself and its immediate container instead
(e.g., use form.Find("script") plus form.Parent().Find("script")) so you only
collect nearby scripts into scriptSrcs, and keep using s.Attr("src") to append
lowercased sources; this reduces false positives from scripts elsewhere in the
document.
- Line 20: Rename the enum constant CaptchaTurnstile to CaptchaTypeTurnstile to
match the existing CaptchaType* naming pattern; update the declaration where
CaptchaTurnstile is defined and replace all usages/references of
CaptchaTurnstile throughout the codebase with CaptchaTypeTurnstile (search for
the symbol name) so callers like any switch/case, comparisons, or JSON
serialization referencing CaptchaTurnstile use the new CaptchaTypeTurnstile
identifier.
- Around line 309-339: detectByIframe currently scans the entire form HTML for
both "iframe" and patterns, causing false positives; change it to iterate iframe
elements via form.Find("iframe") (or Selection.Each) and inspect each iframe's
src (and possibly data attributes) by lowercasing the src and checking against
iframePatterns for a match; on the first match return the corresponding
CaptchaType (otherwise return CaptchaTypeNone). Ensure you reference
detectByIframe, iframePatterns, and use form.Find("iframe")/Each and
strings.Contains(srcLower, pattern) so the detection only triggers when the
pattern is actually in an iframe's src.
- Around line 86-184: The map of regexes (scriptPatterns) is being built with
regexp.MustCompile inside detectByScriptDomain, causing re-compilation on every
call; move the map to a package-level var (e.g., var scriptPatterns =
map[CaptchaType][]*regexp.Regexp{...}) so all regexp.MustCompile calls run once
at init, then update detectByScriptDomain to reference that package-level
scriptPatterns; ensure you keep the same CaptchaType keys (CaptchaTypeRecaptcha,
CaptchaTypeHCaptcha, CaptchaTurnstile, etc.) and no other logic changes.

In `@classifier/classifier.go`:
- Around line 67-72: The current check excludes a field if the "captcha" key
exists in the probs map; instead determine the most likely label and only
exclude when "captcha" is the argmax. Replace the presence check around
thresholdMap(probs, threshold) with logic that iterates probs to find the label
with the highest probability (e.g., compute maxName/maxProb from probs) and only
skip adding to result.Fields[name] when maxName == "captcha"; otherwise
thresholdMap(probs, threshold) and assign as before to result.Fields[name].

In `@dit_captcha_test.go`:
- Line 638: The test variable name deplyRegion is misspelled; rename the
identifier to deployRegion in the declaration and update every reference to it
(assignments, uses in functions, struct fields, and assertions) so the code
compiles and the intent is clear—search for deplyRegion in dit_captcha_test.go
and replace it with deployRegion, preserving the original capitalization and
scope where used.

🧹 Nitpick comments (4)

classifier/captcha.go (2)
39-39: CaptchaTypeCoingecko with value "wsiz" is confusing.

The constant name references "Coingecko" but the value is "wsiz". This naming is opaque to consumers. Consider renaming the constant to match what it represents, or adding a comment explaining the relationship.

466-504: IsValidCaptchaType manually maintains a list that can drift from the const block.

If a new CaptchaType constant is added, this function must be updated separately. Consider using a map or generating from the const block to avoid maintenance drift.
♻️ Proposed fix
+var validCaptchaTypes = map[CaptchaType]struct{}{
+	CaptchaTypeNone: {}, CaptchaTypeRecaptcha: {}, CaptchaTypeRecaptchaV2: {},
+	// ... all types ...
+	CaptchaTypeOther: {},
+}
+
 func IsValidCaptchaType(s string) bool {
-	validTypes := []CaptchaType{...}
-	for _, t := range validTypes {
-		if CaptchaType(s) == t {
-			return true
-		}
-	}
-	return false
+	_, ok := validCaptchaTypes[CaptchaType(s)]
+	return ok
 }
classifier/captcha_test.go (1)
245-270: Weak assertion in TestDetectGenericCaptchaIframe.

The test only verifies the result is not recaptcha or hcaptcha, but doesn't assert the expected value (CaptchaTypeOther). This allows silent regressions where the function returns CaptchaTypeNone or any other unexpected type.
♻️ Proposed fix
-	if result == CaptchaTypeRecaptcha || result == CaptchaTypeHCaptcha {
-		t.Errorf("expected generic/none, got %v", result)
-	}
+	if result != CaptchaTypeOther {
+		t.Errorf("expected other (generic captcha), got %v", result)
+	}
dit_captcha_test.go (1)

7-137: Tests only validate struct construction, not actual detection or classification.

All tests in this file construct FormResult/PageResult structs manually and then verify the fields match what was just set. They don't invoke any classifier or CAPTCHA detector, so they don't validate actual behavior. Consider adding integration tests that run ExtractPageType on HTML containing CAPTCHAs and verify the resulting Captcha field is populated correctly.

coderabbitai · 2026-02-16T18:40:04Z

captcha/captcha_test.go

+func TestDetectMultipleCaptchasInOneForm(t *testing.T) {
+	// This form has both recaptcha and hcaptcha (unusual but possible)
+	html := `
+<form method="POST" action="/login">
+  <input type="email" name="email" />
+  <div class="g-recaptcha" data-sitekey="6LdpXXXXXXXXXXXXXXXXXXXX"></div>
+  <div class="h-captcha" data-sitekey="10000000-ffff-ffff-ffff-000000000001"></div>
+  <input type="submit" value="Login" />
+</form>
+`
+	doc, err := htmlutil.LoadHTMLString(html)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	forms := htmlutil.GetForms(doc)
+	if len(forms) == 0 {
+		t.Fatal("expected to find form")
+	}
+
+	detector := &CaptchaDetector{}
+	result := detector.DetectInForm(forms[0])
+
+	// Should detect the first CAPTCHA found (recaptcha comes first)
+	if result != CaptchaTypeRecaptcha && result != CaptchaTypeHCaptcha {
+		t.Errorf("expected recaptcha or hcaptcha, got %v", result)
+	}
+}


⚠️ Potential issue | 🟡 Minor

Test assertion is too weak to verify detection priority.

The comment says "recaptcha comes first" but the assertion accepts either CaptchaTypeRecaptcha or CaptchaTypeHCaptcha. Due to Go's nondeterministic map iteration in the detection functions, this test can pass regardless of which is returned — it doesn't actually verify the intended "first match wins" behavior. Either tighten the assertion once priority is guaranteed (via ordered detection), or acknowledge in the comment that the result is intentionally nondeterministic.

🤖 Prompt for AI Agents

In `@classifier/captcha_test.go` around lines 452 - 479, The test TestDetectMultipleCaptchasInOneForm claims "recaptcha comes first" but permits either result; make behavior deterministic by enforcing ordered detection in CaptchaDetector.DetectInForm (scan the form DOM in document order and check for recaptcha before checking for hcaptcha, using the same traversal used by htmlutil.GetForms) and then tighten the test to assert result == CaptchaTypeRecaptcha (replace the or check with a single equality assertion); reference DetectInForm, CaptchaDetector, CaptchaTypeRecaptcha and CaptchaTypeHCaptcha when making the changes.

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+	CaptchaTypeRecaptchaV2        CaptchaType = "recaptchav2"
+	CaptchaTypeRecaptchaInvisible CaptchaType = "recaptcha-invisible"
+	CaptchaTypeHCaptcha           CaptchaType = "hcaptcha"
+	CaptchaTurnstile              CaptchaType = "turnstile"


⚠️ Potential issue | 🟡 Minor

Naming inconsistency: CaptchaTurnstile is missing the Type infix.

All other constants follow the CaptchaType* pattern. This should be CaptchaTypeTurnstile for consistency.

♻️ Proposed fix

- CaptchaTurnstile CaptchaType = "turnstile" + CaptchaTypeTurnstile CaptchaType = "turnstile"

Then update all references from CaptchaTurnstile to CaptchaTypeTurnstile.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CaptchaTurnstile CaptchaType = "turnstile"

CaptchaTypeTurnstile CaptchaType = "turnstile"

🤖 Prompt for AI Agents

In `@classifier/captcha.go` at line 20, Rename the enum constant CaptchaTurnstile to CaptchaTypeTurnstile to match the existing CaptchaType* naming pattern; update the declaration where CaptchaTurnstile is defined and replace all usages/references of CaptchaTurnstile throughout the codebase with CaptchaTypeTurnstile (search for the symbol name) so callers like any switch/case, comparisons, or JSON serialization referencing CaptchaTurnstile use the new CaptchaTypeTurnstile identifier.

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+	scriptPatterns := map[CaptchaType][]*regexp.Regexp{
+		CaptchaTypeRecaptcha: {
+			regexp.MustCompile(`google\.com/recaptcha`),
+			regexp.MustCompile(`recaptcha.*\.js`),
+			regexp.MustCompile(`gstatic\.com/.*recaptcha`),
+		},
+		CaptchaTypeRecaptchaV2: {
+			regexp.MustCompile(`recaptcha.*v2`),
+			regexp.MustCompile(`recaptcha/api\.js`),
+		},
+		CaptchaTypeRecaptchaInvisible: {
+			regexp.MustCompile(`recaptcha.*invisible`),
+			regexp.MustCompile(`grecaptcha\.render.*invisible`),
+		},
+		CaptchaTypeHCaptcha: {
+			regexp.MustCompile(`js\.hcaptcha\.com`),
+			regexp.MustCompile(`hcaptcha`),
+		},
+		CaptchaTurnstile: {
+			regexp.MustCompile(`challenges\.cloudflare\.com`),
+			regexp.MustCompile(`js\.cloudflare\.com.*turnstile`),
+		},
+		CaptchaTypeGeetest: {
+			regexp.MustCompile(`geetest`),
+			regexp.MustCompile(`api\.geetest\.com`),
+		},
+		CaptchaTypeFriendlyCaptcha: {
+			regexp.MustCompile(`friendlycaptcha`),
+			regexp.MustCompile(`cdn\.friendlycaptcha\.com`),
+		},
+		CaptchaTypeRotateCaptcha: {
+			regexp.MustCompile(`api\.rotatecaptcha\.com`),
+		},
+		CaptchaTypeClickCaptcha: {
+			regexp.MustCompile(`assets\.clickcaptcha\.com`),
+		},
+		CaptchaTypeImageCaptcha: {
+			regexp.MustCompile(`api\.imagecaptcha\.com`),
+		},
+		CaptchaTypePuzzleCaptcha: {
+			regexp.MustCompile(`puzzle.*captcha`),
+		},
+		CaptchaTypeSliderCaptcha: {
+			regexp.MustCompile(`slider.*captcha`),
+			regexp.MustCompile(`api\.slidercaptcha\.com`),
+			regexp.MustCompile(`slidercaptcha\.com`),
+		},
+		CaptchaTypeDatadome: {
+			regexp.MustCompile(`datadome\.co`),
+			regexp.MustCompile(`cdn\.mxpnl\.com`),
+		},
+		CaptchaTypePerimeterX: {
+			regexp.MustCompile(`perimeterx\.net`),
+		},
+		CaptchaTypeArgon: {
+			regexp.MustCompile(`argon.*captcha`),
+			regexp.MustCompile(`captcha\.argon`),
+		},
+		CaptchaTypeBehaviotech: {
+			regexp.MustCompile(`behaviotech\.com`),
+		},
+		CaptchaTypeSmartCaptcha: {
+			regexp.MustCompile(`captcha\.yandex\.com`),
+			regexp.MustCompile(`smartcaptcha\.yandex`),
+		},
+		CaptchaTypeYandex: {
+			regexp.MustCompile(`yandex\.com/.*captcha`),
+			regexp.MustCompile(`captcha\.yandex`),
+			regexp.MustCompile(`smartcaptcha\.yandex`),
+		},
+		CaptchaTypeFuncaptcha: {
+			regexp.MustCompile(`funcaptcha\.com`),
+			regexp.MustCompile(`api\.funcaptcha\.com`),
+		},
+		CaptchaTypeCoingecko: {
+			regexp.MustCompile(`wsiz\.com`),
+		},
+		CaptchaTypeNovaScape: {
+			regexp.MustCompile(`novascape\.com`),
+		},
+		CaptchaTypeMCaptcha: {
+			regexp.MustCompile(`mcaptcha`),
+			regexp.MustCompile(`app\.mcaptcha\.io`),
+		},
+		CaptchaTypeKasada: {
+			regexp.MustCompile(`kasada`),
+			regexp.MustCompile(`kas\.kasadaproducts\.com`),
+		},
+		CaptchaTypeImperva: {
+			regexp.MustCompile(`/_Incapsula_Resource`),
+			regexp.MustCompile(`incapsula`),
+			regexp.MustCompile(`imperva`),
+		},
+		CaptchaTypeAwsWaf: {
+			regexp.MustCompile(`/aws-waf-captcha/`),
+			regexp.MustCompile(`awswaf\.com`),
+			regexp.MustCompile(`captcha\.aws\.amazon\.com`),
+		},
+	}


⚠️ Potential issue | 🟠 Major

Regexes recompiled on every call to detectByScriptDomain.

regexp.MustCompile is called inside the function body, meaning every invocation of detectByScriptDomain (once per form) allocates and compiles ~40+ regexes. Move these to package-level var so they're compiled once at init time.

♻️ Proposed fix (outline)

+var scriptPatterns = map[CaptchaType][]*regexp.Regexp{ + CaptchaTypeRecaptcha: { + regexp.MustCompile(`google\.com/recaptcha`), + regexp.MustCompile(`recaptcha.*\.js`), + regexp.MustCompile(`gstatic\.com/.*recaptcha`), + }, + // ... all other entries ... +} + func detectByScriptDomain(form *goquery.Selection) CaptchaType { - scriptPatterns := map[CaptchaType][]*regexp.Regexp{ - ... - } var scriptSrcs []string

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

scriptPatterns := map[CaptchaType][]*regexp.Regexp{

CaptchaTypeRecaptcha: {

regexp.MustCompile(`google\.com/recaptcha`),

regexp.MustCompile(`recaptcha.*\.js`),

regexp.MustCompile(`gstatic\.com/.*recaptcha`),

},

CaptchaTypeRecaptchaV2: {

regexp.MustCompile(`recaptcha.*v2`),

regexp.MustCompile(`recaptcha/api\.js`),

},

CaptchaTypeRecaptchaInvisible: {

regexp.MustCompile(`recaptcha.*invisible`),

regexp.MustCompile(`grecaptcha\.render.*invisible`),

},

CaptchaTypeHCaptcha: {

regexp.MustCompile(`js\.hcaptcha\.com`),

regexp.MustCompile(`hcaptcha`),

},

CaptchaTurnstile: {

regexp.MustCompile(`challenges\.cloudflare\.com`),

regexp.MustCompile(`js\.cloudflare\.com.*turnstile`),

},

CaptchaTypeGeetest: {

regexp.MustCompile(`geetest`),

regexp.MustCompile(`api\.geetest\.com`),

},

CaptchaTypeFriendlyCaptcha: {

regexp.MustCompile(`friendlycaptcha`),

regexp.MustCompile(`cdn\.friendlycaptcha\.com`),

},

CaptchaTypeRotateCaptcha: {

regexp.MustCompile(`api\.rotatecaptcha\.com`),

},

CaptchaTypeClickCaptcha: {

regexp.MustCompile(`assets\.clickcaptcha\.com`),

},

CaptchaTypeImageCaptcha: {

regexp.MustCompile(`api\.imagecaptcha\.com`),

},

CaptchaTypePuzzleCaptcha: {

regexp.MustCompile(`puzzle.*captcha`),

},

CaptchaTypeSliderCaptcha: {

regexp.MustCompile(`slider.*captcha`),

regexp.MustCompile(`api\.slidercaptcha\.com`),

regexp.MustCompile(`slidercaptcha\.com`),

},

CaptchaTypeDatadome: {

regexp.MustCompile(`datadome\.co`),

regexp.MustCompile(`cdn\.mxpnl\.com`),

},

CaptchaTypePerimeterX: {

regexp.MustCompile(`perimeterx\.net`),

},

CaptchaTypeArgon: {

regexp.MustCompile(`argon.*captcha`),

regexp.MustCompile(`captcha\.argon`),

},

CaptchaTypeBehaviotech: {

regexp.MustCompile(`behaviotech\.com`),

},

CaptchaTypeSmartCaptcha: {

regexp.MustCompile(`captcha\.yandex\.com`),

regexp.MustCompile(`smartcaptcha\.yandex`),

},

CaptchaTypeYandex: {

regexp.MustCompile(`yandex\.com/.*captcha`),

regexp.MustCompile(`captcha\.yandex`),

regexp.MustCompile(`smartcaptcha\.yandex`),

},

CaptchaTypeFuncaptcha: {

regexp.MustCompile(`funcaptcha\.com`),

regexp.MustCompile(`api\.funcaptcha\.com`),

},

CaptchaTypeCoingecko: {

regexp.MustCompile(`wsiz\.com`),

},

CaptchaTypeNovaScape: {

regexp.MustCompile(`novascape\.com`),

},

CaptchaTypeMCaptcha: {

regexp.MustCompile(`mcaptcha`),

regexp.MustCompile(`app\.mcaptcha\.io`),

},

CaptchaTypeKasada: {

regexp.MustCompile(`kasada`),

regexp.MustCompile(`kas\.kasadaproducts\.com`),

},

CaptchaTypeImperva: {

regexp.MustCompile(`/_Incapsula_Resource`),

regexp.MustCompile(`incapsula`),

regexp.MustCompile(`imperva`),

},

CaptchaTypeAwsWaf: {

regexp.MustCompile(`/aws-waf-captcha/`),

regexp.MustCompile(`awswaf\.com`),

regexp.MustCompile(`captcha\.aws\.amazon\.com`),

},

}

var scriptPatterns = map[CaptchaType][]*regexp.Regexp{

CaptchaTypeRecaptcha: {

regexp.MustCompile(`google\.com/recaptcha`),

regexp.MustCompile(`recaptcha.*\.js`),

regexp.MustCompile(`gstatic\.com/.*recaptcha`),

},

CaptchaTypeRecaptchaV2: {

regexp.MustCompile(`recaptcha.*v2`),

regexp.MustCompile(`recaptcha/api\.js`),

},

CaptchaTypeRecaptchaInvisible: {

regexp.MustCompile(`recaptcha.*invisible`),

regexp.MustCompile(`grecaptcha\.render.*invisible`),

},

CaptchaTypeHCaptcha: {

regexp.MustCompile(`js\.hcaptcha\.com`),

regexp.MustCompile(`hcaptcha`),

},

CaptchaTurnstile: {

regexp.MustCompile(`challenges\.cloudflare\.com`),

regexp.MustCompile(`js\.cloudflare\.com.*turnstile`),

},

CaptchaTypeGeetest: {

regexp.MustCompile(`geetest`),

regexp.MustCompile(`api\.geetest\.com`),

},

CaptchaTypeFriendlyCaptcha: {

regexp.MustCompile(`friendlycaptcha`),

regexp.MustCompile(`cdn\.friendlycaptcha\.com`),

},

CaptchaTypeRotateCaptcha: {

regexp.MustCompile(`api\.rotatecaptcha\.com`),

},

CaptchaTypeClickCaptcha: {

regexp.MustCompile(`assets\.clickcaptcha\.com`),

},

CaptchaTypeImageCaptcha: {

regexp.MustCompile(`api\.imagecaptcha\.com`),

},

CaptchaTypePuzzleCaptcha: {

regexp.MustCompile(`puzzle.*captcha`),

},

CaptchaTypeSliderCaptcha: {

regexp.MustCompile(`slider.*captcha`),

regexp.MustCompile(`api\.slidercaptcha\.com`),

regexp.MustCompile(`slidercaptcha\.com`),

},

CaptchaTypeDatadome: {

regexp.MustCompile(`datadome\.co`),

regexp.MustCompile(`cdn\.mxpnl\.com`),

},

CaptchaTypePerimeterX: {

regexp.MustCompile(`perimeterx\.net`),

},

CaptchaTypeArgon: {

regexp.MustCompile(`argon.*captcha`),

regexp.MustCompile(`captcha\.argon`),

},

CaptchaTypeBehaviotech: {

regexp.MustCompile(`behaviotech\.com`),

},

CaptchaTypeSmartCaptcha: {

regexp.MustCompile(`captcha\.yandex\.com`),

regexp.MustCompile(`smartcaptcha\.yandex`),

},

CaptchaTypeYandex: {

regexp.MustCompile(`yandex\.com/.*captcha`),

regexp.MustCompile(`captcha\.yandex`),

regexp.MustCompile(`smartcaptcha\.yandex`),

},

CaptchaTypeFuncaptcha: {

regexp.MustCompile(`funcaptcha\.com`),

regexp.MustCompile(`api\.funcaptcha\.com`),

},

CaptchaTypeCoingecko: {

regexp.MustCompile(`wsiz\.com`),

},

CaptchaTypeNovaScape: {

regexp.MustCompile(`novascape\.com`),

},

CaptchaTypeMCaptcha: {

regexp.MustCompile(`mcaptcha`),

regexp.MustCompile(`app\.mcaptcha\.io`),

},

CaptchaTypeKasada: {

regexp.MustCompile(`kasada`),

regexp.MustCompile(`kas\.kasadaproducts\.com`),

},

CaptchaTypeImperva: {

regexp.MustCompile(`/_Incapsula_Resource`),

regexp.MustCompile(`incapsula`),

regexp.MustCompile(`imperva`),

},

CaptchaTypeAwsWaf: {

regexp.MustCompile(`/aws-waf-captcha/`),

regexp.MustCompile(`awswaf\.com`),

regexp.MustCompile(`captcha\.aws\.amazon\.com`),

},

}

🤖 Prompt for AI Agents

In `@classifier/captcha.go` around lines 86 - 184, The map of regexes (scriptPatterns) is being built with regexp.MustCompile inside detectByScriptDomain, causing re-compilation on every call; move the map to a package-level var (e.g., var scriptPatterns = map[CaptchaType][]*regexp.Regexp{...}) so all regexp.MustCompile calls run once at init, then update detectByScriptDomain to reference that package-level scriptPatterns; ensure you keep the same CaptchaType keys (CaptchaTypeRecaptcha, CaptchaTypeHCaptcha, CaptchaTurnstile, etc.) and no other logic changes.

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+		CaptchaTypeSmartCaptcha: {
+			regexp.MustCompile(`captcha\.yandex\.com`),
+			regexp.MustCompile(`smartcaptcha\.yandex`),
+		},
+		CaptchaTypeYandex: {
+			regexp.MustCompile(`yandex\.com/.*captcha`),
+			regexp.MustCompile(`captcha\.yandex`),
+			regexp.MustCompile(`smartcaptcha\.yandex`),
+		},


⚠️ Potential issue | 🟠 Major

Overlapping patterns between CaptchaTypeSmartCaptcha and CaptchaTypeYandex with nondeterministic map iteration.

Both types share the pattern smartcaptcha.yandex (and captcha.yandex prefix). Since Go map iteration order is random, which type is returned for Yandex-related scripts is unpredictable. This applies broadly to all detection functions that iterate map[CaptchaType][]string — when multiple types can match, the result is nondeterministic.

Consider using an ordered slice of (CaptchaType, patterns) pairs instead of a map to enforce priority, or deduplicate the overlapping types.

🤖 Prompt for AI Agents

In `@classifier/captcha.go` around lines 147 - 155, The patterns for CaptchaTypeSmartCaptcha and CaptchaTypeYandex overlap (e.g., `smartcaptcha.yandex`) and because detection uses an iteration over a map[CaptchaType][]*regexp.Regexp the returned type is nondeterministic; replace the map with an ordered slice of pairs (e.g., a []struct{ Type CaptchaType; Patterns []*regexp.Regexp }) or otherwise enforce an explicit priority order and update all detection functions that currently iterate the map to iterate this ordered slice so Yandex-related patterns are matched deterministically (or deduplicate/adjust patterns so they no longer overlap).

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+	// Check for parent scripts too
+	form.Parents().First().Find("script").Each(func(_ int, s *goquery.Selection) {
+		if src, ok := s.Attr("src"); ok {
+			scriptSrcs = append(scriptSrcs, strings.ToLower(src))
+		}
+	})


⚠️ Potential issue | 🟡 Minor

form.Parents().First().Find("script") may search far too broadly.

This traverses up to the first parent and finds all scripts within it. Depending on document structure, this could be the <body> or even <html>, picking up scripts completely unrelated to the form. This could cause false positives (detecting a CAPTCHA loaded for a different form or purpose).

🤖 Prompt for AI Agents

In `@classifier/captcha.go` around lines 193 - 198, The current traversal using form.Parents().First().Find("script") is too broad and can capture unrelated scripts; limit the search to the form itself and its immediate container instead (e.g., use form.Find("script") plus form.Parent().Find("script")) so you only collect nearby scripts into scriptSrcs, and keep using s.Attr("src") to append lowercased sources; this reduces false positives from scripts elsewhere in the document.

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+func detectByClasses(form *goquery.Selection) CaptchaType {
+	classPatterns := map[CaptchaType][]string{
+		CaptchaTypeRecaptcha:          {"g-recaptcha", "grecaptcha"},
+		CaptchaTypeRecaptchaV2:        {"g-recaptcha-v2", "grecaptcha-v2"},
+		CaptchaTypeRecaptchaInvisible: {"g-recaptcha-invisible", "grecaptcha-invisible"},
+		CaptchaTypeHCaptcha:           {"h-captcha", "hcaptcha"},
+		CaptchaTurnstile:              {"cf-turnstile", "cloudflare-turnstile-challenge", "turnstile"},
+		CaptchaTypeGeetest:            {"geetest_", "geetest-box"},
+		CaptchaTypeFriendlyCaptcha:    {"frc-captcha", "friendlycaptcha"},
+		CaptchaTypeRotateCaptcha:      {"rotate-captcha", "rotatecaptcha"},
+		CaptchaTypeClickCaptcha:       {"click-captcha", "clickcaptcha"},
+		CaptchaTypeImageCaptcha:       {"image-captcha", "imagecaptcha"},
+		CaptchaTypePuzzleCaptcha:      {"puzzle-captcha", "__puzzle_captcha"},
+		CaptchaTypeSliderCaptcha:      {"slider-captcha", "slidercaptcha", "slide-verify"},
+		CaptchaTypeDatadome:           {"dd-challenge", "dd-top"},
+		CaptchaTypePerimeterX:         {"_px3", "px-container"},
+		CaptchaTypeArgon:              {"argon-captcha"},
+		CaptchaTypeSmartCaptcha:       {"smart-captcha"},
+		CaptchaTypeYandex:             {"smartcaptcha", "yandex-captcha"},
+		CaptchaTypeFuncaptcha:         {"funcaptcha-container"},
+		CaptchaTypeMCaptcha:           {"mcaptcha", "mcaptcha-container"},
+		CaptchaTypeKasada:             {"kas", "kasada"},
+		CaptchaTypeImperva:            {"_inc", "incapsula", "imperva"},
+		CaptchaTypeAwsWaf:             {"aws-waf", "awswaf"},
+	}
+
+	html, _ := form.Html()
+	htmlLower := strings.ToLower(html)
+
+	for captchaType, classes := range classPatterns {
+		for _, class := range classes {
+			if strings.Contains(htmlLower, class) {
+				return captchaType
+			}
+		}
+	}
+
+	return CaptchaTypeNone
+}


⚠️ Potential issue | 🟠 Major

Overly short class patterns risk false positives.

Patterns like "kas" (Line 290), "_inc" (Line 291), and "_px3" (Line 284) are very short substrings searched across the entire HTML. These can easily match unrelated content (e.g., a CSS class like "tasks" contains "kas", or "_include" contains "_inc"). Consider making these patterns more specific or matching against actual class attributes rather than raw HTML.

🤖 Prompt for AI Agents

In `@classifier/captcha.go` around lines 269 - 307, The detectByClasses function uses overly short substrings in classPatterns and searches the entire HTML (htmlLower), causing false positives; change detection to read the form's "class" attribute(s) and match against actual class tokens (split on whitespace) or use stricter regex/word-boundary checks for each pattern (e.g., replace "kas" with "kas-" or "^kas$" style matches, change "_inc" and "_px3" to more specific tokens) inside detectByClasses so you only return a CaptchaType when an actual class name matches the more specific pattern; update the classPatterns entries accordingly and adjust the loop to inspect class attributes via goquery (e.g., form.Attr("class") or iterating child elements' class attrs) rather than searching htmlLower.

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+// detectByIframe checks for CAPTCHA-specific iframes
+func detectByIframe(form *goquery.Selection) CaptchaType {
+	iframePatterns := map[CaptchaType][]string{
+		CaptchaTypeRecaptcha:     {"recaptcha"},
+		CaptchaTypeHCaptcha:      {"hcaptcha"},
+		CaptchaTurnstile:         {"challenges.cloudflare.com"},
+		CaptchaTypeGeetest:       {"geetest"},
+		CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"},
+		CaptchaTypeMCaptcha:      {"mcaptcha", "app.mcaptcha.io"},
+		CaptchaTypeYandex:        {"yandex", "smartcaptcha"},
+		CaptchaTypeKasada:        {"kasada", "kas"},
+		CaptchaTypeImperva:       {"incapsula", "imperva"},
+		CaptchaTypeDatadome:      {"datadome"},
+	}
+
+	html, _ := form.Html()
+	htmlLower := strings.ToLower(html)
+
+	// Check for iframe with CAPTCHA patterns in raw HTML
+	if strings.Contains(htmlLower, "iframe") {
+		for captchaType, patterns := range iframePatterns {
+			for _, pattern := range patterns {
+				if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") {
+					return captchaType
+				}
+			}
+		}
+	}
+
+	return CaptchaTypeNone
+}


⚠️ Potential issue | 🟠 Major

detectByIframe matches patterns anywhere in HTML, not within iframe elements.

Line 331 checks if htmlLower contains both "iframe" and a pattern like "recaptcha", but doesn't verify the pattern appears inside the iframe's src attribute. Any form that happens to contain an unrelated iframe and the word "recaptcha" elsewhere (e.g., in a comment or label) will be falsely detected.

Consider using goquery to select iframe elements and inspect their src attribute specifically, similar to how detectByScriptDomain inspects script[src].

♻️ Proposed fix

func detectByIframe(form *goquery.Selection) CaptchaType { iframePatterns := map[CaptchaType][]string{ // ...same patterns... } - html, _ := form.Html() - htmlLower := strings.ToLower(html) - - if strings.Contains(htmlLower, "iframe") { - for captchaType, patterns := range iframePatterns { - for _, pattern := range patterns { - if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") { - return captchaType - } + var iframeSrcs []string + form.Find("iframe").Each(func(_ int, s *goquery.Selection) { + if src, ok := s.Attr("src"); ok { + iframeSrcs = append(iframeSrcs, strings.ToLower(src)) + } + }) + + for captchaType, patterns := range iframePatterns { + for _, src := range iframeSrcs { + for _, pattern := range patterns { + if strings.Contains(src, pattern) { + return captchaType + } } } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// detectByIframe checks for CAPTCHA-specific iframes

func detectByIframe(form *goquery.Selection) CaptchaType {

iframePatterns := map[CaptchaType][]string{

CaptchaTypeRecaptcha: {"recaptcha"},

CaptchaTypeHCaptcha: {"hcaptcha"},

CaptchaTurnstile: {"challenges.cloudflare.com"},

CaptchaTypeGeetest: {"geetest"},

CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"},

CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"},

CaptchaTypeYandex: {"yandex", "smartcaptcha"},

CaptchaTypeKasada: {"kasada", "kas"},

CaptchaTypeImperva: {"incapsula", "imperva"},

CaptchaTypeDatadome: {"datadome"},

}

html, _ := form.Html()

htmlLower := strings.ToLower(html)

// Check for iframe with CAPTCHA patterns in raw HTML

if strings.Contains(htmlLower, "iframe") {

for captchaType, patterns := range iframePatterns {

for _, pattern := range patterns {

if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") {

return captchaType

}

}

}

}

return CaptchaTypeNone

}

// detectByIframe checks for CAPTCHA-specific iframes

func detectByIframe(form *goquery.Selection) CaptchaType {

iframePatterns := map[CaptchaType][]string{

CaptchaTypeRecaptcha: {"recaptcha"},

CaptchaTypeHCaptcha: {"hcaptcha"},

CaptchaTurnstile: {"challenges.cloudflare.com"},

CaptchaTypeGeetest: {"geetest"},

CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"},

CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"},

CaptchaTypeYandex: {"yandex", "smartcaptcha"},

CaptchaTypeKasada: {"kasada", "kas"},

CaptchaTypeImperva: {"incapsula", "imperva"},

CaptchaTypeDatadome: {"datadome"},

}

var iframeSrcs []string

form.Find("iframe").Each(func(_ int, s *goquery.Selection) {

if src, ok := s.Attr("src"); ok {

iframeSrcs = append(iframeSrcs, strings.ToLower(src))

}

})

for captchaType, patterns := range iframePatterns {

for _, src := range iframeSrcs {

for _, pattern := range patterns {

if strings.Contains(src, pattern) {

return captchaType

}

}

}

}

return CaptchaTypeNone

}

🤖 Prompt for AI Agents

In `@classifier/captcha.go` around lines 309 - 339, detectByIframe currently scans the entire form HTML for both "iframe" and patterns, causing false positives; change it to iterate iframe elements via form.Find("iframe") (or Selection.Each) and inspect each iframe's src (and possibly data attributes) by lowercasing the src and checking against iframePatterns for a match; on the first match return the corresponding CaptchaType (otherwise return CaptchaTypeNone). Ensure you reference detectByIframe, iframePatterns, and use form.Find("iframe")/Each and strings.Contains(srcLower, pattern) so the detection only triggers when the pattern is actually in an iframe's src.

coderabbitai · 2026-02-16T18:40:04Z

classifier/captcha.go

+func DetectCaptchaInHTML(html string) CaptchaType {
+	htmlLower := strings.ToLower(html)
+
+	// Priority 1: Domain-based detection patterns (most reliable)
+	domainPatterns := map[CaptchaType][]string{
+		CaptchaTypeRecaptcha:          {"google.com/recaptcha", "gstatic.com", "recaptcha"},
+		CaptchaTypeRecaptchaV2:        {"recaptcha/api.js", "recaptcha.*v2"},
+		CaptchaTypeRecaptchaInvisible: {"recaptcha.*invisible"},
+		CaptchaTypeHCaptcha:           {"hcaptcha", "js.hcaptcha.com"},
+		CaptchaTurnstile:              {"challenges.cloudflare.com", "js.cloudflare.com"},
+		CaptchaTypeGeetest:            {"geetest", "api.geetest.com"},
+		CaptchaTypeFriendlyCaptcha:    {"friendlycaptcha", "cdn.friendlycaptcha.com"},
+		CaptchaTypeRotateCaptcha:      {"rotatecaptcha", "api.rotatecaptcha.com"},
+		CaptchaTypeClickCaptcha:       {"clickcaptcha", "assets.clickcaptcha.com"},
+		CaptchaTypeImageCaptcha:       {"imagecaptcha", "api.imagecaptcha.com"},
+		CaptchaTypePuzzleCaptcha:      {"puzzle-captcha", "__puzzle_captcha"},
+		CaptchaTypeSliderCaptcha:      {"slider-captcha", "slidercaptcha"},
+		CaptchaTypeMCaptcha:           {"mcaptcha", "app.mcaptcha.io"},
+		CaptchaTypeKasada:             {"kasada", "kas.kasadaproducts.com"},
+		CaptchaTypeImperva:            {"incapsula", "imperva"},
+		CaptchaTypeAwsWaf:             {"awswaf", "captcha.aws.amazon.com"},
+		CaptchaTypeDatadome:           {"datadome", "dd-challenge"},
+		CaptchaTypePerimeterX:         {"perimeterx", "_pxappid"},
+		CaptchaTypeArgon:              {"argon-captcha"},
+		CaptchaTypeBehaviotech:        {"behaviotech"},
+		CaptchaTypeSmartCaptcha:       {"captcha.yandex.com", "smartcaptcha"},
+		CaptchaTypeYandex:             {"yandex.com/.*captcha", "yandex.ru/.*captcha", "smartcaptcha.yandex"},
+		CaptchaTypeFuncaptcha:         {"funcaptcha", "arkose"},
+		CaptchaTypeCoingecko:          {"wsiz.com"},
+		CaptchaTypeNovaScape:          {"novascape"},
+	}
+
+	for captchaType, patterns := range domainPatterns {
+		for _, pattern := range patterns {
+			if strings.Contains(htmlLower, pattern) {
+				return captchaType
+			}
+		}
+	}


⚠️ Potential issue | 🔴 Critical

Regex-like patterns used with strings.Contains — they will never match.

Several entries in domainPatterns contain regex syntax (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha") but are matched using strings.Contains, which treats them as literal strings. These patterns will never match real HTML.

Either use regexp for matching or convert these to plain substrings (e.g., "recaptcha/api.js" is already correct as a literal, but "recaptcha.*v2" is not).

🤖 Prompt for AI Agents

In `@classifier/captcha.go` around lines 370 - 408, DetectCaptchaInHTML uses domainPatterns with regex-like entries (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha") but matches them with strings.Contains, so those will never match; fix by either (A) replacing regex-like patterns in the domainPatterns map with actual literal substrings that exist in the HTML (e.g., "recaptcha/api.js" or "recaptcha v2" as appropriate) or (B) switch matching to regular expressions: import regexp, precompile each pattern from domainPatterns and use regexp.MatchString (or compile once per pattern) when iterating in DetectCaptchaInHTML; update the map keys/values and matching loop accordingly so regex patterns are evaluated correctly.

classifier/classifier.go

dit_captcha_test.go

dogancanbakir

Recommending different approach

dogancanbakir · 2026-02-17T10:23:35Z

classifier/captcha.go

@@ -0,0 +1,509 @@
+package classifier


This should be outside of the classifier, classifier pkg is for ML classifiers. What about a pkg under the name of captcha, outside of classifier?

Yes, I think this is a much more professional approach. I will implement this.

dogancanbakir · 2026-02-17T10:24:02Z

classifier/classifier.go

-	Form   string            `json:"form"`
-	Fields map[string]string `json:"fields,omitempty"`
+	Form    string            `json:"form"`
+	Captcha string            `json:"captcha,omitempty"`


See the comment above.

… improve form-level detection fallback

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dit.go (1)
224-254: ⚠️ Potential issue | 🟠 Major

Per-form Captcha field is not populated in ExtractPageType and ExtractPageTypeProba, creating inconsistency with ExtractForms.

ExtractForms (lines 162-166) and ExtractFormsProba (lines 201-205) populate Captcha for each form, but ExtractPageType (lines 224-230) and ExtractPageTypeProba (lines 272-278) leave the Captcha field empty in the returned forms. A caller will always receive empty Captcha values when using PageResult.Forms[i].Captcha, unlike calling ExtractForms directly.

This should either be documented if intentional, or per-form CAPTCHA detection should be added to both functions to match the behavior of ExtractForms.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dit.go` around lines 224 - 254, ExtractPageType and ExtractPageTypeProba
build PageResult.Forms but never set each FormResult.Captcha, causing
inconsistency with ExtractForms/ExtractFormsProba; update these functions to
populate per-form Captcha the same way as ExtractForms does: for each form
result in the loop that constructs forms (the block filling FormResult{Type,
Fields}), run the same captcha detection used in ExtractForms (use
captcha.CaptchaDetector{}.DetectInForm on each htmlutil.GetForms form or call
the shared helper used by ExtractForms/ExtractFormsProba) and set
FormResult.Captcha accordingly before returning PageResult; ensure you reference
the same detector/path (captcha.CaptchaDetector, DetectInForm,
captcha.DetectCaptchaInHTML) so per-form Captcha values are filled consistently.

🧹 Nitpick comments (3)

dit.go (1)

234-244: Error from LoadHTMLString is silently ignored.

Line 234-235 swallow docErr. While this is best-effort, the same HTML was already successfully parsed a few lines earlier inside ExtractPage (Line 219). If it now fails, something unexpected happened and a log or warning would be appropriate. At minimum, add a comment explaining the intentional silent fallthrough.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dit.go` around lines 234 - 244, The LoadHTMLString error is being silently
ignored in the block that calls htmlutil.LoadHTMLString(html) before extracting
forms and detecting captchas; update this by either logging the error via the
existing logger (or returning it) when docErr != nil, or if the silent fallback
is intentional add a clear comment explaining why it's safe to ignore docErr;
specifically modify the code around htmlutil.LoadHTMLString(html) and the
subsequent htmlutil.GetForms / captcha.CaptchaDetector{}.DetectInForm loop to
either emit a warning that includes docErr or to document the rationale for
swallowing the error (referencing ExtractPage behavior if relevant).

captcha/captcha.go (1)

469-480: Generic markers include provider-specific strings already handled by earlier layers.

"g-recaptcha", "h-captcha", "turnstile", and "geetest" in hasGenericCaptchaMarkers are redundant since they're detected by Layers 1-2. Only "captcha" is genuinely generic here. Additionally, "captcha" as a bare substring will match any occurrence in the HTML (e.g., a <a href="/captcha-policy"> link), which may be too aggressive for a "Layer 7" fallback.

Consider trimming to only the truly generic marker and/or requiring it to appear in a meaningful attribute context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 469 - 480, hasGenericCaptchaMarkers
currently treats provider-specific strings
("g-recaptcha","h-captcha","turnstile","geetest") as generic and uses a broad
substring match on "captcha"; update it to only consider truly generic markers
by removing the provider-specific entries from genericMarkers and tighten the
"captcha" check to a meaningful context — for example search relevant attributes
(e.g., form.Find for input[name|id|class], div[class|id], labels, or the form's
action) or use a word-boundary/attribute-aware regex so stray links like
"/captcha-policy" don't trigger; locate the hasGenericCaptchaMarkers function
and replace the simple strings.Contains(htmlLower, "captcha") logic with an
attribute-scoped or word-boundary check.

captcha/captcha_test.go (1)

10-696: Consider table-driven tests to reduce boilerplate.

Nearly every test follows the identical pattern: define HTML → load → get forms → detect → assert. A single table-driven test would eliminate ~500 lines of duplication and make adding new providers trivial.

Example sketch

func TestDetectInForm(t *testing.T) {
	tests := []struct {
		name     string
		html     string
		expected captcha.CaptchaType
	}{
		{"RecaptchaV2 class", `<form>...<div class="g-recaptcha" data-sitekey="..."></div>...</form>`, captcha.CaptchaTypeRecaptcha},
		{"HCaptcha class", `<form>...<div class="h-captcha" data-sitekey="..."></div>...</form>`, captcha.CaptchaTypeHCaptcha},
		// ... all other cases
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			doc, err := htmlutil.LoadHTMLString(tt.html)
			if err != nil {
				t.Fatal(err)
			}
			forms := htmlutil.GetForms(doc)
			if len(forms) == 0 {
				t.Fatal("expected to find form")
			}
			detector := &captcha.CaptchaDetector{}
			if got := detector.DetectInForm(forms[0]); got != tt.expected {
				t.Errorf("expected %v, got %v", tt.expected, got)
			}
		})
	}
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha_test.go` around lines 10 - 696, Many tests duplicate the same
pattern calling htmlutil.LoadHTMLString → htmlutil.GetForms →
(&captcha.CaptchaDetector{}).DetectInForm and asserting the result; consolidate
these into a single table-driven test (e.g., replace TestDetectRecaptchaV2,
TestDetectRecaptchaScript, TestDetectHCaptcha, etc.) by creating tests :=
[]struct{name,string,expected captcha.CaptchaType} with entries for each
provider, iterate with t.Run(tt.name, func(t *testing.T){ doc,err :=
htmlutil.LoadHTMLString(tt.html); forms := htmlutil.GetForms(doc); detector :=
&captcha.CaptchaDetector{}; if got := detector.DetectInForm(forms[0]); got !=
tt.expected { t.Errorf(...) } }), and keep complex or multi-form cases
(TestDetectCaptchaInMultipleForms, iframe/multiple captchas,
captcha.DetectCaptchaInHTML, TestCaptchaTypeString, TestIsValidCaptchaType) as
separate tests. Ensure you reference DetectInForm and
captcha.DetectCaptchaInHTML when moving cases so each HTML sample maps to the
correct assertion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@captcha/captcha_test.go`:
- Around line 246-271: The test TestDetectGenericCaptchaIframe currently only
asserts the result is not CaptchaTypeRecaptcha/HCaptcha, but should assert that
CaptchaDetector{}.DetectInForm on the form containing an iframe with "captcha"
returns CaptchaTypeOther (since hasGenericCaptchaMarkers should match). Update
the assertion to compare result == captcha.CaptchaTypeOther (or use
t.Fatalf/t.Errorf with expected vs actual) so the test explicitly verifies the
expected CaptchaTypeOther behavior; reference CaptchaDetector.DetectInForm,
hasGenericCaptchaMarkers, and CaptchaTypeOther when locating the logic to
validate.

In `@captcha/captcha.go`:
- Line 39: The CaptchaTypeCoingecko constant maps to the opaque string "wsiz"
which is confusing; update the declaration for CaptchaTypeCoingecko to either
use a more descriptive literal (e.g., "wsiz.com" or "coingecko_wsiz") or add a
concise comment on the same line explaining the mapping (for example: "wsiz.com
is used by Coingecko for bot protection"), and keep the constant name
CaptchaTypeCoingecko unchanged so callers remain stable; edit the
CaptchaTypeCoingecko symbol in captcha.go accordingly.
- Around line 295-334: The detectByClasses function is producing false positives
because it searches the entire form HTML for short substrings; update detection
for the CaptchaTypeKasada, CaptchaTypeImperva, and CaptchaTypeArgon entries (and
any other overly short patterns in the classPatterns slice) to be more specific
by either (a) replacing short fragments with full/kebab names like "kasada",
"incapsula" or "argon-captcha", or (b) change the check to inspect actual class
attributes (e.g., iterate form.Find("[class]"), split class values and match
whole class tokens or use word-boundary regex) instead of strings.Contains on
the whole HTML to avoid matching inside unrelated words.
- Around line 128-232: The regexps in detectByScriptDomain and the
integrationPatterns used by DetectCaptchaInHTML are being (re)compiled on every
call; move all regexp.MustCompile calls out of those functions into
package-level var slices (e.g., precompile a package-level variable like
scriptDomainPatterns []struct{captchaType CaptchaType; patterns
[]*regexp.Regexp} and integrationPatterns []*regexp.Regexp) and initialize them
once (via var init or an init() function), then update detectByScriptDomain and
DetectCaptchaInHTML to reference those precompiled vars instead of constructing
regexps locally.
- Around line 135-143: The regex list for detectByScriptDomain currently places
generic patterns (CaptchaTypeRecaptcha, CaptchaTypeSmartCaptcha) before more
specific subtypes (CaptchaTypeRecaptchaV2, CaptchaTypeYandex), causing
shadowing; update the mapping so the most-specific detectors run first (e.g.,
move CaptchaTypeRecaptchaV2 entries above CaptchaTypeRecaptcha and place
CaptchaTypeYandex entries before CaptchaTypeSmartCaptcha) or remove
redundant/ambiguous patterns from the generic entries; adjust the slice ordering
in the captcha type-to-regex table used by detectByScriptDomain (and the similar
block at the other location referenced) so specific regexps like
`recaptcha/api\.js` and `captcha\.yandex` are evaluated before the broader
`recaptcha.*\.js` and `smartcaptcha\.yandex` patterns.
- Around line 350-368: The code currently mutates idPatterns entries using the
"__matched__" sentinel inside the form.Find(...).Each callback which is brittle;
replace that sentinel with a closure variable (e.g., idResult or
matchedCaptchaType) initialized outside the Each loop, set it to
entry.captchaType when a pattern match is found inside the Each for loop, and
then return early from the Each. After the Each completes, check the closure
variable and return its value if non-empty; do not modify idPatterns. This
mirrors the existing altResult pattern and keeps matching logic local to the
callback.
- Around line 241-246: Add a clarifying comment above the form-scoped script
search in the code (the block using form.Parents().First().Find(...)) stating
that this only checks scripts in the form and its immediate parent (form-scoped
detection) and that page-level scripts in <head> are intentionally excluded
because DetectCaptchaInHTML() handles page-level CAPTCHA detection as a
fallback; place the comment near the existing block so future readers understand
the layered detection approach.

In `@classifier/classifier.go`:
- Around line 63-78: The proba path currently always initializes result.Fields
to an empty map which will serialize as {} (breaking omitempty); instead, avoid
calling make at the top—only allocate result.Fields when you are about to insert
the first thresholdedProbs (or set it back to nil if no non-"captcha" fields
were added); modify the loop around fieldProba so that you create result.Fields
= make(map[string]map[string]float64) right before result.Fields[name] =
thresholdedProbs (or check len(result.Fields) == 0 and set to nil after the
loop), referencing result.Fields, fieldProba, thresholdMap, threshold, and the
"captcha" check to locate the code to change.
- Around line 34-41: When filtering out "captcha" fields you currently always
allocate result.Fields as a non-nil empty map, which causes encoding/json to
emit "fields":{} instead of omitting the key; instead, after calling
c.FieldModel.Classify(form, formType) and iterating, only allocate and assign
result.Fields if you add at least one non-"captcha" entry (leave result.Fields
nil when none found) so that the json "omitempty" behavior preserves backward
compatibility.

In `@dit.go`:
- Around line 144-149: The code redundantly reparses HTML after calling
c.fc.ExtractForms/ExtractFormsProba/ExtractPageType/ExtractPageTypeProba; fix by
avoiding a second htmlutil.LoadHTMLString/htmlutil.GetForms pass — either add a
lower-level classifier method that accepts a *goquery.Document (or returns the
parsed document/forms) so callers can reuse the parsed doc, or move the CAPTCHA
detection into the classifier's ExtractForms/ExtractPage methods so they run
once on the already-parsed document; update calls to use the new classifier API
(referencing c.fc.ExtractForms, ExtractFormsProba, ExtractPageType,
ExtractPageTypeProba and htmlutil.LoadHTMLString/htmlutil.GetForms) and remove
the duplicate parsing logic.
- Around line 232-249: Duplicate page-level CAPTCHA detection appears in
ExtractPageType and ExtractPageTypeProba; extract the logic into a helper (e.g.,
detectPageCaptcha) that takes the HTML string and returns the cap string, by
reusing htmlutil.LoadHTMLString, htmlutil.GetForms,
captcha.CaptchaDetector.DetectInForm for form-level checks and falling back to
captcha.DetectCaptchaInHTML for full-HTML detection, then replace the duplicated
blocks in both ExtractPageType and ExtractPageTypeProba with a call to this
helper and remove the local capStr duplication.

---

Outside diff comments:
In `@dit.go`:
- Around line 224-254: ExtractPageType and ExtractPageTypeProba build
PageResult.Forms but never set each FormResult.Captcha, causing inconsistency
with ExtractForms/ExtractFormsProba; update these functions to populate per-form
Captcha the same way as ExtractForms does: for each form result in the loop that
constructs forms (the block filling FormResult{Type, Fields}), run the same
captcha detection used in ExtractForms (use
captcha.CaptchaDetector{}.DetectInForm on each htmlutil.GetForms form or call
the shared helper used by ExtractForms/ExtractFormsProba) and set
FormResult.Captcha accordingly before returning PageResult; ensure you reference
the same detector/path (captcha.CaptchaDetector, DetectInForm,
captcha.DetectCaptchaInHTML) so per-form Captcha values are filled consistently.

---

Duplicate comments:
In `@captcha/captcha_test.go`:
- Around line 453-480: The test TestDetectMultipleCaptchasInOneForm is asserting
too loosely—change the assertion to require the detector to return
captcha.CaptchaTypeRecaptcha (not either) to enforce deterministic priority;
locate the call to detector.DetectInForm(forms[0]) and replace the current
permissive check with a strict equality against captcha.CaptchaTypeRecaptcha (or
adjust the test HTML or detector setup if you intend the priority to be hcaptcha
instead), ensuring the test verifies that DetectInForm on
captcha.CaptchaDetector returns the CAPTCHA type that appears first in the form.

In `@classifier/classifier.go`:
- Around line 65-77: The inner bestProb variable inside the loop that computes
the argmax over probs shadows the outer bestProb; rename the inner variable
(e.g., to fieldBestProb) and update all uses in that block (the argmax loop that
sets bestClass and bestProb) so there is no shadowing, leaving
thresholdMap(probs, threshold) and result.Fields[name] logic unchanged.

---

Nitpick comments:
In `@captcha/captcha_test.go`:
- Around line 10-696: Many tests duplicate the same pattern calling
htmlutil.LoadHTMLString → htmlutil.GetForms →
(&captcha.CaptchaDetector{}).DetectInForm and asserting the result; consolidate
these into a single table-driven test (e.g., replace TestDetectRecaptchaV2,
TestDetectRecaptchaScript, TestDetectHCaptcha, etc.) by creating tests :=
[]struct{name,string,expected captcha.CaptchaType} with entries for each
provider, iterate with t.Run(tt.name, func(t *testing.T){ doc,err :=
htmlutil.LoadHTMLString(tt.html); forms := htmlutil.GetForms(doc); detector :=
&captcha.CaptchaDetector{}; if got := detector.DetectInForm(forms[0]); got !=
tt.expected { t.Errorf(...) } }), and keep complex or multi-form cases
(TestDetectCaptchaInMultipleForms, iframe/multiple captchas,
captcha.DetectCaptchaInHTML, TestCaptchaTypeString, TestIsValidCaptchaType) as
separate tests. Ensure you reference DetectInForm and
captcha.DetectCaptchaInHTML when moving cases so each HTML sample maps to the
correct assertion.

In `@captcha/captcha.go`:
- Around line 469-480: hasGenericCaptchaMarkers currently treats
provider-specific strings ("g-recaptcha","h-captcha","turnstile","geetest") as
generic and uses a broad substring match on "captcha"; update it to only
consider truly generic markers by removing the provider-specific entries from
genericMarkers and tighten the "captcha" check to a meaningful context — for
example search relevant attributes (e.g., form.Find for input[name|id|class],
div[class|id], labels, or the form's action) or use a
word-boundary/attribute-aware regex so stray links like "/captcha-policy" don't
trigger; locate the hasGenericCaptchaMarkers function and replace the simple
strings.Contains(htmlLower, "captcha") logic with an attribute-scoped or
word-boundary check.

In `@dit.go`:
- Around line 234-244: The LoadHTMLString error is being silently ignored in the
block that calls htmlutil.LoadHTMLString(html) before extracting forms and
detecting captchas; update this by either logging the error via the existing
logger (or returning it) when docErr != nil, or if the silent fallback is
intentional add a clear comment explaining why it's safe to ignore docErr;
specifically modify the code around htmlutil.LoadHTMLString(html) and the
subsequent htmlutil.GetForms / captcha.CaptchaDetector{}.DetectInForm loop to
either emit a warning that includes docErr or to document the rationale for
swallowing the error (referencing ExtractPage behavior if relevant).

coderabbitai · 2026-02-22T04:04:41Z

captcha/captcha_test.go

+func TestDetectGenericCaptchaIframe(t *testing.T) {
+	html := `
+<form method="POST" action="/login">
+  <input type="email" name="email" />
+  <iframe src="https://example.com/captcha"></iframe>
+  <input type="submit" value="Login" />
+</form>
+`
+	doc, err := htmlutil.LoadHTMLString(html)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	forms := htmlutil.GetForms(doc)
+	if len(forms) == 0 {
+		t.Fatal("expected to find form")
+	}
+
+	detector := &captcha.CaptchaDetector{}
+	result := detector.DetectInForm(forms[0])
+
+	// Should detect as generic captcha or none (depending on iframe content)
+	if result == captcha.CaptchaTypeRecaptcha || result == captcha.CaptchaTypeHCaptcha {
+		t.Errorf("expected generic/none, got %v", result)
+	}
+}


⚠️ Potential issue | 🟡 Minor

Assertion doesn't verify the expected CaptchaTypeOther result.

The iframe src contains "captcha", which should trigger hasGenericCaptchaMarkers (Layer 7) and return CaptchaTypeOther. The current assertion only verifies the result is not recaptcha/hcaptcha — it would pass even if the detector returned CaptchaTypeNone. Assert the expected value directly.

Suggested fix

- // Should detect as generic captcha or none (depending on iframe content) - if result == captcha.CaptchaTypeRecaptcha || result == captcha.CaptchaTypeHCaptcha { - t.Errorf("expected generic/none, got %v", result) - } + // iframe src contains "captcha" → Layer 7 generic marker should match + if result != captcha.CaptchaTypeOther { + t.Errorf("expected other (generic captcha), got %v", result) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func TestDetectGenericCaptchaIframe(t *testing.T) {

html := `

<form method="POST" action="/login">

<input type="email" name="email" />

<iframe src="https://example.com/captcha"></iframe>

<input type="submit" value="Login" />

</form>

`

doc, err := htmlutil.LoadHTMLString(html)

if err != nil {

t.Fatal(err)

}

forms := htmlutil.GetForms(doc)

if len(forms) == 0 {

t.Fatal("expected to find form")

}

detector := &captcha.CaptchaDetector{}

result := detector.DetectInForm(forms[0])

// Should detect as generic captcha or none (depending on iframe content)

if result == captcha.CaptchaTypeRecaptcha || result == captcha.CaptchaTypeHCaptcha {

t.Errorf("expected generic/none, got %v", result)

}

}

func TestDetectGenericCaptchaIframe(t *testing.T) {

html := `

<form method="POST" action="/login">

<input type="email" name="email" />

<iframe src="https://example.com/captcha"></iframe>

<input type="submit" value="Login" />

</form>

`

doc, err := htmlutil.LoadHTMLString(html)

if err != nil {

t.Fatal(err)

}

forms := htmlutil.GetForms(doc)

if len(forms) == 0 {

t.Fatal("expected to find form")

}

detector := &captcha.CaptchaDetector{}

result := detector.DetectInForm(forms[0])

// iframe src contains "captcha" → Layer 7 generic marker should match

if result != captcha.CaptchaTypeOther {

t.Errorf("expected other (generic captcha), got %v", result)

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@captcha/captcha_test.go` around lines 246 - 271, The test TestDetectGenericCaptchaIframe currently only asserts the result is not CaptchaTypeRecaptcha/HCaptcha, but should assert that CaptchaDetector{}.DetectInForm on the form containing an iframe with "captcha" returns CaptchaTypeOther (since hasGenericCaptchaMarkers should match). Update the assertion to compare result == captcha.CaptchaTypeOther (or use t.Fatalf/t.Errorf with expected vs actual) so the test explicitly verifies the expected CaptchaTypeOther behavior; reference CaptchaDetector.DetectInForm, hasGenericCaptchaMarkers, and CaptchaTypeOther when locating the logic to validate.

captcha/captcha.go

classifier/classifier.go

dit.go

…ML scan

…HA detection

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dit.go (1)
226-238: ⚠️ Potential issue | 🟡 Minor

Per-form CAPTCHA is missing from ExtractPageType / ExtractPageTypeProba form results.

ExtractForms populates FormResult.Captcha for each form, but ExtractPageType (and ExtractPageTypeProba) builds its form results without a Captcha field (lines 228–231, 257–260). This means an API consumer gets per-form CAPTCHA info via ExtractForms but not via ExtractPageType, which is inconsistent.

Consider populating per-form CAPTCHA here as well, since the parsed form selections are already available:
Proposed fix for ExtractPageType
+	detector := &captcha.CaptchaDetector{}
 	forms := make([]FormResult, len(formResults))
 	for i, r := range formResults {
+		capStr := ""
+		if i < len(parsedForms) {
+			if ct := detector.DetectInForm(parsedForms[i]); ct != captcha.CaptchaTypeNone {
+				capStr = string(ct)
+			}
+		}
 		forms[i] = FormResult{
-			Type:   r.Result.Form,
-			Fields: r.Result.Fields,
+			Type:    r.Result.Form,
+			Captcha: capStr,
+			Fields:  r.Result.Fields,
 		}
 	}
Apply the same pattern in ExtractPageTypeProba.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dit.go` around lines 226 - 238, The Page-level extractors (ExtractPageType
and ExtractPageTypeProba) build FormResult entries from formResults but omit the
per-form Captcha field; update both functions to set FormResult.Captcha for each
form by using the parsedForms data (the same way ExtractForms does) — iterate
formResults and for each index i set Forms[i].Captcha = parsedForms[i].Captcha
(or call the same helper used by ExtractForms) so returned PageResult.Forms
contains per-form Captcha info while keeping the existing page-level
detectPageCaptcha call.

♻️ Duplicate comments (2)

captcha/captcha.go (2)
39-39: CaptchaTypeCoingecko maps to "wsiz" — still undocumented.

The constant name references "Coingecko" but the string value "wsiz" remains opaque. Add a brief comment explaining the relationship (e.g., // wsiz.com is the bot-protection provider used by Coingecko).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` at line 39, Add a short explanatory comment for the
constant CaptchaTypeCoingecko so the string literal "wsiz" is no longer opaque;
e.g., next to the declaration of CaptchaTypeCoingecko explain that "wsiz" refers
to the wsiz.com bot-protection provider used by Coingecko (or similar brief
phrasing) so future readers understand the mapping between the constant name and
its string value.
414-453: ⚠️ Potential issue | 🟠 Major

Overly broad class-pattern substrings still cause false positives.

The patterns flagged in the prior review remain unchanged:

Line 430: "kas" matches "task", "Kansas", etc. → false Kasada detection

Line 431: "_inc" matches "max_increment", "_include", etc. → false Imperva detection

Line 436: "argon" alone matches "argon2" (password hashing), etc. → false Argon detection

These should use more specific strings (e.g., "kasada", "incapsula", "argon-captcha").

Additionally, "smartcaptcha" appears in both CaptchaTypeSmartCaptcha (line 435) and CaptchaTypeYandex (line 438). Since SmartCaptcha is checked first, the "smartcaptcha" pattern under Yandex is unreachable in this layer.
Proposed fix
-		{CaptchaTypeKasada, []string{"kas", "kasada"}},
-		{CaptchaTypeImperva, []string{"_inc", "incapsula", "imperva"}},
+		{CaptchaTypeKasada, []string{"kasada"}},
+		{CaptchaTypeImperva, []string{"incapsula", "imperva"}},
 		{CaptchaTypeAwsWaf, []string{"aws-waf", "awswaf"}},
 		{CaptchaTypeDatadome, []string{"dd-challenge", "dd-top"}},
 		{CaptchaTypePerimeterX, []string{"_px3", "px-container"}},
 		{CaptchaTypeSmartCaptcha, []string{"smart-captcha", "smartcaptcha"}},
-		{CaptchaTypeArgon, []string{"argon-captcha", "argon"}},
+		{CaptchaTypeArgon, []string{"argon-captcha"}},
 		{CaptchaTypePuzzleCaptcha, []string{"puzzle-captcha", "__puzzle_captcha"}},
-		{CaptchaTypeYandex, []string{"smartcaptcha", "yandex-captcha"}},
+		{CaptchaTypeYandex, []string{"yandex-captcha"}},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 414 - 453, The detectByClasses function uses
overly broad substrings in classPatterns causing false positives—update the
patterns for CaptchaTypeKasada, CaptchaTypeImperva, and CaptchaTypeArgon to use
specific tokens (e.g., "kasada" instead of "kas", "incapsula" instead of "_inc",
and "argon-captcha" instead of "argon") and remove or replace the duplicate
"smartcaptcha" entry under CaptchaTypeYandex (use "yandex-captcha" or another
Yandex-specific token) so the entries in classPatterns and their checks
correctly and uniquely identify each CaptchaType.

🧹 Nitpick comments (2)

captcha/captcha.go (1)
309-347: Layered detection pipeline is well-structured.

Clean short-circuit design. However, note that Layer 1 (detectByClasses) uses raw HTML substring matching with some broad patterns (see comment on lines 414–453), while Layer 2 (detectByScriptDomain) is more precise. Consider whether swapping their order would reduce false positives.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 309 - 347, The current detection pipeline in
DetectInForm calls detectByClasses first which uses broad substring matching and
can yield false positives; change the order to call detectByScriptDomain before
detectByClasses so more precise script-domain detection short-circuits earlier
(i.e., swap the calls to detectByScriptDomain and detectByClasses in the
DetectInForm function while keeping the rest of the layered checks the same).
classifier/classifier.go (1)
67-84: LGTM — argmax-based captcha exclusion and lazy allocation address prior review concerns.

Minor note: bestProb on line 70 shadows the outer bestProb on line 59. Both are in distinct scopes so there's no bug, but a different variable name (e.g., fieldBestProb) would improve readability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@classifier/classifier.go` around lines 67 - 84, The inner loop variable
bestProb inside the for name, probs := range fieldProba block shadows an outer
bestProb; rename the inner variable (e.g., to fieldBestProb) and update its
comparisons and assignments (the if p > ... and fieldBestProb = p, bestClass
logic) so the outer bestProb remains distinct; ensure any uses of bestProb
outside the inner loop still refer to the original variable and that
result.Fields logic (thresholdMap call, thresholdedProbs assignment) is
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@captcha/captcha.go`:
- Around line 535-550: The simpleCaptchaPatterns in detectByFieldNames are too
broad: entries "security_code", "verify_code", and "verification_code" cause
false positives on non-CAPTCHA forms (payment or verification flows). Remove
those three patterns from simpleCaptchaPatterns or change the detection to
require an additional CAPTCHA indicator before returning CaptchaTypeSimple (for
example, require the field name match AND either the presence of "captcha" in
the same form HTML, an adjacent <img> tag, or a nearby element/class like
"captcha", "recaptcha", or "g-recaptcha"); update detectByFieldNames and any
callers to use this stricter check so only true CAPTCHA forms return
CaptchaTypeSimple.
- Around line 589-600: The current hasGenericCaptchaMarkers function uses a raw
HTML substring search that causes false positives; update
hasGenericCaptchaMarkers to look for captcha indicators only in element
names/attributes (e.g., check form.Find("input, div, iframe, script,
textarea").Each and inspect attr values like id, name, class, src) instead of
the full inner HTML, and narrow the marker set to attribute/class/id tokens
(remove redundant "g-recaptcha", "h-captcha", "turnstile", "geetest" since they
are handled earlier). Ensure you keep the function name hasGenericCaptchaMarkers
and the genericMarkers concept but apply matches against element
attributes/classes/ids rather than strings.Contains on htmlLower.
- Line 571: The iframe matcher for CaptchaTypeYandex is too broad (currently
uses "yandex") and causes false positives; update the entry in the captcha types
slice (the element referencing CaptchaTypeYandex) to use a more specific domain
token such as "smartcaptcha.yandex" or "captcha.yandex" (or add both patterns)
so only true Yandex CAPTCHA iframes match; adjust any related matching
logic/tests that rely on the old "yandex" token to ensure they still pass.

---

Outside diff comments:
In `@dit.go`:
- Around line 226-238: The Page-level extractors (ExtractPageType and
ExtractPageTypeProba) build FormResult entries from formResults but omit the
per-form Captcha field; update both functions to set FormResult.Captcha for each
form by using the parsedForms data (the same way ExtractForms does) — iterate
formResults and for each index i set Forms[i].Captcha = parsedForms[i].Captcha
(or call the same helper used by ExtractForms) so returned PageResult.Forms
contains per-form Captcha info while keeping the existing page-level
detectPageCaptcha call.

---

Duplicate comments:
In `@captcha/captcha.go`:
- Line 39: Add a short explanatory comment for the constant CaptchaTypeCoingecko
so the string literal "wsiz" is no longer opaque; e.g., next to the declaration
of CaptchaTypeCoingecko explain that "wsiz" refers to the wsiz.com
bot-protection provider used by Coingecko (or similar brief phrasing) so future
readers understand the mapping between the constant name and its string value.
- Around line 414-453: The detectByClasses function uses overly broad substrings
in classPatterns causing false positives—update the patterns for
CaptchaTypeKasada, CaptchaTypeImperva, and CaptchaTypeArgon to use specific
tokens (e.g., "kasada" instead of "kas", "incapsula" instead of "_inc", and
"argon-captcha" instead of "argon") and remove or replace the duplicate
"smartcaptcha" entry under CaptchaTypeYandex (use "yandex-captcha" or another
Yandex-specific token) so the entries in classPatterns and their checks
correctly and uniquely identify each CaptchaType.

---

Nitpick comments:
In `@captcha/captcha.go`:
- Around line 309-347: The current detection pipeline in DetectInForm calls
detectByClasses first which uses broad substring matching and can yield false
positives; change the order to call detectByScriptDomain before detectByClasses
so more precise script-domain detection short-circuits earlier (i.e., swap the
calls to detectByScriptDomain and detectByClasses in the DetectInForm function
while keeping the rest of the layered checks the same).

In `@classifier/classifier.go`:
- Around line 67-84: The inner loop variable bestProb inside the for name, probs
:= range fieldProba block shadows an outer bestProb; rename the inner variable
(e.g., to fieldBestProb) and update its comparisons and assignments (the if p >
... and fieldBestProb = p, bestClass logic) so the outer bestProb remains
distinct; ensure any uses of bestProb outside the inner loop still refer to the
original variable and that result.Fields logic (thresholdMap call,
thresholdedProbs assignment) is unchanged.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f1b7c9 and e356dd2.

📒 Files selected for processing (3)

captcha/captcha.go
classifier/classifier.go
dit.go

captcha/captcha.go

dogancanbakir · 2026-02-23T10:01:46Z

Could you add some test cases to tests things?

ezyasin · 2026-02-25T03:46:10Z

These are the test results for the code running on a few live CAPTCHAs.

If a CAPTCHA is detected, this is recorded alongside its type as a property; if not, no action is taken.

I also added code to make it work with the --proba flag, but I'm not sure if that's really necessary.

…arity

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

captcha/captcha.go (3)
39-39: ⚠️ Potential issue | 🟡 Minor

CaptchaTypeCoingecko still maps to the opaque string "wsiz" with no explanation.

The constant name clearly references Coingecko, but the serialized value ("wsiz") gives no hint of that relationship. Consumers comparing against "coingecko" in JSON output would get silent mismatches. Add an inline comment documenting that wsiz.com is Coingecko's bot-protection vendor, or change the value to something self-descriptive like "coingecko" or "wsiz.com".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` at line 39, The constant CaptchaTypeCoingecko currently
serializes to the opaque string "wsiz" which hides that it represents
Coingecko's bot-protection vendor; update the declaration for
CaptchaTypeCoingecko to either use a self-descriptive value ("coingecko" or
"wsiz.com") so JSON consumers can match reliably, or add a clear inline comment
on the CaptchaTypeCoingecko constant explaining that "wsiz" refers to wsiz.com
(Coingecko's vendor) so future readers and consumers understand the mapping;
ensure the chosen change preserves any existing external contract or update
callers/tests accordingly.
586-597: ⚠️ Potential issue | 🟠 Major

hasGenericCaptchaMarkers raw-HTML search for "captcha" produces false positives, and 4 of 5 markers are redundant.

strings.Contains(htmlLower, "captcha") triggers on <label>Enter the captcha below</label>, , or any <a href="/captcha-help"> — none of which embed a CAPTCHA widget. Additionally, "g-recaptcha", "h-captcha", "turnstile", and "geetest" are already caught by earlier layers; they will never reach this fallback.

Consider narrowing the search to element attributes only (id, name, class, src) rather than the full HTML string, and trimming the marker list to just "captcha" once the match scope is tightened.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 586 - 597, The current
hasGenericCaptchaMarkers function searches the entire raw HTML and uses multiple
redundant markers causing false positives; change it to inspect only element
attributes (id, name, class, src) within the provided goquery.Selection and
perform a case-insensitive substring check for the single marker "captcha"
(remove "g-recaptcha", "h-captcha", "turnstile", "geetest"), returning true if
any of those attributes contain "captcha"; use form.Find("*").Each or equivalent
to iterate elements and check Attr("id"), Attr("name"), Attr("class") and
Attr("src") for the substring to reliably detect an embedded widget.
430-436: ⚠️ Potential issue | 🟠 Major

Overly-broad substring patterns "kas", "_inc", and "argon" remain — false-positive risk.

These three entries search the full inner HTML of the form via strings.Contains, meaning they match inside class names, attribute values, text nodes, and comments:

"kas" triggers on "task", "kansas", "breakfast", etc.

"_inc" triggers on "max_increment", "_include", "font-spacing_inc", etc.

"argon" triggers on any form that mentions argon2 (a common password-hashing scheme).

The safer alternatives are either (a) use the longer, unambiguous tokens already in the same slice ("kasada", "incapsula", "argon-captcha"), or (b) perform the match against actual class attribute tokens instead of raw HTML.
🐛 Proposed fix
-		{CaptchaTypeKasada, []string{"kas", "kasada"}},
-		{CaptchaTypeImperva, []string{"_inc", "incapsula", "imperva"}},
...
-		{CaptchaTypeArgon, []string{"argon-captcha", "argon"}},
+		{CaptchaTypeKasada, []string{"kasada"}},
+		{CaptchaTypeImperva, []string{"incapsula", "imperva"}},
...
+		{CaptchaTypeArgon, []string{"argon-captcha"}},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 430 - 436, The substring match list for
CaptchaTypeKasada, CaptchaTypeImperva and CaptchaTypeArgon is too broad and
causes false positives; update the mapping entries so they use only the longer,
unambiguous tokens ("kasada" instead of "kas", "incapsula" or "inc" instead of
"_inc", and "argon-captcha" instead of "argon") in the slice associated with
CaptchaTypeKasada, CaptchaTypeImperva and CaptchaTypeArgon, and if possible
change the HTML detection code that uses strings.Contains to instead check class
attribute tokens (split class attribute on whitespace) when identifying matches
so you don’t match substrings inside other attributes or text.

🧹 Nitpick comments (2)

dit_captcha_test.go (2)

632-695: deployRegion is populated in every test case but never used in any assertion — dead test data.

The field participates in the table-driven test struct but is only read by the loop range binding; it is never referenced in the body of t.Run. Either remove it from the struct if it adds no value, or add an assertion that exercises it (e.g., verifying that region-specific detection behaviour differs).

♻️ Proposed cleanup (remove unused field)

 tests := []struct {
 	name         string
 	captcha      string
 	description  string
-	deployRegion string
 }{
 	{
 		name:         "yandex_smartcaptcha",
 		captcha:      "yandex",
 		description:  "Yandex SmartCaptcha - Russian behavioral CAPTCHA",
-		deployRegion: "russia",
 	},
 	{
 		name:         "yandex_global",
 		captcha:      "yandex",
 		description:  "Yandex SmartCaptcha - Global deployment",
-		deployRegion: "global",
 	},
 	{
 		name:         "yandex_enterprise",
 		captcha:      "yandex",
 		description:  "Yandex SmartCaptcha - Enterprise use",
-		deployRegion: "enterprise",
 	},
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dit_captcha_test.go` around lines 632 - 695, The table-driven test
TestYandexCaptchaDetection includes a deployRegion field that is never used;
remove deployRegion from the test struct and all test cases, or if region
behavior should be tested, add assertions inside the t.Run block that reference
tt.deployRegion and verify expected region-specific behavior (e.g., adjust
expected detection or a region-dependent field on FormResult). Update the test
cases and any references to tt.deployRegion accordingly so the struct and
assertions are consistent with the intended coverage.

8-737: All tests exercise struct construction only — no detection logic is covered.

Every test in this file follows the same pattern: manually construct a FormResult/PageResult/FormResultProba with hard-coded values, then assert that those same values are still present. This is equivalent to testing Go's struct field assignment and provides no regression protection for the actual CAPTCHA detection pipeline.

For this test file to be meaningful, the tests should call the detection entry points with representative HTML fixtures and verify the produced results. For example:

// Meaningful pattern — calls detection, verifies output
func TestDetectInFormRecaptcha(t *testing.T) {
    html := `<html><body>
        <form>
            <div class="g-recaptcha" data-sitekey="abc123"></div>
            <input type="text" name="email">
        </form>
    </body></html>`

    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))
    detector := captcha.CaptchaDetector{}
    got := detector.DetectInForm(doc.Find("form").First())
    if got != captcha.CaptchaTypeRecaptcha {
        t.Errorf("got %s, want recaptcha", got)
    }
}

Integration-level tests via ExtractPage() / Classify() with real HTML would similarly catch regressions introduced by changes to the detection layers in captcha/captcha.go.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dit_captcha_test.go` around lines 8 - 737, The tests only construct
FormResult/PageResult/FormResultProba/PageResultProba structs and assert their
fields, which doesn't exercise your detection logic; update the tests to call
the actual detection entry points (e.g., captcha.CaptchaDetector, DetectInForm,
ExtractPage, Classify) with representative HTML fixtures parsed via
goquery.NewDocumentFromReader (or equivalent) and assert the returned captcha
types/structures match expected values; add tests like
"TestDetectInFormRecaptcha" that build minimal HTML snippets containing known
CAPTCHA markers, parse them into a document, invoke
detector.DetectInForm(doc.Find("form").First()) and assert the returned captcha
type, and add integration-style tests that call ExtractPage()/Classify() on full
HTML fixtures to validate page-level results.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@captcha/captcha.go`:
- Around line 143-146: The Datadome signature list incorrectly includes
Mixpanel's CDN; remove the regexp.MustCompile(`cdn\.mxpnl\.com`) entry from the
CaptchaTypeDatadome regex slice so only Datadome-related hosts remain (leave
regexp.MustCompile(`datadome\.co`) intact and ensure the CaptchaTypeDatadome
slice compiles and tests still pass).

---

Duplicate comments:
In `@captcha/captcha.go`:
- Line 39: The constant CaptchaTypeCoingecko currently serializes to the opaque
string "wsiz" which hides that it represents Coingecko's bot-protection vendor;
update the declaration for CaptchaTypeCoingecko to either use a self-descriptive
value ("coingecko" or "wsiz.com") so JSON consumers can match reliably, or add a
clear inline comment on the CaptchaTypeCoingecko constant explaining that "wsiz"
refers to wsiz.com (Coingecko's vendor) so future readers and consumers
understand the mapping; ensure the chosen change preserves any existing external
contract or update callers/tests accordingly.
- Around line 586-597: The current hasGenericCaptchaMarkers function searches
the entire raw HTML and uses multiple redundant markers causing false positives;
change it to inspect only element attributes (id, name, class, src) within the
provided goquery.Selection and perform a case-insensitive substring check for
the single marker "captcha" (remove "g-recaptcha", "h-captcha", "turnstile",
"geetest"), returning true if any of those attributes contain "captcha"; use
form.Find("*").Each or equivalent to iterate elements and check Attr("id"),
Attr("name"), Attr("class") and Attr("src") for the substring to reliably detect
an embedded widget.
- Around line 430-436: The substring match list for CaptchaTypeKasada,
CaptchaTypeImperva and CaptchaTypeArgon is too broad and causes false positives;
update the mapping entries so they use only the longer, unambiguous tokens
("kasada" instead of "kas", "incapsula" or "inc" instead of "_inc", and
"argon-captcha" instead of "argon") in the slice associated with
CaptchaTypeKasada, CaptchaTypeImperva and CaptchaTypeArgon, and if possible
change the HTML detection code that uses strings.Contains to instead check class
attribute tokens (split class attribute on whitespace) when identifying matches
so you don’t match substrings inside other attributes or text.

---

Nitpick comments:
In `@dit_captcha_test.go`:
- Around line 632-695: The table-driven test TestYandexCaptchaDetection includes
a deployRegion field that is never used; remove deployRegion from the test
struct and all test cases, or if region behavior should be tested, add
assertions inside the t.Run block that reference tt.deployRegion and verify
expected region-specific behavior (e.g., adjust expected detection or a
region-dependent field on FormResult). Update the test cases and any references
to tt.deployRegion accordingly so the struct and assertions are consistent with
the intended coverage.
- Around line 8-737: The tests only construct
FormResult/PageResult/FormResultProba/PageResultProba structs and assert their
fields, which doesn't exercise your detection logic; update the tests to call
the actual detection entry points (e.g., captcha.CaptchaDetector, DetectInForm,
ExtractPage, Classify) with representative HTML fixtures parsed via
goquery.NewDocumentFromReader (or equivalent) and assert the returned captcha
types/structures match expected values; add tests like
"TestDetectInFormRecaptcha" that build minimal HTML snippets containing known
CAPTCHA markers, parse them into a document, invoke
detector.DetectInForm(doc.Find("form").First()) and assert the returned captcha
type, and add integration-style tests that call ExtractPage()/Classify() on full
HTML fixtures to validate page-level results.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e356dd2 and 828d34a.

📒 Files selected for processing (2)

captcha/captcha.go
dit_captcha_test.go

captcha/captcha.go

…TCHA, hCaptcha, and Turnstile

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

captcha/captcha.go (1)

421-423: ⚠️ Potential issue | 🟠 Major

Specific reCAPTCHA variants are shadowed by the generic class rule.

g-recaptcha is checked before g-recaptcha-v2 and g-recaptcha-invisible, so variant forms get classified as generic reCAPTCHA in class-based detection.

🐛 Proposed fix

-		{CaptchaTypeRecaptcha, []string{"g-recaptcha", "grecaptcha"}},
-		{CaptchaTypeRecaptchaV2, []string{"g-recaptcha-v2", "grecaptcha-v2"}},
-		{CaptchaTypeRecaptchaInvisible, []string{"g-recaptcha-invisible", "grecaptcha-invisible"}},
+		{CaptchaTypeRecaptchaInvisible, []string{"g-recaptcha-invisible", "grecaptcha-invisible"}},
+		{CaptchaTypeRecaptchaV2, []string{"g-recaptcha-v2", "grecaptcha-v2"}},
+		{CaptchaTypeRecaptcha, []string{"g-recaptcha", "grecaptcha"}},

Also applies to: 441-447

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 421 - 423, The class-based detection rules
currently list the generic CaptchaTypeRecaptcha entry before its specific
variants, causing classMatches for "g-recaptcha-v2" and "g-recaptcha-invisible"
to be shadowed; reorder the class rules so more specific variants
(CaptchaTypeRecaptchaV2, CaptchaTypeRecaptchaInvisible) appear before the
generic CaptchaTypeRecaptcha (and apply the same reordering for the other
duplicated block at the later location), ensuring the slices mapping class names
to captcha types check variant strings like
"g-recaptcha-v2"/"g-recaptcha-invisible" prior to "g-recaptcha".

🧹 Nitpick comments (2)

dit_captcha_test.go (1)

10-737: Most of these tests are tautological and won’t catch integration regressions.

They mainly assert values immediately written into structs, so failures in extraction/classification logic can slip through undetected.

Consider shifting these tables to behavior tests that call the real entry points (ExtractPage, classify/proba flows) and assert populated Captcha fields from HTML fixtures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dit_captcha_test.go` around lines 10 - 737, Tests currently only assert
values written into structs (e.g., FormResult, PageResult, FormResultProba,
PageResultProba) which makes them tautological; replace key table-driven cases
with behavior/integration tests that call the real entry points (e.g.,
ExtractPage) using HTML fixtures and the actual classification/proba flows, then
assert the populated Captcha fields and field/type probabilities on the returned
PageResult/FormResult or PageResultProba/FormResultProba; keep a small set of
pure unit tests for struct shaping but move verification of real detection logic
into tests that feed known HTML fixtures and verify outputs from ExtractPage and
the classifier/proba pipeline.

captcha/captcha.go (1)

49-83: Reduce enum validation duplication.

IsValidCaptchaType duplicates the full constant list. This is easy to drift when adding/removing types.

Use a single map[CaptchaType]struct{} (or slice+set init) as the source of truth and validate via lookup.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@captcha/captcha.go` around lines 49 - 83, Replace the long switch in
IsValidCaptchaType with a single package-level lookup set: create a var like
validCaptchaTypes map[CaptchaType]struct{} (or build from a slice) containing
all existing constants (e.g., CaptchaTypeNone, CaptchaTypeRecaptcha,
CaptchaTypeRecaptchaV2, CaptchaTypeRecaptchaInvisible, CaptchaTypeHCaptcha,
CaptchaTypeTurnstile, CaptchaTypeGeetest, CaptchaTypeFriendlyCaptcha,
CaptchaTypeRotateCaptcha, CaptchaTypeClickCaptcha, CaptchaTypeImageCaptcha,
CaptchaTypePuzzleCaptcha, CaptchaTypeSliderCaptcha, CaptchaTypeMCaptcha,
CaptchaTypeDatadome, CaptchaTypePerimeterX, CaptchaTypeArgon,
CaptchaTypeBehaviotech, CaptchaTypeSmartCaptcha, CaptchaTypeYandex,
CaptchaTypeFuncaptcha, CaptchaTypeKasada, CaptchaTypeImperva, CaptchaTypeAwsWaf,
CaptchaTypeCoingecko, CaptchaTypeNovaScape, CaptchaTypeSimple,
CaptchaTypeOther), then implement IsValidCaptchaType by converting the input
string to CaptchaType and returning whether it exists in that map (e.g., _, ok
:= validCaptchaTypes[CaptchaType(s)]; return ok); keep the map as the single
source of truth so adding/removing captcha types only updates one place.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@captcha/captcha.go`:
- Around line 199-306: DetectCaptchaInHTML's regex table only matches
double-quoted attributes and misses single-quoted HTML; update each regexp in
the captcha detection slice used by DetectCaptchaInHTML to accept either single
or double quotes (e.g., change patterns like attr="...` to attr=(["']).*?\1 or
use ['"] instead of " where applicable), and apply the same change to the
similar patterns around the other occurrence (lines referenced in the review:
the block shown and the other block at 607-613) so all attribute-based matches
handle both quote styles.

---

Duplicate comments:
In `@captcha/captcha.go`:
- Around line 421-423: The class-based detection rules currently list the
generic CaptchaTypeRecaptcha entry before its specific variants, causing
classMatches for "g-recaptcha-v2" and "g-recaptcha-invisible" to be shadowed;
reorder the class rules so more specific variants (CaptchaTypeRecaptchaV2,
CaptchaTypeRecaptchaInvisible) appear before the generic CaptchaTypeRecaptcha
(and apply the same reordering for the other duplicated block at the later
location), ensuring the slices mapping class names to captcha types check
variant strings like "g-recaptcha-v2"/"g-recaptcha-invisible" prior to
"g-recaptcha".

---

Nitpick comments:
In `@captcha/captcha.go`:
- Around line 49-83: Replace the long switch in IsValidCaptchaType with a single
package-level lookup set: create a var like validCaptchaTypes
map[CaptchaType]struct{} (or build from a slice) containing all existing
constants (e.g., CaptchaTypeNone, CaptchaTypeRecaptcha, CaptchaTypeRecaptchaV2,
CaptchaTypeRecaptchaInvisible, CaptchaTypeHCaptcha, CaptchaTypeTurnstile,
CaptchaTypeGeetest, CaptchaTypeFriendlyCaptcha, CaptchaTypeRotateCaptcha,
CaptchaTypeClickCaptcha, CaptchaTypeImageCaptcha, CaptchaTypePuzzleCaptcha,
CaptchaTypeSliderCaptcha, CaptchaTypeMCaptcha, CaptchaTypeDatadome,
CaptchaTypePerimeterX, CaptchaTypeArgon, CaptchaTypeBehaviotech,
CaptchaTypeSmartCaptcha, CaptchaTypeYandex, CaptchaTypeFuncaptcha,
CaptchaTypeKasada, CaptchaTypeImperva, CaptchaTypeAwsWaf, CaptchaTypeCoingecko,
CaptchaTypeNovaScape, CaptchaTypeSimple, CaptchaTypeOther), then implement
IsValidCaptchaType by converting the input string to CaptchaType and returning
whether it exists in that map (e.g., _, ok := validCaptchaTypes[CaptchaType(s)];
return ok); keep the map as the single source of truth so adding/removing
captcha types only updates one place.

In `@dit_captcha_test.go`:
- Around line 10-737: Tests currently only assert values written into structs
(e.g., FormResult, PageResult, FormResultProba, PageResultProba) which makes
them tautological; replace key table-driven cases with behavior/integration
tests that call the real entry points (e.g., ExtractPage) using HTML fixtures
and the actual classification/proba flows, then assert the populated Captcha
fields and field/type probabilities on the returned PageResult/FormResult or
PageResultProba/FormResultProba; keep a small set of pure unit tests for struct
shaping but move verification of real detection logic into tests that feed known
HTML fixtures and verify outputs from ExtractPage and the classifier/proba
pipeline.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 828d34a and fd2c55b.

📒 Files selected for processing (2)

captcha/captcha.go
dit_captcha_test.go

coderabbitai · 2026-02-25T04:19:01Z

captcha/captcha.go

+		regexp.MustCompile(`class="[^"]*g-recaptcha-invisible`),
+		regexp.MustCompile(`data-size="invisible"`),
+	}},
+	{CaptchaTypeRecaptchaV2, []*regexp.Regexp{
+		regexp.MustCompile(`class="[^"]*g-recaptcha-v2`),
+	}},
+	{CaptchaTypeRecaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*google\.com/recaptcha`),
+		regexp.MustCompile(`src="[^"]*gstatic\.com/[^"]*recaptcha`),
+		regexp.MustCompile(`src="[^"]*recaptcha/api\.js`),
+		regexp.MustCompile(`class="[^"]*g-recaptcha`),
+	}},
+	{CaptchaTypeHCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*js\.hcaptcha\.com`),
+		regexp.MustCompile(`class="[^"]*h-captcha`),
+		regexp.MustCompile(`data-hcaptcha-widget-id`),
+	}},
+	{CaptchaTypeTurnstile, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*challenges\.cloudflare\.com`),
+		regexp.MustCompile(`class="[^"]*cf-turnstile`),
+	}},
+	{CaptchaTypeGeetest, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*geetest`),
+		regexp.MustCompile(`class="[^"]*geetest`),
+	}},
+	{CaptchaTypeFriendlyCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*friendlycaptcha`),
+		regexp.MustCompile(`class="[^"]*frc-captcha`),
+	}},
+	{CaptchaTypeRotateCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`alt="[^"]*rotatecaptcha`),
+		regexp.MustCompile(`src="[^"]*rotatecaptcha`),
+	}},
+	{CaptchaTypeClickCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`alt="[^"]*clickcaptcha`),
+		regexp.MustCompile(`src="[^"]*clickcaptcha`),
+	}},
+	{CaptchaTypeImageCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`alt="[^"]*imagecaptcha`),
+		regexp.MustCompile(`src="[^"]*imagecaptcha`),
+	}},
+	{CaptchaTypePuzzleCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`class="[^"]*__puzzle_captcha`),
+	}},
+	{CaptchaTypeSliderCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`class="[^"]*slider-captcha`),
+		regexp.MustCompile(`src="[^"]*slidercaptcha`),
+	}},
+	{CaptchaTypeMCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*mcaptcha`),
+		regexp.MustCompile(`class="[^"]*mcaptcha`),
+		regexp.MustCompile(`data-mcaptcha`),
+	}},
+	{CaptchaTypeKasada, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*kasadaproducts\.com`),
+		regexp.MustCompile(`data-kasada`),
+	}},
+	{CaptchaTypeImperva, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*/_incapsula_resource`),
+		regexp.MustCompile(`data-incapsula`),
+		regexp.MustCompile(`data-imperva`),
+	}},
+	{CaptchaTypeAwsWaf, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*aws-waf-captcha`),
+		regexp.MustCompile(`src="[^"]*awswaf\.com`),
+	}},
+	{CaptchaTypeDatadome, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*datadome`),
+		regexp.MustCompile(`data-datadome`),
+		regexp.MustCompile(`class="[^"]*dd-challenge`),
+	}},
+	{CaptchaTypePerimeterX, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*perimeterx`),
+		regexp.MustCompile(`data-px`),
+		regexp.MustCompile(`_pxappid`),
+	}},
+	{CaptchaTypeArgon, []*regexp.Regexp{
+		regexp.MustCompile(`class="[^"]*argon-captcha`),
+	}},
+	{CaptchaTypeBehaviotech, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*behaviotech\.com`),
+	}},
+	// SmartCaptcha (specific Yandex product) before generic Yandex to avoid shadowing.
+	{CaptchaTypeSmartCaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*captcha\.yandex`),
+		regexp.MustCompile(`class="[^"]*smart-captcha`),
+		regexp.MustCompile(`data-smartcaptcha`),
+	}},
+	{CaptchaTypeYandex, []*regexp.Regexp{
+		regexp.MustCompile(`class="[^"]*yandex-captcha`),
+	}},
+	{CaptchaTypeFuncaptcha, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*funcaptcha\.com`),
+		regexp.MustCompile(`src="[^"]*arkoselabs\.com`),
+		regexp.MustCompile(`class="[^"]*funcaptcha`),
+	}},
+	{CaptchaTypeCoingecko, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*wsiz\.com`),
+	}},
+	{CaptchaTypeNovaScape, []*regexp.Regexp{
+		regexp.MustCompile(`src="[^"]*novascape`),
+	}},
+	{CaptchaTypeSimple, []*regexp.Regexp{
+		regexp.MustCompile(`name="[^"]*captcha_code`),
+		regexp.MustCompile(`name="[^"]*captcha_input`),
+		regexp.MustCompile(`id="[^"]*captcha_image`),
+	}},
+}


⚠️ Potential issue | 🟠 Major

DetectCaptchaInHTML misses single-quoted attributes.

Most regexes only match attr="...". Valid HTML often uses single quotes, which causes avoidable false negatives.

🐛 Proposed fix pattern (apply across this table)

- regexp.MustCompile(`src="[^"]*google\.com/recaptcha`), + regexp.MustCompile(`src=['"][^'"]*google\.com/recaptcha`), - regexp.MustCompile(`class="[^"]*g-recaptcha`), + regexp.MustCompile(`class=['"][^'"]*g-recaptcha`), - regexp.MustCompile(`name="[^"]*captcha_code`), + regexp.MustCompile(`name=['"][^'"]*captcha_code`),

Also applies to: 607-613

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 199 - 306, DetectCaptchaInHTML's regex table only matches double-quoted attributes and misses single-quoted HTML; update each regexp in the captcha detection slice used by DetectCaptchaInHTML to accept either single or double quotes (e.g., change patterns like attr="...` to attr=(["']).*?\1 or use ['"] instead of " where applicable), and apply the same change to the similar patterns around the other occurrence (lines referenced in the review: the block shown and the other block at 607-613) so all attribute-based matches handle both quote styles.

- Introduced a map of valid CAPTCHA types to streamline validation in IsValidCaptchaType function. - Simplified the logic for checking valid CAPTCHA types by replacing the switch statement with a map lookup. - Updated HTML integration patterns to use consistent regex for attribute matching. - Removed redundant test cases and added new tests for CAPTCHA detection in HTML forms. - Enhanced existing tests to verify the structure of FormResult and PageResult. - Consolidated tests for various CAPTCHA types to ensure comprehensive coverage.

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

dogancanbakir requested changes Feb 17, 2026

View reviewed changes

ezyasin added 2 commits February 21, 2026 00:05

refactor: Reorganize captcha detection under captcha package

2d94570

feat: Enhance CAPTCHA detection by adding ID and alt-text checks, and…

6f1b7c9

… improve form-level detection fallback

coderabbitai bot reviewed Feb 22, 2026

View reviewed changes

ezyasin added 2 commits February 23, 2026 10:32

feat: Implement page-level CAPTCHA detection with fallback to full HT…

de5f630

…ML scan

feat: Update form extraction methods to return parsed forms for CAPTC…

e356dd2

…HA detection

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

captcha/captcha.go Show resolved Hide resolved

captcha/captcha.go Outdated Show resolved Hide resolved

captcha/captcha.go Show resolved Hide resolved

refactor: Update Yandex CAPTCHA test structure for consistency and cl…

828d34a

…arity

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

captcha/captcha.go Show resolved Hide resolved

ezyasin added 2 commits February 25, 2026 07:14

feat: Update a CAPTCHA type description and refine detection patterns

6dd2fe6

feat: Add tests for various CAPTCHA detection methods including reCAP…

fd2c55b

…TCHA, hCaptcha, and Turnstile

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

	CaptchaTurnstile CaptchaType = "turnstile"
	CaptchaTypeTurnstile CaptchaType = "turnstile"

Conversation

ezyasin commented Feb 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add CAPTCHA Detection Support

Overview

Changes

Struct Updates

CAPTCHA Detection Implementation

CAPTCHA Types Supported (28 types)

Google Solutions (3 types)

Enterprise Solutions (4 types)

Alternative CAPTCHA Services (5 types)

Bot Protection & Behavioral Analysis (5 types)

Interaction-Based CAPTCHAs (5 types)

Simple/Legacy (2 types)

Generic Types

Test Coverage Enhancements

Key Test Scenarios

Documentation

Backward Compatibility

Key Features

Test Results

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dogancanbakir left a comment

Choose a reason for hiding this comment

Uh oh!

dogancanbakir Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ezyasin Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

dogancanbakir Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 22, 2026

ezyasin commented Feb 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 16, 2026 •

edited

Loading