feat: Enhance form classification to include CAPTCHA detection#30
feat: Enhance form classification to include CAPTCHA detection#30ezyasin wants to merge 9 commits intoHappyHackingSpace:mainfrom
Conversation
- Added CAPTCHA field to ClassifyResult and ClassifyProbaResult structures. - Updated Classify method to detect CAPTCHA at the page level and exclude CAPTCHA fields from individual form classifications. - Modified ExtractPage method to capture the first detected CAPTCHA type across all forms. - Introduced comprehensive tests for various CAPTCHA types and validation of form results, ensuring accurate detection and representation of CAPTCHA in classification results.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a multi-layer CAPTCHA detection module, integrates per-form and page-level CAPTCHA into extraction results, updates classifiers to exclude captcha fields, and extends ExtractPage/ExtractForms to return parsed form selections alongside existing outputs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Extractor as Page/Form Extractor
participant Detector as CAPTCHA Detector
participant Classifier as Field Classifier
participant Result as Result Builder
Client->>Extractor: provide HTML
Extractor->>Extractor: parse HTML → parsedForms
loop per form
Extractor->>Detector: DetectInForm(parsedForm)
Detector-->>Extractor: CaptchaType (per-form)
end
Extractor->>Detector: DetectCaptchaInHTML(fullHTML) (fallback)
Detector-->>Extractor: CaptchaType (page-level)
Extractor->>Classifier: classify form fields (exclude captcha fields)
Classifier-->>Extractor: classified fields
Extractor->>Result: assemble FormResult/PageResult with Captcha + Fields + parsedForms
Result-->>Client: return results (and parsedForms)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dit.go (1)
188-200:⚠️ Potential issue | 🟠 MajorPer-form
Captchafield is never populated.
FormResulthas aCaptchafield (Line 33), but when building forms inExtractPageType(Lines 190-193),ExtractPageTypeProba(Lines 219-222),ExtractForms(Lines 144-147), andExtractFormsProba(Lines 166-169), theCaptchafield is never set. It will always be empty. The page-level captcha is propagated correctly (Line 198), but per-form captcha data fromclassifier.ClassifyResult.Captchais not forwarded.🐛 Proposed fix (for ExtractPageType; apply similar pattern to other methods)
forms := make([]FormResult, len(formResults)) for i, r := range formResults { forms[i] = FormResult{ - Type: r.Result.Form, - Fields: r.Result.Fields, + Type: r.Result.Form, + Captcha: r.Result.Captcha, + Fields: r.Result.Fields, } }
🤖 Fix all issues with AI agents
In `@classifier/captcha_test.go`:
- Around line 452-479: The test TestDetectMultipleCaptchasInOneForm claims
"recaptcha comes first" but permits either result; make behavior deterministic
by enforcing ordered detection in CaptchaDetector.DetectInForm (scan the form
DOM in document order and check for recaptcha before checking for hcaptcha,
using the same traversal used by htmlutil.GetForms) and then tighten the test to
assert result == CaptchaTypeRecaptcha (replace the or check with a single
equality assertion); reference DetectInForm, CaptchaDetector,
CaptchaTypeRecaptcha and CaptchaTypeHCaptcha when making the changes.
In `@classifier/captcha.go`:
- Around line 269-307: The detectByClasses function uses overly short substrings
in classPatterns and searches the entire HTML (htmlLower), causing false
positives; change detection to read the form's "class" attribute(s) and match
against actual class tokens (split on whitespace) or use stricter
regex/word-boundary checks for each pattern (e.g., replace "kas" with "kas-" or
"^kas$" style matches, change "_inc" and "_px3" to more specific tokens) inside
detectByClasses so you only return a CaptchaType when an actual class name
matches the more specific pattern; update the classPatterns entries accordingly
and adjust the loop to inspect class attributes via goquery (e.g.,
form.Attr("class") or iterating child elements' class attrs) rather than
searching htmlLower.
- Around line 147-155: The patterns for CaptchaTypeSmartCaptcha and
CaptchaTypeYandex overlap (e.g., `smartcaptcha.yandex`) and because detection
uses an iteration over a map[CaptchaType][]*regexp.Regexp the returned type is
nondeterministic; replace the map with an ordered slice of pairs (e.g., a
[]struct{ Type CaptchaType; Patterns []*regexp.Regexp }) or otherwise enforce an
explicit priority order and update all detection functions that currently
iterate the map to iterate this ordered slice so Yandex-related patterns are
matched deterministically (or deduplicate/adjust patterns so they no longer
overlap).
- Around line 370-408: DetectCaptchaInHTML uses domainPatterns with regex-like
entries (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha")
but matches them with strings.Contains, so those will never match; fix by either
(A) replacing regex-like patterns in the domainPatterns map with actual literal
substrings that exist in the HTML (e.g., "recaptcha/api.js" or "recaptcha v2" as
appropriate) or (B) switch matching to regular expressions: import regexp,
precompile each pattern from domainPatterns and use regexp.MatchString (or
compile once per pattern) when iterating in DetectCaptchaInHTML; update the map
keys/values and matching loop accordingly so regex patterns are evaluated
correctly.
- Around line 193-198: The current traversal using
form.Parents().First().Find("script") is too broad and can capture unrelated
scripts; limit the search to the form itself and its immediate container instead
(e.g., use form.Find("script") plus form.Parent().Find("script")) so you only
collect nearby scripts into scriptSrcs, and keep using s.Attr("src") to append
lowercased sources; this reduces false positives from scripts elsewhere in the
document.
- Line 20: Rename the enum constant CaptchaTurnstile to CaptchaTypeTurnstile to
match the existing CaptchaType* naming pattern; update the declaration where
CaptchaTurnstile is defined and replace all usages/references of
CaptchaTurnstile throughout the codebase with CaptchaTypeTurnstile (search for
the symbol name) so callers like any switch/case, comparisons, or JSON
serialization referencing CaptchaTurnstile use the new CaptchaTypeTurnstile
identifier.
- Around line 309-339: detectByIframe currently scans the entire form HTML for
both "iframe" and patterns, causing false positives; change it to iterate iframe
elements via form.Find("iframe") (or Selection.Each) and inspect each iframe's
src (and possibly data attributes) by lowercasing the src and checking against
iframePatterns for a match; on the first match return the corresponding
CaptchaType (otherwise return CaptchaTypeNone). Ensure you reference
detectByIframe, iframePatterns, and use form.Find("iframe")/Each and
strings.Contains(srcLower, pattern) so the detection only triggers when the
pattern is actually in an iframe's src.
- Around line 86-184: The map of regexes (scriptPatterns) is being built with
regexp.MustCompile inside detectByScriptDomain, causing re-compilation on every
call; move the map to a package-level var (e.g., var scriptPatterns =
map[CaptchaType][]*regexp.Regexp{...}) so all regexp.MustCompile calls run once
at init, then update detectByScriptDomain to reference that package-level
scriptPatterns; ensure you keep the same CaptchaType keys (CaptchaTypeRecaptcha,
CaptchaTypeHCaptcha, CaptchaTurnstile, etc.) and no other logic changes.
In `@classifier/classifier.go`:
- Around line 67-72: The current check excludes a field if the "captcha" key
exists in the probs map; instead determine the most likely label and only
exclude when "captcha" is the argmax. Replace the presence check around
thresholdMap(probs, threshold) with logic that iterates probs to find the label
with the highest probability (e.g., compute maxName/maxProb from probs) and only
skip adding to result.Fields[name] when maxName == "captcha"; otherwise
thresholdMap(probs, threshold) and assign as before to result.Fields[name].
In `@dit_captcha_test.go`:
- Line 638: The test variable name deplyRegion is misspelled; rename the
identifier to deployRegion in the declaration and update every reference to it
(assignments, uses in functions, struct fields, and assertions) so the code
compiles and the intent is clear—search for deplyRegion in dit_captcha_test.go
and replace it with deployRegion, preserving the original capitalization and
scope where used.
🧹 Nitpick comments (4)
classifier/captcha.go (2)
39-39:CaptchaTypeCoingeckowith value"wsiz"is confusing.The constant name references "Coingecko" but the value is
"wsiz". This naming is opaque to consumers. Consider renaming the constant to match what it represents, or adding a comment explaining the relationship.
466-504:IsValidCaptchaTypemanually maintains a list that can drift from the const block.If a new
CaptchaTypeconstant is added, this function must be updated separately. Consider using a map or generating from the const block to avoid maintenance drift.♻️ Proposed fix
+var validCaptchaTypes = map[CaptchaType]struct{}{ + CaptchaTypeNone: {}, CaptchaTypeRecaptcha: {}, CaptchaTypeRecaptchaV2: {}, + // ... all types ... + CaptchaTypeOther: {}, +} + func IsValidCaptchaType(s string) bool { - validTypes := []CaptchaType{...} - for _, t := range validTypes { - if CaptchaType(s) == t { - return true - } - } - return false + _, ok := validCaptchaTypes[CaptchaType(s)] + return ok }classifier/captcha_test.go (1)
245-270: Weak assertion inTestDetectGenericCaptchaIframe.The test only verifies the result is not recaptcha or hcaptcha, but doesn't assert the expected value (
CaptchaTypeOther). This allows silent regressions where the function returnsCaptchaTypeNoneor any other unexpected type.♻️ Proposed fix
- if result == CaptchaTypeRecaptcha || result == CaptchaTypeHCaptcha { - t.Errorf("expected generic/none, got %v", result) - } + if result != CaptchaTypeOther { + t.Errorf("expected other (generic captcha), got %v", result) + }dit_captcha_test.go (1)
7-137: Tests only validate struct construction, not actual detection or classification.All tests in this file construct
FormResult/PageResultstructs manually and then verify the fields match what was just set. They don't invoke any classifier or CAPTCHA detector, so they don't validate actual behavior. Consider adding integration tests that runExtractPageTypeon HTML containing CAPTCHAs and verify the resultingCaptchafield is populated correctly.
| func TestDetectMultipleCaptchasInOneForm(t *testing.T) { | ||
| // This form has both recaptcha and hcaptcha (unusual but possible) | ||
| html := ` | ||
| <form method="POST" action="/login"> | ||
| <input type="email" name="email" /> | ||
| <div class="g-recaptcha" data-sitekey="6LdpXXXXXXXXXXXXXXXXXXXX"></div> | ||
| <div class="h-captcha" data-sitekey="10000000-ffff-ffff-ffff-000000000001"></div> | ||
| <input type="submit" value="Login" /> | ||
| </form> | ||
| ` | ||
| doc, err := htmlutil.LoadHTMLString(html) | ||
| if err != nil { | ||
| t.Fatal(err) | ||
| } | ||
|
|
||
| forms := htmlutil.GetForms(doc) | ||
| if len(forms) == 0 { | ||
| t.Fatal("expected to find form") | ||
| } | ||
|
|
||
| detector := &CaptchaDetector{} | ||
| result := detector.DetectInForm(forms[0]) | ||
|
|
||
| // Should detect the first CAPTCHA found (recaptcha comes first) | ||
| if result != CaptchaTypeRecaptcha && result != CaptchaTypeHCaptcha { | ||
| t.Errorf("expected recaptcha or hcaptcha, got %v", result) | ||
| } | ||
| } |
There was a problem hiding this comment.
Test assertion is too weak to verify detection priority.
The comment says "recaptcha comes first" but the assertion accepts either CaptchaTypeRecaptcha or CaptchaTypeHCaptcha. Due to Go's nondeterministic map iteration in the detection functions, this test can pass regardless of which is returned — it doesn't actually verify the intended "first match wins" behavior. Either tighten the assertion once priority is guaranteed (via ordered detection), or acknowledge in the comment that the result is intentionally nondeterministic.
🤖 Prompt for AI Agents
In `@classifier/captcha_test.go` around lines 452 - 479, The test
TestDetectMultipleCaptchasInOneForm claims "recaptcha comes first" but permits
either result; make behavior deterministic by enforcing ordered detection in
CaptchaDetector.DetectInForm (scan the form DOM in document order and check for
recaptcha before checking for hcaptcha, using the same traversal used by
htmlutil.GetForms) and then tighten the test to assert result ==
CaptchaTypeRecaptcha (replace the or check with a single equality assertion);
reference DetectInForm, CaptchaDetector, CaptchaTypeRecaptcha and
CaptchaTypeHCaptcha when making the changes.
classifier/captcha.go
Outdated
| CaptchaTypeRecaptchaV2 CaptchaType = "recaptchav2" | ||
| CaptchaTypeRecaptchaInvisible CaptchaType = "recaptcha-invisible" | ||
| CaptchaTypeHCaptcha CaptchaType = "hcaptcha" | ||
| CaptchaTurnstile CaptchaType = "turnstile" |
There was a problem hiding this comment.
Naming inconsistency: CaptchaTurnstile is missing the Type infix.
All other constants follow the CaptchaType* pattern. This should be CaptchaTypeTurnstile for consistency.
♻️ Proposed fix
- CaptchaTurnstile CaptchaType = "turnstile"
+ CaptchaTypeTurnstile CaptchaType = "turnstile"Then update all references from CaptchaTurnstile to CaptchaTypeTurnstile.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CaptchaTurnstile CaptchaType = "turnstile" | |
| CaptchaTypeTurnstile CaptchaType = "turnstile" |
🤖 Prompt for AI Agents
In `@classifier/captcha.go` at line 20, Rename the enum constant CaptchaTurnstile
to CaptchaTypeTurnstile to match the existing CaptchaType* naming pattern;
update the declaration where CaptchaTurnstile is defined and replace all
usages/references of CaptchaTurnstile throughout the codebase with
CaptchaTypeTurnstile (search for the symbol name) so callers like any
switch/case, comparisons, or JSON serialization referencing CaptchaTurnstile use
the new CaptchaTypeTurnstile identifier.
classifier/captcha.go
Outdated
| scriptPatterns := map[CaptchaType][]*regexp.Regexp{ | ||
| CaptchaTypeRecaptcha: { | ||
| regexp.MustCompile(`google\.com/recaptcha`), | ||
| regexp.MustCompile(`recaptcha.*\.js`), | ||
| regexp.MustCompile(`gstatic\.com/.*recaptcha`), | ||
| }, | ||
| CaptchaTypeRecaptchaV2: { | ||
| regexp.MustCompile(`recaptcha.*v2`), | ||
| regexp.MustCompile(`recaptcha/api\.js`), | ||
| }, | ||
| CaptchaTypeRecaptchaInvisible: { | ||
| regexp.MustCompile(`recaptcha.*invisible`), | ||
| regexp.MustCompile(`grecaptcha\.render.*invisible`), | ||
| }, | ||
| CaptchaTypeHCaptcha: { | ||
| regexp.MustCompile(`js\.hcaptcha\.com`), | ||
| regexp.MustCompile(`hcaptcha`), | ||
| }, | ||
| CaptchaTurnstile: { | ||
| regexp.MustCompile(`challenges\.cloudflare\.com`), | ||
| regexp.MustCompile(`js\.cloudflare\.com.*turnstile`), | ||
| }, | ||
| CaptchaTypeGeetest: { | ||
| regexp.MustCompile(`geetest`), | ||
| regexp.MustCompile(`api\.geetest\.com`), | ||
| }, | ||
| CaptchaTypeFriendlyCaptcha: { | ||
| regexp.MustCompile(`friendlycaptcha`), | ||
| regexp.MustCompile(`cdn\.friendlycaptcha\.com`), | ||
| }, | ||
| CaptchaTypeRotateCaptcha: { | ||
| regexp.MustCompile(`api\.rotatecaptcha\.com`), | ||
| }, | ||
| CaptchaTypeClickCaptcha: { | ||
| regexp.MustCompile(`assets\.clickcaptcha\.com`), | ||
| }, | ||
| CaptchaTypeImageCaptcha: { | ||
| regexp.MustCompile(`api\.imagecaptcha\.com`), | ||
| }, | ||
| CaptchaTypePuzzleCaptcha: { | ||
| regexp.MustCompile(`puzzle.*captcha`), | ||
| }, | ||
| CaptchaTypeSliderCaptcha: { | ||
| regexp.MustCompile(`slider.*captcha`), | ||
| regexp.MustCompile(`api\.slidercaptcha\.com`), | ||
| regexp.MustCompile(`slidercaptcha\.com`), | ||
| }, | ||
| CaptchaTypeDatadome: { | ||
| regexp.MustCompile(`datadome\.co`), | ||
| regexp.MustCompile(`cdn\.mxpnl\.com`), | ||
| }, | ||
| CaptchaTypePerimeterX: { | ||
| regexp.MustCompile(`perimeterx\.net`), | ||
| }, | ||
| CaptchaTypeArgon: { | ||
| regexp.MustCompile(`argon.*captcha`), | ||
| regexp.MustCompile(`captcha\.argon`), | ||
| }, | ||
| CaptchaTypeBehaviotech: { | ||
| regexp.MustCompile(`behaviotech\.com`), | ||
| }, | ||
| CaptchaTypeSmartCaptcha: { | ||
| regexp.MustCompile(`captcha\.yandex\.com`), | ||
| regexp.MustCompile(`smartcaptcha\.yandex`), | ||
| }, | ||
| CaptchaTypeYandex: { | ||
| regexp.MustCompile(`yandex\.com/.*captcha`), | ||
| regexp.MustCompile(`captcha\.yandex`), | ||
| regexp.MustCompile(`smartcaptcha\.yandex`), | ||
| }, | ||
| CaptchaTypeFuncaptcha: { | ||
| regexp.MustCompile(`funcaptcha\.com`), | ||
| regexp.MustCompile(`api\.funcaptcha\.com`), | ||
| }, | ||
| CaptchaTypeCoingecko: { | ||
| regexp.MustCompile(`wsiz\.com`), | ||
| }, | ||
| CaptchaTypeNovaScape: { | ||
| regexp.MustCompile(`novascape\.com`), | ||
| }, | ||
| CaptchaTypeMCaptcha: { | ||
| regexp.MustCompile(`mcaptcha`), | ||
| regexp.MustCompile(`app\.mcaptcha\.io`), | ||
| }, | ||
| CaptchaTypeKasada: { | ||
| regexp.MustCompile(`kasada`), | ||
| regexp.MustCompile(`kas\.kasadaproducts\.com`), | ||
| }, | ||
| CaptchaTypeImperva: { | ||
| regexp.MustCompile(`/_Incapsula_Resource`), | ||
| regexp.MustCompile(`incapsula`), | ||
| regexp.MustCompile(`imperva`), | ||
| }, | ||
| CaptchaTypeAwsWaf: { | ||
| regexp.MustCompile(`/aws-waf-captcha/`), | ||
| regexp.MustCompile(`awswaf\.com`), | ||
| regexp.MustCompile(`captcha\.aws\.amazon\.com`), | ||
| }, | ||
| } |
There was a problem hiding this comment.
Regexes recompiled on every call to detectByScriptDomain.
regexp.MustCompile is called inside the function body, meaning every invocation of detectByScriptDomain (once per form) allocates and compiles ~40+ regexes. Move these to package-level var so they're compiled once at init time.
♻️ Proposed fix (outline)
+var scriptPatterns = map[CaptchaType][]*regexp.Regexp{
+ CaptchaTypeRecaptcha: {
+ regexp.MustCompile(`google\.com/recaptcha`),
+ regexp.MustCompile(`recaptcha.*\.js`),
+ regexp.MustCompile(`gstatic\.com/.*recaptcha`),
+ },
+ // ... all other entries ...
+}
+
func detectByScriptDomain(form *goquery.Selection) CaptchaType {
- scriptPatterns := map[CaptchaType][]*regexp.Regexp{
- ...
- }
var scriptSrcs []string📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| scriptPatterns := map[CaptchaType][]*regexp.Regexp{ | |
| CaptchaTypeRecaptcha: { | |
| regexp.MustCompile(`google\.com/recaptcha`), | |
| regexp.MustCompile(`recaptcha.*\.js`), | |
| regexp.MustCompile(`gstatic\.com/.*recaptcha`), | |
| }, | |
| CaptchaTypeRecaptchaV2: { | |
| regexp.MustCompile(`recaptcha.*v2`), | |
| regexp.MustCompile(`recaptcha/api\.js`), | |
| }, | |
| CaptchaTypeRecaptchaInvisible: { | |
| regexp.MustCompile(`recaptcha.*invisible`), | |
| regexp.MustCompile(`grecaptcha\.render.*invisible`), | |
| }, | |
| CaptchaTypeHCaptcha: { | |
| regexp.MustCompile(`js\.hcaptcha\.com`), | |
| regexp.MustCompile(`hcaptcha`), | |
| }, | |
| CaptchaTurnstile: { | |
| regexp.MustCompile(`challenges\.cloudflare\.com`), | |
| regexp.MustCompile(`js\.cloudflare\.com.*turnstile`), | |
| }, | |
| CaptchaTypeGeetest: { | |
| regexp.MustCompile(`geetest`), | |
| regexp.MustCompile(`api\.geetest\.com`), | |
| }, | |
| CaptchaTypeFriendlyCaptcha: { | |
| regexp.MustCompile(`friendlycaptcha`), | |
| regexp.MustCompile(`cdn\.friendlycaptcha\.com`), | |
| }, | |
| CaptchaTypeRotateCaptcha: { | |
| regexp.MustCompile(`api\.rotatecaptcha\.com`), | |
| }, | |
| CaptchaTypeClickCaptcha: { | |
| regexp.MustCompile(`assets\.clickcaptcha\.com`), | |
| }, | |
| CaptchaTypeImageCaptcha: { | |
| regexp.MustCompile(`api\.imagecaptcha\.com`), | |
| }, | |
| CaptchaTypePuzzleCaptcha: { | |
| regexp.MustCompile(`puzzle.*captcha`), | |
| }, | |
| CaptchaTypeSliderCaptcha: { | |
| regexp.MustCompile(`slider.*captcha`), | |
| regexp.MustCompile(`api\.slidercaptcha\.com`), | |
| regexp.MustCompile(`slidercaptcha\.com`), | |
| }, | |
| CaptchaTypeDatadome: { | |
| regexp.MustCompile(`datadome\.co`), | |
| regexp.MustCompile(`cdn\.mxpnl\.com`), | |
| }, | |
| CaptchaTypePerimeterX: { | |
| regexp.MustCompile(`perimeterx\.net`), | |
| }, | |
| CaptchaTypeArgon: { | |
| regexp.MustCompile(`argon.*captcha`), | |
| regexp.MustCompile(`captcha\.argon`), | |
| }, | |
| CaptchaTypeBehaviotech: { | |
| regexp.MustCompile(`behaviotech\.com`), | |
| }, | |
| CaptchaTypeSmartCaptcha: { | |
| regexp.MustCompile(`captcha\.yandex\.com`), | |
| regexp.MustCompile(`smartcaptcha\.yandex`), | |
| }, | |
| CaptchaTypeYandex: { | |
| regexp.MustCompile(`yandex\.com/.*captcha`), | |
| regexp.MustCompile(`captcha\.yandex`), | |
| regexp.MustCompile(`smartcaptcha\.yandex`), | |
| }, | |
| CaptchaTypeFuncaptcha: { | |
| regexp.MustCompile(`funcaptcha\.com`), | |
| regexp.MustCompile(`api\.funcaptcha\.com`), | |
| }, | |
| CaptchaTypeCoingecko: { | |
| regexp.MustCompile(`wsiz\.com`), | |
| }, | |
| CaptchaTypeNovaScape: { | |
| regexp.MustCompile(`novascape\.com`), | |
| }, | |
| CaptchaTypeMCaptcha: { | |
| regexp.MustCompile(`mcaptcha`), | |
| regexp.MustCompile(`app\.mcaptcha\.io`), | |
| }, | |
| CaptchaTypeKasada: { | |
| regexp.MustCompile(`kasada`), | |
| regexp.MustCompile(`kas\.kasadaproducts\.com`), | |
| }, | |
| CaptchaTypeImperva: { | |
| regexp.MustCompile(`/_Incapsula_Resource`), | |
| regexp.MustCompile(`incapsula`), | |
| regexp.MustCompile(`imperva`), | |
| }, | |
| CaptchaTypeAwsWaf: { | |
| regexp.MustCompile(`/aws-waf-captcha/`), | |
| regexp.MustCompile(`awswaf\.com`), | |
| regexp.MustCompile(`captcha\.aws\.amazon\.com`), | |
| }, | |
| } | |
| var scriptPatterns = map[CaptchaType][]*regexp.Regexp{ | |
| CaptchaTypeRecaptcha: { | |
| regexp.MustCompile(`google\.com/recaptcha`), | |
| regexp.MustCompile(`recaptcha.*\.js`), | |
| regexp.MustCompile(`gstatic\.com/.*recaptcha`), | |
| }, | |
| CaptchaTypeRecaptchaV2: { | |
| regexp.MustCompile(`recaptcha.*v2`), | |
| regexp.MustCompile(`recaptcha/api\.js`), | |
| }, | |
| CaptchaTypeRecaptchaInvisible: { | |
| regexp.MustCompile(`recaptcha.*invisible`), | |
| regexp.MustCompile(`grecaptcha\.render.*invisible`), | |
| }, | |
| CaptchaTypeHCaptcha: { | |
| regexp.MustCompile(`js\.hcaptcha\.com`), | |
| regexp.MustCompile(`hcaptcha`), | |
| }, | |
| CaptchaTurnstile: { | |
| regexp.MustCompile(`challenges\.cloudflare\.com`), | |
| regexp.MustCompile(`js\.cloudflare\.com.*turnstile`), | |
| }, | |
| CaptchaTypeGeetest: { | |
| regexp.MustCompile(`geetest`), | |
| regexp.MustCompile(`api\.geetest\.com`), | |
| }, | |
| CaptchaTypeFriendlyCaptcha: { | |
| regexp.MustCompile(`friendlycaptcha`), | |
| regexp.MustCompile(`cdn\.friendlycaptcha\.com`), | |
| }, | |
| CaptchaTypeRotateCaptcha: { | |
| regexp.MustCompile(`api\.rotatecaptcha\.com`), | |
| }, | |
| CaptchaTypeClickCaptcha: { | |
| regexp.MustCompile(`assets\.clickcaptcha\.com`), | |
| }, | |
| CaptchaTypeImageCaptcha: { | |
| regexp.MustCompile(`api\.imagecaptcha\.com`), | |
| }, | |
| CaptchaTypePuzzleCaptcha: { | |
| regexp.MustCompile(`puzzle.*captcha`), | |
| }, | |
| CaptchaTypeSliderCaptcha: { | |
| regexp.MustCompile(`slider.*captcha`), | |
| regexp.MustCompile(`api\.slidercaptcha\.com`), | |
| regexp.MustCompile(`slidercaptcha\.com`), | |
| }, | |
| CaptchaTypeDatadome: { | |
| regexp.MustCompile(`datadome\.co`), | |
| regexp.MustCompile(`cdn\.mxpnl\.com`), | |
| }, | |
| CaptchaTypePerimeterX: { | |
| regexp.MustCompile(`perimeterx\.net`), | |
| }, | |
| CaptchaTypeArgon: { | |
| regexp.MustCompile(`argon.*captcha`), | |
| regexp.MustCompile(`captcha\.argon`), | |
| }, | |
| CaptchaTypeBehaviotech: { | |
| regexp.MustCompile(`behaviotech\.com`), | |
| }, | |
| CaptchaTypeSmartCaptcha: { | |
| regexp.MustCompile(`captcha\.yandex\.com`), | |
| regexp.MustCompile(`smartcaptcha\.yandex`), | |
| }, | |
| CaptchaTypeYandex: { | |
| regexp.MustCompile(`yandex\.com/.*captcha`), | |
| regexp.MustCompile(`captcha\.yandex`), | |
| regexp.MustCompile(`smartcaptcha\.yandex`), | |
| }, | |
| CaptchaTypeFuncaptcha: { | |
| regexp.MustCompile(`funcaptcha\.com`), | |
| regexp.MustCompile(`api\.funcaptcha\.com`), | |
| }, | |
| CaptchaTypeCoingecko: { | |
| regexp.MustCompile(`wsiz\.com`), | |
| }, | |
| CaptchaTypeNovaScape: { | |
| regexp.MustCompile(`novascape\.com`), | |
| }, | |
| CaptchaTypeMCaptcha: { | |
| regexp.MustCompile(`mcaptcha`), | |
| regexp.MustCompile(`app\.mcaptcha\.io`), | |
| }, | |
| CaptchaTypeKasada: { | |
| regexp.MustCompile(`kasada`), | |
| regexp.MustCompile(`kas\.kasadaproducts\.com`), | |
| }, | |
| CaptchaTypeImperva: { | |
| regexp.MustCompile(`/_Incapsula_Resource`), | |
| regexp.MustCompile(`incapsula`), | |
| regexp.MustCompile(`imperva`), | |
| }, | |
| CaptchaTypeAwsWaf: { | |
| regexp.MustCompile(`/aws-waf-captcha/`), | |
| regexp.MustCompile(`awswaf\.com`), | |
| regexp.MustCompile(`captcha\.aws\.amazon\.com`), | |
| }, | |
| } |
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 86 - 184, The map of regexes
(scriptPatterns) is being built with regexp.MustCompile inside
detectByScriptDomain, causing re-compilation on every call; move the map to a
package-level var (e.g., var scriptPatterns =
map[CaptchaType][]*regexp.Regexp{...}) so all regexp.MustCompile calls run once
at init, then update detectByScriptDomain to reference that package-level
scriptPatterns; ensure you keep the same CaptchaType keys (CaptchaTypeRecaptcha,
CaptchaTypeHCaptcha, CaptchaTurnstile, etc.) and no other logic changes.
classifier/captcha.go
Outdated
| CaptchaTypeSmartCaptcha: { | ||
| regexp.MustCompile(`captcha\.yandex\.com`), | ||
| regexp.MustCompile(`smartcaptcha\.yandex`), | ||
| }, | ||
| CaptchaTypeYandex: { | ||
| regexp.MustCompile(`yandex\.com/.*captcha`), | ||
| regexp.MustCompile(`captcha\.yandex`), | ||
| regexp.MustCompile(`smartcaptcha\.yandex`), | ||
| }, |
There was a problem hiding this comment.
Overlapping patterns between CaptchaTypeSmartCaptcha and CaptchaTypeYandex with nondeterministic map iteration.
Both types share the pattern smartcaptcha.yandex (and captcha.yandex prefix). Since Go map iteration order is random, which type is returned for Yandex-related scripts is unpredictable. This applies broadly to all detection functions that iterate map[CaptchaType][]string — when multiple types can match, the result is nondeterministic.
Consider using an ordered slice of (CaptchaType, patterns) pairs instead of a map to enforce priority, or deduplicate the overlapping types.
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 147 - 155, The patterns for
CaptchaTypeSmartCaptcha and CaptchaTypeYandex overlap (e.g.,
`smartcaptcha.yandex`) and because detection uses an iteration over a
map[CaptchaType][]*regexp.Regexp the returned type is nondeterministic; replace
the map with an ordered slice of pairs (e.g., a []struct{ Type CaptchaType;
Patterns []*regexp.Regexp }) or otherwise enforce an explicit priority order and
update all detection functions that currently iterate the map to iterate this
ordered slice so Yandex-related patterns are matched deterministically (or
deduplicate/adjust patterns so they no longer overlap).
classifier/captcha.go
Outdated
| // Check for parent scripts too | ||
| form.Parents().First().Find("script").Each(func(_ int, s *goquery.Selection) { | ||
| if src, ok := s.Attr("src"); ok { | ||
| scriptSrcs = append(scriptSrcs, strings.ToLower(src)) | ||
| } | ||
| }) |
There was a problem hiding this comment.
form.Parents().First().Find("script") may search far too broadly.
This traverses up to the first parent and finds all scripts within it. Depending on document structure, this could be the <body> or even <html>, picking up scripts completely unrelated to the form. This could cause false positives (detecting a CAPTCHA loaded for a different form or purpose).
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 193 - 198, The current traversal using
form.Parents().First().Find("script") is too broad and can capture unrelated
scripts; limit the search to the form itself and its immediate container instead
(e.g., use form.Find("script") plus form.Parent().Find("script")) so you only
collect nearby scripts into scriptSrcs, and keep using s.Attr("src") to append
lowercased sources; this reduces false positives from scripts elsewhere in the
document.
classifier/captcha.go
Outdated
| func detectByClasses(form *goquery.Selection) CaptchaType { | ||
| classPatterns := map[CaptchaType][]string{ | ||
| CaptchaTypeRecaptcha: {"g-recaptcha", "grecaptcha"}, | ||
| CaptchaTypeRecaptchaV2: {"g-recaptcha-v2", "grecaptcha-v2"}, | ||
| CaptchaTypeRecaptchaInvisible: {"g-recaptcha-invisible", "grecaptcha-invisible"}, | ||
| CaptchaTypeHCaptcha: {"h-captcha", "hcaptcha"}, | ||
| CaptchaTurnstile: {"cf-turnstile", "cloudflare-turnstile-challenge", "turnstile"}, | ||
| CaptchaTypeGeetest: {"geetest_", "geetest-box"}, | ||
| CaptchaTypeFriendlyCaptcha: {"frc-captcha", "friendlycaptcha"}, | ||
| CaptchaTypeRotateCaptcha: {"rotate-captcha", "rotatecaptcha"}, | ||
| CaptchaTypeClickCaptcha: {"click-captcha", "clickcaptcha"}, | ||
| CaptchaTypeImageCaptcha: {"image-captcha", "imagecaptcha"}, | ||
| CaptchaTypePuzzleCaptcha: {"puzzle-captcha", "__puzzle_captcha"}, | ||
| CaptchaTypeSliderCaptcha: {"slider-captcha", "slidercaptcha", "slide-verify"}, | ||
| CaptchaTypeDatadome: {"dd-challenge", "dd-top"}, | ||
| CaptchaTypePerimeterX: {"_px3", "px-container"}, | ||
| CaptchaTypeArgon: {"argon-captcha"}, | ||
| CaptchaTypeSmartCaptcha: {"smart-captcha"}, | ||
| CaptchaTypeYandex: {"smartcaptcha", "yandex-captcha"}, | ||
| CaptchaTypeFuncaptcha: {"funcaptcha-container"}, | ||
| CaptchaTypeMCaptcha: {"mcaptcha", "mcaptcha-container"}, | ||
| CaptchaTypeKasada: {"kas", "kasada"}, | ||
| CaptchaTypeImperva: {"_inc", "incapsula", "imperva"}, | ||
| CaptchaTypeAwsWaf: {"aws-waf", "awswaf"}, | ||
| } | ||
|
|
||
| html, _ := form.Html() | ||
| htmlLower := strings.ToLower(html) | ||
|
|
||
| for captchaType, classes := range classPatterns { | ||
| for _, class := range classes { | ||
| if strings.Contains(htmlLower, class) { | ||
| return captchaType | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return CaptchaTypeNone | ||
| } |
There was a problem hiding this comment.
Overly short class patterns risk false positives.
Patterns like "kas" (Line 290), "_inc" (Line 291), and "_px3" (Line 284) are very short substrings searched across the entire HTML. These can easily match unrelated content (e.g., a CSS class like "tasks" contains "kas", or "_include" contains "_inc"). Consider making these patterns more specific or matching against actual class attributes rather than raw HTML.
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 269 - 307, The detectByClasses function
uses overly short substrings in classPatterns and searches the entire HTML
(htmlLower), causing false positives; change detection to read the form's
"class" attribute(s) and match against actual class tokens (split on whitespace)
or use stricter regex/word-boundary checks for each pattern (e.g., replace "kas"
with "kas-" or "^kas$" style matches, change "_inc" and "_px3" to more specific
tokens) inside detectByClasses so you only return a CaptchaType when an actual
class name matches the more specific pattern; update the classPatterns entries
accordingly and adjust the loop to inspect class attributes via goquery (e.g.,
form.Attr("class") or iterating child elements' class attrs) rather than
searching htmlLower.
classifier/captcha.go
Outdated
| // detectByIframe checks for CAPTCHA-specific iframes | ||
| func detectByIframe(form *goquery.Selection) CaptchaType { | ||
| iframePatterns := map[CaptchaType][]string{ | ||
| CaptchaTypeRecaptcha: {"recaptcha"}, | ||
| CaptchaTypeHCaptcha: {"hcaptcha"}, | ||
| CaptchaTurnstile: {"challenges.cloudflare.com"}, | ||
| CaptchaTypeGeetest: {"geetest"}, | ||
| CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"}, | ||
| CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"}, | ||
| CaptchaTypeYandex: {"yandex", "smartcaptcha"}, | ||
| CaptchaTypeKasada: {"kasada", "kas"}, | ||
| CaptchaTypeImperva: {"incapsula", "imperva"}, | ||
| CaptchaTypeDatadome: {"datadome"}, | ||
| } | ||
|
|
||
| html, _ := form.Html() | ||
| htmlLower := strings.ToLower(html) | ||
|
|
||
| // Check for iframe with CAPTCHA patterns in raw HTML | ||
| if strings.Contains(htmlLower, "iframe") { | ||
| for captchaType, patterns := range iframePatterns { | ||
| for _, pattern := range patterns { | ||
| if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") { | ||
| return captchaType | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return CaptchaTypeNone | ||
| } |
There was a problem hiding this comment.
detectByIframe matches patterns anywhere in HTML, not within iframe elements.
Line 331 checks if htmlLower contains both "iframe" and a pattern like "recaptcha", but doesn't verify the pattern appears inside the iframe's src attribute. Any form that happens to contain an unrelated iframe and the word "recaptcha" elsewhere (e.g., in a comment or label) will be falsely detected.
Consider using goquery to select iframe elements and inspect their src attribute specifically, similar to how detectByScriptDomain inspects script[src].
♻️ Proposed fix
func detectByIframe(form *goquery.Selection) CaptchaType {
iframePatterns := map[CaptchaType][]string{
// ...same patterns...
}
- html, _ := form.Html()
- htmlLower := strings.ToLower(html)
-
- if strings.Contains(htmlLower, "iframe") {
- for captchaType, patterns := range iframePatterns {
- for _, pattern := range patterns {
- if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") {
- return captchaType
- }
+ var iframeSrcs []string
+ form.Find("iframe").Each(func(_ int, s *goquery.Selection) {
+ if src, ok := s.Attr("src"); ok {
+ iframeSrcs = append(iframeSrcs, strings.ToLower(src))
+ }
+ })
+
+ for captchaType, patterns := range iframePatterns {
+ for _, src := range iframeSrcs {
+ for _, pattern := range patterns {
+ if strings.Contains(src, pattern) {
+ return captchaType
+ }
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // detectByIframe checks for CAPTCHA-specific iframes | |
| func detectByIframe(form *goquery.Selection) CaptchaType { | |
| iframePatterns := map[CaptchaType][]string{ | |
| CaptchaTypeRecaptcha: {"recaptcha"}, | |
| CaptchaTypeHCaptcha: {"hcaptcha"}, | |
| CaptchaTurnstile: {"challenges.cloudflare.com"}, | |
| CaptchaTypeGeetest: {"geetest"}, | |
| CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"}, | |
| CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"}, | |
| CaptchaTypeYandex: {"yandex", "smartcaptcha"}, | |
| CaptchaTypeKasada: {"kasada", "kas"}, | |
| CaptchaTypeImperva: {"incapsula", "imperva"}, | |
| CaptchaTypeDatadome: {"datadome"}, | |
| } | |
| html, _ := form.Html() | |
| htmlLower := strings.ToLower(html) | |
| // Check for iframe with CAPTCHA patterns in raw HTML | |
| if strings.Contains(htmlLower, "iframe") { | |
| for captchaType, patterns := range iframePatterns { | |
| for _, pattern := range patterns { | |
| if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") { | |
| return captchaType | |
| } | |
| } | |
| } | |
| } | |
| return CaptchaTypeNone | |
| } | |
| // detectByIframe checks for CAPTCHA-specific iframes | |
| func detectByIframe(form *goquery.Selection) CaptchaType { | |
| iframePatterns := map[CaptchaType][]string{ | |
| CaptchaTypeRecaptcha: {"recaptcha"}, | |
| CaptchaTypeHCaptcha: {"hcaptcha"}, | |
| CaptchaTurnstile: {"challenges.cloudflare.com"}, | |
| CaptchaTypeGeetest: {"geetest"}, | |
| CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"}, | |
| CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"}, | |
| CaptchaTypeYandex: {"yandex", "smartcaptcha"}, | |
| CaptchaTypeKasada: {"kasada", "kas"}, | |
| CaptchaTypeImperva: {"incapsula", "imperva"}, | |
| CaptchaTypeDatadome: {"datadome"}, | |
| } | |
| var iframeSrcs []string | |
| form.Find("iframe").Each(func(_ int, s *goquery.Selection) { | |
| if src, ok := s.Attr("src"); ok { | |
| iframeSrcs = append(iframeSrcs, strings.ToLower(src)) | |
| } | |
| }) | |
| for captchaType, patterns := range iframePatterns { | |
| for _, src := range iframeSrcs { | |
| for _, pattern := range patterns { | |
| if strings.Contains(src, pattern) { | |
| return captchaType | |
| } | |
| } | |
| } | |
| } | |
| return CaptchaTypeNone | |
| } |
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 309 - 339, detectByIframe currently scans
the entire form HTML for both "iframe" and patterns, causing false positives;
change it to iterate iframe elements via form.Find("iframe") (or Selection.Each)
and inspect each iframe's src (and possibly data attributes) by lowercasing the
src and checking against iframePatterns for a match; on the first match return
the corresponding CaptchaType (otherwise return CaptchaTypeNone). Ensure you
reference detectByIframe, iframePatterns, and use form.Find("iframe")/Each and
strings.Contains(srcLower, pattern) so the detection only triggers when the
pattern is actually in an iframe's src.
classifier/captcha.go
Outdated
| func DetectCaptchaInHTML(html string) CaptchaType { | ||
| htmlLower := strings.ToLower(html) | ||
|
|
||
| // Priority 1: Domain-based detection patterns (most reliable) | ||
| domainPatterns := map[CaptchaType][]string{ | ||
| CaptchaTypeRecaptcha: {"google.com/recaptcha", "gstatic.com", "recaptcha"}, | ||
| CaptchaTypeRecaptchaV2: {"recaptcha/api.js", "recaptcha.*v2"}, | ||
| CaptchaTypeRecaptchaInvisible: {"recaptcha.*invisible"}, | ||
| CaptchaTypeHCaptcha: {"hcaptcha", "js.hcaptcha.com"}, | ||
| CaptchaTurnstile: {"challenges.cloudflare.com", "js.cloudflare.com"}, | ||
| CaptchaTypeGeetest: {"geetest", "api.geetest.com"}, | ||
| CaptchaTypeFriendlyCaptcha: {"friendlycaptcha", "cdn.friendlycaptcha.com"}, | ||
| CaptchaTypeRotateCaptcha: {"rotatecaptcha", "api.rotatecaptcha.com"}, | ||
| CaptchaTypeClickCaptcha: {"clickcaptcha", "assets.clickcaptcha.com"}, | ||
| CaptchaTypeImageCaptcha: {"imagecaptcha", "api.imagecaptcha.com"}, | ||
| CaptchaTypePuzzleCaptcha: {"puzzle-captcha", "__puzzle_captcha"}, | ||
| CaptchaTypeSliderCaptcha: {"slider-captcha", "slidercaptcha"}, | ||
| CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"}, | ||
| CaptchaTypeKasada: {"kasada", "kas.kasadaproducts.com"}, | ||
| CaptchaTypeImperva: {"incapsula", "imperva"}, | ||
| CaptchaTypeAwsWaf: {"awswaf", "captcha.aws.amazon.com"}, | ||
| CaptchaTypeDatadome: {"datadome", "dd-challenge"}, | ||
| CaptchaTypePerimeterX: {"perimeterx", "_pxappid"}, | ||
| CaptchaTypeArgon: {"argon-captcha"}, | ||
| CaptchaTypeBehaviotech: {"behaviotech"}, | ||
| CaptchaTypeSmartCaptcha: {"captcha.yandex.com", "smartcaptcha"}, | ||
| CaptchaTypeYandex: {"yandex.com/.*captcha", "yandex.ru/.*captcha", "smartcaptcha.yandex"}, | ||
| CaptchaTypeFuncaptcha: {"funcaptcha", "arkose"}, | ||
| CaptchaTypeCoingecko: {"wsiz.com"}, | ||
| CaptchaTypeNovaScape: {"novascape"}, | ||
| } | ||
|
|
||
| for captchaType, patterns := range domainPatterns { | ||
| for _, pattern := range patterns { | ||
| if strings.Contains(htmlLower, pattern) { | ||
| return captchaType | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Regex-like patterns used with strings.Contains — they will never match.
Several entries in domainPatterns contain regex syntax (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha") but are matched using strings.Contains, which treats them as literal strings. These patterns will never match real HTML.
Either use regexp for matching or convert these to plain substrings (e.g., "recaptcha/api.js" is already correct as a literal, but "recaptcha.*v2" is not).
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 370 - 408, DetectCaptchaInHTML uses
domainPatterns with regex-like entries (e.g., "recaptcha.*v2",
"recaptcha.*invisible", "yandex.com/.*captcha") but matches them with
strings.Contains, so those will never match; fix by either (A) replacing
regex-like patterns in the domainPatterns map with actual literal substrings
that exist in the HTML (e.g., "recaptcha/api.js" or "recaptcha v2" as
appropriate) or (B) switch matching to regular expressions: import regexp,
precompile each pattern from domainPatterns and use regexp.MatchString (or
compile once per pattern) when iterating in DetectCaptchaInHTML; update the map
keys/values and matching loop accordingly so regex patterns are evaluated
correctly.
dogancanbakir
left a comment
There was a problem hiding this comment.
Recommending different approach
classifier/captcha.go
Outdated
| @@ -0,0 +1,509 @@ | |||
| package classifier | |||
There was a problem hiding this comment.
This should be outside of the classifier, classifier pkg is for ML classifiers. What about a pkg under the name of captcha, outside of classifier?
There was a problem hiding this comment.
Yes, I think this is a much more professional approach. I will implement this.
classifier/classifier.go
Outdated
| Form string `json:"form"` | ||
| Fields map[string]string `json:"fields,omitempty"` | ||
| Form string `json:"form"` | ||
| Captcha string `json:"captcha,omitempty"` |
… improve form-level detection fallback
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dit.go (1)
224-254:⚠️ Potential issue | 🟠 MajorPer-form
Captchafield is not populated inExtractPageTypeandExtractPageTypeProba, creating inconsistency withExtractForms.
ExtractForms(lines 162-166) andExtractFormsProba(lines 201-205) populateCaptchafor each form, butExtractPageType(lines 224-230) andExtractPageTypeProba(lines 272-278) leave theCaptchafield empty in the returned forms. A caller will always receive emptyCaptchavalues when usingPageResult.Forms[i].Captcha, unlike callingExtractFormsdirectly.This should either be documented if intentional, or per-form CAPTCHA detection should be added to both functions to match the behavior of
ExtractForms.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dit.go` around lines 224 - 254, ExtractPageType and ExtractPageTypeProba build PageResult.Forms but never set each FormResult.Captcha, causing inconsistency with ExtractForms/ExtractFormsProba; update these functions to populate per-form Captcha the same way as ExtractForms does: for each form result in the loop that constructs forms (the block filling FormResult{Type, Fields}), run the same captcha detection used in ExtractForms (use captcha.CaptchaDetector{}.DetectInForm on each htmlutil.GetForms form or call the shared helper used by ExtractForms/ExtractFormsProba) and set FormResult.Captcha accordingly before returning PageResult; ensure you reference the same detector/path (captcha.CaptchaDetector, DetectInForm, captcha.DetectCaptchaInHTML) so per-form Captcha values are filled consistently.
🧹 Nitpick comments (3)
dit.go (1)
234-244: Error fromLoadHTMLStringis silently ignored.Line 234-235 swallow
docErr. While this is best-effort, the same HTML was already successfully parsed a few lines earlier insideExtractPage(Line 219). If it now fails, something unexpected happened and a log or warning would be appropriate. At minimum, add a comment explaining the intentional silent fallthrough.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dit.go` around lines 234 - 244, The LoadHTMLString error is being silently ignored in the block that calls htmlutil.LoadHTMLString(html) before extracting forms and detecting captchas; update this by either logging the error via the existing logger (or returning it) when docErr != nil, or if the silent fallback is intentional add a clear comment explaining why it's safe to ignore docErr; specifically modify the code around htmlutil.LoadHTMLString(html) and the subsequent htmlutil.GetForms / captcha.CaptchaDetector{}.DetectInForm loop to either emit a warning that includes docErr or to document the rationale for swallowing the error (referencing ExtractPage behavior if relevant).captcha/captcha.go (1)
469-480: Generic markers include provider-specific strings already handled by earlier layers.
"g-recaptcha","h-captcha","turnstile", and"geetest"inhasGenericCaptchaMarkersare redundant since they're detected by Layers 1-2. Only"captcha"is genuinely generic here. Additionally,"captcha"as a bare substring will match any occurrence in the HTML (e.g., a<a href="/captcha-policy">link), which may be too aggressive for a "Layer 7" fallback.Consider trimming to only the truly generic marker and/or requiring it to appear in a meaningful attribute context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 469 - 480, hasGenericCaptchaMarkers currently treats provider-specific strings ("g-recaptcha","h-captcha","turnstile","geetest") as generic and uses a broad substring match on "captcha"; update it to only consider truly generic markers by removing the provider-specific entries from genericMarkers and tighten the "captcha" check to a meaningful context — for example search relevant attributes (e.g., form.Find for input[name|id|class], div[class|id], labels, or the form's action) or use a word-boundary/attribute-aware regex so stray links like "/captcha-policy" don't trigger; locate the hasGenericCaptchaMarkers function and replace the simple strings.Contains(htmlLower, "captcha") logic with an attribute-scoped or word-boundary check.captcha/captcha_test.go (1)
10-696: Consider table-driven tests to reduce boilerplate.Nearly every test follows the identical pattern: define HTML → load → get forms → detect → assert. A single table-driven test would eliminate ~500 lines of duplication and make adding new providers trivial.
Example sketch
func TestDetectInForm(t *testing.T) { tests := []struct { name string html string expected captcha.CaptchaType }{ {"RecaptchaV2 class", `<form>...<div class="g-recaptcha" data-sitekey="..."></div>...</form>`, captcha.CaptchaTypeRecaptcha}, {"HCaptcha class", `<form>...<div class="h-captcha" data-sitekey="..."></div>...</form>`, captcha.CaptchaTypeHCaptcha}, // ... all other cases } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { doc, err := htmlutil.LoadHTMLString(tt.html) if err != nil { t.Fatal(err) } forms := htmlutil.GetForms(doc) if len(forms) == 0 { t.Fatal("expected to find form") } detector := &captcha.CaptchaDetector{} if got := detector.DetectInForm(forms[0]); got != tt.expected { t.Errorf("expected %v, got %v", tt.expected, got) } }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha_test.go` around lines 10 - 696, Many tests duplicate the same pattern calling htmlutil.LoadHTMLString → htmlutil.GetForms → (&captcha.CaptchaDetector{}).DetectInForm and asserting the result; consolidate these into a single table-driven test (e.g., replace TestDetectRecaptchaV2, TestDetectRecaptchaScript, TestDetectHCaptcha, etc.) by creating tests := []struct{name,string,expected captcha.CaptchaType} with entries for each provider, iterate with t.Run(tt.name, func(t *testing.T){ doc,err := htmlutil.LoadHTMLString(tt.html); forms := htmlutil.GetForms(doc); detector := &captcha.CaptchaDetector{}; if got := detector.DetectInForm(forms[0]); got != tt.expected { t.Errorf(...) } }), and keep complex or multi-form cases (TestDetectCaptchaInMultipleForms, iframe/multiple captchas, captcha.DetectCaptchaInHTML, TestCaptchaTypeString, TestIsValidCaptchaType) as separate tests. Ensure you reference DetectInForm and captcha.DetectCaptchaInHTML when moving cases so each HTML sample maps to the correct assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@captcha/captcha_test.go`:
- Around line 246-271: The test TestDetectGenericCaptchaIframe currently only
asserts the result is not CaptchaTypeRecaptcha/HCaptcha, but should assert that
CaptchaDetector{}.DetectInForm on the form containing an iframe with "captcha"
returns CaptchaTypeOther (since hasGenericCaptchaMarkers should match). Update
the assertion to compare result == captcha.CaptchaTypeOther (or use
t.Fatalf/t.Errorf with expected vs actual) so the test explicitly verifies the
expected CaptchaTypeOther behavior; reference CaptchaDetector.DetectInForm,
hasGenericCaptchaMarkers, and CaptchaTypeOther when locating the logic to
validate.
In `@captcha/captcha.go`:
- Line 39: The CaptchaTypeCoingecko constant maps to the opaque string "wsiz"
which is confusing; update the declaration for CaptchaTypeCoingecko to either
use a more descriptive literal (e.g., "wsiz.com" or "coingecko_wsiz") or add a
concise comment on the same line explaining the mapping (for example: "wsiz.com
is used by Coingecko for bot protection"), and keep the constant name
CaptchaTypeCoingecko unchanged so callers remain stable; edit the
CaptchaTypeCoingecko symbol in captcha.go accordingly.
- Around line 295-334: The detectByClasses function is producing false positives
because it searches the entire form HTML for short substrings; update detection
for the CaptchaTypeKasada, CaptchaTypeImperva, and CaptchaTypeArgon entries (and
any other overly short patterns in the classPatterns slice) to be more specific
by either (a) replacing short fragments with full/kebab names like "kasada",
"incapsula" or "argon-captcha", or (b) change the check to inspect actual class
attributes (e.g., iterate form.Find("[class]"), split class values and match
whole class tokens or use word-boundary regex) instead of strings.Contains on
the whole HTML to avoid matching inside unrelated words.
- Around line 128-232: The regexps in detectByScriptDomain and the
integrationPatterns used by DetectCaptchaInHTML are being (re)compiled on every
call; move all regexp.MustCompile calls out of those functions into
package-level var slices (e.g., precompile a package-level variable like
scriptDomainPatterns []struct{captchaType CaptchaType; patterns
[]*regexp.Regexp} and integrationPatterns []*regexp.Regexp) and initialize them
once (via var init or an init() function), then update detectByScriptDomain and
DetectCaptchaInHTML to reference those precompiled vars instead of constructing
regexps locally.
- Around line 135-143: The regex list for detectByScriptDomain currently places
generic patterns (CaptchaTypeRecaptcha, CaptchaTypeSmartCaptcha) before more
specific subtypes (CaptchaTypeRecaptchaV2, CaptchaTypeYandex), causing
shadowing; update the mapping so the most-specific detectors run first (e.g.,
move CaptchaTypeRecaptchaV2 entries above CaptchaTypeRecaptcha and place
CaptchaTypeYandex entries before CaptchaTypeSmartCaptcha) or remove
redundant/ambiguous patterns from the generic entries; adjust the slice ordering
in the captcha type-to-regex table used by detectByScriptDomain (and the similar
block at the other location referenced) so specific regexps like
`recaptcha/api\.js` and `captcha\.yandex` are evaluated before the broader
`recaptcha.*\.js` and `smartcaptcha\.yandex` patterns.
- Around line 350-368: The code currently mutates idPatterns entries using the
"__matched__" sentinel inside the form.Find(...).Each callback which is brittle;
replace that sentinel with a closure variable (e.g., idResult or
matchedCaptchaType) initialized outside the Each loop, set it to
entry.captchaType when a pattern match is found inside the Each for loop, and
then return early from the Each. After the Each completes, check the closure
variable and return its value if non-empty; do not modify idPatterns. This
mirrors the existing altResult pattern and keeps matching logic local to the
callback.
- Around line 241-246: Add a clarifying comment above the form-scoped script
search in the code (the block using form.Parents().First().Find(...)) stating
that this only checks scripts in the form and its immediate parent (form-scoped
detection) and that page-level scripts in <head> are intentionally excluded
because DetectCaptchaInHTML() handles page-level CAPTCHA detection as a
fallback; place the comment near the existing block so future readers understand
the layered detection approach.
In `@classifier/classifier.go`:
- Around line 63-78: The proba path currently always initializes result.Fields
to an empty map which will serialize as {} (breaking omitempty); instead, avoid
calling make at the top—only allocate result.Fields when you are about to insert
the first thresholdedProbs (or set it back to nil if no non-"captcha" fields
were added); modify the loop around fieldProba so that you create result.Fields
= make(map[string]map[string]float64) right before result.Fields[name] =
thresholdedProbs (or check len(result.Fields) == 0 and set to nil after the
loop), referencing result.Fields, fieldProba, thresholdMap, threshold, and the
"captcha" check to locate the code to change.
- Around line 34-41: When filtering out "captcha" fields you currently always
allocate result.Fields as a non-nil empty map, which causes encoding/json to
emit "fields":{} instead of omitting the key; instead, after calling
c.FieldModel.Classify(form, formType) and iterating, only allocate and assign
result.Fields if you add at least one non-"captcha" entry (leave result.Fields
nil when none found) so that the json "omitempty" behavior preserves backward
compatibility.
In `@dit.go`:
- Around line 144-149: The code redundantly reparses HTML after calling
c.fc.ExtractForms/ExtractFormsProba/ExtractPageType/ExtractPageTypeProba; fix by
avoiding a second htmlutil.LoadHTMLString/htmlutil.GetForms pass — either add a
lower-level classifier method that accepts a *goquery.Document (or returns the
parsed document/forms) so callers can reuse the parsed doc, or move the CAPTCHA
detection into the classifier's ExtractForms/ExtractPage methods so they run
once on the already-parsed document; update calls to use the new classifier API
(referencing c.fc.ExtractForms, ExtractFormsProba, ExtractPageType,
ExtractPageTypeProba and htmlutil.LoadHTMLString/htmlutil.GetForms) and remove
the duplicate parsing logic.
- Around line 232-249: Duplicate page-level CAPTCHA detection appears in
ExtractPageType and ExtractPageTypeProba; extract the logic into a helper (e.g.,
detectPageCaptcha) that takes the HTML string and returns the cap string, by
reusing htmlutil.LoadHTMLString, htmlutil.GetForms,
captcha.CaptchaDetector.DetectInForm for form-level checks and falling back to
captcha.DetectCaptchaInHTML for full-HTML detection, then replace the duplicated
blocks in both ExtractPageType and ExtractPageTypeProba with a call to this
helper and remove the local capStr duplication.
---
Outside diff comments:
In `@dit.go`:
- Around line 224-254: ExtractPageType and ExtractPageTypeProba build
PageResult.Forms but never set each FormResult.Captcha, causing inconsistency
with ExtractForms/ExtractFormsProba; update these functions to populate per-form
Captcha the same way as ExtractForms does: for each form result in the loop that
constructs forms (the block filling FormResult{Type, Fields}), run the same
captcha detection used in ExtractForms (use
captcha.CaptchaDetector{}.DetectInForm on each htmlutil.GetForms form or call
the shared helper used by ExtractForms/ExtractFormsProba) and set
FormResult.Captcha accordingly before returning PageResult; ensure you reference
the same detector/path (captcha.CaptchaDetector, DetectInForm,
captcha.DetectCaptchaInHTML) so per-form Captcha values are filled consistently.
---
Duplicate comments:
In `@captcha/captcha_test.go`:
- Around line 453-480: The test TestDetectMultipleCaptchasInOneForm is asserting
too loosely—change the assertion to require the detector to return
captcha.CaptchaTypeRecaptcha (not either) to enforce deterministic priority;
locate the call to detector.DetectInForm(forms[0]) and replace the current
permissive check with a strict equality against captcha.CaptchaTypeRecaptcha (or
adjust the test HTML or detector setup if you intend the priority to be hcaptcha
instead), ensuring the test verifies that DetectInForm on
captcha.CaptchaDetector returns the CAPTCHA type that appears first in the form.
In `@classifier/classifier.go`:
- Around line 65-77: The inner bestProb variable inside the loop that computes
the argmax over probs shadows the outer bestProb; rename the inner variable
(e.g., to fieldBestProb) and update all uses in that block (the argmax loop that
sets bestClass and bestProb) so there is no shadowing, leaving
thresholdMap(probs, threshold) and result.Fields[name] logic unchanged.
---
Nitpick comments:
In `@captcha/captcha_test.go`:
- Around line 10-696: Many tests duplicate the same pattern calling
htmlutil.LoadHTMLString → htmlutil.GetForms →
(&captcha.CaptchaDetector{}).DetectInForm and asserting the result; consolidate
these into a single table-driven test (e.g., replace TestDetectRecaptchaV2,
TestDetectRecaptchaScript, TestDetectHCaptcha, etc.) by creating tests :=
[]struct{name,string,expected captcha.CaptchaType} with entries for each
provider, iterate with t.Run(tt.name, func(t *testing.T){ doc,err :=
htmlutil.LoadHTMLString(tt.html); forms := htmlutil.GetForms(doc); detector :=
&captcha.CaptchaDetector{}; if got := detector.DetectInForm(forms[0]); got !=
tt.expected { t.Errorf(...) } }), and keep complex or multi-form cases
(TestDetectCaptchaInMultipleForms, iframe/multiple captchas,
captcha.DetectCaptchaInHTML, TestCaptchaTypeString, TestIsValidCaptchaType) as
separate tests. Ensure you reference DetectInForm and
captcha.DetectCaptchaInHTML when moving cases so each HTML sample maps to the
correct assertion.
In `@captcha/captcha.go`:
- Around line 469-480: hasGenericCaptchaMarkers currently treats
provider-specific strings ("g-recaptcha","h-captcha","turnstile","geetest") as
generic and uses a broad substring match on "captcha"; update it to only
consider truly generic markers by removing the provider-specific entries from
genericMarkers and tighten the "captcha" check to a meaningful context — for
example search relevant attributes (e.g., form.Find for input[name|id|class],
div[class|id], labels, or the form's action) or use a
word-boundary/attribute-aware regex so stray links like "/captcha-policy" don't
trigger; locate the hasGenericCaptchaMarkers function and replace the simple
strings.Contains(htmlLower, "captcha") logic with an attribute-scoped or
word-boundary check.
In `@dit.go`:
- Around line 234-244: The LoadHTMLString error is being silently ignored in the
block that calls htmlutil.LoadHTMLString(html) before extracting forms and
detecting captchas; update this by either logging the error via the existing
logger (or returning it) when docErr != nil, or if the silent fallback is
intentional add a clear comment explaining why it's safe to ignore docErr;
specifically modify the code around htmlutil.LoadHTMLString(html) and the
subsequent htmlutil.GetForms / captcha.CaptchaDetector{}.DetectInForm loop to
either emit a warning that includes docErr or to document the rationale for
swallowing the error (referencing ExtractPage behavior if relevant).
| func TestDetectGenericCaptchaIframe(t *testing.T) { | ||
| html := ` | ||
| <form method="POST" action="/login"> | ||
| <input type="email" name="email" /> | ||
| <iframe src="https://example.com/captcha"></iframe> | ||
| <input type="submit" value="Login" /> | ||
| </form> | ||
| ` | ||
| doc, err := htmlutil.LoadHTMLString(html) | ||
| if err != nil { | ||
| t.Fatal(err) | ||
| } | ||
|
|
||
| forms := htmlutil.GetForms(doc) | ||
| if len(forms) == 0 { | ||
| t.Fatal("expected to find form") | ||
| } | ||
|
|
||
| detector := &captcha.CaptchaDetector{} | ||
| result := detector.DetectInForm(forms[0]) | ||
|
|
||
| // Should detect as generic captcha or none (depending on iframe content) | ||
| if result == captcha.CaptchaTypeRecaptcha || result == captcha.CaptchaTypeHCaptcha { | ||
| t.Errorf("expected generic/none, got %v", result) | ||
| } | ||
| } |
There was a problem hiding this comment.
Assertion doesn't verify the expected CaptchaTypeOther result.
The iframe src contains "captcha", which should trigger hasGenericCaptchaMarkers (Layer 7) and return CaptchaTypeOther. The current assertion only verifies the result is not recaptcha/hcaptcha — it would pass even if the detector returned CaptchaTypeNone. Assert the expected value directly.
Suggested fix
- // Should detect as generic captcha or none (depending on iframe content)
- if result == captcha.CaptchaTypeRecaptcha || result == captcha.CaptchaTypeHCaptcha {
- t.Errorf("expected generic/none, got %v", result)
- }
+ // iframe src contains "captcha" → Layer 7 generic marker should match
+ if result != captcha.CaptchaTypeOther {
+ t.Errorf("expected other (generic captcha), got %v", result)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestDetectGenericCaptchaIframe(t *testing.T) { | |
| html := ` | |
| <form method="POST" action="/login"> | |
| <input type="email" name="email" /> | |
| <iframe src="https://example.com/captcha"></iframe> | |
| <input type="submit" value="Login" /> | |
| </form> | |
| ` | |
| doc, err := htmlutil.LoadHTMLString(html) | |
| if err != nil { | |
| t.Fatal(err) | |
| } | |
| forms := htmlutil.GetForms(doc) | |
| if len(forms) == 0 { | |
| t.Fatal("expected to find form") | |
| } | |
| detector := &captcha.CaptchaDetector{} | |
| result := detector.DetectInForm(forms[0]) | |
| // Should detect as generic captcha or none (depending on iframe content) | |
| if result == captcha.CaptchaTypeRecaptcha || result == captcha.CaptchaTypeHCaptcha { | |
| t.Errorf("expected generic/none, got %v", result) | |
| } | |
| } | |
| func TestDetectGenericCaptchaIframe(t *testing.T) { | |
| html := ` | |
| <form method="POST" action="/login"> | |
| <input type="email" name="email" /> | |
| <iframe src="https://example.com/captcha"></iframe> | |
| <input type="submit" value="Login" /> | |
| </form> | |
| ` | |
| doc, err := htmlutil.LoadHTMLString(html) | |
| if err != nil { | |
| t.Fatal(err) | |
| } | |
| forms := htmlutil.GetForms(doc) | |
| if len(forms) == 0 { | |
| t.Fatal("expected to find form") | |
| } | |
| detector := &captcha.CaptchaDetector{} | |
| result := detector.DetectInForm(forms[0]) | |
| // iframe src contains "captcha" → Layer 7 generic marker should match | |
| if result != captcha.CaptchaTypeOther { | |
| t.Errorf("expected other (generic captcha), got %v", result) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@captcha/captcha_test.go` around lines 246 - 271, The test
TestDetectGenericCaptchaIframe currently only asserts the result is not
CaptchaTypeRecaptcha/HCaptcha, but should assert that
CaptchaDetector{}.DetectInForm on the form containing an iframe with "captcha"
returns CaptchaTypeOther (since hasGenericCaptchaMarkers should match). Update
the assertion to compare result == captcha.CaptchaTypeOther (or use
t.Fatalf/t.Errorf with expected vs actual) so the test explicitly verifies the
expected CaptchaTypeOther behavior; reference CaptchaDetector.DetectInForm,
hasGenericCaptchaMarkers, and CaptchaTypeOther when locating the logic to
validate.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dit.go (1)
226-238:⚠️ Potential issue | 🟡 MinorPer-form CAPTCHA is missing from
ExtractPageType/ExtractPageTypeProbaform results.
ExtractFormspopulatesFormResult.Captchafor each form, butExtractPageType(andExtractPageTypeProba) builds its form results without aCaptchafield (lines 228–231, 257–260). This means an API consumer gets per-form CAPTCHA info viaExtractFormsbut not viaExtractPageType, which is inconsistent.Consider populating per-form CAPTCHA here as well, since the parsed form selections are already available:
Proposed fix for ExtractPageType
+ detector := &captcha.CaptchaDetector{} forms := make([]FormResult, len(formResults)) for i, r := range formResults { + capStr := "" + if i < len(parsedForms) { + if ct := detector.DetectInForm(parsedForms[i]); ct != captcha.CaptchaTypeNone { + capStr = string(ct) + } + } forms[i] = FormResult{ - Type: r.Result.Form, - Fields: r.Result.Fields, + Type: r.Result.Form, + Captcha: capStr, + Fields: r.Result.Fields, } }Apply the same pattern in
ExtractPageTypeProba.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dit.go` around lines 226 - 238, The Page-level extractors (ExtractPageType and ExtractPageTypeProba) build FormResult entries from formResults but omit the per-form Captcha field; update both functions to set FormResult.Captcha for each form by using the parsedForms data (the same way ExtractForms does) — iterate formResults and for each index i set Forms[i].Captcha = parsedForms[i].Captcha (or call the same helper used by ExtractForms) so returned PageResult.Forms contains per-form Captcha info while keeping the existing page-level detectPageCaptcha call.
♻️ Duplicate comments (2)
captcha/captcha.go (2)
39-39:CaptchaTypeCoingeckomaps to"wsiz"— still undocumented.The constant name references "Coingecko" but the string value
"wsiz"remains opaque. Add a brief comment explaining the relationship (e.g.,// wsiz.com is the bot-protection provider used by Coingecko).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` at line 39, Add a short explanatory comment for the constant CaptchaTypeCoingecko so the string literal "wsiz" is no longer opaque; e.g., next to the declaration of CaptchaTypeCoingecko explain that "wsiz" refers to the wsiz.com bot-protection provider used by Coingecko (or similar brief phrasing) so future readers understand the mapping between the constant name and its string value.
414-453:⚠️ Potential issue | 🟠 MajorOverly broad class-pattern substrings still cause false positives.
The patterns flagged in the prior review remain unchanged:
- Line 430:
"kas"matches "task", "Kansas", etc. → false Kasada detection- Line 431:
"_inc"matches "max_increment", "_include", etc. → false Imperva detection- Line 436:
"argon"alone matches "argon2" (password hashing), etc. → false Argon detectionThese should use more specific strings (e.g.,
"kasada","incapsula","argon-captcha").Additionally,
"smartcaptcha"appears in bothCaptchaTypeSmartCaptcha(line 435) andCaptchaTypeYandex(line 438). Since SmartCaptcha is checked first, the"smartcaptcha"pattern under Yandex is unreachable in this layer.Proposed fix
- {CaptchaTypeKasada, []string{"kas", "kasada"}}, - {CaptchaTypeImperva, []string{"_inc", "incapsula", "imperva"}}, + {CaptchaTypeKasada, []string{"kasada"}}, + {CaptchaTypeImperva, []string{"incapsula", "imperva"}}, {CaptchaTypeAwsWaf, []string{"aws-waf", "awswaf"}}, {CaptchaTypeDatadome, []string{"dd-challenge", "dd-top"}}, {CaptchaTypePerimeterX, []string{"_px3", "px-container"}}, {CaptchaTypeSmartCaptcha, []string{"smart-captcha", "smartcaptcha"}}, - {CaptchaTypeArgon, []string{"argon-captcha", "argon"}}, + {CaptchaTypeArgon, []string{"argon-captcha"}}, {CaptchaTypePuzzleCaptcha, []string{"puzzle-captcha", "__puzzle_captcha"}}, - {CaptchaTypeYandex, []string{"smartcaptcha", "yandex-captcha"}}, + {CaptchaTypeYandex, []string{"yandex-captcha"}},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 414 - 453, The detectByClasses function uses overly broad substrings in classPatterns causing false positives—update the patterns for CaptchaTypeKasada, CaptchaTypeImperva, and CaptchaTypeArgon to use specific tokens (e.g., "kasada" instead of "kas", "incapsula" instead of "_inc", and "argon-captcha" instead of "argon") and remove or replace the duplicate "smartcaptcha" entry under CaptchaTypeYandex (use "yandex-captcha" or another Yandex-specific token) so the entries in classPatterns and their checks correctly and uniquely identify each CaptchaType.
🧹 Nitpick comments (2)
captcha/captcha.go (1)
309-347: Layered detection pipeline is well-structured.Clean short-circuit design. However, note that Layer 1 (
detectByClasses) uses raw HTML substring matching with some broad patterns (see comment on lines 414–453), while Layer 2 (detectByScriptDomain) is more precise. Consider whether swapping their order would reduce false positives.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 309 - 347, The current detection pipeline in DetectInForm calls detectByClasses first which uses broad substring matching and can yield false positives; change the order to call detectByScriptDomain before detectByClasses so more precise script-domain detection short-circuits earlier (i.e., swap the calls to detectByScriptDomain and detectByClasses in the DetectInForm function while keeping the rest of the layered checks the same).classifier/classifier.go (1)
67-84: LGTM — argmax-based captcha exclusion and lazy allocation address prior review concerns.Minor note:
bestProbon line 70 shadows the outerbestProbon line 59. Both are in distinct scopes so there's no bug, but a different variable name (e.g.,fieldBestProb) would improve readability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@classifier/classifier.go` around lines 67 - 84, The inner loop variable bestProb inside the for name, probs := range fieldProba block shadows an outer bestProb; rename the inner variable (e.g., to fieldBestProb) and update its comparisons and assignments (the if p > ... and fieldBestProb = p, bestClass logic) so the outer bestProb remains distinct; ensure any uses of bestProb outside the inner loop still refer to the original variable and that result.Fields logic (thresholdMap call, thresholdedProbs assignment) is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@captcha/captcha.go`:
- Around line 535-550: The simpleCaptchaPatterns in detectByFieldNames are too
broad: entries "security_code", "verify_code", and "verification_code" cause
false positives on non-CAPTCHA forms (payment or verification flows). Remove
those three patterns from simpleCaptchaPatterns or change the detection to
require an additional CAPTCHA indicator before returning CaptchaTypeSimple (for
example, require the field name match AND either the presence of "captcha" in
the same form HTML, an adjacent <img> tag, or a nearby element/class like
"captcha", "recaptcha", or "g-recaptcha"); update detectByFieldNames and any
callers to use this stricter check so only true CAPTCHA forms return
CaptchaTypeSimple.
- Around line 589-600: The current hasGenericCaptchaMarkers function uses a raw
HTML substring search that causes false positives; update
hasGenericCaptchaMarkers to look for captcha indicators only in element
names/attributes (e.g., check form.Find("input, div, iframe, script,
textarea").Each and inspect attr values like id, name, class, src) instead of
the full inner HTML, and narrow the marker set to attribute/class/id tokens
(remove redundant "g-recaptcha", "h-captcha", "turnstile", "geetest" since they
are handled earlier). Ensure you keep the function name hasGenericCaptchaMarkers
and the genericMarkers concept but apply matches against element
attributes/classes/ids rather than strings.Contains on htmlLower.
- Line 571: The iframe matcher for CaptchaTypeYandex is too broad (currently
uses "yandex") and causes false positives; update the entry in the captcha types
slice (the element referencing CaptchaTypeYandex) to use a more specific domain
token such as "smartcaptcha.yandex" or "captcha.yandex" (or add both patterns)
so only true Yandex CAPTCHA iframes match; adjust any related matching
logic/tests that rely on the old "yandex" token to ensure they still pass.
---
Outside diff comments:
In `@dit.go`:
- Around line 226-238: The Page-level extractors (ExtractPageType and
ExtractPageTypeProba) build FormResult entries from formResults but omit the
per-form Captcha field; update both functions to set FormResult.Captcha for each
form by using the parsedForms data (the same way ExtractForms does) — iterate
formResults and for each index i set Forms[i].Captcha = parsedForms[i].Captcha
(or call the same helper used by ExtractForms) so returned PageResult.Forms
contains per-form Captcha info while keeping the existing page-level
detectPageCaptcha call.
---
Duplicate comments:
In `@captcha/captcha.go`:
- Line 39: Add a short explanatory comment for the constant CaptchaTypeCoingecko
so the string literal "wsiz" is no longer opaque; e.g., next to the declaration
of CaptchaTypeCoingecko explain that "wsiz" refers to the wsiz.com
bot-protection provider used by Coingecko (or similar brief phrasing) so future
readers understand the mapping between the constant name and its string value.
- Around line 414-453: The detectByClasses function uses overly broad substrings
in classPatterns causing false positives—update the patterns for
CaptchaTypeKasada, CaptchaTypeImperva, and CaptchaTypeArgon to use specific
tokens (e.g., "kasada" instead of "kas", "incapsula" instead of "_inc", and
"argon-captcha" instead of "argon") and remove or replace the duplicate
"smartcaptcha" entry under CaptchaTypeYandex (use "yandex-captcha" or another
Yandex-specific token) so the entries in classPatterns and their checks
correctly and uniquely identify each CaptchaType.
---
Nitpick comments:
In `@captcha/captcha.go`:
- Around line 309-347: The current detection pipeline in DetectInForm calls
detectByClasses first which uses broad substring matching and can yield false
positives; change the order to call detectByScriptDomain before detectByClasses
so more precise script-domain detection short-circuits earlier (i.e., swap the
calls to detectByScriptDomain and detectByClasses in the DetectInForm function
while keeping the rest of the layered checks the same).
In `@classifier/classifier.go`:
- Around line 67-84: The inner loop variable bestProb inside the for name, probs
:= range fieldProba block shadows an outer bestProb; rename the inner variable
(e.g., to fieldBestProb) and update its comparisons and assignments (the if p >
... and fieldBestProb = p, bestClass logic) so the outer bestProb remains
distinct; ensure any uses of bestProb outside the inner loop still refer to the
original variable and that result.Fields logic (thresholdMap call,
thresholdedProbs assignment) is unchanged.
|
Could you add some test cases to tests things? |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
captcha/captcha.go (3)
39-39:⚠️ Potential issue | 🟡 Minor
CaptchaTypeCoingeckostill maps to the opaque string"wsiz"with no explanation.The constant name clearly references Coingecko, but the serialized value (
"wsiz") gives no hint of that relationship. Consumers comparing against"coingecko"in JSON output would get silent mismatches. Add an inline comment documenting thatwsiz.comis Coingecko's bot-protection vendor, or change the value to something self-descriptive like"coingecko"or"wsiz.com".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` at line 39, The constant CaptchaTypeCoingecko currently serializes to the opaque string "wsiz" which hides that it represents Coingecko's bot-protection vendor; update the declaration for CaptchaTypeCoingecko to either use a self-descriptive value ("coingecko" or "wsiz.com") so JSON consumers can match reliably, or add a clear inline comment on the CaptchaTypeCoingecko constant explaining that "wsiz" refers to wsiz.com (Coingecko's vendor) so future readers and consumers understand the mapping; ensure the chosen change preserves any existing external contract or update callers/tests accordingly.
586-597:⚠️ Potential issue | 🟠 Major
hasGenericCaptchaMarkersraw-HTML search for"captcha"produces false positives, and 4 of 5 markers are redundant.
strings.Contains(htmlLower, "captcha")triggers on<label>Enter the captcha below</label>,<!-- captcha section -->, or any<a href="/captcha-help">— none of which embed a CAPTCHA widget. Additionally,"g-recaptcha","h-captcha","turnstile", and"geetest"are already caught by earlier layers; they will never reach this fallback.Consider narrowing the search to element attributes only (id, name, class, src) rather than the full HTML string, and trimming the marker list to just
"captcha"once the match scope is tightened.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 586 - 597, The current hasGenericCaptchaMarkers function searches the entire raw HTML and uses multiple redundant markers causing false positives; change it to inspect only element attributes (id, name, class, src) within the provided goquery.Selection and perform a case-insensitive substring check for the single marker "captcha" (remove "g-recaptcha", "h-captcha", "turnstile", "geetest"), returning true if any of those attributes contain "captcha"; use form.Find("*").Each or equivalent to iterate elements and check Attr("id"), Attr("name"), Attr("class") and Attr("src") for the substring to reliably detect an embedded widget.
430-436:⚠️ Potential issue | 🟠 MajorOverly-broad substring patterns
"kas","_inc", and"argon"remain — false-positive risk.These three entries search the full inner HTML of the form via
strings.Contains, meaning they match inside class names, attribute values, text nodes, and comments:
"kas"triggers on"task","kansas","breakfast", etc."_inc"triggers on"max_increment","_include","font-spacing_inc", etc."argon"triggers on any form that mentions argon2 (a common password-hashing scheme).The safer alternatives are either (a) use the longer, unambiguous tokens already in the same slice (
"kasada","incapsula","argon-captcha"), or (b) perform the match against actualclassattribute tokens instead of raw HTML.🐛 Proposed fix
- {CaptchaTypeKasada, []string{"kas", "kasada"}}, - {CaptchaTypeImperva, []string{"_inc", "incapsula", "imperva"}}, ... - {CaptchaTypeArgon, []string{"argon-captcha", "argon"}}, + {CaptchaTypeKasada, []string{"kasada"}}, + {CaptchaTypeImperva, []string{"incapsula", "imperva"}}, ... + {CaptchaTypeArgon, []string{"argon-captcha"}},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 430 - 436, The substring match list for CaptchaTypeKasada, CaptchaTypeImperva and CaptchaTypeArgon is too broad and causes false positives; update the mapping entries so they use only the longer, unambiguous tokens ("kasada" instead of "kas", "incapsula" or "inc" instead of "_inc", and "argon-captcha" instead of "argon") in the slice associated with CaptchaTypeKasada, CaptchaTypeImperva and CaptchaTypeArgon, and if possible change the HTML detection code that uses strings.Contains to instead check class attribute tokens (split class attribute on whitespace) when identifying matches so you don’t match substrings inside other attributes or text.
🧹 Nitpick comments (2)
dit_captcha_test.go (2)
632-695:deployRegionis populated in every test case but never used in any assertion — dead test data.The field participates in the table-driven test struct but is only read by the loop range binding; it is never referenced in the body of
t.Run. Either remove it from the struct if it adds no value, or add an assertion that exercises it (e.g., verifying that region-specific detection behaviour differs).♻️ Proposed cleanup (remove unused field)
tests := []struct { name string captcha string description string - deployRegion string }{ { name: "yandex_smartcaptcha", captcha: "yandex", description: "Yandex SmartCaptcha - Russian behavioral CAPTCHA", - deployRegion: "russia", }, { name: "yandex_global", captcha: "yandex", description: "Yandex SmartCaptcha - Global deployment", - deployRegion: "global", }, { name: "yandex_enterprise", captcha: "yandex", description: "Yandex SmartCaptcha - Enterprise use", - deployRegion: "enterprise", }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dit_captcha_test.go` around lines 632 - 695, The table-driven test TestYandexCaptchaDetection includes a deployRegion field that is never used; remove deployRegion from the test struct and all test cases, or if region behavior should be tested, add assertions inside the t.Run block that reference tt.deployRegion and verify expected region-specific behavior (e.g., adjust expected detection or a region-dependent field on FormResult). Update the test cases and any references to tt.deployRegion accordingly so the struct and assertions are consistent with the intended coverage.
8-737: All tests exercise struct construction only — no detection logic is covered.Every test in this file follows the same pattern: manually construct a
FormResult/PageResult/FormResultProbawith hard-coded values, then assert that those same values are still present. This is equivalent to testing Go's struct field assignment and provides no regression protection for the actual CAPTCHA detection pipeline.For this test file to be meaningful, the tests should call the detection entry points with representative HTML fixtures and verify the produced results. For example:
// Meaningful pattern — calls detection, verifies output func TestDetectInFormRecaptcha(t *testing.T) { html := `<html><body> <form> <div class="g-recaptcha" data-sitekey="abc123"></div> <input type="text" name="email"> </form> </body></html>` doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html)) detector := captcha.CaptchaDetector{} got := detector.DetectInForm(doc.Find("form").First()) if got != captcha.CaptchaTypeRecaptcha { t.Errorf("got %s, want recaptcha", got) } }Integration-level tests via
ExtractPage()/Classify()with real HTML would similarly catch regressions introduced by changes to the detection layers incaptcha/captcha.go.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dit_captcha_test.go` around lines 8 - 737, The tests only construct FormResult/PageResult/FormResultProba/PageResultProba structs and assert their fields, which doesn't exercise your detection logic; update the tests to call the actual detection entry points (e.g., captcha.CaptchaDetector, DetectInForm, ExtractPage, Classify) with representative HTML fixtures parsed via goquery.NewDocumentFromReader (or equivalent) and assert the returned captcha types/structures match expected values; add tests like "TestDetectInFormRecaptcha" that build minimal HTML snippets containing known CAPTCHA markers, parse them into a document, invoke detector.DetectInForm(doc.Find("form").First()) and assert the returned captcha type, and add integration-style tests that call ExtractPage()/Classify() on full HTML fixtures to validate page-level results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@captcha/captcha.go`:
- Around line 143-146: The Datadome signature list incorrectly includes
Mixpanel's CDN; remove the regexp.MustCompile(`cdn\.mxpnl\.com`) entry from the
CaptchaTypeDatadome regex slice so only Datadome-related hosts remain (leave
regexp.MustCompile(`datadome\.co`) intact and ensure the CaptchaTypeDatadome
slice compiles and tests still pass).
---
Duplicate comments:
In `@captcha/captcha.go`:
- Line 39: The constant CaptchaTypeCoingecko currently serializes to the opaque
string "wsiz" which hides that it represents Coingecko's bot-protection vendor;
update the declaration for CaptchaTypeCoingecko to either use a self-descriptive
value ("coingecko" or "wsiz.com") so JSON consumers can match reliably, or add a
clear inline comment on the CaptchaTypeCoingecko constant explaining that "wsiz"
refers to wsiz.com (Coingecko's vendor) so future readers and consumers
understand the mapping; ensure the chosen change preserves any existing external
contract or update callers/tests accordingly.
- Around line 586-597: The current hasGenericCaptchaMarkers function searches
the entire raw HTML and uses multiple redundant markers causing false positives;
change it to inspect only element attributes (id, name, class, src) within the
provided goquery.Selection and perform a case-insensitive substring check for
the single marker "captcha" (remove "g-recaptcha", "h-captcha", "turnstile",
"geetest"), returning true if any of those attributes contain "captcha"; use
form.Find("*").Each or equivalent to iterate elements and check Attr("id"),
Attr("name"), Attr("class") and Attr("src") for the substring to reliably detect
an embedded widget.
- Around line 430-436: The substring match list for CaptchaTypeKasada,
CaptchaTypeImperva and CaptchaTypeArgon is too broad and causes false positives;
update the mapping entries so they use only the longer, unambiguous tokens
("kasada" instead of "kas", "incapsula" or "inc" instead of "_inc", and
"argon-captcha" instead of "argon") in the slice associated with
CaptchaTypeKasada, CaptchaTypeImperva and CaptchaTypeArgon, and if possible
change the HTML detection code that uses strings.Contains to instead check class
attribute tokens (split class attribute on whitespace) when identifying matches
so you don’t match substrings inside other attributes or text.
---
Nitpick comments:
In `@dit_captcha_test.go`:
- Around line 632-695: The table-driven test TestYandexCaptchaDetection includes
a deployRegion field that is never used; remove deployRegion from the test
struct and all test cases, or if region behavior should be tested, add
assertions inside the t.Run block that reference tt.deployRegion and verify
expected region-specific behavior (e.g., adjust expected detection or a
region-dependent field on FormResult). Update the test cases and any references
to tt.deployRegion accordingly so the struct and assertions are consistent with
the intended coverage.
- Around line 8-737: The tests only construct
FormResult/PageResult/FormResultProba/PageResultProba structs and assert their
fields, which doesn't exercise your detection logic; update the tests to call
the actual detection entry points (e.g., captcha.CaptchaDetector, DetectInForm,
ExtractPage, Classify) with representative HTML fixtures parsed via
goquery.NewDocumentFromReader (or equivalent) and assert the returned captcha
types/structures match expected values; add tests like
"TestDetectInFormRecaptcha" that build minimal HTML snippets containing known
CAPTCHA markers, parse them into a document, invoke
detector.DetectInForm(doc.Find("form").First()) and assert the returned captcha
type, and add integration-style tests that call ExtractPage()/Classify() on full
HTML fixtures to validate page-level results.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
captcha/captcha.go (1)
421-423:⚠️ Potential issue | 🟠 MajorSpecific reCAPTCHA variants are shadowed by the generic class rule.
g-recaptchais checked beforeg-recaptcha-v2andg-recaptcha-invisible, so variant forms get classified as generic reCAPTCHA in class-based detection.🐛 Proposed fix
- {CaptchaTypeRecaptcha, []string{"g-recaptcha", "grecaptcha"}}, - {CaptchaTypeRecaptchaV2, []string{"g-recaptcha-v2", "grecaptcha-v2"}}, - {CaptchaTypeRecaptchaInvisible, []string{"g-recaptcha-invisible", "grecaptcha-invisible"}}, + {CaptchaTypeRecaptchaInvisible, []string{"g-recaptcha-invisible", "grecaptcha-invisible"}}, + {CaptchaTypeRecaptchaV2, []string{"g-recaptcha-v2", "grecaptcha-v2"}}, + {CaptchaTypeRecaptcha, []string{"g-recaptcha", "grecaptcha"}},Also applies to: 441-447
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 421 - 423, The class-based detection rules currently list the generic CaptchaTypeRecaptcha entry before its specific variants, causing classMatches for "g-recaptcha-v2" and "g-recaptcha-invisible" to be shadowed; reorder the class rules so more specific variants (CaptchaTypeRecaptchaV2, CaptchaTypeRecaptchaInvisible) appear before the generic CaptchaTypeRecaptcha (and apply the same reordering for the other duplicated block at the later location), ensuring the slices mapping class names to captcha types check variant strings like "g-recaptcha-v2"/"g-recaptcha-invisible" prior to "g-recaptcha".
🧹 Nitpick comments (2)
dit_captcha_test.go (1)
10-737: Most of these tests are tautological and won’t catch integration regressions.They mainly assert values immediately written into structs, so failures in extraction/classification logic can slip through undetected.
Consider shifting these tables to behavior tests that call the real entry points (
ExtractPage, classify/proba flows) and assert populatedCaptchafields from HTML fixtures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dit_captcha_test.go` around lines 10 - 737, Tests currently only assert values written into structs (e.g., FormResult, PageResult, FormResultProba, PageResultProba) which makes them tautological; replace key table-driven cases with behavior/integration tests that call the real entry points (e.g., ExtractPage) using HTML fixtures and the actual classification/proba flows, then assert the populated Captcha fields and field/type probabilities on the returned PageResult/FormResult or PageResultProba/FormResultProba; keep a small set of pure unit tests for struct shaping but move verification of real detection logic into tests that feed known HTML fixtures and verify outputs from ExtractPage and the classifier/proba pipeline.captcha/captcha.go (1)
49-83: Reduce enum validation duplication.
IsValidCaptchaTypeduplicates the full constant list. This is easy to drift when adding/removing types.Use a single
map[CaptchaType]struct{}(or slice+set init) as the source of truth and validate via lookup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@captcha/captcha.go` around lines 49 - 83, Replace the long switch in IsValidCaptchaType with a single package-level lookup set: create a var like validCaptchaTypes map[CaptchaType]struct{} (or build from a slice) containing all existing constants (e.g., CaptchaTypeNone, CaptchaTypeRecaptcha, CaptchaTypeRecaptchaV2, CaptchaTypeRecaptchaInvisible, CaptchaTypeHCaptcha, CaptchaTypeTurnstile, CaptchaTypeGeetest, CaptchaTypeFriendlyCaptcha, CaptchaTypeRotateCaptcha, CaptchaTypeClickCaptcha, CaptchaTypeImageCaptcha, CaptchaTypePuzzleCaptcha, CaptchaTypeSliderCaptcha, CaptchaTypeMCaptcha, CaptchaTypeDatadome, CaptchaTypePerimeterX, CaptchaTypeArgon, CaptchaTypeBehaviotech, CaptchaTypeSmartCaptcha, CaptchaTypeYandex, CaptchaTypeFuncaptcha, CaptchaTypeKasada, CaptchaTypeImperva, CaptchaTypeAwsWaf, CaptchaTypeCoingecko, CaptchaTypeNovaScape, CaptchaTypeSimple, CaptchaTypeOther), then implement IsValidCaptchaType by converting the input string to CaptchaType and returning whether it exists in that map (e.g., _, ok := validCaptchaTypes[CaptchaType(s)]; return ok); keep the map as the single source of truth so adding/removing captcha types only updates one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@captcha/captcha.go`:
- Around line 199-306: DetectCaptchaInHTML's regex table only matches
double-quoted attributes and misses single-quoted HTML; update each regexp in
the captcha detection slice used by DetectCaptchaInHTML to accept either single
or double quotes (e.g., change patterns like attr="...` to attr=(["']).*?\1 or
use ['"] instead of " where applicable), and apply the same change to the
similar patterns around the other occurrence (lines referenced in the review:
the block shown and the other block at 607-613) so all attribute-based matches
handle both quote styles.
---
Duplicate comments:
In `@captcha/captcha.go`:
- Around line 421-423: The class-based detection rules currently list the
generic CaptchaTypeRecaptcha entry before its specific variants, causing
classMatches for "g-recaptcha-v2" and "g-recaptcha-invisible" to be shadowed;
reorder the class rules so more specific variants (CaptchaTypeRecaptchaV2,
CaptchaTypeRecaptchaInvisible) appear before the generic CaptchaTypeRecaptcha
(and apply the same reordering for the other duplicated block at the later
location), ensuring the slices mapping class names to captcha types check
variant strings like "g-recaptcha-v2"/"g-recaptcha-invisible" prior to
"g-recaptcha".
---
Nitpick comments:
In `@captcha/captcha.go`:
- Around line 49-83: Replace the long switch in IsValidCaptchaType with a single
package-level lookup set: create a var like validCaptchaTypes
map[CaptchaType]struct{} (or build from a slice) containing all existing
constants (e.g., CaptchaTypeNone, CaptchaTypeRecaptcha, CaptchaTypeRecaptchaV2,
CaptchaTypeRecaptchaInvisible, CaptchaTypeHCaptcha, CaptchaTypeTurnstile,
CaptchaTypeGeetest, CaptchaTypeFriendlyCaptcha, CaptchaTypeRotateCaptcha,
CaptchaTypeClickCaptcha, CaptchaTypeImageCaptcha, CaptchaTypePuzzleCaptcha,
CaptchaTypeSliderCaptcha, CaptchaTypeMCaptcha, CaptchaTypeDatadome,
CaptchaTypePerimeterX, CaptchaTypeArgon, CaptchaTypeBehaviotech,
CaptchaTypeSmartCaptcha, CaptchaTypeYandex, CaptchaTypeFuncaptcha,
CaptchaTypeKasada, CaptchaTypeImperva, CaptchaTypeAwsWaf, CaptchaTypeCoingecko,
CaptchaTypeNovaScape, CaptchaTypeSimple, CaptchaTypeOther), then implement
IsValidCaptchaType by converting the input string to CaptchaType and returning
whether it exists in that map (e.g., _, ok := validCaptchaTypes[CaptchaType(s)];
return ok); keep the map as the single source of truth so adding/removing
captcha types only updates one place.
In `@dit_captcha_test.go`:
- Around line 10-737: Tests currently only assert values written into structs
(e.g., FormResult, PageResult, FormResultProba, PageResultProba) which makes
them tautological; replace key table-driven cases with behavior/integration
tests that call the real entry points (e.g., ExtractPage) using HTML fixtures
and the actual classification/proba flows, then assert the populated Captcha
fields and field/type probabilities on the returned PageResult/FormResult or
PageResultProba/FormResultProba; keep a small set of pure unit tests for struct
shaping but move verification of real detection logic into tests that feed known
HTML fixtures and verify outputs from ExtractPage and the classifier/proba
pipeline.
captcha/captcha.go
Outdated
| regexp.MustCompile(`class="[^"]*g-recaptcha-invisible`), | ||
| regexp.MustCompile(`data-size="invisible"`), | ||
| }}, | ||
| {CaptchaTypeRecaptchaV2, []*regexp.Regexp{ | ||
| regexp.MustCompile(`class="[^"]*g-recaptcha-v2`), | ||
| }}, | ||
| {CaptchaTypeRecaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*google\.com/recaptcha`), | ||
| regexp.MustCompile(`src="[^"]*gstatic\.com/[^"]*recaptcha`), | ||
| regexp.MustCompile(`src="[^"]*recaptcha/api\.js`), | ||
| regexp.MustCompile(`class="[^"]*g-recaptcha`), | ||
| }}, | ||
| {CaptchaTypeHCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*js\.hcaptcha\.com`), | ||
| regexp.MustCompile(`class="[^"]*h-captcha`), | ||
| regexp.MustCompile(`data-hcaptcha-widget-id`), | ||
| }}, | ||
| {CaptchaTypeTurnstile, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*challenges\.cloudflare\.com`), | ||
| regexp.MustCompile(`class="[^"]*cf-turnstile`), | ||
| }}, | ||
| {CaptchaTypeGeetest, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*geetest`), | ||
| regexp.MustCompile(`class="[^"]*geetest`), | ||
| }}, | ||
| {CaptchaTypeFriendlyCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*friendlycaptcha`), | ||
| regexp.MustCompile(`class="[^"]*frc-captcha`), | ||
| }}, | ||
| {CaptchaTypeRotateCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`alt="[^"]*rotatecaptcha`), | ||
| regexp.MustCompile(`src="[^"]*rotatecaptcha`), | ||
| }}, | ||
| {CaptchaTypeClickCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`alt="[^"]*clickcaptcha`), | ||
| regexp.MustCompile(`src="[^"]*clickcaptcha`), | ||
| }}, | ||
| {CaptchaTypeImageCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`alt="[^"]*imagecaptcha`), | ||
| regexp.MustCompile(`src="[^"]*imagecaptcha`), | ||
| }}, | ||
| {CaptchaTypePuzzleCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`class="[^"]*__puzzle_captcha`), | ||
| }}, | ||
| {CaptchaTypeSliderCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`class="[^"]*slider-captcha`), | ||
| regexp.MustCompile(`src="[^"]*slidercaptcha`), | ||
| }}, | ||
| {CaptchaTypeMCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*mcaptcha`), | ||
| regexp.MustCompile(`class="[^"]*mcaptcha`), | ||
| regexp.MustCompile(`data-mcaptcha`), | ||
| }}, | ||
| {CaptchaTypeKasada, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*kasadaproducts\.com`), | ||
| regexp.MustCompile(`data-kasada`), | ||
| }}, | ||
| {CaptchaTypeImperva, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*/_incapsula_resource`), | ||
| regexp.MustCompile(`data-incapsula`), | ||
| regexp.MustCompile(`data-imperva`), | ||
| }}, | ||
| {CaptchaTypeAwsWaf, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*aws-waf-captcha`), | ||
| regexp.MustCompile(`src="[^"]*awswaf\.com`), | ||
| }}, | ||
| {CaptchaTypeDatadome, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*datadome`), | ||
| regexp.MustCompile(`data-datadome`), | ||
| regexp.MustCompile(`class="[^"]*dd-challenge`), | ||
| }}, | ||
| {CaptchaTypePerimeterX, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*perimeterx`), | ||
| regexp.MustCompile(`data-px`), | ||
| regexp.MustCompile(`_pxappid`), | ||
| }}, | ||
| {CaptchaTypeArgon, []*regexp.Regexp{ | ||
| regexp.MustCompile(`class="[^"]*argon-captcha`), | ||
| }}, | ||
| {CaptchaTypeBehaviotech, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*behaviotech\.com`), | ||
| }}, | ||
| // SmartCaptcha (specific Yandex product) before generic Yandex to avoid shadowing. | ||
| {CaptchaTypeSmartCaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*captcha\.yandex`), | ||
| regexp.MustCompile(`class="[^"]*smart-captcha`), | ||
| regexp.MustCompile(`data-smartcaptcha`), | ||
| }}, | ||
| {CaptchaTypeYandex, []*regexp.Regexp{ | ||
| regexp.MustCompile(`class="[^"]*yandex-captcha`), | ||
| }}, | ||
| {CaptchaTypeFuncaptcha, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*funcaptcha\.com`), | ||
| regexp.MustCompile(`src="[^"]*arkoselabs\.com`), | ||
| regexp.MustCompile(`class="[^"]*funcaptcha`), | ||
| }}, | ||
| {CaptchaTypeCoingecko, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*wsiz\.com`), | ||
| }}, | ||
| {CaptchaTypeNovaScape, []*regexp.Regexp{ | ||
| regexp.MustCompile(`src="[^"]*novascape`), | ||
| }}, | ||
| {CaptchaTypeSimple, []*regexp.Regexp{ | ||
| regexp.MustCompile(`name="[^"]*captcha_code`), | ||
| regexp.MustCompile(`name="[^"]*captcha_input`), | ||
| regexp.MustCompile(`id="[^"]*captcha_image`), | ||
| }}, | ||
| } |
There was a problem hiding this comment.
DetectCaptchaInHTML misses single-quoted attributes.
Most regexes only match attr="...". Valid HTML often uses single quotes, which causes avoidable false negatives.
🐛 Proposed fix pattern (apply across this table)
- regexp.MustCompile(`src="[^"]*google\.com/recaptcha`),
+ regexp.MustCompile(`src=['"][^'"]*google\.com/recaptcha`),
- regexp.MustCompile(`class="[^"]*g-recaptcha`),
+ regexp.MustCompile(`class=['"][^'"]*g-recaptcha`),
- regexp.MustCompile(`name="[^"]*captcha_code`),
+ regexp.MustCompile(`name=['"][^'"]*captcha_code`),Also applies to: 607-613
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@captcha/captcha.go` around lines 199 - 306, DetectCaptchaInHTML's regex table
only matches double-quoted attributes and misses single-quoted HTML; update each
regexp in the captcha detection slice used by DetectCaptchaInHTML to accept
either single or double quotes (e.g., change patterns like attr="...` to
attr=(["']).*?\1 or use ['"] instead of " where applicable), and apply the same
change to the similar patterns around the other occurrence (lines referenced in
the review: the block shown and the other block at 607-613) so all
attribute-based matches handle both quote styles.
- Introduced a map of valid CAPTCHA types to streamline validation in IsValidCaptchaType function. - Simplified the logic for checking valid CAPTCHA types by replacing the switch statement with a map lookup. - Updated HTML integration patterns to use consistent regex for attribute matching. - Removed redundant test cases and added new tests for CAPTCHA detection in HTML forms. - Enhanced existing tests to verify the structure of FormResult and PageResult. - Consolidated tests for various CAPTCHA types to ensure comprehensive coverage.


Add CAPTCHA Detection Support
Overview
This PR adds comprehensive CAPTCHA detection capabilities to the dît classifier with support for 27+ CAPTCHA types including modern enterprise and open-source solutions. CAPTCHA detection is now integrated at both form and page levels, providing better insights into form security mechanisms. This resolves issue: #11
Changes
Struct Updates
Captchafield toFormResultandFormResultProbastructs to support form-level CAPTCHA detectionPageResultandPageResultProbawith improvedCaptchafield placementCAPTCHA Detection Implementation
Created new
classifier/captcha.gomodule withCaptchaDetectorfor robust CAPTCHA detection using 6-layer detection strategy:Integrated CAPTCHA detection in
ExtractPage()method to detect CAPTCHAs across all forms on a pageUpdated
Classify()method to filter out captcha-classified fields from results (captcha detection is now separate from field classification)Enhanced
ClassifyProba()to properly handle probability-based classification with CAPTCHA awarenessCAPTCHA Types Supported (28 types)
Google Solutions (3 types)
Enterprise Solutions (4 types)
Alternative CAPTCHA Services (5 types)
Bot Protection & Behavioral Analysis (5 types)
Interaction-Based CAPTCHAs (5 types)
Simple/Legacy (2 types)
Generic Types
Test Coverage Enhancements
Total CAPTCHA Tests: 35+
Added comprehensive test suite in
dit_captcha_test.gowith test cases for:Created form-level captcha test cases in
classifier/captcha_test.goFixed all unused field write warnings with proper assertions
Added dedicated Yandex test function with Russian and global deployment scenarios
Key Test Scenarios
Documentation
CAPTCHA_DETECTION.mdwith detailed documentation on CAPTCHA detection mechanisms and supported typesBackward Compatibility
✅ Fully backward compatible - all changes use
omitemptyJSON tags, ensuring existing code and API consumers are not affected.Key Features
Test Results
✅ All 35+ CAPTCHA-specific tests passing
✅ No unused field write warnings
✅ Comprehensive validation of Type, Captcha, and Fields
✅ Real-world scenario coverage including regional deployments
Summary by CodeRabbit
New Features
Improvements
Tests