C.J. Collier: The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104
The
WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104
This document synthesizes the extensive work performed from March
13th to March 20th, 2026, to harden, stabilize, and refactor the
WWW::Mechanize::Chrome library and its test suite. This
effort involved deep dives into asynchronous programming,
platform-specific bug hunting, and strategic architectural
decisions.
Part I:
The Quest for Cross-Platform Stability (March 13 – 16)
The initial phase of work focused on achieving a “green” test suite across a variety of Linux distributions and preparing for a new release. This involved significant hardening of the library to account for different browser versions, OS-level security restrictions, and filesystem differences.
Key Milestones &
Engineering Decisions:
Fedora & RHEL-family Success: A major effort was undertaken to achieve a 100% pass rate on modern Fedora 43 and CentOS Stream 10. This required several key engineering decisions to handle modern browser behavior:
- **Decision: Implement Asynchronous DOM Serialization
Fallback.** Synchronous fallbacks in an async context are
dangerous. To prevent Resource was not cached errors during
saveResources, we implemented a fully asynchronous fallback
in _saveResourceTree. By chaining
_cached_document with DOM.getOuterHTML
messages, we can reconstruct document content without blocking the event
loop, even if Chromium has evicted the resource from its cache. This
also proved resilient against Fedora’s security policies, which often
block file:// access.
* Decision: Truncate Filenames for Cross-Platform
Safety. To avoid File name too long errors,
especially on Windows where the MAX_PATH limit is 260
characters, filenameFromUrl was hardened. The filename
truncation was reduced to a more conservative 150
characters, leaving ample headroom for deeply nested CI
temporary directories. Logic was also added to preserve file extensions
during truncation and to sanitize backslashes from URI paths.
* Decision: Expand Browser Discovery Paths. To
support RHEL-based systems out-of-the-box, the
default_executable_names was expanded to include
headless_shell and search paths were updated to include
/usr/lib64/chromium-browser/.
* Decision: Mitigate Race Conditions with Stabilization Waits
and Resilient Fetching. On fast systems,
DOM.documentUpdated events could invalidate
nodeIds immediately after navigation, causing XPath queries
to fail with “Could not find node with given id”. A small stabilization
sleep(0.25s) was added after page loads to ensure the DOM
is settled. Furthermore, the asynchronous DOM fetching loop was hardened
to gracefully handle these errors by catching protocol errors and
returning an empty string for any node that was invalidated during
serialization, ensuring the overall process could complete.
- Windows Hardening:
- Decision: Adopt Platform-Aware Watchdogs. The test
suite’s reliance on
ualarmwas a blocker for Windows, where it is not implemented. Thet::helper::set_watchdogfunction was refactored to use standardalarm()(seconds) on Windows andualarm(microseconds) on Unix-like systems, enabling consistent test-level timeout enforcement.
- Decision: Adopt Platform-Aware Watchdogs. The test
suite’s reliance on
- Version 0.77 Release:
- Decision: Adopt SOP for Version Synchronization.
The project maintains duplicate version strings across 24+ files. A
Standard Operating Procedure was adopted to use a batch-replacement tool
to update all sub-modules in
lib/and to always runmake cleanandperl Makefile.PLto ensureMETA.jsonandMETA.ymlreflect the new version. After achieving stability on Linux, the project version was bumped to 0.77.
- Decision: Adopt SOP for Version Synchronization.
The project maintains duplicate version strings across 24+ files. A
Standard Operating Procedure was adopted to use a batch-replacement tool
to update all sub-modules in
- Infrastructure & Strategic Work:
- The
ad2Windows Server 2025 instance was restored and optimized, with Active Directory demoted and disk I/O performance improved. - A strategic proposal for the Heterogeneous Directory Replication Protocol (HDRP) was drafted and published.
- The
Part II: The
Great Async Refactor (March 17 – 18)
Despite success on Linux, tests on the slow ad2 Windows
host were still plagued by intermittent, indefinite hangs. This
triggered a fundamental architectural shift to move the library’s core
from a mix of synchronous and asynchronous code to a fully non-blocking
internal API.
Key Milestones &
Engineering Decisions:
Decision: Expose a
_futureAPI. Instead of hardcoding timeouts in the library, the core strategy was to refactor all blocking methods (xpath,field,get, etc.) into thin wrappers around new non-blocking..._futurecounterparts. This moved timeout management to the test harness, allowing for flexible and explicit handling of stalls.# Example library implementation sub xpath($self, $query, %options) { return $self->xpath_future($query, %options)->get; } sub xpath_future($self, $query, %options) { # Async implementation using $self->target->send_message(...) }Decision: Centralize Test Hardening in a Helper. A dedicated test library,
t/lib/t/helper.pm, was created to contain all stabilization logic. “Safe” wrappers (safe_get,safe_xpath) were implemented there, usingFuture->wait_anyto race asynchronous operations against a timeout, preventing tests from hanging.# Example test helper implementation sub safe_xpath { my ($mech, $query, %options) = @_; my $timeout = delete $options{timeout} || 5; my $call_f = $mech->xpath_future($query, %options); my $timeout_f = $mech->sleep_future($timeout)->then(sub { Future->fail("Timeout") }); return Future->wait_any($call_f, $timeout_f)->get; }Decision: Refactor Node Attribute Cache. Investigations into flaky checkbox tests (
t/50-tick.t) revealed thatWWW::Mechanize::Chrome::Nodewas storing attributes as a flat list ([key, val, key, val]), which was inefficient for lookups and individual updates. The cache was refactored to definitively use a HashRef , providing O(1) lookups and enabling atomic dual-updates where both the browser property (via JS) and the internal library attribute are synchronized simultaneously.Decision: Implement Self-Cancelling Socket Watchdog. On Windows, traditional watchdog processes often failed to detect parent termination, leading to 60-second hangs after successful tests. We implemented a new socket-based watchdog in
t::helperthat listens on an ephemeral port; the background process terminates immediately when the parent socket closes, eliminating these cumulative delays.Decision: Deep Recursive Refactoring & Form Selection. To make the API truly non-blocking, the entire internal call stack had to be refactored. For example, making
get_set_value_futurenon-blocking required first making its dependency,_field_by_name, asynchronous. This culminated in refactoring the entire form selection API (form_name,form_id, etc.) to use the new asynchronous_futurelookups, which was a key step in mitigating the Windows deadlocks.Decision: Fix Critical Regressions & Memory Cycles.
- Evaluation Normalization: Implemented a
_process_eval_result helper to centralize the parsing of
results from Runtime.evaluate. This ensures consistent
handling of return values and exceptions between synchronous
(eval_in_page) and asynchronous (eval_future)
calls.
* **Memory Cycle Mitigation:** A significant memory
leak was discovered where closures attached to CDP event futures (like
for asynchronous body retrieval) would capture strong references to
$self and the $response object, creating a
circular reference. The established rule is to now always use
Scalar::Util::weaken on both $self and any
other relevant objects before they are used inside a
->then block that is stored on an object.
* **Context Propagation (`wantarray`):** A
major regression was discovered where Perl’s wantarray
context, which distinguishes between scalar and list context, was lost
inside asynchronous Future->then blocks. This caused
methods like xpath to return incorrect results (e.g., a
count instead of a list of nodes). The solution was to adopt the “Async
Context Pattern”: capture wantarray in the synchronous
wrapper, pass it as an option to the _future method, and
then use that captured value inside the future’s final resolution
block.
# Synchronous Wrapper
sub xpath($self, $query, %options) {
$options{ wantarray } = wantarray; # 1. Capture
return $self->xpath_future($query, %options)->get; # 2. Pass
}
# Asynchronous Implementation
sub xpath_future($self, $query, %options) {
my $wantarray = delete $options{ wantarray }; # 3. Retrieve
# ... async logic ...
return $doc->then(sub {
if ($wantarray) { # 4. Respect
return Future->done(@results);
} else {
return Future->done($results[0]);
}
});
}
* **Asynchronous Body Retrieval & Robust Content
Fallbacks:** Fixed a bug where decoded_content()
would return empty strings by ensuring it awaited a
__body_future. This was implemented by storing the
retrieval future directly on the response object
($response->{__body_future}). To make this more robust,
a tiered strategy was implemented: first try to get the content from the
network response, but if that fails (e.g., for about:blank
or due to cache eviction), fall back to a JavaScript
XMLSerializer to get the live DOM content.
* **Signature Hardening:** Fixed “Too few arguments”
errors when using modern Perl signatures with
Future->then. Callbacks were updated to use optional
parameters (sub($result = undef) { ... }) to gracefully
handle futures that resolve with no value.
* **XHTML “Split-Brain” Bug:** Resolved a
long-standing Chromium bug (40130141) where content provided via
setDocumentContent is parsed differently than content
loaded from a URL. A workaround was implemented: for XHTML documents,
WMC now uses a JavaScript-based XPath evaluation
(document.evaluate) against the live DOM, bypassing the
broken CDP search mechanism.
Derived Architectural Rules
& SOPs:
- Rule: Always provide
_futurevariants. Every library method that interacts with the browser via CDP must have a non-blocking asynchronous counterpart. - Rule: Centralize stabilization in the test layer.
All timeout and retry logic should reside in the test harness
(
t/lib/t/helper.pm), not in the core library. - Rule: Explicitly propagate
wantarraycontext. Synchronous wrappers must capture the caller’s context and pass it down theFuturechain to ensure correct scalar/list behavior. - Rule: The entire call chain must be asynchronous. To enable non-blocking timeouts, even a single “hidden” blocking call in an otherwise asynchronous method will cause a stall.
- SOP: Reduce Library Noise. Diagnostic messages
(
warn,note,diag) should be removed from library code before commits. All such messages should be converted to use the internal$self->log('debug', ...)mechanism, ensuring a clean TAP output for CI systems.
Part III: The
MutationObserver Saga (March 19)
With most of the library refactored to be asynchronous, one stubborn
test, t/65-is_visible.t, continued to fail with timeouts.
This led to an ambitious, but ultimately unsuccessful, attempt to
replace the wait_until_visible polling logic with a more
“modern” MutationObserver.
Key Milestones & Challenges:
The Theory: The goal was to replace an inefficient
repeat { sleep }loop with an event-drivenMutationObserverin JavaScript that would notify Perl immediately when an element’s visibility changed.Implementation & Cascade Failure: The implementation proved incredibly difficult and introduced a series of new, hard-to-diagnose bugs:
- An incorrect function signature for
callFunctionOn_future.
2. A critical unit mismatch, passing seconds from Perl to JavaScript’s
setTimeout, which expected milliseconds.
3. A fundamental hang where the MutationObserver’s
JavaScript Promise would never resolve, even after the
underlying DOM element changed.
- Debugging Maze: Multiple attempts to fix the
checkVisibilityJavaScript logic inside the observer callback, including making it more robust by adding DOM tree traversal and extensiveconsole.logtracing, failed to resolve the hang. This highlighted the opacity and difficulty of debugging complex, cross-language asynchronous interactions, especially when dealing with low-level browser APIs.
Procedural Learning:
Granular Edits
The effort was plagued by procedural missteps in using automated file-editing tools. Initial attempts to replace large code blocks in a single operation led to accidental code loss and match failures.
Decision: Adopt “Delete, then Add” Workflow. Following forceful user correction, a new SOP was established for all future modifications:
- Isolate: Break the file into small, manageable
chunks (e.g., 250 lines).
2. Delete: Perform a “delete” operation by replacing
the old code block with an empty string.
3. Add: Perform an “add” operation by inserting the
new code into the empty space.
4. Verify: Verifying each atomic step before
proceeding. This granular process, while slower, ensured surgical
precision and regained technical control over the large
Chrome.pm module.
The consistent failure of the MutationObserver approach
eventually led to the decision to abandon it in favor of stabilizing the
original, more transparent implementation.
Part IV:
Reversion and Final Stabilization (March 20)
After exhausting all reasonable attempts to fix the
MutationObserver, a strategic decision was made to revert
to the simpler, more transparent polling implementation and fix it
correctly. This proved to be the correct path to a stable solution.
Key Milestones &
Engineering Decisions:
- Decision: Perform Strategic Reversion. The
MutationObserverimplementation, when integrated viacallFunctionOn_futurewithawaitPromise, proved fundamentally unstable. Its JavaScript promise would consistently fail to resolve, causing indefinite hangs. A decision was made to revert allMutationObservercode fromWWW::Mechanize::Chrome.pmand restore the originalrepeat { sleep }polling mechanism. A stable, understandable solution was prioritized over an elegant but broken one. - Decision: Correct Timeout Delegation in the
Harness. The root cause of the original timeout failure was
identified as a race condition in the
t/lib/t/helper.pmtest harness. Thesafe_wait_until_*wrappers were implementing their own timeout (viawait_anyandsleep_future) that raced against the underlying polling function’s internal timeout. This led to intermittent failures on slow machines. The helpers were refactored to delegate all timeout management to the library’s polling functions, ensuring a single, authoritative timer controlled the operation. - Decision: Optimize Polling Performance. At the user’s request, the polling interval was reduced from 300ms to 150ms. This modest performance improvement reduced the test suite’s wallclock execution time by over a second while maintaining stability.
- Decision: Tune Test Watchdogs. The global watchdog timeout was adjusted to 12 seconds, specifically calculated as 1.5x the observed real execution time of the optimized test. This provides a data-driven safety margin for CI.
Part
V: The Last Bug – A Platform-Specific Memory Leak (March 20)
With all other tests passing, a single memory leak failure in
t/78-memleak.t persisted, but only on the Windows
ad2 environment. This required a different approach than
the timeout fixes.
Key Milestones:
The Bug: A strong reference cycle involving the
on_dialogevent listener was not being broken on Windows, despite multiple attempts to fix it. Fixes that worked on Linux (such as callingon_dialog(undef)inDESTROY) were not sufficient on the Windows host.The Diagnosis: The issue was determined to be a deep, platform-specific interaction between Perl’s garbage collector, the
IO::Asyncevent loop implementation on Windows, and theTest::Memory::Cyclemodule. The cycle report was identical on both platforms, but the cleanup behavior was different.Failed Attempts: A series of increasingly aggressive fixes were attempted to break the cycle, including:
- Moving the
on_dialog(undef)call from
- Moving the
close() to DESTROY().
2. Explicitly deleteing the listener and callback
properties from the object hash in DESTROY.
3. Swapping between $self->remove_listener and
$self->target->unlisten in a mistaken attempt to find
the correct un-registration method.
- Pragmatic Solution: After exhausting all reasonable code-level fixes without a resolution on Windows, the user opted to mark the failing test as a known issue for that specific platform.
- Final Fix: The single failing test in
t/78-memleak.twas wrapped in a conditionalTODOblock that only executes on Windows (if ($^O =~ /MSWin32/i)), formally acknowledging the bug without blocking the build. This allows the test suite to pass in CI environments while flagging the issue for future, deeper investigation.
Part VI: CI Hardening (March
A final failure in the GitHub Actions CI environment revealed one last configuration flaw.
Key Milestones:
- The Bug: The CI was running
prove --nocount --jobs 3 -I local/ -bl xt tdirectly. This command was missing the crucial-It/libinclude path, which is necessary for test files to locate thet::helpermodule. This resulted in nearly all tests failing withCan't locate t/helper.pm in @INC. - The Investigation: An analysis of
Makefile.PLrevealed a customMY::testblock specifically designed to inject the-It/libflag into themake testcommand. This confirmed thatmake testis the correct, canonical way to run the test suite for this project. - The Fix: The
.github/workflows/linux.ymlfile was modified to replace the directprovecall withmake testin theRun Testsstep. This ensures the CI environment runs the tests in the exact same way as a local developer, with all necessary include paths correctly configured by the project’s build system.
Final Outcome
After this long and arduous journey, the
WWW::Mechanize::Chrome test suite is now stable and
passing on all targeted platforms , with known
platform-specific issues clearly documented in the code. The project is
in a vastly more robust and reliable state.
Tweet
Discussion in the ATmosphere