External Publication
Visit Post

C.J. Collier: The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

Planet Debian [Unofficial] March 21, 2026
Source

The

WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104

This document synthesizes the extensive work performed from March 13th to March 20th, 2026, to harden, stabilize, and refactor the WWW::Mechanize::Chrome library and its test suite. This effort involved deep dives into asynchronous programming, platform-specific bug hunting, and strategic architectural decisions.


Part I:

The Quest for Cross-Platform Stability (March 13 – 16)

The initial phase of work focused on achieving a “green” test suite across a variety of Linux distributions and preparing for a new release. This involved significant hardening of the library to account for different browser versions, OS-level security restrictions, and filesystem differences.

Key Milestones &

Engineering Decisions:

  • Fedora & RHEL-family Success: A major effort was undertaken to achieve a 100% pass rate on modern Fedora 43 and CentOS Stream 10. This required several key engineering decisions to handle modern browser behavior:

    • **Decision: Implement Asynchronous DOM Serialization

Fallback.** Synchronous fallbacks in an async context are dangerous. To prevent Resource was not cached errors during saveResources, we implemented a fully asynchronous fallback in _saveResourceTree. By chaining _cached_document with DOM.getOuterHTML messages, we can reconstruct document content without blocking the event loop, even if Chromium has evicted the resource from its cache. This also proved resilient against Fedora’s security policies, which often block file:// access. * Decision: Truncate Filenames for Cross-Platform Safety. To avoid File name too long errors, especially on Windows where the MAX_PATH limit is 260 characters, filenameFromUrl was hardened. The filename truncation was reduced to a more conservative 150 characters, leaving ample headroom for deeply nested CI temporary directories. Logic was also added to preserve file extensions during truncation and to sanitize backslashes from URI paths. * Decision: Expand Browser Discovery Paths. To support RHEL-based systems out-of-the-box, the default_executable_names was expanded to include headless_shell and search paths were updated to include /usr/lib64/chromium-browser/. * Decision: Mitigate Race Conditions with Stabilization Waits and Resilient Fetching. On fast systems, DOM.documentUpdated events could invalidate nodeIds immediately after navigation, causing XPath queries to fail with “Could not find node with given id”. A small stabilization sleep(0.25s) was added after page loads to ensure the DOM is settled. Furthermore, the asynchronous DOM fetching loop was hardened to gracefully handle these errors by catching protocol errors and returning an empty string for any node that was invalidated during serialization, ensuring the overall process could complete.

  • Windows Hardening:
    • Decision: Adopt Platform-Aware Watchdogs. The test suite’s reliance on ualarm was a blocker for Windows, where it is not implemented. The t::helper::set_watchdog function was refactored to use standard alarm() (seconds) on Windows and ualarm (microseconds) on Unix-like systems, enabling consistent test-level timeout enforcement.
  • Version 0.77 Release:
    • Decision: Adopt SOP for Version Synchronization. The project maintains duplicate version strings across 24+ files. A Standard Operating Procedure was adopted to use a batch-replacement tool to update all sub-modules in lib/ and to always run make clean and perl Makefile.PL to ensure META.json and META.yml reflect the new version. After achieving stability on Linux, the project version was bumped to 0.77.
  • Infrastructure & Strategic Work:
    • The ad2 Windows Server 2025 instance was restored and optimized, with Active Directory demoted and disk I/O performance improved.
    • A strategic proposal for the Heterogeneous Directory Replication Protocol (HDRP) was drafted and published.

Part II: The

Great Async Refactor (March 17 – 18)

Despite success on Linux, tests on the slow ad2 Windows host were still plagued by intermittent, indefinite hangs. This triggered a fundamental architectural shift to move the library’s core from a mix of synchronous and asynchronous code to a fully non-blocking internal API.

Key Milestones &

Engineering Decisions:

  • Decision: Expose a_future API. Instead of hardcoding timeouts in the library, the core strategy was to refactor all blocking methods (xpath, field, get, etc.) into thin wrappers around new non-blocking ..._future counterparts. This moved timeout management to the test harness, allowing for flexible and explicit handling of stalls.

    # Example library implementation
    sub xpath($self, $query, %options) {
        return $self->xpath_future($query, %options)->get;
    }
    
    sub xpath_future($self, $query, %options) {
        # Async implementation using $self->target->send_message(...)
    }
    
  • Decision: Centralize Test Hardening in a Helper. A dedicated test library, t/lib/t/helper.pm, was created to contain all stabilization logic. “Safe” wrappers (safe_get, safe_xpath) were implemented there, using Future->wait_any to race asynchronous operations against a timeout, preventing tests from hanging.

    # Example test helper implementation
    sub safe_xpath {
        my ($mech, $query, %options) = @_;
        my $timeout = delete $options{timeout} || 5;
        my $call_f = $mech->xpath_future($query, %options);
        my $timeout_f = $mech->sleep_future($timeout)->then(sub { Future->fail("Timeout") });
        return Future->wait_any($call_f, $timeout_f)->get;
    }
    
  • Decision: Refactor Node Attribute Cache. Investigations into flaky checkbox tests (t/50-tick.t) revealed that WWW::Mechanize::Chrome::Node was storing attributes as a flat list ([key, val, key, val]), which was inefficient for lookups and individual updates. The cache was refactored to definitively use a HashRef , providing O(1) lookups and enabling atomic dual-updates where both the browser property (via JS) and the internal library attribute are synchronized simultaneously.

  • Decision: Implement Self-Cancelling Socket Watchdog. On Windows, traditional watchdog processes often failed to detect parent termination, leading to 60-second hangs after successful tests. We implemented a new socket-based watchdog in t::helper that listens on an ephemeral port; the background process terminates immediately when the parent socket closes, eliminating these cumulative delays.

  • Decision: Deep Recursive Refactoring & Form Selection. To make the API truly non-blocking, the entire internal call stack had to be refactored. For example, making get_set_value_future non-blocking required first making its dependency, _field_by_name, asynchronous. This culminated in refactoring the entire form selection API (form_name, form_id, etc.) to use the new asynchronous _future lookups, which was a key step in mitigating the Windows deadlocks.

  • Decision: Fix Critical Regressions & Memory Cycles.

    • Evaluation Normalization: Implemented a

_process_eval_result helper to centralize the parsing of results from Runtime.evaluate. This ensures consistent handling of return values and exceptions between synchronous (eval_in_page) and asynchronous (eval_future) calls.

* **Memory Cycle Mitigation:** A significant memory

leak was discovered where closures attached to CDP event futures (like for asynchronous body retrieval) would capture strong references to $self and the $response object, creating a circular reference. The established rule is to now always use Scalar::Util::weaken on both $self and any other relevant objects before they are used inside a ->then block that is stored on an object.

* **Context Propagation (`wantarray`):** A

major regression was discovered where Perl’s wantarray context, which distinguishes between scalar and list context, was lost inside asynchronous Future->then blocks. This caused methods like xpath to return incorrect results (e.g., a count instead of a list of nodes). The solution was to adopt the “Async Context Pattern”: capture wantarray in the synchronous wrapper, pass it as an option to the _future method, and then use that captured value inside the future’s final resolution block.

      # Synchronous Wrapper
      sub xpath($self, $query, %options) {
          $options{ wantarray } = wantarray; # 1. Capture
          return $self->xpath_future($query, %options)->get; # 2. Pass
      }

      # Asynchronous Implementation
      sub xpath_future($self, $query, %options) {
          my $wantarray = delete $options{ wantarray }; # 3. Retrieve
          # ... async logic ...
          return $doc->then(sub {
              if ($wantarray) { # 4. Respect
                  return Future->done(@results);
              } else {
                  return Future->done($results[0]);
              }
          });
      }

* **Asynchronous Body Retrieval & Robust Content

Fallbacks:** Fixed a bug where decoded_content() would return empty strings by ensuring it awaited a __body_future. This was implemented by storing the retrieval future directly on the response object ($response->{__body_future}). To make this more robust, a tiered strategy was implemented: first try to get the content from the network response, but if that fails (e.g., for about:blank or due to cache eviction), fall back to a JavaScript XMLSerializer to get the live DOM content.

* **Signature Hardening:** Fixed “Too few arguments”

errors when using modern Perl signatures with Future->then. Callbacks were updated to use optional parameters (sub($result = undef) { ... }) to gracefully handle futures that resolve with no value.

* **XHTML “Split-Brain” Bug:** Resolved a

long-standing Chromium bug (40130141) where content provided via setDocumentContent is parsed differently than content loaded from a URL. A workaround was implemented: for XHTML documents, WMC now uses a JavaScript-based XPath evaluation (document.evaluate) against the live DOM, bypassing the broken CDP search mechanism.

Derived Architectural Rules

& SOPs:

  • Rule: Always provide_future variants. Every library method that interacts with the browser via CDP must have a non-blocking asynchronous counterpart.
  • Rule: Centralize stabilization in the test layer. All timeout and retry logic should reside in the test harness (t/lib/t/helper.pm), not in the core library.
  • Rule: Explicitly propagatewantarray context. Synchronous wrappers must capture the caller’s context and pass it down the Future chain to ensure correct scalar/list behavior.
  • Rule: The entire call chain must be asynchronous. To enable non-blocking timeouts, even a single “hidden” blocking call in an otherwise asynchronous method will cause a stall.
  • SOP: Reduce Library Noise. Diagnostic messages (warn, note, diag) should be removed from library code before commits. All such messages should be converted to use the internal $self->log('debug', ...) mechanism, ensuring a clean TAP output for CI systems.

Part III: The

MutationObserver Saga (March 19)

With most of the library refactored to be asynchronous, one stubborn test, t/65-is_visible.t, continued to fail with timeouts. This led to an ambitious, but ultimately unsuccessful, attempt to replace the wait_until_visible polling logic with a more “modern” MutationObserver.

Key Milestones & Challenges:

  • The Theory: The goal was to replace an inefficient repeat { sleep } loop with an event-driven MutationObserver in JavaScript that would notify Perl immediately when an element’s visibility changed.

  • Implementation & Cascade Failure: The implementation proved incredibly difficult and introduced a series of new, hard-to-diagnose bugs:

    1. An incorrect function signature for

callFunctionOn_future. 2. A critical unit mismatch, passing seconds from Perl to JavaScript’s setTimeout, which expected milliseconds. 3. A fundamental hang where the MutationObserver’s JavaScript Promise would never resolve, even after the underlying DOM element changed.

  • Debugging Maze: Multiple attempts to fix the checkVisibility JavaScript logic inside the observer callback, including making it more robust by adding DOM tree traversal and extensive console.log tracing, failed to resolve the hang. This highlighted the opacity and difficulty of debugging complex, cross-language asynchronous interactions, especially when dealing with low-level browser APIs.

Procedural Learning:

Granular Edits

The effort was plagued by procedural missteps in using automated file-editing tools. Initial attempts to replace large code blocks in a single operation led to accidental code loss and match failures.

  • Decision: Adopt “Delete, then Add” Workflow. Following forceful user correction, a new SOP was established for all future modifications:

    1. Isolate: Break the file into small, manageable

chunks (e.g., 250 lines). 2. Delete: Perform a “delete” operation by replacing the old code block with an empty string. 3. Add: Perform an “add” operation by inserting the new code into the empty space. 4. Verify: Verifying each atomic step before proceeding. This granular process, while slower, ensured surgical precision and regained technical control over the large Chrome.pm module.

The consistent failure of the MutationObserver approach eventually led to the decision to abandon it in favor of stabilizing the original, more transparent implementation.


Part IV:

Reversion and Final Stabilization (March 20)

After exhausting all reasonable attempts to fix the MutationObserver, a strategic decision was made to revert to the simpler, more transparent polling implementation and fix it correctly. This proved to be the correct path to a stable solution.

Key Milestones &

Engineering Decisions:

  • Decision: Perform Strategic Reversion. The MutationObserver implementation, when integrated via callFunctionOn_future with awaitPromise, proved fundamentally unstable. Its JavaScript promise would consistently fail to resolve, causing indefinite hangs. A decision was made to revert allMutationObserver code from WWW::Mechanize::Chrome.pm and restore the original repeat { sleep } polling mechanism. A stable, understandable solution was prioritized over an elegant but broken one.
  • Decision: Correct Timeout Delegation in the Harness. The root cause of the original timeout failure was identified as a race condition in the t/lib/t/helper.pm test harness. The safe_wait_until_* wrappers were implementing their own timeout (via wait_any and sleep_future) that raced against the underlying polling function’s internal timeout. This led to intermittent failures on slow machines. The helpers were refactored to delegate all timeout management to the library’s polling functions, ensuring a single, authoritative timer controlled the operation.
  • Decision: Optimize Polling Performance. At the user’s request, the polling interval was reduced from 300ms to 150ms. This modest performance improvement reduced the test suite’s wallclock execution time by over a second while maintaining stability.
  • Decision: Tune Test Watchdogs. The global watchdog timeout was adjusted to 12 seconds, specifically calculated as 1.5x the observed real execution time of the optimized test. This provides a data-driven safety margin for CI.

Part

V: The Last Bug – A Platform-Specific Memory Leak (March 20)

With all other tests passing, a single memory leak failure in t/78-memleak.t persisted, but only on the Windows ad2 environment. This required a different approach than the timeout fixes.

Key Milestones:

  • The Bug: A strong reference cycle involving the on_dialog event listener was not being broken on Windows, despite multiple attempts to fix it. Fixes that worked on Linux (such as calling on_dialog(undef) in DESTROY) were not sufficient on the Windows host.

  • The Diagnosis: The issue was determined to be a deep, platform-specific interaction between Perl’s garbage collector, the IO::Async event loop implementation on Windows, and the Test::Memory::Cycle module. The cycle report was identical on both platforms, but the cleanup behavior was different.

  • Failed Attempts: A series of increasingly aggressive fixes were attempted to break the cycle, including:

    1. Moving the on_dialog(undef) call from

close() to DESTROY(). 2. Explicitly deleteing the listener and callback properties from the object hash in DESTROY. 3. Swapping between $self->remove_listener and $self->target->unlisten in a mistaken attempt to find the correct un-registration method.

  • Pragmatic Solution: After exhausting all reasonable code-level fixes without a resolution on Windows, the user opted to mark the failing test as a known issue for that specific platform.
  • Final Fix: The single failing test in t/78-memleak.t was wrapped in a conditional TODO block that only executes on Windows (if ($^O =~ /MSWin32/i)), formally acknowledging the bug without blocking the build. This allows the test suite to pass in CI environments while flagging the issue for future, deeper investigation.

Part VI: CI Hardening (March

A final failure in the GitHub Actions CI environment revealed one last configuration flaw.

Key Milestones:

  • The Bug: The CI was running prove --nocount --jobs 3 -I local/ -bl xt t directly. This command was missing the crucial -It/lib include path, which is necessary for test files to locate the t::helper module. This resulted in nearly all tests failing with Can't locate t/helper.pm in @INC.
  • The Investigation: An analysis of Makefile.PL revealed a custom MY::test block specifically designed to inject the -It/lib flag into the make test command. This confirmed that make test is the correct, canonical way to run the test suite for this project.
  • The Fix: The .github/workflows/linux.yml file was modified to replace the direct prove call with make test in the Run Tests step. This ensures the CI environment runs the tests in the exact same way as a local developer, with all necessary include paths correctly configured by the project’s build system.

Final Outcome

After this long and arduous journey, the WWW::Mechanize::Chrome test suite is now stable and passing on all targeted platforms , with known platform-specific issues clearly documented in the code. The project is in a vastly more robust and reliable state.

Tweet

Discussion in the ATmosphere

Loading comments...