Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiazfynkrth3qug2ffnvb56qpcpq4ltip5phq7exb4qiu5mtcseqyq",
    "uri": "at://did:plc:46ti67tc37qcmwp2vaynk6fq/app.bsky.feed.post/3mhk45h7mjra2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreifqaymphlhaou7ymuxlsdcq46k6c4k27dbs2lrqq2rxp5ioaxmdoy"
    },
    "mimeType": "image/jpeg",
    "size": 8554
  },
  "path": "/cj/?p=2129",
  "publishedAt": "2026-03-21T03:37:53.220Z",
  "site": "https://wp.c9h.org",
  "tags": [
    "Tweet",
    "@_",
    "@results",
    "@INC"
  ],
  "textContent": "# The\nWWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104\n\nThis document synthesizes the extensive work performed from March\n13th to March 20th, 2026, to harden, stabilize, and refactor the\n`WWW::Mechanize::Chrome` library and its test suite. This\neffort involved deep dives into asynchronous programming,\nplatform-specific bug hunting, and strategic architectural\ndecisions.\n\n* * *\n\n## Part I:\nThe Quest for Cross-Platform Stability (March 13 – 16)\n\nThe initial phase of work focused on achieving a “green” test suite\nacross a variety of Linux distributions and preparing for a new release.\nThis involved significant hardening of the library to account for\ndifferent browser versions, OS-level security restrictions, and\nfilesystem differences.\n\n### Key Milestones &\nEngineering Decisions:\n\n  * **Fedora & RHEL-family Success:** A major effort\nwas undertaken to achieve a 100% pass rate on modern Fedora 43 and\nCentOS Stream 10. This required several key engineering decisions to\nhandle modern browser behavior:\n\n    * **Decision: Implement Asynchronous DOM Serialization\nFallback.** Synchronous fallbacks in an async context are\ndangerous. To prevent `Resource was not cached` errors during\n`saveResources`, we implemented a fully asynchronous fallback\nin `_saveResourceTree`. By chaining\n`_cached_document` with `DOM.getOuterHTML`\nmessages, we can reconstruct document content without blocking the event\nloop, even if Chromium has evicted the resource from its cache. This\nalso proved resilient against Fedora’s security policies, which often\nblock `file://` access.\n    * **Decision: Truncate Filenames for Cross-Platform\nSafety.** To avoid `File name too long` errors,\nespecially on Windows where the `MAX_PATH` limit is 260\ncharacters, `filenameFromUrl` was hardened. The filename\ntruncation was reduced to a more conservative **150\ncharacters**, leaving ample headroom for deeply nested CI\ntemporary directories. Logic was also added to preserve file extensions\nduring truncation and to sanitize backslashes from URI paths.\n    * **Decision: Expand Browser Discovery Paths.** To\nsupport RHEL-based systems out-of-the-box, the\n`default_executable_names` was expanded to include\n`headless_shell` and search paths were updated to include\n`/usr/lib64/chromium-browser/`.\n    * **Decision: Mitigate Race Conditions with Stabilization Waits\nand Resilient Fetching.** On fast systems,\n`DOM.documentUpdated` events could invalidate\n`nodeId`s immediately after navigation, causing XPath queries\nto fail with “Could not find node with given id”. A small stabilization\n`sleep(0.25s)` was added after page loads to ensure the DOM\nis settled. Furthermore, the asynchronous DOM fetching loop was hardened\nto gracefully handle these errors by catching protocol errors and\nreturning an empty string for any node that was invalidated during\nserialization, ensuring the overall process could complete.\n  * **Windows Hardening:**\n    * **Decision: Adopt Platform-Aware Watchdogs.** The test\nsuite’s reliance on `ualarm` was a blocker for Windows, where\nit is not implemented. The `t::helper::set_watchdog` function\nwas refactored to use standard `alarm()` (seconds) on Windows\nand `ualarm` (microseconds) on Unix-like systems, enabling\nconsistent test-level timeout enforcement.\n  * **Version 0.77 Release:**\n    * **Decision: Adopt SOP for Version Synchronization.**\nThe project maintains duplicate version strings across 24+ files. A\nStandard Operating Procedure was adopted to use a batch-replacement tool\nto update all sub-modules in `lib/` and to always run\n`make clean` and `perl Makefile.PL` to ensure\n`META.json` and `META.yml` reflect the new\nversion. After achieving stability on Linux, the project version was\nbumped to 0.77.\n  * **Infrastructure & Strategic Work:**\n    * The `ad2` Windows Server 2025 instance was restored and\noptimized, with Active Directory demoted and disk I/O performance\nimproved.\n    * A strategic proposal for the **Heterogeneous Directory\nReplication Protocol (HDRP)** was drafted and published.\n\n\n\n* * *\n\n## Part II: The\nGreat Async Refactor (March 17 – 18)\n\nDespite success on Linux, tests on the slow `ad2` Windows\nhost were still plagued by intermittent, indefinite hangs. This\ntriggered a fundamental architectural shift to move the library’s core\nfrom a mix of synchronous and asynchronous code to a fully non-blocking\ninternal API.\n\n### Key Milestones &\nEngineering Decisions:\n\n  * **Decision: Expose a`_future` API.**\nInstead of hardcoding timeouts in the library, the core strategy was to\nrefactor all blocking methods (`xpath`, `field`,\n`get`, etc.) into thin wrappers around new non-blocking\n`..._future` counterparts. This moved timeout management to\nthe test harness, allowing for flexible and explicit handling of\nstalls.\n\n        # Example library implementation\n        sub xpath($self, $query, %options) {\n            return $self->xpath_future($query, %options)->get;\n        }\n\n        sub xpath_future($self, $query, %options) {\n            # Async implementation using $self->target->send_message(...)\n        }\n\n  * **Decision: Centralize Test Hardening in a Helper.**\nA dedicated test library, `t/lib/t/helper.pm`, was created to\ncontain all stabilization logic. “Safe” wrappers (`safe_get`,\n`safe_xpath`) were implemented there, using\n`Future->wait_any` to race asynchronous operations against\na timeout, preventing tests from hanging.\n\n        # Example test helper implementation\n        sub safe_xpath {\n            my ($mech, $query, %options) = @_;\n            my $timeout = delete $options{timeout} || 5;\n            my $call_f = $mech->xpath_future($query, %options);\n            my $timeout_f = $mech->sleep_future($timeout)->then(sub { Future->fail(\"Timeout\") });\n            return Future->wait_any($call_f, $timeout_f)->get;\n        }\n\n  * **Decision: Refactor Node Attribute Cache.**\nInvestigations into flaky checkbox tests (`t/50-tick.t`)\nrevealed that `WWW::Mechanize::Chrome::Node` was storing\nattributes as a flat list (`[key, val, key, val]`), which was\ninefficient for lookups and individual updates. The cache was refactored\nto definitively use a **HashRef** , providing O(1) lookups\nand enabling atomic dual-updates where both the browser property (via\nJS) and the internal library attribute are synchronized\nsimultaneously.\n\n  * **Decision: Implement Self-Cancelling Socket\nWatchdog.** On Windows, traditional watchdog processes often\nfailed to detect parent termination, leading to 60-second hangs after\nsuccessful tests. We implemented a new socket-based watchdog in\n`t::helper` that listens on an ephemeral port; the background\nprocess terminates immediately when the parent socket closes,\neliminating these cumulative delays.\n\n  * **Decision: Deep Recursive Refactoring & Form\nSelection.** To make the API truly non-blocking, the entire\ninternal call stack had to be refactored. For example, making\n`get_set_value_future` non-blocking required first making its\ndependency, `_field_by_name`, asynchronous. This culminated\nin refactoring the entire form selection API (`form_name`,\n`form_id`, etc.) to use the new asynchronous\n`_future` lookups, which was a key step in mitigating the\nWindows deadlocks.\n\n  * **Decision: Fix Critical Regressions & Memory\nCycles.**\n\n    * **Evaluation Normalization:** Implemented a\n`_process_eval_result` helper to centralize the parsing of\nresults from `Runtime.evaluate`. This ensures consistent\nhandling of return values and exceptions between synchronous\n(`eval_in_page`) and asynchronous (`eval_future`)\ncalls.\n\n    * **Memory Cycle Mitigation:** A significant memory\nleak was discovered where closures attached to CDP event futures (like\nfor asynchronous body retrieval) would capture strong references to\n`$self` and the `$response` object, creating a\ncircular reference. The established rule is to now always use\n`Scalar::Util::weaken` on both `$self` and any\nother relevant objects before they are used inside a\n`->then` block that is stored on an object.\n\n    * **Context Propagation (`wantarray`):** A\nmajor regression was discovered where Perl’s `wantarray`\ncontext, which distinguishes between scalar and list context, was lost\ninside asynchronous `Future->then` blocks. This caused\nmethods like `xpath` to return incorrect results (e.g., a\ncount instead of a list of nodes). The solution was to adopt the “Async\nContext Pattern”: capture `wantarray` in the synchronous\nwrapper, pass it as an option to the `_future` method, and\nthen use that captured value inside the future’s final resolution\nblock.\n\n          # Synchronous Wrapper\n          sub xpath($self, $query, %options) {\n              $options{ wantarray } = wantarray; # 1. Capture\n              return $self->xpath_future($query, %options)->get; # 2. Pass\n          }\n\n          # Asynchronous Implementation\n          sub xpath_future($self, $query, %options) {\n              my $wantarray = delete $options{ wantarray }; # 3. Retrieve\n              # ... async logic ...\n              return $doc->then(sub {\n                  if ($wantarray) { # 4. Respect\n                      return Future->done(@results);\n                  } else {\n                      return Future->done($results[0]);\n                  }\n              });\n          }\n\n    * **Asynchronous Body Retrieval & Robust Content\nFallbacks:** Fixed a bug where `decoded_content()`\nwould return empty strings by ensuring it awaited a\n`__body_future`. This was implemented by storing the\nretrieval future directly on the response object\n(`$response->{__body_future}`). To make this more robust,\na tiered strategy was implemented: first try to get the content from the\nnetwork response, but if that fails (e.g., for `about:blank`\nor due to cache eviction), fall back to a JavaScript\n`XMLSerializer` to get the live DOM content.\n\n    * **Signature Hardening:** Fixed “Too few arguments”\nerrors when using modern Perl signatures with\n`Future->then`. Callbacks were updated to use optional\nparameters (`sub($result = undef) { ... }`) to gracefully\nhandle futures that resolve with no value.\n\n    * **XHTML “Split-Brain” Bug:** Resolved a\nlong-standing Chromium bug (40130141) where content provided via\n`setDocumentContent` is parsed differently than content\nloaded from a URL. A workaround was implemented: for XHTML documents,\nWMC now uses a JavaScript-based XPath evaluation\n(`document.evaluate`) against the live DOM, bypassing the\nbroken CDP search mechanism.\n\n\n\n\n### Derived Architectural Rules\n& SOPs:\n\n  * **Rule: Always provide`_future` variants.**\nEvery library method that interacts with the browser via CDP must have a\nnon-blocking asynchronous counterpart.\n  * **Rule: Centralize stabilization in the test layer.**\nAll timeout and retry logic should reside in the test harness\n(`t/lib/t/helper.pm`), not in the core library.\n  * **Rule: Explicitly propagate`wantarray`\ncontext.** Synchronous wrappers must capture the caller’s context\nand pass it down the `Future` chain to ensure correct\nscalar/list behavior.\n  * **Rule: The entire call chain must be asynchronous.**\nTo enable non-blocking timeouts, even a single “hidden” blocking call in\nan otherwise asynchronous method will cause a stall.\n  * **SOP: Reduce Library Noise.** Diagnostic messages\n(`warn`, `note`, `diag`) should be\nremoved from library code before commits. All such messages should be\nconverted to use the internal `$self->log('debug', ...)`\nmechanism, ensuring a clean TAP output for CI systems.\n\n\n\n* * *\n\n## Part III: The\n`MutationObserver` Saga (March 19)\n\nWith most of the library refactored to be asynchronous, one stubborn\ntest, `t/65-is_visible.t`, continued to fail with timeouts.\nThis led to an ambitious, but ultimately unsuccessful, attempt to\nreplace the `wait_until_visible` polling logic with a more\n“modern” `MutationObserver`.\n\n### Key Milestones & Challenges:\n\n  * **The Theory:** The goal was to replace an inefficient\n`repeat { sleep }` loop with an event-driven\n`MutationObserver` in JavaScript that would notify Perl\nimmediately when an element’s visibility changed.\n  * **Implementation & Cascade Failure:** The\nimplementation proved incredibly difficult and introduced a series of\nnew, hard-to-diagnose bugs:\n\n    1. An incorrect function signature for\n`callFunctionOn_future`.\n    2. A critical unit mismatch, passing seconds from Perl to JavaScript’s\n`setTimeout`, which expected milliseconds.\n    3. A fundamental hang where the `MutationObserver`’s\nJavaScript `Promise` would never resolve, even after the\nunderlying DOM element changed.\n  * **Debugging Maze:** Multiple attempts to fix the\n`checkVisibility` JavaScript logic inside the observer\ncallback, including making it more robust by adding DOM tree traversal\nand extensive `console.log` tracing, failed to resolve the\nhang. This highlighted the opacity and difficulty of debugging complex,\ncross-language asynchronous interactions, especially when dealing with\nlow-level browser APIs.\n\n\n\n### Procedural Learning:\nGranular Edits\n\nThe effort was plagued by procedural missteps in using automated\nfile-editing tools. Initial attempts to replace large code blocks in a\nsingle operation led to accidental code loss and match failures.\n\n  * **Decision: Adopt “Delete, then Add” Workflow.**\nFollowing forceful user correction, a new SOP was established for all\nfuture modifications:\n\n    1. **Isolate:** Break the file into small, manageable\nchunks (e.g., 250 lines).\n    2. **Delete:** Perform a “delete” operation by replacing\nthe old code block with an empty string.\n    3. **Add:** Perform an “add” operation by inserting the\nnew code into the empty space.\n    4. **Verify:** Verifying each atomic step before\nproceeding. This granular process, while slower, ensured surgical\nprecision and regained technical control over the large\n`Chrome.pm` module.\n\n\n\nThe consistent failure of the `MutationObserver` approach\neventually led to the decision to abandon it in favor of stabilizing the\noriginal, more transparent implementation.\n\n* * *\n\n## Part IV:\nReversion and Final Stabilization (March 20)\n\nAfter exhausting all reasonable attempts to fix the\n`MutationObserver`, a strategic decision was made to revert\nto the simpler, more transparent polling implementation and fix it\ncorrectly. This proved to be the correct path to a stable solution.\n\n### Key Milestones &\nEngineering Decisions:\n\n  * **Decision: Perform Strategic Reversion.** The\n`MutationObserver` implementation, when integrated via\n`callFunctionOn_future` with `awaitPromise`,\nproved fundamentally unstable. Its JavaScript promise would consistently\nfail to resolve, causing indefinite hangs. A decision was made to\n**revert all`MutationObserver` code** from\n`WWW::Mechanize::Chrome.pm` and restore the original\n`repeat { sleep }` polling mechanism. A stable,\nunderstandable solution was prioritized over an elegant but broken\none.\n  * **Decision: Correct Timeout Delegation in the\nHarness.** The root cause of the original timeout failure was\nidentified as a race condition in the `t/lib/t/helper.pm`\ntest harness. The `safe_wait_until_*` wrappers were\nimplementing their own timeout (via `wait_any` and\n`sleep_future`) that raced against the underlying polling\nfunction’s internal timeout. This led to intermittent failures on slow\nmachines. The helpers were refactored to **delegate all timeout\nmanagement to the library’s polling functions**, ensuring a\nsingle, authoritative timer controlled the operation.\n  * **Decision: Optimize Polling Performance.** At the\nuser’s request, the polling interval was reduced from 300ms to\n**150ms**. This modest performance improvement reduced the\ntest suite’s wallclock execution time by over a second while maintaining\nstability.\n  * **Decision: Tune Test Watchdogs.** The global watchdog\ntimeout was adjusted to 12 seconds, specifically calculated as 1.5x the\nobserved real execution time of the optimized test. This provides a\ndata-driven safety margin for CI.\n\n\n\n* * *\n\n## Part\nV: The Last Bug – A Platform-Specific Memory Leak (March 20)\n\nWith all other tests passing, a single memory leak failure in\n`t/78-memleak.t` persisted, but only on the Windows\n`ad2` environment. This required a different approach than\nthe timeout fixes.\n\n### Key Milestones:\n\n  * **The Bug:** A strong reference cycle involving the\n`on_dialog` event listener was not being broken on Windows,\ndespite multiple attempts to fix it. Fixes that worked on Linux (such as\ncalling `on_dialog(undef)` in `DESTROY`) were not\nsufficient on the Windows host.\n  * **The Diagnosis:** The issue was determined to be a\ndeep, platform-specific interaction between Perl’s garbage collector,\nthe `IO::Async` event loop implementation on Windows, and the\n`Test::Memory::Cycle` module. The cycle report was identical\non both platforms, but the cleanup behavior was different.\n  * **Failed Attempts:** A series of increasingly\naggressive fixes were attempted to break the cycle, including:\n\n    1. Moving the `on_dialog(undef)` call from\n`close()` to `DESTROY()`.\n    2. Explicitly `delete`ing the listener and callback\nproperties from the object hash in `DESTROY`.\n    3. Swapping between `$self->remove_listener` and\n`$self->target->unlisten` in a mistaken attempt to find\nthe correct un-registration method.\n  * **Pragmatic Solution:** After exhausting all reasonable\ncode-level fixes without a resolution on Windows, the user opted to mark\nthe failing test as a known issue for that specific platform.\n  * **Final Fix:** The single failing test in\n`t/78-memleak.t` was wrapped in a conditional\n`TODO` block that only executes on Windows\n(`if ($^O =~ /MSWin32/i)`), formally acknowledging the bug\nwithout blocking the build. This allows the test suite to pass in CI\nenvironments while flagging the issue for future, deeper\ninvestigation.\n\n\n\n* * *\n\n## Part VI: CI Hardening (March\n20)\n\nA final failure in the GitHub Actions CI environment revealed one\nlast configuration flaw.\n\n### Key Milestones:\n\n  * **The Bug:** The CI was running\n`prove --nocount --jobs 3 -I local/ -bl xt t` directly. This\ncommand was missing the crucial `-It/lib` include path, which\nis necessary for test files to locate the `t::helper` module.\nThis resulted in nearly all tests failing with\n`Can't locate t/helper.pm in @INC`.\n  * **The Investigation:** An analysis of\n`Makefile.PL` revealed a custom `MY::test` block\nspecifically designed to inject the `-It/lib` flag into the\n`make test` command. This confirmed that\n`make test` is the correct, canonical way to run the test\nsuite for this project.\n  * **The Fix:** The\n`.github/workflows/linux.yml` file was modified to replace\nthe direct `prove` call with `make test` in the\n`Run Tests` step. This ensures the CI environment runs the\ntests in the exact same way as a local developer, with all necessary\ninclude paths correctly configured by the project’s build system.\n\n\n\n## Final Outcome\n\nAfter this long and arduous journey, the\n`WWW::Mechanize::Chrome` test suite is now stable and\n**passing on all targeted platforms** , with known\nplatform-specific issues clearly documented in the code. The project is\nin a vastly more robust and reliable state.\n\nTweet",
  "title": "C.J. Collier: The WWW::Mechanize::Chrome Saga: A Comprehensive Narrative of PR #104",
  "updatedAt": "2026-03-21T01:52:46.000Z"
}