Raw Record Source

{
  "$type": "site.standard.document",
  "canonicalUrl": "https://johnnyreilly.com/posts/xml-read-and-write-with-node-js",
  "description": "This post demonstrates reading and writing XML in Node.js using fast-xml-parser. We will use the Docusauruses XML sitemap as an example.",
  "path": "/posts/xml-read-and-write-with-node-js",
  "publishedAt": "2022-11-22T00:00:00.000Z",
  "site": "at://did:plc:yy3apqjlms24kso7ahn7lbmb/site.standard.publication/3mova7c4nho2b",
  "tags": [
    "node.js",
    "docusaurus"
  ],
  "textContent": "This post demonstrates reading and writing XML in Node.js using fast-xml-parser. We'll use the Docusauruses XML sitemap as an example.\n\n\n\nUpdated 03/05/2023\n\nThis post talks about manipulating the Docusaurus sitemap as an example of how to work on XML with Node.js.\n\nIt's worth noting that Docusaurus has offered a way to configure the sitemap as of Docusaurus 3.3 which I worked on.\n\nHowever, the techniques described here are still useful for working with XML in Node.js.\n\nDocusaurus sitemap\n\nI was prompted to write this post by wanting to edit the sitemap on my Docusaurus blog. I wanted to remove the /page/ and /tag/ routes from the sitemap. They effectively serve as duplicate content and I don't want them to be indexed by search engines. (A little more is required to remove them from search engines - see the section at the end of the post.)\n\nI was able to find the sitemap in the build folder of my Docusaurus site. It's called sitemap.xml and it's in the root of the build folder. It looks like this:\n\nfast-xml-parser\n\nAfter experimenting with a few different XML parsers I settled on fast-xml-parser. It's fast, it's simple and it's well maintained. It also handles XML namespaces and attributes well. (This appears to be rare in XML parsers.)\n\nLet's scaffold up an example project alongside our Docusaurus site:\n\nAnd in the package.json file add a start script:\n\nFinally, create an empty index.ts file.\n\nReading XML\n\nOur Docusaurus sitemap is in the build folder of our Docusaurus site. Let's read it in and parse it into a JavaScript object:\n\nWe're using the XMLParser class to parse the XML into a JavaScript object. We're also using the ignoreAttributes option to ensure that attributes are included in the parsed object. When we run this we get the following output:\n\nAs we can see, the fast-xml-parser library has parsed the XML into a JavaScript object. We can see that the urlset element has an array of url elements. Each url element has a loc, changefreq and priority element. We can also see that the urlset element has a number of attributes. This matches the XML we saw earlier and the interface we defined.\n\nFiltering and writing XML\n\nNow that we have the XML parsed into a JavaScript object we can filter it just like we would any other JavaScript object. We have all the power of JavaScript at our fingertips!\n\nAs I mentioned earlier, I want to remove all the URLs that represent duplicate content. This includes \"pagination\" URLs. These are URLs that are used to navigate between pages of content. For example, the URL https://johnnyreilly.com/page/10 is a pagination URL. I want to remove these URLs from the sitemap. I also want to get rid of the \"tags\" URLs. These are URLs that are used to navigate between posts that have a particular tag. For example, the URL https://johnnyreilly.com/tags/ajax is a tag URL. I want to remove these URLs from the sitemap too.\n\nThis is simplicity itself now we're in JavaScript land. We can use the filter method on the url array to remove the URLs we don't want:\n\nWe can then update the url array with the filtered URLs:\n\nFinally, we can write the XML back out to a file:\n\nNote again that we're using the ignoreAttributes option to ensure that attributes are included in the XML.\n\nLet's put it all together into a single file:\n\nWith that we're done. We can run the script and see the result:\n\nConclusion\n\nIn this post we've seen how to use the fast-xml-parser library to parse XML into a JavaScript object, operate upon that object and then write it back out to XML.\n\nIf you'd to see how I'm using this directly on my blog, it's probably worth looking at this PR.\n\nPS noindex\n\nThis is unrelated to XML processing, but I didn't want to miss this out. Merely editing the sitemap isn't enough to remove them from search engines. We're also going to serve a noindex response header for those routes by adjusting the staticwebapp.config.json file of our Static Web App:",
  "title": "XML: read and write with Node.js"
}