Raw Record Source

{
  "$type": "site.standard.document",
  "canonicalUrl": "https://numergent.com/2015-04/Visualizing-TVTropes-Part-3-Proper-batching.html",
  "path": "/2015-04/Visualizing-TVTropes-Part-3-Proper-batching.html",
  "publishedAt": "2015-04-08T08:44:33.000Z",
  "site": "at://did:plc:cf6futaebyc2k4wgzsr4v42k/site.standard.publication/3mp2ewx43js2g",
  "tags": [
    "neo4j",
    "clojure",
    "optimization",
    "performance",
    "tropology"
  ],
  "textContent": "The story so far\n\nYesterday we were discussing our (admittedly) somewhat ghetto, not-quite-batched Neo4j implementation.\n\nThe long and short of it is that I was initially attacking the import process as one would on a JDBC client for a relational database. Query for these values here, create those if some don't exist, insert relationships here, etc.\n\nThat seems to be woefully inefficient in Neo4j.\n\n\n\nThe second approach I tried was to reduce calls for a page to just a handful, following this procedure:\n\n Start from a page.\n Create a node for it if it doesn’t exist on the database.\n Parse the page, and get all the references to other tropes. \n Import all the nodes that we don’t have already in the database.\n Create an outgoing relationship from the central node to all the nodes it references.\n\nOn our test case, this turned out to decrease the import cost from the initial implementation from 52 seconds down to 12.  My conclusion was that we're still getting a significant hit from Neo4j's writing to the transaction log, which every call up there requires, plus likely missing some indices from the way I had to write the cypher calls.\n\nIf we want to stay with Neo4j, our option is to use their batching REST end point. Which will require some serious data -pre-processing.\n\nMeet the new batching\n\nIt's not the same as the old batching, it's actually using Neo4j's batch REST end point.\n\nAfter staring at the data for a bit, I decided the approach needed to be the following:\n\n Lump together all URLs we need to get - our import set.\n Handle exceptions on retrieving each one independently, so that we can continue importing 9 out of 10 and just mark that 10th one as having a problem if necessary.\n Find which of the import set pages we need to create and which we just need to update the data for.\n For the pages with get errors, only mark them as having an error and log the error.\n Figure out which of the pages from the import set are redirects. In that case we'll need to create two nodes - one with the original redirector URL, so we don't try to import it again, and one with the target redirected URL. \n Get all pages to link to from the import set - the link set.\n From the pages to link to, get all of those that do not exist.\n Remove out of those that do not exist the pages we're about to create or modify because they're part of the import set (since pages from the import set may link to each other).\n Create all pages from the import and link set.\n For all pages we just created - both import set and link set - add a label.\n Update the existing import set pages.\n\nThis yields a more complex workflow than the original plan, since now we're looking at how pages in our import set relate to each other, but has the potential for being faster.\n\nOur data\n\nLet's see about testing this with the 10 pages with the most outgoing links we have on the db right now.\n\n<code>\nMATCH (n)-[r]->() WITH COUNT(r) as c, n \nRETURN c, n.id, n.url ORDER BY c DESC LIMIT 10;\n</code>\n\nThis yields:\n\nWe'll use them as our test import set.\n\nChanges\n\nAs discussed on the last post, I ran into a bit of a modeling issue where sometimes I needed to pass a label value to a cypher query but couldn't parametrize the labels easily.  For now (and probably moving forward) I've made the decision to just file all imported nodes into an Article label, so that I don't have any index missed when I can't pass the label for an article group.\n\nAlso, since we _will_ be querying by for the node id, I've renamed our internal id to _code_ to avoid confusion. Will likely re-import the whole database anyway.\n\nNow, not being able to handle each page individually and having to batch everything up obviously will make the code less straightforward. The current version of the import functions looks a bit like this (very much a WIP and not handling redirects yet):\n\n<a id=\"import-page-set\"/>\n<code data-gist-id=ea7ebbe273cf6f9bcddb></code>\n\nBear with me here - that's mostly me thinking out loud as I write tests and figure out the best way to collate the data.  However, if you have any improvement suggestions (the to-create-link / to-create-import filters can get expensive) I'd love to hear them on Github.\n\nHaving said that, while that is not as short as the trivial per-page Clojure approach, it's still easy enough to read what's going on, and it expressed the described workflow well.\n\nThe good, the bad, the ugly\n\nNow, this section is going to be an assortment of notes.  Here be randomness.\n\n...\n\nGah. Hair-tearing time. The error you get from Neo4j when you accidentally have a PUT method where you should have POST is not exactly explicit.  If you get an obscure error, check you have the right method on the request first.\n\n...\n\nDamn the batch API took a bit to figure out for some reason - perhaps because I had gotten used to Cypher and neocons, and had to start by looking into what the REST endpoint expectations for each thing were myself.\n\nEventually I end up settling on these for create and relate:\n\n<code data-gist-id=673dc9eb46489456ea36></code>\n\nThat <code>batch-rel-ops</code> implementation is using cypher because while I can create a new relationship using the {id-number}/relationships REST end point, I didn't immediately see a way to tell to create a new one.  I've decided to batch a cypher query while I do my performance tests and look for an alternative after I've got some tests around it.\n\n...\n\nInteresting.  The Neo4j batch API only accepts positive integers for the request reference ID. It would be nice if that were obviously documented, since there is no particular reason why - it's after all just a reference. If it wasn't going to take negative integers, it could at least take longs.\n\n...\n\nFascinating. I ran into this during testing.\n\nSuppose the two equivalent queries:\n\n<code>\nMATCH ()-[r]->() RETURN DISTINCT r\nMATCH (n:Article)-[r]->(m:Article) RETURN DISTINCT r\n</code>\n\nThey're being executed on an almost empty database (4 nodes, 5 relationships) and neither is using an index.  I'm calling both using neocons' cypher querying, profiling with timbre. \n\nThe first one is profiled at 3.1 seconds, the second at 18ms.  I haven't the foggiest as to why.\n\n...\n\nAnd we're done lumping the majority of these changes into a single, batched REST call.\n\nThe results\n\nAfter some massive code surgery, we can now import a page with 1880 links to the same Docker container on about 15 seconds... which was what Cowboy Bebop took after our previous semi-batched changes.\n\nHmpf.\n\nOK, still a wee bit of an improvement, since we get to import about 2x as many links, but not significant enough. This is probably because I am creating the relationships using a Cypher query, since (as far as I know) we can't create unique relationships using the _relationships_ REST end point.  We need to improve this by adding an index, now that we can actually take advantage of it.\n\nSo let's see. If we do a...\n\n<code>CREATE INDEX ON :Article(code)</code>\n\nthen the batch insert of 1880 nodes and their relationships goes down from 15 seconds to 5.8s. Moreover, importing our top 10 pages (a total of about 15k nodes, 15k label assignments and 16k relationships) takes only 30s.\n\n_That's more like it_.\n\nWhile on the surface it would look like we're still down to only 20 pages imported per minute, there's several things to consider:\n\n That's on a test Docker container which is only using two cores (even though increasing the number of cores available to the container doesn't seem to make much difference for a single batch request);\n Only some of that cost comes from Neo4j, with another chunk of it coming from some Clojure filtering and mapping that I can likely improve;\n Querying for existing nodes once we have indices in is _fast_;\n This is likely the heaviest insert that the system will see, since other pages don't have as many outgoing links (95% of the currently imported ones have less than 450);\n The more pages we have imported, the more we'll just need to reference existing nodes;\n\nSure, some queries like the one necessary for creating the relationships will have to deal with many more nodes on the live site, but since we're doing joins on indexes they should be efficient.\n\nGood old Cowboy Bebop*, which initially had a test environment import cost of 52 seconds, is now down to 600ms.\n\nThat's an improvement I can definitely live with.\n\nNext steps\n\nOf course, this was some massive hacking and has only partial test coverage. I now need to fix the tests I have, handle redirects (which I don't on the current version) and clean up for an initial release.\n\nMeanwhile, any Clojurians more experienced than myself with how to optimize and/or improve the <code>filter</code> operations on the gist above?  That's a good chunk of my total time when dealing with a lot of new nodes, in particular filters which end up being a cartesian product.\n\nMore coming soon.",
  "title": "Visualizing TVTropes - Part 3: Proper batching"
}