{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiaxxt74o5bobsp6k3uwz5yrt4tdcqv6mzbypytv62qhdg77tgaxhi",
    "uri": "at://did:plc:wcnuqk5ofjntnj6rke7oplqd/app.bsky.feed.post/3mngrnnqqpd22"
  },
  "description": "This week, in between a bunch of other chaos, I got to do something that only happens once a decade... I bought a new Network Attached Storage (NAS) server (a Synology DS1525+) to replace my older DS1315. Between the DRAM shortage and hard disk shortages, it wasn't a particularly great time to do it, but my older NAS had one of its five drives completely crash out after almost 6 years of continuous operation. The NAS itself was 10 years old so it was about time for an upgrade.\n\nBeing an old comp",
  "path": "/moving-data-around-still-sucks/",
  "publishedAt": "2026-06-04T04:23:13.000Z",
  "site": "https://www.counting-stuff.com",
  "tags": [
    "Windirstat"
  ],
  "textContent": "This week, in between a bunch of other chaos, I got to do something that only happens once a decade... I bought a new Network Attached Storage (NAS) server (a Synology DS1525+) to replace my older DS1315. Between the DRAM shortage and hard disk shortages, it wasn't a particularly great time to do it, but my older NAS had one of its five drives completely crash out after almost 6 years of continuous operation. The NAS itself was 10 years old so it was about time for an upgrade.\n\nBeing an old computer nerd, I've been carrying data with me for the past 25+ years. Somehow I've got 1.7TB worth of raw photos from since 2010, and 530GB worth of music collected since the late 90s. Pile on old videos, lots of various project data over the years, and I had to move 5.4 TB worth of data. There wasn't any rush to do everything, but in total, it took close to 36 hours to move the data over the 1-gigabit network I had set up at home.\n\nWhile these days at work, manipulating 5 TB of data is mostly trivial thanks to how massive clusters can process data in parallel, many of us forget about the various issues that pop up when you have to actually _move_ significant amounts of data around. It's one of the more basic bits of data engineering that easily fades into the background amidst all the pipelines.\n\nAs an example, the transfer between the two NAS devices was essentially a big rsync job. One of the many options you can use for such a job is to enable compression of the data before transmitting. Usually, because CPUs are significantly faster than network ports, enabling compression allows you to attain higher overall throughput. But in the case of my 10-year old NAS box, it could barely move data at an average of about 4-5MB/s down the wire. In fact, _disabling_ compression allowed the NAS to push on average 45-60MB/s, at least a 10x speed increase and enough to pretty much max out the 1-gigabit port linking the two devices together.\n\nIn theory, if I spent time setting up my network switches and ran enough cables, I could theoretically get up to 2 gbps of bandwidth between the two boxes, but that would've been a very delicate operation compared to just waiting for the transfer to complete. Either way, with all the advances in the speed of computing itself, it's easy to forget that the speed of networking did not really improve nearly as much. Even now, the fastest networking gear that's easily available to consumers is just 10 gigabits a second. Such technology would've reduced the transfer time to under two hours but neither of the NAS devices have network interfaces that support those speeds.\n\nThis reminds me of when I was at Bitly again and after a big data center move, we had to decide whether about 80TB of historic log data should be transfered from cold storage into the hot analytics cluster. After doing the math, it would have cost something close to $25,000 to simply transfer the bytes. It would've also taken many, many, many days. Ultimately I decided that we don't use that data enough to warrant the cost and time. If we actually did need the data, it'd be significantly cheaper to just spin up a temporary analytics cluster and do the analysis remotely.\n\nEven when I worked at Google on various storage related projects, it was hard to get people who weren't deep in the nitty gritty of moving data around to fully grasp just how LONG it takes to move significant amounts of data. Considering that various enterprise customers could easily have petabytes of data in their various systems, moving that much data around can have all sorts of horrific unexpected side effects like completely saturating the optical fiber to a chunk of a data center and causing outages, or even worse, a crowded undersea cable which causes even bigger outages.\n\nMoving data around is a bit of a mini engineering problem, but so is cleaning up the data. My disk array has a 7TB capacity, and it is human nature to be very sloppy about what you leave around when you start with so much storage space. Over the decade, I had been lax about deleting useless data and keeping large poorly-compressed video around. So a secondary project involved trawling through the array looking for things to clean up.\n\nI still use a trusty old tool call Windirstat to scan through my drives and generate an interactive treemap to help me find the biggest files to delete. I also had to ask myself if I was really going to watch that old series from a decade ago again or not. For large video files, I wound up re-encoding a lot of things into h.265 format because it gives a better compression ratio than h.264 or older codecs.\n\nThe one thing I need to figure out is how to cull out the photo archive. There are obviously tens of thousands of photos and a non-trivial number of them aren't exactly \"keepers\". A thorough and very critical culling could probably save me a ton of space... But someone's gotta go in there and sift out the bad shots and no magical AI in the world will have the context of memory needed to make sense of why some shots that are objectively bad (poor focus, motion blur, weird subjects) need to be kept for posterity, while dozens of variations of a normal portrait shot can be safely removed in favor of one \"winning\" shot. Not even the various auto culling tools that Adobe is trying out in Lightroom can fully cope with this problem.\n\nAnyways. It was a good reminder that foundational data engineering isn't just about jugging YAML files still.",
  "title": "Moving data around still sucks",
  "updatedAt": "2026-06-04T04:23:13.230Z"
}