External Publication

Some useful SQL(ite) tips I've learned

Jamie Tanna December 5, 2024

I've recently spent some time working towards presenting some of the findings from dependency-management-data at work, which has required writing some fun queries, and then using Google Sheets to better visualise that data.

Not only have I taken advantage of some fairly straightforward queries, written off-the-cuff, or relied upon some pre-written queries in the DMD codebase, I've also found myself needing to branch out and write some much more complex SQL, as I didn't want to add any logic into the Google Sheet and have that only as a presentation layer, which could then be consumed by a Google Slides presentation.

In particular, because I haven't looked at presenting this data directly via Google Sheets visualisations, I ended up searching for things like sqlite buckets or how to make rows columns vice versa, to mixed luck.

Instead of this being a spatter of separate posts, I thought I'd collect them here for lower discoverability, but higher density.

Splitting data into buckets

This is a big one that took me many attempts to find an article / StackOverflow post that would explain quite right for what I wanted to do.

For instance, let's say we have the following data (limited for brevity):

platform	organisation	repo	package_name	version	current_version	package_manager	package_file_path	dep_types	level	advisory_type	description	supported_until	eol_from
github	stretchr	testify	go	1.11	1.23.3	gomod	_codegen/go.mod	["golang"]	ERROR	UNMAINTAINED	go 1.11 has been End-of-Life for 1897 days		2019-09-03
github	wiremock	wiremock	gradle	4.5.1	4.5.1	gradle-wrapper	perf-test/gradle/wrapper/gradle-wrapper.properties	[]	ERROR	UNMAINTAINED	gradle 4 has been End-of-Life for 2178 days	2018-11-26	2018-11-26
github	wiremock	wiremock-resilience-examples	gradle	4.5.1	4.5.1	gradle-wrapper	gradle/wrapper/gradle-wrapper.properties	[]	ERROR	UNMAINTAINED	gradle 4 has been End-of-Life for 2178 days	2018-11-26	2018-11-26
gitlab	gitlab-org	gitlab	node	4.2.2	4.2.2	gitlabci	lib/gitlab/ci/templates/Pages/Metalsmith.gitlab-ci.yml	["image"]	ERROR	UNMAINTAINED	nodejs 4 has been End-of-Life for 2388 days	2017-04-01	2018-04-30

If we wanted to get a high-level view of what packages (via the package_name) were in the state of being UNMAINTAINED or DEPRECATED, how would we do that, so we can i.e. get the following output?

package_name	bucket	count(*)
go	unmaintained	1
gradle	unmaintained	2
node	deprecated	2
node	unmaintained	10
python	deprecated	1
rails	unmaintained	2
ruby	unmaintained	1

To do this, we can use a case / when statement to group each type of data into a named bucket, like so:

select
package_name,
(
    case
    when abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
        then 'unmaintained'
    when abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
        then 'deprecated'
    else
        'THIS SHOULD NOT BE HIT'
    end
) as bucket,
count(*)
from advisories
where
  (
    abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
  )
  or
  (
    abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
  )
group by package_name, bucket
order by package_name, bucket

This is very useful when trying to then graph the resulting data, and I've ended up using this bucketing quite a few times after finding out how to do it 🙌🏼

`ORDER BY` doesn't need to reference a column

Something interesting is the fact that it's possible to use an ORDER BY with an arbitrary statement, for instance:

select
  visibility,
  count(*)
from
  repository_metadata
group by
  visibility
order by
  (
    case
      visibility
      when 'PUBLIC' then 0
      when 'INTERNAL' then 1
      when 'PRIVATE' then 2
      else 3
    end
  )

Previously, I had been creating a separate column called ord and then was using this, but this is much simpler.

Querying JSON

I've written about this more in a separate post but one thing to note that I've recently been doing is relying on the shorthand "arrow functions" that are inbuilt to SQLite, and are familiar if you're coming from Postgres.

For instance, let's say we have the following:

select
  id,
  json_extract(data.json, '$.name') name
from
  data

We can actually further simplify this with an arrow operator, ->>:

 select
   id,
-  json_extract(data.json, '$.name') name
+  data.json ->> '$.name' name
 from
   data

Concatenating strings

Something I've done a bit before, but for posterity:

  -- ...
  datasource || ': ' || update_type as label,
  --         ^^ this concatenates between multiple strings
  -- ...

Very useful when trying to add useful labels for graphs.

Providing default values for `NULL`s

In the case that we're retrieving a nullable column from database, but want to provide a default value, we can use coalesce.

For instance:

select
  coalesce(current_version, version) as ver
from
  renovate

Or:

select
  coalesce(metadata, '{}') as m
-- ...

Date calculations

Something I've been doing since very early on with DMD is calculating "days between" dates, i.e. to given an indication of "you've been running Node 12.x for 900 days past its End-of-Life date".

For instance, if we want to take a column like:

supported_until

2024-10-22

And work out how many days it's been since that date, we could write a query using julianday, such as:

select
  supported_until,
  -- it's more useful to show the absolute (aka always positive) value to humans
  abs(cast(julianday(supported_until) - julianday('now') as integer)) as since
from
  advisories
where
  supported_until is not null
  and
  -- this number will be negative if it's in the past
  cast(julianday(supported_until) - julianday('now') as integer) < 0

Using subqueries as a way to produce column-focussed data (for presenting in Google Sheets)

Within the DMD codebase, I'm using the excellent sqlc for my database queries.

One thing it doesn't really support - which is fair - is heavily nested queries, including those with subqueries, so I don't generally lean on them as it's hard to use them with sqlc.

However, when preparing this presentation, I found that as I wasn't tied to sqlc while preparing the data, I could write the most cursed and nested subqueries I wanted 😈

This allowed me to think about how to handle getting the right columns in the resulting output, at the cost of it generally being a less efficient query, but working for what I needed.

For instance, the below query:

select
(
    select
    count(*)
    from
    (
        select
        distinct package_manager, package_name
        from
        renovate
    )
) as num_deps,
(
    select
    count(*)
    from advisories
    where advisory_type = 'DEPRECATED'
) as total_deprecated,
(
    select
    count(*)
    from advisories
    where advisory_type = 'UNMAINTAINED'
) as total_unmaintained

This isn't the most complex example I could share, but out of an abundance of shame I won't go into it too much 🫣