External Publication
Visit Post

Some useful SQL(ite) tips I've learned

Jamie Tanna December 5, 2024
Source

I've recently spent some time working towards presenting some of the findings from dependency-management-data at work, which has required writing some fun queries, and then using Google Sheets to better visualise that data.

Not only have I taken advantage of some fairly straightforward queries, written off-the-cuff, or relied upon some pre-written queries in the DMD codebase, I've also found myself needing to branch out and write some much more complex SQL, as I didn't want to add any logic into the Google Sheet and have that only as a presentation layer, which could then be consumed by a Google Slides presentation.

In particular, because I haven't looked at presenting this data directly via Google Sheets visualisations, I ended up searching for things like sqlite buckets or how to make rows columns vice versa, to mixed luck.

Instead of this being a spatter of separate posts, I thought I'd collect them here for lower discoverability, but higher density.

Splitting data into buckets

This is a big one that took me many attempts to find an article / StackOverflow post that would explain quite right for what I wanted to do.

For instance, let's say we have the following data (limited for brevity):

platform organisation repo package_name version current_version package_manager package_file_path dep_types level advisory_type description supported_until eol_from
github stretchr testify go 1.11 1.23.3 gomod _codegen/go.mod ["golang"] ERROR UNMAINTAINED go 1.11 has been End-of-Life for 1897 days 2019-09-03
github wiremock wiremock gradle 4.5.1 4.5.1 gradle-wrapper perf-test/gradle/wrapper/gradle-wrapper.properties [] ERROR UNMAINTAINED gradle 4 has been End-of-Life for 2178 days 2018-11-26 2018-11-26
github wiremock wiremock-resilience-examples gradle 4.5.1 4.5.1 gradle-wrapper gradle/wrapper/gradle-wrapper.properties [] ERROR UNMAINTAINED gradle 4 has been End-of-Life for 2178 days 2018-11-26 2018-11-26
gitlab gitlab-org gitlab node 4.2.2 4.2.2 gitlabci lib/gitlab/ci/templates/Pages/Metalsmith.gitlab-ci.yml ["image"] ERROR UNMAINTAINED nodejs 4 has been End-of-Life for 2388 days 2017-04-01 2018-04-30

If we wanted to get a high-level view of what packages (via the package_name) were in the state of being UNMAINTAINED or DEPRECATED, how would we do that, so we can i.e. get the following output?

package_name bucket count(*)
go unmaintained 1
gradle unmaintained 2
node deprecated 2
node unmaintained 10
python deprecated 1
rails unmaintained 2
ruby unmaintained 1

To do this, we can use a case / when statement to group each type of data into a named bucket, like so:

select
package_name,
(
    case
    when abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
        then 'unmaintained'
    when abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
        then 'deprecated'
    else
        'THIS SHOULD NOT BE HIT'
    end
) as bucket,
count(*)
from advisories
where
  (
    abs(cast ((julianday(eol_from) - julianday('now')) as integer)) > (365 * 5)
  )
  or
  (
    abs(cast ((julianday(supported_until) - julianday('now')) as integer)) > (365 * 5)
  )
group by package_name, bucket
order by package_name, bucket

This is very useful when trying to then graph the resulting data, and I've ended up using this bucketing quite a few times after finding out how to do it ๐Ÿ™Œ๐Ÿผ

ORDER BY doesn't need to reference a column

Something interesting is the fact that it's possible to use an ORDER BY with an arbitrary statement, for instance:

select
  visibility,
  count(*)
from
  repository_metadata
group by
  visibility
order by
  (
    case
      visibility
      when 'PUBLIC' then 0
      when 'INTERNAL' then 1
      when 'PRIVATE' then 2
      else 3
    end
  )

Previously, I had been creating a separate column called ord and then was using this, but this is much simpler.

Querying JSON

I've written about this more in a separate post but one thing to note that I've recently been doing is relying on the shorthand "arrow functions" that are inbuilt to SQLite, and are familiar if you're coming from Postgres.

For instance, let's say we have the following:

select
  id,
  json_extract(data.json, '$.name') name
from
  data

We can actually further simplify this with an arrow operator, ->>:

 select
   id,
-  json_extract(data.json, '$.name') name
+  data.json ->> '$.name' name
 from
   data

Concatenating strings

Something I've done a bit before, but for posterity:

  -- ...
  datasource || ': ' || update_type as label,
  --         ^^ this concatenates between multiple strings
  -- ...

Very useful when trying to add useful labels for graphs.

Providing default values for NULLs

In the case that we're retrieving a nullable column from database, but want to provide a default value, we can use coalesce.

For instance:

select
  coalesce(current_version, version) as ver
from
  renovate

Or:

select
  coalesce(metadata, '{}') as m
-- ...

Date calculations

Something I've been doing since very early on with DMD is calculating "days between" dates, i.e. to given an indication of "you've been running Node 12.x for 900 days past its End-of-Life date".

For instance, if we want to take a column like:

supported_until

2024-10-22

And work out how many days it's been since that date, we could write a query using julianday, such as:

select
  supported_until,
  -- it's more useful to show the absolute (aka always positive) value to humans
  abs(cast(julianday(supported_until) - julianday('now') as integer)) as since
from
  advisories
where
  supported_until is not null
  and
  -- this number will be negative if it's in the past
  cast(julianday(supported_until) - julianday('now') as integer) < 0

Using subqueries as a way to produce column-focussed data (for presenting in Google Sheets)

Within the DMD codebase, I'm using the excellent sqlc for my database queries.

One thing it doesn't really support - which is fair - is heavily nested queries, including those with subqueries, so I don't generally lean on them as it's hard to use them with sqlc.

However, when preparing this presentation, I found that as I wasn't tied to sqlc while preparing the data, I could write the most cursed and nested subqueries I wanted ๐Ÿ˜ˆ

This allowed me to think about how to handle getting the right columns in the resulting output, at the cost of it generally being a less efficient query, but working for what I needed.

For instance, the below query:

select
(
    select
    count(*)
    from
    (
        select
        distinct package_manager, package_name
        from
        renovate
    )
) as num_deps,
(
    select
    count(*)
    from advisories
    where advisory_type = 'DEPRECATED'
) as total_deprecated,
(
    select
    count(*)
    from advisories
    where advisory_type = 'UNMAINTAINED'
) as total_unmaintained

This isn't the most complex example I could share, but out of an abundance of shame I won't go into it too much ๐Ÿซฃ

Discussion in the ATmosphere

Loading comments...