External Publication
Visit Post

Adding Visual Context to the Dagbanli Dictionary: How the University of Ghana HCI Lab Dataset Powered our Sentence Matching System

en.planet.wikimedia.org [Unofficial] May 27, 2026
Source

Colorful handwoven kafuna from Ghana, made with straw and leather handles, arranged on the ground.

The University of Ghana HCI Lab contributed a Dagbanli speech dataset alongside Mozilla Common Voice. Merging two sources with different formats, and matching sentences to an agglutinative language’s lexemes, required us to build a custom three-level matching system.

Introduction

In the sixth post of this series, we explained how we integrated Mozilla Common Voice sentences and audio into the Dagbanli dictionary. Those recordings gave us thousands of spoken usage examples. But Mozilla Common Voice was not our only external source. The University of Ghana’s Human‑Computer Interaction (HCI) Lab had produced its own Dagbanli speech dataset, and it came with a different structure and metadata.

Every sentence in the HCI Lab dataset is accompanied by an image. Six hundred unique images are used across 18,198 sentences, each describing a real‑world scene. By merging this dataset with Mozilla Common Voice, we added a visual context [image] section to the dictionary. Users would see an image, read a sentence that describes the image, and hear a native speaker say that sentence.

However, merging two datasets with different formats was only half the challenge. The real work was matching sentences to the correct lexemes in an agglutinative language like Dagbanli. This post explains the structure of the UG HCI Lab dataset, the three‑tier matching engine we built, and the extra steps we took to prevent image link rot.

1. The UG HCI Lab Dataset

The University of Ghana HCI Lab created a speech dataset for Dagbanli as part of broader research into language technology for under‑resourced languages (you can read more about the lab’s work in their 2025 paper). The dataset, last updated in February 2026, was originally designed for automatic speech recognition and linguistic analysis. It was built around an image description task : native speakers were shown a photograph and asked to describe it in Dagbanli, with the audio recorded and transcribed. This means every sentence is grounded in a specific image , a property that neither Wikidata nor Mozilla Common Voice has. We saw another use for this dataset: it could supply authentic, image‑rich example sentences for our dictionary.

What the dataset contains:

  • Sentences: 18,198 Dagbanli sentences, each describing a scene shown in an image.
  • Audio: recordings of native speakers reading each sentence.
  • Images: one image per sentence, provided as a URL (over 600 unique images in total).
  • Metadata: speaker gender, age, and year of recording.

The format differed from Common Voice. The CSV had a different shape and used different column names, audio container, and metadata, including recording environment and device. We used only what we needed. While Common Voice data is distributed as a .tar archive of OGG files with a TSV metadata file, the HCI Lab dataset came as a single CSV with columns for the sentence and file name that included an identifiable string reference within the full file name, image URL, and demographics. A separate folder contained the audio files.

We wrote a script that:

  1. Read both CSVs (Common Voice and HCI Lab).
  2. Extract the filename and create a unique identifier for each row.
  3. Rename columns to a common set: sentence, image_url, source, gender, age_group, environment, and year.
  4. Upload the audio files to R2 and produce a public audio_url
  5. Tag each row with a source field: mozilla_validated, mozilla_invalidated, mozilla_other, or ug_transcribed.
  6. Combined both CV and HCI CSVs and output a single CSV (cv-ug-combined.csv), and added a new tokenised column of the words for each row, which we use for all matching.

This unified CSV became the input for the matching engine. The dictionary now displays examples from both sources, and users see where each example came from (example, “More usage examples” for Mozilla Common Voice” or “Visual Context” for University of Ghana HCI Lab).

2. Agglutinative Morphology

Dagbanli is an agglutinative language. Words are built by adding suffixes to a base stem, and those suffixes change the word’s form. A simple exact‑match approach would fail for most real‑world sentences.

As described in the previous post, if we only look for an exact match of the word in the sentence against the lemma, none of its derived forms would be found. A user who looks up the verb “ pie ” would not see the forms pieya , piemi , piela , piema , or piemiya in related sentences.

This is common to many African languages, but there are no off‑the‑shelf Dagbanli stemmers or morphological analysers. We had to build our own pragmatic solution.

3. The Three‑Tier Matching Strategy

Our matching script (build-combined-examples-from-csv.mjs) tries three strategies in order, from most precise to most flexible.

Tier 1: Exact match (cheapest, highest confidence).

We normalise each token by:

  • Converting to lowercase.
  • Stripping punctuation.
  • Preserving special characters (ɛ , ɔ , ŋ , ɣ , ʒ). We never collapse them.
  • Applying Unicode normalisation (NFC) to handle combining characters.

If the normalised token equals the normalised lemma of a lexeme, we record an exact match.

Example: Sentence contains “pie”. Lexeme lemma is “pie”. ✅ match.

Tier 2: Suffix stripping (hand‑curated)

If an exact match fails, we try to remove common Dagbanli suffixes. The list of suffixes was hand‑curated from linguistic descriptions and our own manual review:

  • Nominal suffixes: -li, -bu, -gu, -nima (plural), -tali
  • Verbal suffixes: -ya, -ma, -mi, -la
  • Adjectival/adverbial suffixes: -a, -lim, -m

We remove the suffix from the end of the token and check if the remaining stem matches a lemma. Example: “piema” –> strip -ma –> pie –> ✅ match found.

Tier 3: Progressive prefix (this is a “better than nothing” fallback)

If neither Tier 1 nor Tier 2 works, we progressively trim characters from the end of the token (keeping at least three characters) until we either find a match or run out of possibilities. Example: “piela” –> piel (no match) –> pie (match with pie meaning “to milk” or “stand in line”).

This tier is a deliberate fallback. It catches some irregular forms, but it can also produce false positives. We accept that because the overall number of matches it adds is small (usually below 15% of total matches), and the benefit of surfacing genuine examples outweighs the occasional mistake.

Tracking match methods: Each match is recorded with a match_method field (“exact”, “suffix”, or “prefix”). When we run the script, it prints a breakdown that helps us tune the stemmer. For the current dataset, the typical distribution is:

  • ~65% exact matches
  • ~20% suffix matches
  • ~15% prefix matches

This distribution means that nearly 35% of matches would be lost without the fallback tiers.

4. Normalisation: Respecting the Alphabet

One of the most important decisions we made from the beginning was not to collapse special characters. Many NLP tools for African languages “helpfully” replace ɛ with e, ɔ with o, ŋ with n, etc. In Dagbanli, that would destroy meaning. For example, paɣa (woman) and paga (place name) are completely different words. Our normalisation does the following:

  • Lowercase the token.
  • Strip punctuation (commas, full stops, question marks, etc.).
  • Keep all Dagbanli special characters exactly as they appear.
  • Apply Unicode normalisation form NFC (so that ɣ becomes g where appropriate, but note that Dagbanli tone marking is not yet fully represented in the dictionary, so we rely on audio for that).

This is a simple, safe normalisation that never invents new characters. It ensures that “Kurugu” and “kurugu” match, but “paɣa” and “paga” do not.

5. The Link Rot Problem and Our Archiving Script

The UG HCI Lab dataset included image URLs that pointed to external websites. At first, we stored those URLs directly in our JSON. But soon we noticed a serious issue: many images were disappearing from the original sources. Every week, some URLs would return 404 errors. We lost nearly 90 of them within a few months.

We could not rely on external hosts. We needed our own copy. We wrote a dedicated script, scripts/archive-ug-images.mjs, that:

  1. Reads the current cv-ug-combined-tokenized.json (the file with all matched examples).
  2. Collects every external image_url.
  3. Downloads each image and uploads it to our R2 bucket under the dict/ug-visual/ prefix.
  4. Replaces the original URL with a new permanent URL pointing to our R2 bucket.
  5. Saves the updated JSON and uploads it back to R2.

The script uses a checkpoint system, so it can be stopped and resumed without re‑downloading already archived images. It runs as part of our maintenance pipeline, ensuring that the dictionary always serves images from our own infrastructure, safe from link rot.

After this change, every image in the visual context section is served from dict/ug-visual/ using a hash based on the original URL. Even if the external source vanishes, the dictionary retains the image.

6. Sorting, Capping, and Uploading

Once we have all matches for a lexeme, we sort them according to a priority order:

  1. Audio first: Sentences that have an associated audio file appear instead of text‑only examples.
  2. Source priority: Common Voice (validated) comes before Common Voice (invalidated). The UG sources have equal priority for all sentences.
  3. Match method quality: exact matches are prioritised over suffix matches, which are prioritised over prefix matches.

After sorting, we cap the number of examples per lexeme at 25. This keeps the JSON file size manageable (roughly 20 MB for all 11,000+ lexemes) and ensures that the dictionary’s UI doesn’t become overwhelming.

The result is written to cv-ug-combined-tokenized.json and uploaded to our R2 bucket (dict/gbilli/). From there, the dictionary app syncs it to users’ devices, exactly as described in the offline‑first post.

7. Why Not Use an Existing NLP Library?

We asked ourselves the same question. Mainstream tokenizers and stemmers (NLTK, spaCy, Stanza) ship with models for English, French, Spanish, etc. Unfortunately, they are not trained on small languages. None of them supports Dagbanli. There is no Dagbanli stemmer, tokeniser, or part‑of‑speech tagger in any open‑source library.

Building our own suffix list and fallback algorithm was the only practical option. We are not trained linguists, but we approached the problem with the knowledge of native speakers and iterated on the suffix list by manually reviewing mismatches. The result is not perfect, but it works well enough to surface thousands of useful examples. Over time, we hope to replace our heuristic stemmer with a proper morphological analyser, perhaps built in collaboration with linguists, master students or NLP experts. We are open for collaboration.

8. The Update Challenge (Similar to Common Voice)

Just like Mozilla Common Voice, the UG HCI Lab dataset does not have a live streaming API. New recordings are made periodically, but they are distributed as a new versioned snapshot (for example, a new CSV file and audio bundle). During a weekly lab showcase, we presented our project and exchanged feedback. We also learned that the HCI Lab plans another recording exercise, which is excellent news for the dictionary. However, to incorporate the new data, we would need to:

  • Download the new version.
  • Re‑run the matching script.
  • Re‑archive any new images to R2.
  • Rebuild the JSON and upload it.

This is a manual process. We aim to set up a scheduled cron job that checks for updates on the lab’s dataset portal (when available) and triggers a rebuild. But it is not as seamless as a live API. We hope that in the future, both Common Voice and the HCI Lab will offer live or incremental update mechanisms.

9. Visual Context in the Dictionary

The UG HCI Lab dataset’s images are displayed in a separate “Visual Context” section of the word detail card. When a sentence from this dataset matches a lexeme, the dictionary shows the image (now served from our R2 bucket) alongside the sentence text and an audio play button.

Kafuni entry in Dagbanli Dictionary showing visual context (full image)

For a word like “kafuni”, you will see:

  • A photograph of a hand fan (in the image).
  • The sentence: “…ka sala mini kafuni ʒɛ…” (…charcoal and fan nearby…) highlights the match
  • A speaker icon to hear the sentence spoken.

This adds a powerful dimension to the dictionary. Users don’t just read a definition; they see a real scene and hear a native speaker describe it. And because we archived the images, that scene will never disappear due to link rot.

Conclusion

Matching sentences to lexemes in an agglutinative language with no existing NLP tools required building a matching engine from scratch. Our three‑tier approach (exact, suffix stripping, progressive prefix) is pragmatic. It is not perfect, but it surfaces thousands of relevant examples that an exact‑match approach would have lost.

The HCI Lab dataset adds a unique visual context layer. Unlike Wikidata images that often show a single object in isolation (for example, a picture of a dog for the word “dog”), these images show a full scene: a person, an environment, or an event being described. Users see real‑world context, read the accompanying Dagbanli sentence, and hear a native speaker. This combination of audio and visual context is rare for an under‑resourced language. By archiving the images to R2, we have protected them from link rot. The update process remains a challenge, but one we manage through periodic snapshots.

Together with Common Voice, the HCI Lab dataset makes the dictionary a powerful multimedia language resource. In the next post, we will share the challenges that surprised us along the way, including Cloudflare Worker CPU limits.

Discussion in the ATmosphere

Loading comments...