{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic3zpqsv3qoh3pfk3tg3ef6dbxx3ewyvp3wdwcbfaeaziaruanzau",
    "uri": "at://did:plc:4tuge3k3comfj4nfvqnwkemn/app.bsky.feed.post/3miczfp7cr372"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibkj6g4f73mu7dzbd2nssq7hn6pcgmvwqu2tqeqskmbntggx7krte"
    },
    "mimeType": "image/jpeg",
    "size": 83004
  },
  "path": "/user/assanges/diary/408433",
  "publishedAt": "2026-03-30T08:29:49.000Z",
  "site": "https://www.openstreetmap.org",
  "tags": [
    "台灣華語版本",
    "Taiwanese Mandarin",
    "OSM wiki’s Key:phone",
    "pointed out",
    "OSM wiki’s Key:phone#Extensions",
    "libphonenumber",
    "Unexpected formatting of the TW numbers with 3/4-digit area codes",
    "公眾電信網路號碼計畫",
    "TG Group Chat",
    "@M4HCHE3ZY"
  ],
  "textContent": "> _此文本同時提供_ 台灣華語版本 _This article is also available in_ Taiwanese Mandarin\n\n* * *\n\nOpenStreetMap’s collaborative nature is both its biggest strength and a source of persistent data-quality issues. With thousands of contributors independently adding `phone` tags to shops, restaurants, clinics, and government offices, each person tends to follow their own formatting style. For Taiwan, that means a database where the same country code can show up as `+886`, `+886+`, or `+886(2)`, and a single city’s worth of phone numbers might span a dozen different conventions.\n\nThis post catalogues what we found when we scanned OSM elements across all six special municipalities and five additional counties — we are working on a normalizer to fix the issue.\n\n* * *\n\n## The Scale of the Problem\n\nAcross eleven cached regions — all six special municipalities (臺北市, 新北市, 桃園市, 臺中市, 臺南市, 高雄市) plus 苗栗縣, 新竹市, 臺東縣, 連江縣, 金門縣 — we found **49,260 tags** (`phone` or `contact:phone`) on **49,229 elements**. After splitting multi-value fields on semicolons, that yields **50,643 individual phone number strings** to classify.\n\nFormat class | Count | Share\n---|---|---\nE.123 space (`+886 2 1234 5678`) | 41,842 | 82.6%\nRFC 3966 dash (`+886-2-1234-5678`) | 6,655 | 13.1%\nNo separator (`+886212345678`) | 1,158 | 2.3%\nLocal format, no country code (`02-1234-5678`) | 854 | 1.7%\nCorrupt/typo country code (`+866 …`, `+886(2)…`) | 92 | 0.2%\nOther (wrong country, junk) | 42 | 0.1%\n\nRoughly **1 in 5 individual values** deviates from the most common contributor convention, creating inconsistency that complicates deduplication, display, and machine parsing.\n\n> Things went from bad to downright ridiculous.\n\nThe format split varies noticeably by region:\n\nRegion | Region (ZH) | Tags | Values | E.123 | RFC 3966 | Other\n---|---|---|---|---|---|---\nTPE | 臺北市 ★ | 9,804 | 10,146 | 85% | 11% | 4%\nNWT | 新北市 ★ | 13,963 | 14,395 | 85% | 11% | 4%\nTAO | 桃園市 ★ | 4,177 | 4,277 | 83% | 12% | 5%\nTXG | 臺中市 ★ | 8,065 | 8,282 | 73% | **24%** | 4%\nTNN | 臺南市 ★ | 4,246 | 4,322 | 84% | 11% | 6%\nKNN | 高雄市 ★ | 5,168 | 5,262 | 84% | 10% | 5%\nMIA | 苗栗縣 | 1,318 | 1,338 | 77% | 18% | 5%\nHSZ | 新竹市 | 1,088 | 1,094 | 82% | 12% | 6%\nTTT | 臺東縣 | 1,101 | 1,178 | 96% | 3% | 1%\nLIE | 連江縣 | 95 | 103 | 98% | 0% | 2%\nKMN | 金門縣 | 235 | 246 | 90% | 8% | 2%\n\n★ Special municipality. Taichung stands out with 24% RFC 3966 usage — roughly double the rate of any other major city — suggesting a dominant local editing pattern or tool default in that contributor community. The outlier island counties (連江縣, 臺東縣) have the highest E.123 consistency, possibly because their smaller contributor pools converge on informal norms more easily.\n\n* * *\n\n## What “Correct” Means: Format Standards in OSM\n\nBefore diving into the issues, it’s worth clarifying what “correct” actually means in an OSM context.\n\nThe OSM wiki’s Key:phone page does **not** mandate a single format. It documents E.123 international notation, RFC 3966 (`tel:` URI dash notation), and NANP-style formatting without expressing a clear preference between them. In practice, **E.123 space notation is the most commonly used by Taiwan contributors** — which is why we use it as the normalisation target — but RFC 3966 dash notation is a legitimate alternative that the wiki explicitly acknowledges.\n\nSo the goal of normalization isn’t strict compliance with any one standard — it’s **internal consistency** : a dataset where everything follows the same convention is just much easier to work with than one that mixes three formats at random.\n\n## What Consistent Looks Like\n\nFor Taiwan, the most common contributor convention is E.123, followed by RFC 3966 / NANP (North American `+1`-style, RFC 3966-like):\n\n\n    ITU E.123\n    ----------------------------------------\n    +886 2 1234 5678    ← Taipei landline\n    +886 4 1234 5678    ← Taichung landline\n    +886 37 123 456     ← Miaoli landline (3-digit area code)\n    +886 89 123 456     ← Taitung landline (3-digit area code)\n    +886 9X XXXX XXXX   ← Mobile\n    +886 800 XXX XXX    ← Toll-free (0800)\n\n\n\n    NANP\n    ----------------------------------------\n    +886-2-1234-5678    ← Taipei landline\n    +886-4-1234-5678    ← Taichung landline\n    +886-37-123-456     ← Miaoli landline (3-digit area code)\n    +886-89-123-456     ← Taitung landline (3-digit area code)\n    +886-9X-XXXX-XXXX   ← Mobile\n    +886-800-XXX-XXX    ← Toll-free (0800)\n\n\nMultiple numbers separated by semicolons, no trailing semicolon:\n\n\n    ITU E.123\n    ----------------------------------------\n    +886 2 8787 8787;+886 2 8787 8765\n\n\n(or)\n\n\n    NANP\n    ----------------------------------------\n    +886-2-8787-8787;+886-2-8787-8765\n\n\nBoth are acceptable normalised formats. The open question for the community is agreeing on one and applying it consistently to resolve the current mixing.\n\n> Not your average daily struggle\n\n* * *\n\n## Our findings\n\n### Issue 1: Inconsistent Separators\n\nThe most common deviation is mixing hyphens and spaces. Both of these encode the same number:\n\n\n    +886 2 2181 2345     ← E.123 (space, most common in TW OSM data)\n    +886-2-2181-2345     ← RFC 3966 dash (legitimate, less common)\n\n\nThe real problem is **mixing both within a single value** , which belongs to neither convention:\n\n\n    +886 2 2873-6548     ← space after country code, dashes within\n    +886-2-28358739      ← dashes, then no grouping in subscriber number\n\n\nWe found **1,554 values** that contain both spaces and hyphens in a single phone string — the worst of both worlds, unambiguously wrong under either standard.\n\n* * *\n\n### Issue 2: Missing Country Code\n\nSome contributors enter phone numbers the way they would dial them locally — without the `+886` prefix:\n\n\n    02-2581-7780\n    02 8751 3227\n    0222346763\n    0921067050\n\n\nOSM’s `phone` tag is meant to hold an internationally dialable number. A value like `02-2581-7780` is ambiguous outside Taiwan: consumers have no way to know which country’s area-code conventions apply. We found **854 such values** , including mobile numbers entered as bare `09XXXXXXXX` strings.\n\n* * *\n\n### Issue 3: No Separator After Country Code\n\nA related variant omits any separator between the country code and the rest of the number:\n\n\n    +886288613257\n    +886228839850\n\n\nThese are syntactically valid in E.164 (the all-digits form used by telephony APIs) but fail most display validators and are unreadable as stored OSM data. We found **1,158 such values**.\n\n* * *\n\n### Issue 4: Corrupt or Malformed Country Codes\n\nA small but non-trivial number of entries contain clear input errors:\n\n\n    +866 2 29126883      ← digits transposed (866 instead of 886)\n    +886+2 2311 2940     ← extra plus sign\n    +886(2)28232410      ← parenthesised area code (North American style)\n    +886.2 2322 3477     ← dot as separator\n    +8886 2 8780 6278    ← extra digit in country code\n    +00886-2-23825234    ← international dialling prefix 00 prepended\n\n\nWe found **92 such values**. These will silently fail in any phone-number parsing library that enforces ITU-T E.164 syntax.\n\n* * *\n\n### Issue 5: Duplicate Entries in Multi-Value Fields\n\nOSM supports multiple phone numbers for one element using semicolons. We found **1,320 multi-value tags** across the dataset. Of those, **24 contain duplicate entries** — the same number appearing more than once:\n\n\n    +886 2 2916 0300;+886 2 2916 0300\n    +886 89 862 326;+886 89 862 326;+886 89 862 326\n\n\nThis suggests copy-paste mistakes during editing. While harmless individually, they can inflate the number of contact options and potentially confusing to machines.\n\n* * *\n\n### Issue 6: Extension Numbers — a Format Wild West\n\n> You are the one accountable, Raiden!! (via @M4HCHE3ZY on X (formerly Twitter))\n\nBeyond the main number itself, **635+ values** encode an extension, using at least five different conventions found in the data:\n\nConvention | Example | Count\n---|---|---\nHash `#` | `+886 2 2536 3001#8653` | 572\nTilde `~` | `+886 2 2368 0031~2` | 26\n`ext.` / `ext` | `+886 2 2741 5991 ext.21` | 30\nChinese `分機` | `+886 4 2528 5394分機6000` | 7\nComma `,` (iOS) | `+886 2 2938 2300,630` | ~1+\n\n### Detecting extensions is tractable\n\nAs community members pointed out, a simple rule works: any character that is not a digit, space, or hyphen (`[^\\s\\d-]`) can be treated as the start of the extension suffix. This is essentially what our normalizer does — split at the first such character, normalize the base number, then reattach the suffix verbatim.\n\n### Encoding extensions is where it breaks down\n\nThe OSM wiki’s Key:phone#Extensions page currently documents _three_ different conventions without picking one, which is itself a signal of how unresolved this is.\n\n**E.123** specifies `ext` as the separator. It was standardised in the printed-directory era — `ext 8653` is readable on a business card, but apps do not reliably parse it. There is no DTMF interpretation; the extension string is purely informational.\n\n**Apple iOS** (and macOS Contacts) stores extensions using a comma `,` as a pause-and-dial separator: `+886-2-2938-2300,630`. The comma signals the dialler to wait for the call to connect, then send the remaining digits as DTMF tones — so `630` is dialled automatically after the main number picks up. This is practical on-device behaviour, but it creates two distinct problems in OSM data:\n\n  1. **Ambiguity with multi-value separators.** OSM uses `;` to separate multiple phone numbers in a single tag. Comma has no such defined role in OSM, so an iOS-style value like `+886 2 2938 2300,630` is likely to be misread as a single malformed number rather than a number-plus-extension. We found 16 values with commas in the dataset; most are multi-value numbers incorrectly separated by `,` instead of `;`, but at least one appears to be a genuine iOS-exported extension.\n  2. **Non-portability.** A comma-encoded extension is only meaningful to a DTMF-capable dialler. It conveys no human-readable information and is invisible to any parser that does not understand the pause-dial convention.\n\n\n\n**libphonenumber** detects extensions across many separators (`#`, `ext`, `x`, `,`, etc.) but emits no canonical output format for the extension part, leaving it to the caller.\n\n**RFC 3966** (`tel:` URI) is the most formally specified option — it uses `;ext=NNN`. But RFC 3966 extensions create a structural conflict with OSM’s data model that is worth spelling out in full.\n\n### The RFC 3966 semicolon conflict\n\nOSM uses the semicolon `;` as the multi-value separator for `phone` tags:\n\n\n    +886 2 1234 5678;+886 2 8765 4321    ← two phone numbers, standard OSM\n\n\nRFC 3966’s extension syntax also uses a semicolon as a parameter delimiter:\n\n\n    tel:+886-2-1234-5678;ext=8653        ← RFC 3966 with extension\n\n\nIf a contributor stores this in an OSM tag, any OSM editor or data consumer that naively splits on `;` will interpret it as two values: `tel:+886-2-1234-5678` and `ext=8653`. The extension becomes a phantom second phone number — one that is not a phone number at all.\n\nThe obvious workaround is to escape the semicolon as `\\;`, a convention some OSM tags use for literal semicolons inside values. But this creates its own problems:\n\n  * **OSM editors** do not consistently honour `\\;` escaping; many will still split on it or display it literally.\n  * **RFC 3966 parsers** expect a raw `;` as the parameter delimiter — a backslash-escaped `\\;ext=8653` is not valid RFC 3966 and will not be parsed correctly by any compliant `tel:` URI parser.\n  * **Machine readability** is not improved: a consumer now needs to know both OSM’s backslash-escaping convention _and_ RFC 3966’s parameter syntax, and reconcile the two. It adds encoding complexity without giving any parser a clean path to the extension digits.\n\n\n\nThe backslash escape is a leaky workaround that satisfies neither standard fully. It is, in effect, a third encoding layered on top of two already-conflicting ones.\n\nThe result is that **RFC 3966 extension notation is structurally incompatible with OSM’s semicolon-as-multi-value convention** , with no clean resolution available today. For this reason, our normalizer preserves extension suffixes as-is rather than attempting to rewrite them into any standard form.\n\n* * *\n\n## A Note on E.123 and Machine Readability\n\nHere’s something worth keeping in mind: even a _perfectly_ normalised E.123 phone tag isn’t as machine-friendly as it looks.\n\nE.123 was standardised by the ITU-T in 1988 — when the primary medium for a phone number was a business card, letterhead, or printed directory. The spaces in `+886 2 1234 5678` are visual grouping aids for human readers, not semantic tokens. A parser encountering that string has to strip the spaces, infer the country code, and figure out the area code boundary — all heuristically.\n\nRFC 3966’s `tel:+886-2-1234-5678` is marginally more structured (hyphens as explicit separators, a URI scheme that signals “this is a phone number”), but still requires a real parser to interpret the digit groups. The truly machine-readable form is E.164 — `+886212345678`, all digits, no punctuation — which is what telephony APIs and databases actually want. None of these is what OSM stores by default.\n\nThis tension is fundamental: **OSM’s`phone` tag is human-oriented**. Normalization to E.123 is about making data consistent and editable by contributors, not about producing a format that apps can blindly ingest without parsing. The downstream app still needs a library like libphonenumber to do the real work — which is exactly why that library’s correctness for Taiwan’s edge-case area codes matters as much as it does.\n\n* * *\n\n## A Note on Unexpected Area Code Grouping by google/libphonenumber\n\nThis one is subtle. Google’s libphonenumber — the standard library used by virtually every phone-number parser — groups some Taiwanese area codes differently than how they appear in local usage.\n\nTaiwan’s NCC assigns **3-digit and 4-digit area codes** to several regions. libphonenumber’s metadata appears to represent these as extensions of their 2-digit neighbours, producing a different grouping than what locals would recognise:\n\nDialled | libphonenumber output | Expected output (E.123)\n---|---|---\n`037-123-456` | `+886 3 7123 456` | `+886 37 123 456`\n`049-123-4567` | `+886 4 9123 4567` | `+886 49 123 4567`\n`082-123-456` | `+886 8 2123 456` | `+886 82 123 456`\n`0826-12345` | `+886 8 26123 45` | `+886 826 12345`\n`0836-12345` | `+886 8 36123 45` | `+886 836 12345`\n`089-123-456` | `+886 8 9123 456` | `+886 89 123 456`\n\nAffected regions: **Miaoli (037)** , **Nantou (049)** , **Kinmen (082)** , **Wuqiu (0826)** , **Matsu/Lienchiang (0836)** , and **Taitung (089)**.\n\nThis means that even phone numbers already stored in `+886 X XXXX XXXX` form may carry a different digit grouping if they were entered via a tool backed by libphonenumber. The grouping we use here follows the National Numbering Plan and official government contact listings — though it’s worth noting this may be an intentional design choice in libphonenumber’s metadata.\n\n* * *\n\nSee also:\n\n  * Issue Tracker: Unexpected formatting of the TW numbers with 3/4-digit area codes\n  * 公眾電信網路號碼計畫 (Public Telecommunication Network Numbering Plan, Chinese only) [PDF]\n  * TG Group Chat\n\n\n\n* * *\n\nNNNN",
  "title": "Phone Numbers Data for Taiwan in OSM — Opening a Can of Worms"
}