Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidmz6khn2wq3aijb64jvikrchj5a3m7l4mitosr5p2xtpshuqc2fy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mica7k2jrep2"
  },
  "path": "/t/dealing-with-escape-characters-when-using-models-built-on-the-bert-architecture/174792#post_2",
  "publishedAt": "2026-03-30T11:48:50.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face"
  ],
  "textContent": "Results from actual testing with the (now) latest version of Transformers v5:\n\n* * *\n\nUsually, **no**. For a standard BERT-family model, you generally should **not remove actual newline characters like`\\n` by default**. In the original BERT tokenizer, tab, newline, and carriage return are treated as whitespace, and the text-cleaning step converts whitespace-like characters into ordinary spaces before tokenization. Hugging Face’s `BertNormalizer` describes the same behavior: BERT-style normalization cleans text and standardizes whitespace before subword tokenization. (GitHub)\n\n## The main idea\n\nBERT does not read raw text “as-is.” It reads the output of a tokenizer pipeline. In that pipeline, **normalization and pre-tokenization happen before WordPiece tokenization** , so whitespace handling is already part of how the model was designed to see text. Hugging Face’s tokenizer docs are explicit that pre-tokenization creates an **upper bound** on what final tokens can be. (Hugging Face)\n\nThat leads to the core rule:\n\n  * **Real newlines** are usually safe to keep.\n  * **Literal escaped strings** like `\\\\n` are a different case and may need cleanup.\n  * **Deleting boundaries** can be harmful if it merges words together.\n\n\n\nA practical way to think about it is that the tokenizer is part of the model contract, not a disposable preprocessing detail.\n\n## The first distinction you need to make\n\nThere are two different things people often call “escape characters”:\n\n### 1. A real newline character\n\nThis is an actual line break in the text.\n\nExample:\n\n\n    This movie was not\n    good\n\n\n### 2. A literal backslash-plus-letter sequence\n\nThis is visible text that contains `\\` and `n`.\n\nExample:\n\n\n    This movie was not \\n good\n\n\nFor BERT, these are **not the same**. A real newline is treated like whitespace. A literal `\\` is not whitespace. In the original BERT code, whitespace characters are handled by `_is_whitespace`, while non-letter and non-number ASCII symbols are treated separately as punctuation-like characters. (GitHub)\n\nSo the answer depends on **which one your data actually contains**.\n\n## Why removing real `\\n` usually does not help\n\nBecause BERT already normalizes it.\n\nThe original BERT preprocessing logic treats `\\n`, `\\t`, and `\\r` as whitespace, and Hugging Face’s BERT normalizer says the same thing in current docs: BERT-style text normalization includes cleaning text and standardizing whitespace before subword splitting. (GitHub)\n\nThat means if your text contains normal line breaks, manually deleting them often adds little or nothing. In many cases, the tokenizer will effectively turn them into ordinary spacing anyway. If the line break was just formatting, leaving it alone is usually fine.\n\n## When removing them can make things worse\n\n### If you delete a boundary and merge words\n\nThis is the real danger.\n\nIf preprocessing turns something like:\n\n\n    not\n    good\n\n\ninto:\n\n\n    notgood\n\n\nthen the model no longer sees two ordinary words separated by whitespace. It sees a merged string, and WordPiece can split that merged string into very different subword units. Since BERT is a WordPiece model, that changes the input token IDs the model actually receives. The BERT paper and model docs both describe BERT as using WordPiece tokenization. (GitHub)\n\nThis is the important point: **the problem is usually not the existence of the newline itself. The problem is deleting it in a way that destroys a boundary.**\n\n### If layout carries meaning\n\nSometimes line breaks are not just visual formatting. They can carry weak structure:\n\n  * list items\n  * addresses\n  * clinical notes\n  * dialogue turns\n  * legal clauses\n  * titles versus body text\n\n\n\nIf you flatten everything aggressively, you may remove useful structure. The tokenizer pipeline docs matter here because pre-tokenization defines the pieces within which later tokenization happens. (Hugging Face)\n\n### If your task depends on alignment\n\nFor token classification, NER, or extractive QA, preprocessing can create alignment problems. Hugging Face’s token-classification docs explicitly show that you need to realign labels to tokens with `word_ids()`, and fast tokenizers keep offset mappings to track character spans from the original text. If you rewrite the text carelessly before tokenization, labels and spans can drift. (Hugging Face)\n\n## When you should clean more aggressively\n\nYou should clean more aggressively when the text contains **artifacts** , not real content.\n\nTypical examples:\n\n  * broken JSON export\n  * scraped text with literal `\\\\n`\n  * malformed control characters\n  * OCR junk\n  * text where visible backslashes are accidental formatting residue\n\n\n\nIf your data literally contains `\\\\n` because of serialization, then replacing it with a space or a real newline often makes sense. If you leave it untouched, BERT will not interpret it as layout. It will treat the backslash and following characters as ordinary visible text. (GitHub)\n\nBut if `\\\\n` is meaningful domain content, such as in code, regexes, logs, or markup, then converting or deleting it would be wrong.\n\n## The safest practical default\n\nFor a first BERT-based research project, this is the best default policy:\n\n### Keep by default\n\n  * ordinary words and punctuation\n  * real newlines, tabs, and carriage returns\n  * natural document structure\n\n\n\n### Clean carefully\n\n  * malformed encodings\n  * invisible junk control characters\n  * literal `\\\\n`, `\\\\t`, `\\\\r` only when they are accidental artifacts\n\n\n\n### Preserve explicitly if structure matters\n\nIf line boundaries are important for your task, do not assume default whitespace normalization is enough. Represent them deliberately with a marker such as `[NL]` or `<NEWLINE>` and use that consistently in training and inference. That follows directly from how tokenizer pipelines define boundaries before final tokenization. (Hugging Face)\n\n## What this means by task\n\n### Text classification\n\nFor sentiment, topic classification, or similar sentence/document classification, the safest starting point is **minimal cleanup**. Real `\\n` usually does not need special handling. (GitHub)\n\n### Token classification and NER\n\nBe more conservative. Alignment matters, and label mapping can break if you rewrite text too aggressively. (Hugging Face)\n\n### Extractive QA\n\nAlso be conservative. Span boundaries matter, so avoid unnecessary text rewriting. Offset-aware tokenization exists for a reason. (Hugging Face)\n\n### Code-like or log-like text\n\nDo not assume `\\\\n` is noise. In that setting, it may be meaningful content.\n\n## The short recommendation\n\nFor your case, the best default is:\n\n  * **Do not strip actual`\\n` by default.**\n  * **Do clean literal`\\\\n` only when it is clearly a serialization artifact.**\n  * **Avoid any preprocessing that can merge adjacent words.**\n  * **Keep preprocessing identical across training, validation, test, and inference.** (GitHub)\n\n\n\n## Final answer\n\nSo the answer is:\n\n**Removing real newline characters is usually unnecessary and can sometimes be harmful if it destroys useful boundaries or structure.**\n**Cleaning literal escaped artifacts like`\\\\n` is often correct, but only when they are accidental export noise rather than meaningful text.**\n\nThat is the safest, most technically accurate default for BERT-family models.",
  "title": "Dealing with escape characters when using models built on the BERT architecture"
}