Working with file names sounds simple — until you hit the invisible trap of Unicode normalization.
I ran into this problem while comparing strings and escaping file names between UTF-8 and Windows ANSI.
Everything looked fine — except when it wasn’t.
The issue? German umlauts like ä, ö, and ü — visually identical, but encoded differently under the hood.
The Problem
On Windows, file names with umlauts are stored in NFC (Normalization Form C).
That means a character like ä (U+00E4) is represented as a single, precomposed code point.
On macOS, however, file names are often stored in NFD (Normalization Form D).
In this form, ä is decomposed into two code points:
- a (U+0061)
- ¨ (U+0308, combining diaeresis)
So while both look the same, "Müller.txt" may differ byte-for-byte between macOS and Windows.
Umlaut Encodings: NFC vs. NFD
| Character | NFC (precomposed) | Hex Codepoints | NFD (decomposed) | Hex Codepoints |
|---|---|---|---|---|
| ä | ä (U+00E4) |
00E4 |
a + ¨ | 0061 0308 |
| ö | ö (U+00F6) |
00F6 |
o + ¨ | 006F 0308 |
| ü | ü (U+00FC) |
00FC |
u + ¨ | 0075 0308 |
| Ä | Ä (U+00C4) |
00C4 |
A + ¨ | 0041 0308 |
| Ö | Ö (U+00D6) |
00D6 |
O + ¨ | 004F 0308 |
| Ü | Ü (U+00DC) |
00DC |
U + ¨ | 0055 0308 |
Why It Matters
This mismatch causes subtle but nasty bugs when:
- Comparing file names across platforms
- Escaping text between encodings (UTF-8 ↔ Windows-1252/ANSI)
- Performing hash-based equality checks
- Feeding text into AI or NLP pipelines
The AI / NLP Connection: Vocabulary Mismatch
Most modern language models (BPE, WordPiece, SentencePiece, etc.) tokenize text based on learned vocabulary.
If the training data used NFC text, but your runtime input is NFD, you’ll hit an invisible mismatch.
For instance:
- The model’s vocabulary contains Müller (U+00FC).
- Your macOS text input has Müller (U+0075 + U+0308).
- The tokenizer fails to match and splits the word incorrectly.
This leads to degraded model performance and inconsistent embeddings.
🔍 Tokenization Demo (Python)
Let’s demonstrate this using Hugging Face’s tokenizer from a multilingual model such as bert-base-multilingual-cased.
from transformers import AutoTokenizer
import unicodedata
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
word_nfc = "Müller" # composed ü (U+00FC)
word_nfd = unicodedata.normalize("NFD", word_nfc) # decomposed u + U+0308
print("NFC Tokens:", tokenizer.tokenize(word_nfc))
print("NFD Tokens:", tokenizer.tokenize(word_nfd))
Possible output:
NFC Tokens: ['Müller']
NFD Tokens: ['Mu', '̈', 'ller']
Notice how the NFD version breaks into three tokens — the combining diaeresis is treated as a standalone character the model has never seen.
That can completely change how embeddings or predictions behave.
The Fix: Normalize Before Use
Normalize everything that enters your pipeline — filenames, user input, and especially text data for NLP.
Use NFC unless you have a very specific reason not to.
Python Example
import unicodedata
def normalize_filename(filename: str) -> str:
return unicodedata.normalize('NFC', filename)
# Example
macos_name = "Müller.txt" # 'u' + combining diaeresis
normalized = normalize_filename(macos_name)
print("Original:", [hex(ord(c)) for c in macos_name])
print("Normalized:", [hex(ord(c)) for c in normalized])
Output:
Original: ['0x4d', '0x75', '0x308', '0x6c', '0x6c', '0x65', '0x72']
Normalized: ['0x4d', '0xfc', '0x6c', '0x6c', '0x65', '0x72']
Java Example
import java.text.Normalizer;
public class UmlautNormalizer {
public static String normalizeFilename(String filename) {
return Normalizer.normalize(filename, Normalizer.Form.NFC);
}
public static void main(String[] args) {
String macosName = "Mu\u0308ller.txt"; // 'u' + combining diaeresis
String normalized = normalizeFilename(macosName);
System.out.println("Original: " + macosName);
System.out.println("Normalized: " + normalized);
}
}
Takeaways
- macOS defaults to NFD, Windows to NFC.
- Normalize to NFC for consistent comparisons, storage, and AI training.
- Always clean your text before tokenization — invisible Unicode differences can derail your results.
- Normalization saves hours of debugging “ghost bugs” where
"ä" ≠ "ä".
Final Thoughts
Unicode normalization sounds esoteric, but it’s critical for real-world engineering — whether you’re synchronizing file names or training a multilingual language model.
If your data or users include German umlauts (or any accented characters), make NFC normalization your default preprocessing step.
You’ll sleep better — and so will your tokenizer.