How Do Search Engines Handle Case-Insensitive Search?

How Do Search Engines Handle Case-Insensitive Search?
Photo by Greg Rosenke / Unsplash

What Does "Same Character" Mean?

When building a search engine, you quickly realize that the concept of "same character" is far from simple.

Should a search for `cafe` also find documents containing `café`? In most cases, yes. Should a search for `ABC` (full-width characters) also return documents containing `ABC` (half-width characters)? Absolutely. What about searching for `①` and finding documents with `1`? Or searching for `fi` (fi ligature) and finding documents with `fi`?

Full-width characters are alphanumeric characters and symbols stretched to occupy the same width as CJK characters like Hangul or Kanji. Half-width characters are the regular alphanumeric characters we use every day (`ABC`, `123`). In Japanese and Korean input environments, typing English in full-width mode produces characters like `A`, which looks nearly identical to `A` but is an entirely different Unicode code point.

To answer these questions, you need to understand Unicode Normalization. This article explains the four normalization forms (NFC, NFD, NFKC, NFKD), why NFKC is a reasonable choice for search engines, and how to supplement NFKC where it falls short.


Unicode Basics

(1) Code Points

Unicode assigns a unique number to every character in the world. These numbers are called code points, written with a `U+` prefix followed by hexadecimal digits.

Code Point Character
U+0041 A
U+AC00
U+00E9 é
U+1F600 😀

So far, straightforward. The problem is that the same character can have different code points.

There are two ways to represent the character `é` in Unicode.

Representation Code Points Description
Precomposed U+00E9 Represented as a single code point
Decomposed U+0065 + U+0301 e + combining acute accent

Both render as `é` on screen. However, at the byte level, they are completely different data. A byte-level comparison cannot determine that these two are the same.

This is why normalization is necessary.

(3) Grapheme Clusters: What Users See as "One Character"

Code points and what users perceive as "one character" are not the same thing. Unicode defines the smallest unit of text that a user perceives as a single character as an extended grapheme cluster. The definition from UAX #29 is:

"An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. ... A legacy grapheme cluster is a base (such as A or カ) followed by zero or more continuing characters"

While a single code point usually corresponds to one character, multiple code points frequently compose a single character.

Visible Character Code Points Description
A U+0041 1 code point = 1 character
U+AC00 1 code point = 1 character
é U+0065 + U+0301 2 code points = 1 character
🏳️‍🌈 U+1F3F3 + U+FE0F + U+200D + U+1F308 4 code points = 1 character

Multi-code-point characters are far more common in complex writing systems like Thai and Devanagari (Hindi).

This matters for search engines because of text truncation, highlighting, and character counting. If you split a grapheme cluster in the middle when highlighting a match in search results, you'll display broken characters. Text processing must always operate at the grapheme cluster level to be safe.


Unicode Normalization

There are four normalization forms in Unicode. They can be understood as combinations of two axes:

  • Canonical vs Compatibility decomposition
  • Decomposition vs Composition
Decomposed Composed
Canonical NFD NFC
Compatibility NFKD NFKC

NFC/NFD only handle canonical equivalence, so ligatures (`fi`), full-width characters (`A`), and enclosed characters (`②`) remain unchanged. NFKC/NFKD unify these compatibility characters as well.

Input NFC NFD NFKC NFKD
é (U+00E9) é (U+00E9) e + ́ (U+0065 U+0301) é (U+00E9) e + ́ (U+0065 U+0301)
e + ́ (U+0065 U+0301) é (U+00E9) e + ́ (U+0065 U+0301) é (U+00E9) e + ́ (U+0065 U+0301)
Å (U+212B) Å (U+00C5) A + ̊ (U+0041 U+030A) Å (U+00C5) A + ̊ (U+0041 U+030A)
(U+FB01) (U+FB01) (U+FB01) fi fi
(U+FF21) (U+FF21) (U+FF21) A (U+0041) A (U+0041)
(U+2461) (U+2461) (U+2461) 2 (U+0032) 2 (U+0032)
½ (U+00BD) ½ (U+00BD) ½ (U+00BD) 1⁄2 1⁄2

(1) NFC

Performs canonical decomposition, then recomposes into the most compact canonical form.

e (U+0065) + ́ (U+0301)  →  é (U+00E9)

NFC is the default normalization form recommended by the W3C for web content.

"The W3C Character Model for the World Wide Web 1.0: Normalization and other W3C Specifications recommend using Normalization Form C for all content"

Most text data is likely already stored in NFC. NFC best preserves the meaning of the original text while unifying different canonical representations of the same character.

(2) NFD

Decomposes precomposed characters into base characters and combining characters.

é (U+00E9)  →  e (U+0065) + ́ (U+0301)

NFD is rarely used as a final form, but it's useful as an intermediate step for operations like accent stripping. After decomposing with NFD, you can remove the combining characters to leave only the base characters.

(3) NFKC

Performs compatibility decomposition, then applies canonical composition. Expressed mathematically: NFKC(text) = NFC(NFKD(text)).

Compatibility decomposition considers not only canonical equivalence but also compatibility equivalence, unifying characters that are semantically similar but visually different.

fi (ligature)     →  fi
A (full-width A)  →  A (half-width A)
② (circled 2)     →  2

NFKC unifies characters more aggressively than NFC. It is the most commonly used normalization form in search engines.

(4) NFKD

Performs the most complete decomposition. It corresponds to the pre-composition stage of NFKC.

fi (ligature) →  fi
é (U+00E9)   →  e (U+0065) + ́ (U+0301)

NFKD applies both compatibility and canonical decomposition, so combining characters remain in their separated state.


Caveats When Applying NFKC/NFKD

NFKC/NFKD are powerful, but UAX #15 warns against blind application.

UAX #15

"Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text."
"It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate."

The definition of compatibility equivalence itself illustrates this well.

"Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character, but which may have distinct visual appearances or behaviors."

Let's look at what transformations occur, organized by category.

Enclosed/Parenthesized Characters → Numbers/Letters

Input NFKC Output
1
(1)
a

Superscripts/Subscripts → Regular Numbers/Letters

Input NFKC Output Issue
E=mc² E=mc2 Distorts the meaning of physics formulas
H₂O H2O Loses chemical formula notation
x3 Changes mathematical expression meaning

Fractions → Digits + Slash

Input NFKC Output Length Change
½ 1⁄2 1 char → 3 chars
¼ 1⁄4 1 char → 3 chars
¾ 3⁄4 1 char → 3 chars

Since string length changes, you must be careful with position mapping to the original text when using offset-based indexing. `1½` becomes `11⁄2`, which can cause confusion.

Roman Numerals → Alphabetic Letters

Input NFKC Output
(U+2168) IX
(U+216B) XII
(U+2173) iv

Single-code-point Roman numerals are decomposed into multiple alphabetic characters.

Japanese Compound/Unit Characters

Input NFKC Output
キロ
株式会社
km
平成
令和

Japanese has a particularly large number of compatibility characters. Units, era names, and corporate designations are encoded as single characters, and NFKC expands them all.

Ligature Decomposition

Input NFKC Output
(U+FB01) fi
(U+FB06) st
(U+FB00) ff

Ligatures are glyphs where two or more characters are joined together. In plain text, it is reasonable to treat them the same as their separated forms.

Full-Width Punctuation → Half-Width Punctuation

Input NFKC Output Security Concern
(full-width apostrophe) ' Can bypass SQL injection filters
(full-width slash) / Can be used for path traversal
(full-width angle bracket) < Can bypass XSS filters

From a security perspective, normalizing full-width punctuation is important. To prevent attacks that bypass input validation using full-width characters, NFKC normalization should be applied before validation.


Why NFKC Is Worth Considering for Search Engines

As we've seen, NFKC is a fairly aggressive normalization. `E=mc²` becomes `E=mc2`, and `½` becomes `1⁄2`. Yet there are good reasons to consider NFKC as the normalization approach for search engines.

(1) For Some Applications, Recall Matters More Than Precision

Search quality can be evaluated by recall and precision. While some domains like universal search require both, there are search domains where recall is far more important.

  • Recall failure: The user can't find the document they want → critical
  • Precision degradation: Unintended documents appear alongside results → unfortunate but acceptable

When a user searches for `cafe` and documents containing `café` are missing, the user feels "search is broken." On the other hand, when searching for `②` and documents containing `2` also appear, users aren't particularly bothered as long as the desired results are included.

This is especially true for searches like email or messaging where chronological sorting is the default — as long as the desired result appears in the list, that's enough. Whether results are missing matters far more than ranking quality.

(2) The Actual Frequency of Inconvenience Is Low

Consider how often NFKC's aggressive normalization actually inconveniences users.

  • How often is it actually a problem when searching for `②` and documents containing `2` also appear?
  • How many users need to search for superscript `²` with exact distinction?

For most users, these distinctions don't matter. Inconvenience caused by special character normalization is a rare occurrence, while inconvenience from search failing due to lack of normalization is far more frequent.

(3) Maintenance Cost

Instead of NFKC, you might consider normalizing only what's needed — for example, converting full-width to half-width but not converting superscripts to regular characters. However, this approach is very difficult to maintain.

  • Writing custom Unicode mappings means adding new character mappings every time the Unicode version is updated
  • As the number of normalization option combinations grows, mapping complexity increases exponentially
  • Standard libraries already implement NFKC according to the Unicode standard

Elasticsearch, the most widely used search engine in practice, uses `nfkc_cf` (NFKC + Case Folding) as its default normalization.

Beyond NFKC: Additional Normalization for Search Engines

NFKC solves many problems, but there are areas where NFKC alone isn't enough for search engines.

(1) Case Folding

When a user searches for `Hello`, documents containing `hello` and `HELLO` should also be found. Since NFKC does not perform case conversion, separate Case Folding is needed.

It's not just about simple alphabetic case conversion — there are language-specific considerations. For example, the German `ß` → `ss` conversion changes string length, and the Greek uppercase sigma `Σ` has two lowercase forms.

(2) Accent/Diacritic Removal (Accent Folding)

To find `café` when searching for `cafe`, accents must be removed. Since NFKC preserves accents, accent removal must be handled separately.

The typical approach is:

  1. Decompose with NFD (or NFKD): `é` → `e` + `́` (U+0065 + U+0301)
  2. Remove code points in the Combining Diacritical Marks range (U+0300–U+036F): only `e` remains
  3. Recompose with NFC

In some languages, the presence or absence of an accent makes characters entirely distinct, so accent removal should be applied selectively by language or allow configuring which characters to exclude.

(3) Katakana ↔ Hiragana Conversion

In Japanese search, `カタカナ` (katakana) and `かたかな` (hiragana) often need to be treated as equivalent. Unicode normalization (including NFKC) cannot convert between katakana and hiragana. Separate conversion logic is required.

Fortunately, the code point difference between hiragana and katakana is a constant 0x60, so conversion can be done with simple arithmetic.

カ (U+30AB) - 0x60 = か (U+304B)

Note that half-width katakana → full-width katakana conversion is handled by NFKC.

カ (half-width, U+FF76) → カ (full-width, U+30AB)

Dakuten (voiced sound marks) handling also requires attention. NFKC composes half-width katakana + half-width dakuten combinations into full-width voiced katakana, but non-standard combinations (e.g., full-width katakana + half-width dakuten) may remain separated.


Normalization Pipeline

The order in which text normalization is applied in a search engine affects the results. Here is a pipeline that can be used in practice.

Raw text
  │
  ▼
① Accent/Diacritic removal (NFKD → remove accents → NFC)
  │
  ▼
② Apply NFKC (when ① is not applied. NFKC = NFC(NFKD))
  │
  ▼
③ Case Folding (uppercase → lowercase)
  │
  ▼
④ Script-specific normalization (katakana → hiragana, etc.)
  │
  ▼
Normalized text

Accent removal should be done before NFKC. The reason lies in how NFKC works.

NFKC = NFC(NFKD(text)). In the NFKD step, `é` (U+00E9) is decomposed into `e` (U+0065) + `́` (U+0301), but in the NFC step, it is recomposed back into `é` (U+00E9). After NFKC is applied, accents are in their composed state, so removing accents afterward would require decomposing with NFD again.

When accent removal is needed, processing in the order NFKD decomposition → accent removal → NFC composition completes everything in a single decomposition/composition pass. Since NFKD is used, compatibility decomposition is also performed, making a separate NFKC step unnecessary.

When accent removal is not needed, simply apply NFKC in step ②.

Indexing and Query Normalization

Normalization must be applied at both indexing time and search time.

  • Indexing time: Apply normalization when storing documents in the index. Documents are stored in their normalized form.
  • Search time: Apply the same normalization to user queries. The normalized query searches the index.

The same normalization pipeline must be applied at both points. If NFKC is applied at indexing time but only NFC is applied at search time, a query entered in full-width characters won't find documents indexed with half-width characters.


How Elasticsearch Does It

For practical reference, let's look at the Unicode normalization tools that Elasticsearch provides.

ICU Normalization Token Filter

A token filter provided by the ICU plugin that supports the following normalization forms.

name Parameter Behavior
nfc NFC normalization
nfkc NFKC normalization
nfkc_cf (default) NFKC + Unicode Case Folding + ignorable character removal

It's notable that the default is `nfkc_cf`. Elasticsearch chose NFKC combined with Case Folding as the default normalization for search engines.

The `unicode_set_filter` parameter can be used to exclude specific characters from normalization. For example, to prevent German `ß` from being converted to `ss`, set `"unicode_set_filter": "[^ß]"`.

ICU Folding Token Filter

Based on UTR#30 (Character Foldings), it handles the following in one step:

  • Unicode normalization (NFKC)
  • Case folding
  • Accent/diacritic removal
  • Additional folding (width folding, kana folding, etc.)

Similarly, `unicode_set_filter` can be used to exclude specific characters from folding.

What Is UTR#30?

A technical report that defines character foldings for search engines.

It defines over 25 folding operations including case folding, diacritic removal, width folding, kana folding, and superscript/subscript folding. Although it was withdrawn from Proposed Draft status, Lucene/Elasticsearch adopted it as their implementation, and it serves as a de facto standard.

If you're building a search engine from scratch, referencing UTR#30's folding list is helpful for determining what normalization is needed.

Conclusion

Both precision and recall matter for search engines. However, the precision lost from NFKC's aggressive normalization is far outweighed by the recall lost from not normalizing at all. A slight precision degradation — like getting documents with `2` when searching for `②` — is acceptable, but failing to return any results when searching for `café` with `cafe` is critical.

That said, it's important to recognize NFKC's limitations. NFKC alone cannot handle case unification, accent removal, or hiragana conversion. Combine additional normalization as needed, but pay attention to the order of application.


References