Language-Specific Configurations
On this page
To achieve global search functionality, Algolia needs to know the language of both your data and your end users.
Knowing this enables the engine to apply important word-based processing techniques, such as:
- removing common (stop) words like “the” and “a”
- making singulars and plurals equivalent
- detecting word roots
- separating or combining compound words.
This page goes over all that. First, however, you need to tell Algolia what language(s) is being used.
Setting the Language of the Search
Algolia does not try to detect the language of the index nor of the user. However, for dictionary-based settings, like typo tolerance, stop words, plurals, and others, you’ll need to tell the engine which languages you want these settings to use as their logical bases. If you don’t, the engine will use the default setting, which is to use all dictionaries. This will result in such anomalies as applying French spelling rules to the English language. Therefore, if you want language-based settings to perform with precision and unambiguously, you’ll need to override the default by specifying the language of your data and end users.
You can do this individually for each setting, or more globally with one system-wide setting.
Removing stop words
To separate key terms of a search from its common words, such as “the”, “on”, “it”, etc., the engine can be setup to ignore these common words. Stripping a search of these words helps the engine focus on the essentials of what people are looking for - nouns and adjectives.
Removing stop words is dictionary-based. We parse several sources (Wiktionary and ranks.nl) in order to constitute a list of words commonly used as stop words, not only in English, but in ~50 languages available.
Ignoring plurals (and other alternative forms)
Ignoring plurals, if enabled, tells the engine to consider the plural and singular forms of a word as equivalent.
In English, this is as easy as ignoring the “s” (“cars” = “car”), but what about “es”, or “feet” = “foot”?
To ensure completeness, and to support multiple languages, we rely heavily on Wiktionary templates,
which allows Wiktionary contributors to declare alternative forms of a word. For example, the template {en-noun|s}
, would show up like this on the “car” page of Wiktionary:
$
car (plural cars)
By using templates found inside the Wiktionary data, we are able to build our dictionary of alternative forms. Note that almost every language has its own template syntax, and many languages have multiple templates.
Wiktionary templates also support other alternative forms:
- German declension, where a German noun changes form depending not only on its case, gender, and number, but also on the role it plays in a sentence (dative, nominative, accusative, and genitive).
A german noun can therefore have numerous endings: -er, -e, -es, -e (for nominative), en, -e, -es, -e (accusative), -em, -er, -em, -en (dative), -es, -er, -es, -er (genitive).
- Dutch diminutive endings, where a Dutch noun changes its ending based on whether it is small, countable, and other such noun-nuances. For example, huisje is a small huis, and colaatje is a glass of cola.
Decompounding words
Compound words refer to noun phrases (or nominal groups) which combine, without spaces, a number of words to form a single entity or idea.
For example, “Vaðlaheiðarvegavinnuverkfærageymsluskúraútidyralyklakippuhringur” is a combined collection of Icelandic words with the meaning: the “key ring of the key chain of the outer door to the storage tool shed of the road workers on the Vaðlaheiði plateau”. Very precise, and probably useful when you need a key ring of the key chain, etc..
Perhaps a simpler example is the German word “Lebensabschnittpartner”, which means “the person I am with today”, or current life partner.
The goal of decompounding would be to index the individual words “Leben”, “abschnitt”, “partner” (in English: “life”, “current”, “partner”) separately, thereby improving the chance of a match.
As of today, this setting supports only three languages: Dutch (nl
), German (de
) and Finnish (fi
).