Pieter Bruegel the Elder, The Tower of Babel
Editorial

Searching for Information in the Tower of Babel

5 minute read
Martin White avatar
SAVED
Enterprise search already poses enough challenges. Add in multilingual search, and the challenges only grow

We've only started to recognize the impacts and implications of employees in multinational organizations working in more than one language in the last few years. 

The concept of a definitive single corporate language is fast disappearing as these organizations adopt collaboration as the default mode of work and introduce multiple social media channels. And while employees may indicate language skills on their profiles, the level of competence (a term whose definition is open to debate) at speaking, reading, writing and understanding a foreign language may vary considerably. 

So it stands to reason that employees in multilingual organizations are enlisting search applications  to locate information in a wide range of languages, but I suspect these organizations lack a strategy that will support the resources and action to make this possible.

Accidental Corporate Language Policies

All too often intranet teams define a corporate language policy almost by accident as requests come in from around the world to be able to publish in local languages. Language policies don't only cover which languages will be supported but also which pairs of languages will be supported. 

Machine translation is adequate in some contexts, but insufficient in the case of contracts and other official documents as well as for press releases from publicly-owned companies. A slight machine mistranslation of a corporate press release could have unexpected and undesirable effects on the share price.

The Difference Between Multilingual and Cross-Lingual Search

Multilingual and cross-lingual search are two very different processes that are often confused. Multilingual search is where a query in English will search content in English, a query in French will search for content in French, and so on. Ideally the index for each language will be of a similar quality, having been created by the appropriate stemming and lemmatization tools and with equally appropriate stop words. Integrating content in English, French and German into the same index is a very poor approach.

Cross-lingual search is where a search term in English is used and the search application uses taxonomies, thesauri and maybe machine translation to match the meaning of the term in multiple languages. This is seriously challenging, not just from the query/index management standpoint, but also raises questions of how to present the results. There are many options.

Learning Opportunities

The Special Challenges of Multilingual Documents

When in doubt, assume documents contain more than one language. A document in German may quote the text of a local contract in French and have an executive summary in English. Further adding complexity is the practice of adding metadata in English to a German document because English is the ‘corporate’ language. These situations pose a challenge not only to the language identification algorithms in the search application but also to the ranking of the document.

Mixing languages has implications for the weighting of terms in retrieval. A short snippet of French appearing in predominantly German text would give more weight to the French words, since they are relatively uncommon. But mix those in with documents written predominantly in French, and the short French snippets now have much less weight. 

Metadata presents a special problem. A document in Chinese may end up at the top of a list of relevant documents if it has been helpfully tagged in English, given an English title and the metadata is given substantial weight in the ranking algorithm.

The Joys of Query Language Identification

Although a number of language detectors are available, most need at least 200 characters to confirm a language. This is not really long enough for microblogs, though langid does a pretty good job. 

Identifying a language prior to indexing is not going to create any latency the user will be aware of. That is not the case with a query, where the identification has to be done on the fly. In these instances, it may help to prompt the user to ensure the language identified for the query term(s) is correct. A common use case is where an employee is working in France, is logged in to the French language pages of the intranet, but then conducts a search in English. A single word may not be a major problem, but a noun phrase could be — check this point with your search vendor.

Plan for Multilingual Search

Carol Peters, Martin Braschler and Paul Clough published the definitive text on multilingual search in 2010. The book runs over 200 pages, which may give you an indication of how complex this is to implement. Having a corporate language policy is an important umbrella, as well as identifying and assessing the extent of bilingual or even trilingual content. 

If multilingual search is important to your organization, make sure you understand how your current application is managing the process before writing a specification for a new search application. As is so often the case with search, the devil is in the details.

fa-solid fa-hand-paper Learn how you can join our contributor community.

About the Author

Martin White

Martin White is Managing Director of Intranet Focus, Ltd. and is based in Horsham, UK. An information scientist by profession, he has been involved in information retrieval and search for nearly four decades as a consultant, author and columnist. Connect with Martin White:

Main image: public domain