By Dr. Anastasio Molano, Denodo Technologies
Wouldn't it be nice if you could receive an exact answer when you query a search engine, instead of a list of related documents? Questions such as "What is an iceberg?" or "What is the distance between Rome and Paris?" would receive a precise answer, rather than a long list of URLs. This will be possible in the near future, thanks to the evolution of Natural Language Processing techniques ?NLP for short.
These days the volume of online information on the Web and on corporate Intranets has increased to such an extent that there is a growing need for tools that help people to locate, filter and manage these resources in an efficient and optimal way.
Recent advances in Human Language Technologies have fostered a new generation of search tools which make use of NLP techniques and resources to improve their search capabilities.
Search tools have traditionally been based on classical information retrieval (IR) techniques, for example, some form of Boolean search or probabilistic retrieval method. In these systems, search does not take into account the underlying linguistic properties of text.
NLP technologies go beyond traditional information retrieval techniques, enabling a system to accomplish a human-like understanding of documents, and thus permitting the extraction of useful meaning from unstructured texts.
Search companies such as Ask Jeeves, Convera, Northern Light, Verity, SmartLogik, Q-Go, and Cognit among others, have incorporated NLP techniques in their commercial search solutions.
Expectations are high, as search tools can have a great impact on the industry. This is especially true for corporate Intranet applications, and more generally, in applications where searching efficiently over extensive document repositories is crucial, e.g. digital libraries, medical databases and legal databases. The emergence of multilingual, cross-lingual and interactive retrieval tools will further increase the demands for this kind of technology.
Given the fast growing volume of online digital information, other complementary information processing techniques are required, such as automatic text categorization, filtering and summarization.
It is worth reviewing these techniques briefly here.
NLP comprises those theories and techniques which enable a system to exploit linguistic properties of text in order to extract meaning from it. Understanding word meanings and their association with other words within a sentence structure is key to understand the true meaning of text.
Linguistic knowledge includes morphological, syntactic and semantic information that can be applied within the information retrieval process, for example, to expand queries with related terms (e.g. synonyms) and thus to retrieve a wider selection of relevant documents.
NLP techniques rely on the use of lexical resources, the most common ones being machine-readable dictionaries (MRDs) (e.g. Longman's Dictionary of Contemporary English, LDOCE), and Thesauri (e.g. Roget's), which organize words on the basis of their meanings (rather than alphabetically).
An even richer resource than a MRD or a Thesaurus is the Lexical Knowledge Base, a fully structured computational lexicon where word forms are associated according to morpho-syntactic, semantic and other kinds of information.
There has been a huge effort in building comprehensive lexical knowledge databases during the last decade, both in the USA and Europe.
WordNet and EuroWordNet
The Cognitive Science Laboratory at Princeton University developed WordNet in 1995?currently on version 1.7.1?a large-scale, domain independent, freely available lexical knowledge base for the English language. Information in WordNet is organized around logical groupings of related terms called 'synsets' (short for synonym sets), each of which consists of a list of synonymous word forms and semantic pointers that describe relationships between the current synsets and other synsets. The semantic content and the large coverage of WordNet make it a powerful tool to perform conceptual text retrieval.
The European counterpart is EuroWordNet, developed under an EC funded project within the Telematics Applications Programme, which ended in 1999. EuroWordNet is a multilingual database, which includes semantic relations between words for Dutch, Italian, Spanish, German, French, Czech and Estonian. Within EuroWordNet an inter-lingual index was created to interconnect the languages in such a way that it is possible to go from the words in one language to similar words in any other language, an interesting feature that permits conceptual and cross-language information retrieval.
Other similar lexical knowledge bases are currently being developed for Swedish, Norwegian, Danish, Greek, Portuguese, Basque, Catalan, Romanian, Lithuan, Russian, Bulgarian and Slovenian. These multilingual databases constitute a highly valuable resource to search engines for performing multilingual cross-language text retrieval.
Mono-lingual text retrieval
NLP techniques can be used at all the stages of the information retrieval process, e.g. at indexing time or at querying time.
The indexing process starts with the removal of stop words from the original text, followed by stemming?the process of reducing words to some base form. NLP-based stemming, also known as lemmatization, applies morphological analysis to extract the base form of a word, then checks the base forms against a machine readable dictionary, making sure it is a valid word stem. So, for example, a search for "go" would find the word "went", whereas a classical stemmer would only identify "go" and "going", "gone", "goes", etc.
Then, a part-of-speech tagger assigns labels ('tags') to words reflecting their syntactic category e.g. noun, adjective, verb, adverb. More advanced taggers attempt to recognize proper names, acronyms, complex phrases, as single tokens. For example, "New York" would be viewed as a single unit rather than just as a sequence of two words in the text.
A syntactic analyser processes each sentence to build a tree structure of phrases comprising nouns, verbs, prepositions and conjunctions. Once this portion of sentence analysis is completed, the semantic analysis can proceed to synthesize these multi-word structures into meaningful concept relationships.
At querying time, NLP techniques can be applied to either expand the query with semantically related terms, e.g. with synonyms of the relevant words in the query, or to compare queries against documents to improve search precision.
Multilingual cross-language retrieval
Multilingual cross-language retrieval refers to retrieval of documents in different languages regardless of the language in which the query is performed. In this scenario we could, for example, perform a query in English and receive a relevant document written in Spanish. EuroWordNet has been successfully used to implement this kind of multilingual search for several European languages.
Question answering is a new breed of search application in which the user just asks a question, just as he would of another person. "When did Hawaii become a state?" or "How far is Denver from Aspen?", "What is an iceberg?", are questions that could be answered with such systems.
Market situation - Europe
The European NLP industry is still very young, compared to its US counterpart, although there are some important players such as Albert, and the Xerox Research Center Europe (XRCE). Other start-up companies are appearing on the market, such as the Norwegian CognIT and the Dutch company Irion.
According to a recent Strategy Partners market report "Content Management Europe 2001-2003", content management (CM) is the fastest growing IT sector, while much of the IT industry is in recession. Information Retrieval tools, and consequently those based on NLP techniques are included within this umbrella, along with electronic document management, enterprise information portals, electronic publishing and collaborative filtering.
The worldwide spend on CM was $4,766 millions in 2001, and it is forecast to grow at a rate of 29.5% in the period 2001-2003, to reach a total market value of $10,445 millions in 2003.
The spend in Europe, in comparison, was $1,315 millions in 2001, with an expected growth rate of 34.5% in 2001-2003?a higher than average figure? reaching $3,325 millions in 2003.
Market studies for Knowledge Management (KM) exhibit a similar trend, with an estimated worldwide spend in KM software and services of $6,000 million in 2002, and a forecast of $14,000 millions in 2006 (source: IDC). Search tools including those based on NLP are at the heart of this movement, so perspectives for the future are very good.
Among the EC initiatives fuelling the European NLP industry, the Cross-Language Evaluation Forum (CLEF) is worth noting. Funded under the IST Programme, this has been recently created to promote monolingual and cross-language retrieval in European languages, as a parallel initiative to the US's Text REtrieval Conference (TREC).
The goal of CLEF is to promote the development of European cross-language retrieval systems, to guarantee European competitiveness in the global marketplace.
In Spain, a multilingual, multicultural country with four official languages?Spanish, Basque, Galician and Catalan?there are a substantial number of NLP research groups in the field of Information Retrieval. These are at the University of Alicante (with a strong presence in TREC with Question-Answering applications), UNED, Technical University of Catalonia (UPC - TALP research centre) and the University of Barcelona. All of these are involved in the EuroWordNet project. The University of Basque Country (Ixa research group involved in extending EuroWordNet with Basque language) and the Technical University of Madrid (UPM) also have active groups in IR.
Additionally there are some up-and-coming companies such as Daedalus, a spin-off of the Technical University of Madrid, and CLiC, a spin-off of the University of Barcelona.
Spain has a prominent position in the Content Management market (source: Strategy Partners), mainly due to the strong presence of Telecom and media companies, who are early adopters of these kinds of solutions. This bodes well for the adoption and expansion of NLP tools.
Spanish and Portuguese language use on the Internet, largely dominated by English at present, depends on the development of NLP techniques and resources for both languages. This is an important concern that the Iberoamerican CYTED Programme has taken into account through its RITOS2 Research Network, which includes NLP research groups from 14 Iberoamerican countries.
Beyond Europe - Internet Search Engines
Traditionally, companies that operate Internet crawlers have been reluctant to include NLP techniques in their products, mainly due to the high computational cost of these techniques. However, as computing power continues to increase (and become cheaper), it is expected that this will no longer be a limiting factor, and that we can expect to see crawlers with NLP capabilities as a common feature very soon.
Currently, most of the existing search engines make use of at least simple NLP techniques, in order to interpret queries written in question form?the so called natural language queries?and to regularize singular and plural forms of words in the index.
Some Internet search engines already feature advanced NLP techniques. For example, Northern Light employs NLP to categorize documents providing specialised information folders. These are used to improve relevance ranking, and to filter and organize the results. NLP is also at the core of the Ask Jeeves engine, which applies grammar processing, tokenization, stemming, stop-wording, parsing and semantic analysis in its search process.
The Norwegian company FAST Search & Transfer, provider of search infrastructure for Alltheweb and Lycos, has featured a package for advanced linguistics. This includes lemmatization, approximate match, and phrasing?the ability to identify phrases in sentences. This feature is available for English and Spanish.
Still, there is a long way to go. Very few searchers make use of already available lexical knowledge bases, such as WordNet, or more importantly, its multilingual counterparts such as EuroWordNet.
Question Answering techniques appear to be a very promising approach for the new generation of search engines. At present, no commercial Internet search engine has included this feature in its core technology. However the demand is there, as can be witnessed from the recently launched Answers Service of Google.
Large corporations have huge amounts of electronic documents in dispersed repositories, Intranet Web pages and e-mails, which are critical to the day-to-day operation of the company e.g. product information, client histories, financial data, inventory, and more. NLP-empowered intranet search engines are a competitive alternative to other technologies for this purpose.
Among search companies that focus on the corporate domain and make use of NLP techniques, the most representative are Convera and Applied Psychology Research (was SmartLogik).
Convera's RetrievalWare performs natural language processing - morphology analysis, linguistic pattern matching and search term expansion of queries. The latter uses a proprietary lexicon?the RetrievalWare Semantic Network?with a collection of approximately 500,000 English words, including semantic relations and a powerful tool for categorization.
SmartLogik includes stemmers for 10 languages, and it uses a thesaurus for query term expansion. It features both a search engine and a categorizer.
Verity, the main player in the market of Intranet search, has extended its search technology to support NLP. It makes use of stemming and synonym expansion. A thesaurus is included for each of the supported languages. The system administrator can select lexical and grammatical analysis among the available search options.
Inktomi offers natural language queries across XML data repositories, avoiding the use of more powerful but complex SQL queries, as information mediators like Denodo support.
IDC, the IT-market analysis firm, predicts that in the long term search will be embedded in most enterprise applications, as a function of the portal, infrastructure or gateway. The potential demand for these technologies in the corporate domain can therefore be very high. That?s the reason why traditional Internet searchers such as Google, Northern Light, FAST or Inktomi are now repositioning their business strategy towards this very appealing market.
Other emerging application fields
Medical applications require powerful search tools to explore large repositories of information about diseases (scientific articles, studies), pharmaceutical products, and so on. Simple keyword based search tools are not suitable for such applications where high-precision is key. Summarization is also a desirable feature here, to quickly visualize document contents. NLP tools could provide all these needs. Worth mentioning here is the EC funded LIQUID project, which aims at developing a cross-lingual information retrieval system in the field of gastroenterology.
Similar information retrieval requirements hold also for legal databases and public digital libraries. The HERMES project, funded by the Spanish Ministry of Science and Technology, focuses on applications to facilitate the retrieval of multilingual textual information in the field of Digital Libraries. Automatic categorization, summarization and concept extraction based on linguistic engineering will be implemented in the project.
An emerging application field of NLP tools is Competitive Intelligence, a discipline that is growing in importance in the business world, particularly for technology companies.
Competitive Intelligence tools perform continuous monitoring of large amounts of information from different sources, searching for new advances, patents, products, relevant legal and administrative information. This intelligence can be crucial to organisations, specially to SME's, for making informed business decisions. NLP tools such as KnowledgeGist from Invention Machine are having a great impact in this field.
has a Ph.D. in Telecommunication Engineering from the Politechnical University of Madrid. He held a visiting research position at the Real-Time and Multimedia Systems Laboratory, Carnegie Mellon University between 1996 - 1998. Since then he has been the Chief Technology Officer of Denodo Technologies, a Spanish start-up company specializing in search technologies and information management. He is responsible for research in the application of NLP techniques to search engines.
The editors of HLTCentral would welcome any feedback on the article.