subject: Customized News Powered by Semantic Technology [print this page] Customized News Powered by Semantic Technology
ABOUT ctrl
ctrl, the first research and development (R&D) product of PRAGMATECH (www.pragma-tech.com), is a library for text processing that performs semantic analysis on (news-like) textual documents. The API of ctrl can be used for document summarization, extraction of key topics, and, most importantly, it can be used to index and retrieve documents by concepts and topics/subject-matter as opposed to key words and key phrases.
WHAT DO WE MEAN BY SEMANTICS'?
With the advent of the Semantic Web, there has been an abundance of sloppy use of the term semantics'. Some have even claimed to have systems that understand' natural language. At PRAGMATECH, on the other hand, we are well aware of the multitude of technical and theoretical problems that hinder any real progress in natural language processing (NLP) due to difficulties in the semantic treatment of a number of phenomena in natural language, such as scope and reference resolution, textual entailment, metonymy, nominal compounds, intentionality, and metaphor, just to name a few.
Having said that, it is important to point out that difficulties in NLP research do not necessarily imply that advanced information retrieval systems that go beyond the capabilities of statistical and keyword indexing cannot be developed. This is simply due to the fact that the retrieval of relevant documents is not equivalent to language understanding although the latter subsumes the former. Comprehending a piece of text is a much broader function and is a must when the goal is natural language question/answering, but an expert' need not fully comprehend a piece of text to simply determine its aboutness', i.e., to determine what a certain document is about.
Thus, and notwithstanding our conviction that as of yet there are no systems that truly understand' natural language text, we do believe that semantic (or topic-based) retrieval can be attained. It is in this context that we use the term semantics' here i.e., we use the term semantics' in the limited context of document retrieval to mean retrieval of documents that are semantically (or topically) related, as opposed to key wordbased retrieval models. To achieve this, the analysis of the content of documents in ctrl is based on the analysis of concepts and topics (compound concepts), as opposed to words and phrases.
FROM WORDS TO CONCEPTS
In ctrl we process and analyze concepts, not words, and this includes names of things (people, countries, organizations, products, etc.). Thus, from words we try to infer the most likely meaning in context by a process known as Word Sense Disambiguation (WSD), which is an essential part of the semantic analysis process.
The accuracy of our WSD algorithm in inferring the most likely meaning of a word in some context is above 80%, and, as far as we know, this accuracy rate is by far much higher than any WSD results that have been reported in the computational linguistics research.
FROM CONCEPTS TO TOPICS
Beyond going from words to concepts (meanings), in ctrl we have moved from concepts to topics (which can be thought of as compound concepts that are composed from smaller more primitive concepts). Thus, while (a specific meaning of) system' refers to a simple topic, information systems' is a more complex topic, and so is information management systems', etc.
Therefore, while it is an essential part of the process, going from words to concepts is not the end goal in and of itself since topics of interest are rarely described by single words, but are usually expressed by complex linguistic objects (mostly by nominal phrases) such as "drug trafficking organizations", "current NBA TV analyst Eric Snow", "the upcoming Garry Marshall flick Valentine's Day", etc.
FROM TOPICS TO KEY TOPICS
Going from words to concepts, and from simple concepts to topics (or more complex concepts) is still not enough to achieve high rates of precision and recall in retrieving relevant documents. Even if we are highly accurate in inferring the correct meaning of words in context, and even if we subsequently combined concepts into topics, what we are ultimately interested in is the set of key topics in a document, or simply what the document is essentially "about", regardless of what other words, concepts or topics are also mentioned in it.
BASIC REFERENCE RESOLUTION
In ctrl we give words meanings, including names of things. To do so basic reference resolution and entity identification has to be performed. Names of people, organizations, movies, etc. are not therefore just words and phrases, but are full-fledged concepts that can be related to and matched with other concepts and topics. Thus, while the sentence "popularity of former NY Yankees slugger Babe Ruth" is not at all related to "Dr. Ruth popularity in NY", there is clearly some semantic relation between "US President Barak Obama" and "British Prime Minister Tony Blair" due, among other things, to the semantic relationship between the concept Barak Obama' (who is a president) and the concept of a prime minister'.
DIFFERENT CONCEPTS RELATED TOPICS
AND SAME CONCEPTSYET UNRELATED TOPICS
Combining the process of Word Sense Disambiguation (WSD) and treating topics as the basic building blocks (as opposed to single words or even single concepts) allows ctrl to conceptually (or semantically) relate topics that are expressed in different words, such as "drug trafficking organizations" and "organized crime syndicates", for example.
The flip side of this is that topics that might be composed of similar concepts can potentially refer to completely different topics. For example, even if we fix the meanings of "company" and "insurance", an "insurance company" is a completely different concept/topic from "company insurance". Similarly, and even though the words "technology", "data" and "business" refer to the same concepts in "data about the use of technology in the mining business" and "the use of data mining technology in business", the two phrases refer to completely different topics. In fact, the latter is more related to "the commercial applications of machine learning", although the two phrases are expressed using entirely different sets of words.
APPLICATIONS AND INTEGRATION OF CTRL
ctrl is a library with an API that can be used to extract meta data (key topics, and entities) of any textual document, to generate a document summary, and to index and retrieve documents by topics as opposed to key words.
ctrl can be used in several domains for different purposes (besides the obvious application in search). For instance, in the news industry ctrl provides a tool to automatically generate the story highlights', categorize and index any article based on its topics, and recommend related stories. This in turn helps both internal and external (users) in the retrieval of related documents for any required topic; it further allows automatic data push (could be in the form of RSS Feeds) for user-selected topics. In this context ctrl can be used in effective targeted marketing since users are retrieving highly relevant information that exactly matches their topics of interest.
ctrl can also be the new standard in the business intelligence industry for intelligent topic-based enterprise search. Its ability to provide relevant documents based on topics is a daily need in large corporations. The various solutions that are currently employed in the industry are quite costly and are time- and error-prone since they rely on expert topic (or meta-data) engineering for effective performance.
In the Intelligence community, ctrl means cutting time and cost that is being spent on processing a huge number of documents manually' searching for relevant information about specific topics, since the identification of topics (and more importantly, the identification of the key topics) is the most important differentiator between ctrl and existing systems.
Other applications can be developed around ctrl's API functionality to service several other fields especially when integrated with existing software (e.g., database systems, desktop and document management tools, etc.)
VALIDATION
ctrl has gone through a series of tests both for technical and non-technical reasons. Besides the standard software quality assurance tests, thorough testing has been performed to validate our WSD algorithms (using SemCore and our own test collection), as well as the generation of meta-data, and document summaries. The precision and recall of document retrieval using ctrl's topic-based indexing and retrieval was also thoroughly tested using our own collections (obtained from various media sources) as well as the widely used Reuters Test Collection.
Ctrl-News, ONE APPLICATION OF THE CTRL SEMANTIC ENGINE
Ctrl-News is a free online news service that enables users to receive customized news using the CTRL semantic engine. Users subscribe to the service by submitting a profile, which is one or more "subject(s) of interests".Users can do this by tracking existing stories, by pasting or typing a couple of paragraphs that describe the story they like to track, or by entering some topics of interest from a certain category (e.g., Information Technology >> Cloud Computing, or Politics >> Terrorism, etc.)
Ctrl-News then fetches daily news stories that are semantically/topically related to a user's subject(s) of interests. Along with every news story retrieved, users can view an automatically-generated summary, a list of the key topics for a news article, the key entities identified (people, locations, products, organizations, etc.), as well as a set of topically related stories found on that day.
Sign-up to this service for free (http://www.ctrl-news.com/), save time sifting through a multitude of irrelevant stories and start receiving customized and intelligently filtered news stories. If you like the service we would like to hear from you. Also, you are welcome to invite friends that might also be interested in using or testing this service.