Debugging Search Application Relevance Issues
Many people focus purely on the speed of search
, often neglecting the quality of the results produced by the system. In most cases, people test out some small set of queries, eyeball the top five or ten and then declare the system good enough. In other cases, they have a suite of test queries to run, but they are at a loss for how to fix any issues that arise. To solve this relevance problems takes a systematic approach, a set of useful tools and a dose of patience. This article will outline several approaches and tools. The patience part will come from knowing the problem is being looked at in a pragmatic way that will lead to a solution instead of a dead end.
At some point during the construction, testing or deployment of a search application, every developer will encounter the "relevance problem". Simply put, some person using the system will enter their favorite query (yes, everyone has one) into the system and the results will be, to put it nicely, less than stellar. They are now demanding that it gets fixed.
For example, I was on-site with a customer training their developers and assessing their system when I suggested a query having to do with my hometown of Buffalo, MN (see, I told you we all have favorite queries), knowing that information should be in their system. As you can guess, the desired result did not come up and we then had to work through some of the techniques I'm about to describe. For closure's sake, the problem was traced back to bad data generated during data import and was fixed by the next day.
You might wonder how these bad results can happen, since the system was tested over and over with a lot of queries (likely your "favorites") and the results were always beautiful. You might also wonder if the whole application is bad and how it can be fixed. How do you know if fixing that one query is going to break everything else?
Not to worry: the relevance problem can be addressed head-on and in a systematic way that will answer all of these questions. Specifically, this article will layout how to test a search application's relevance performance and determine the cause of the problem. In a second article, I'll dicsuss techniques for fixing relevance issues.
To get started, a few definitions are in order. First and foremost, relevance in the context of search measures how well a set of results meets the need of the user querying the system. Unfortunately, definitions of relevance always have an element of subjectivity in them due to user interaction. Thus, relevance testing should always be addressed across as many users as possible so that the subjectivity of any single user will be replaced by the objectivity of the group as a whole. Furthermore, relevance testing must always be about the overall net gain in the system versus any single improvement for one query or one user.
To help further define relevance, two more concepts are helpful:
* Precision is the percentage of documents in the returned results that are relevant.
* Recall is the percentage of relevant results returned out of all relevant results in the system. Obtaining perfect recall is trivial: simply return every document in the collection for every query.
Given precision and recall definitions, we can now quantify relevance across users and queries for a collection. In these terms, a perfect system would have 100% precision and 100% recall for every user and every query. In other words, it would retrieve all the relevant documents and nothing else. In practical terms, when talking about precision and recall in real systems, it is common to focus on precision and recall at a certain number of results, the most common (and useful) being ten results.
Practically speaking, there are many factors that go into relevance beyond just the definitions. In designing your system for relevance, consider the following:
*Is it better to be accurate or to return as many feasible matches as possible?
*How important is it to avoid embarrassing results?
*What factors besides pure keyword matches matter? For instance, do users want results for items that are close to them physically (aka "local" search) or do they prefer newer results over older? In the former case, adding spatial search can help, while in the latter sorting by date will help.
With these factors in mind, you can then tailor the testing and resulting work towards achieving those goals.
Now that I've defined relevance and some ways of measuring it, it is time to learn how to determine it in an actual application. After that, I'll discuss how to find out where relevance issues are coming from and I'll walk through some common examples of relevance debugging.
Determining Relevance Quality
There are many ways of determining relevance in a search application, ranging from low cost to expensive. Some are better than others, but none are absolute. To determine relevance, you need at least three things:
*A collection of documents
*A set of queries
* A set of relevance judgments
The first item in the list is the easiest, while obtaining judgments is often difficult since it is the most time consuming. Relevance Testing Options discusses some of the ways these three items get combined to produce a relevance assessment.
Debugging Relevance Issues
Debugging relevance issues is much like Edison's genius equation: 1% inspiration and 99% perspiration. Debugging search systems is time consuming, yet necessary. It can be both frustrating and enjoyable within minutes of each other. If done right, it should lead to better results and a deeper understanding of how your system works.
Given one or more queries determined by the methods outlined above, start by doing an assessment of the queries. In this assessment, take note of misspellings, jargon, abbreviations, case sensitivity mismatches and the potential use of more common synonyms. Also track which queries returned no results and set them aside. If you have judgments, run the queries and see if you agree with the judgments or not. If you don't have judgments, record your own. This isn't to second guess the judge so much as to get a better feel for where to spend your time or apply any system knowledge you might have that the judge doesn't. Take notes on your first impressions of the queries and their results. Prioritize your work based on how popular the query is in terms of query volume. While it is good to investigate problems with infrequent queries, spend most of your time on the bigger ones first. As you gain experience, this preliminary assessment will give you a good idea of where to dig deeper and where to skip.
For all the queries where spelling or synonyms (jargon, abbreviations, etc.) are an issue, try out some alternate queries to see if they produce better results (better yet, ask your testers to do so) and keep a list of what helped and what didn't, as it will be used later when fixing the problems. You might also try systematically working through a set of queries and try removing words, adding words, changing operators and changing term boosts.
Next, take a look at how the queries were parsed. What types of queries were created? Are they searching the expected Fields? Are the right terms being boosted? Do phrase queries have an associated slop? Also, as a sanity check, make sure your query time analysis matches your indexing analysis. (I'll discuss a technique for doing this later.) For developers that create their own queries programmatically, analysis mismatch is a subtle error that can be hard to track down. Additionally, look at what operators are used and whether any term groupings are incorrect. Finally, is the query overly complex? At times, it may seem beneficial to pile on more and more query "features" to cover every little corner case, but this doesn't always yield the expected results as it is sometimes hard to predict the effects and it will make debugging much more complex. Instead, carefully analyze what the goal is of adding the new feature and then run it through your tests to see the effect.
Finding Content Issues
Many relevance issues are solved simply by double checking that input documents were indexed and that the indexed documents have the correct Fields and analysis results. Also double check that queries are searching the expected Fields. For existing indexes, you may also wish to iterate over your index and take an inventory of the existing Fields. While there is nothing wrong with a Document not having a Field, it may be a tipoff to something wrong when a lot of Documents are missing a Field that you commonly search.
Other Debugging Challenges
While explanations and other debugging tools all can help figure out relevance challenges, there are many subtle insights that come solely by working through your specific application. In particular, dealing with queries that return zero results can be especially problematic, since it isn't clear whether the issues are due to problems in the engine (analysis, query construction, etc.) or if there truly are no relevant documents available. To work through zero result issues, create alternate forms of the query by adding synonyms, changing spellings or trying other word level modifications. It may also be helpful to look in the logs to see what other queries the user submitted around the time of the bad query and try those, especially if they resulted in clickthroughs. Next, I might try to determine if there are facets or other navigational aids that might come close to finding relevant documents. For instance, if the query is looking for a "flux capacitor" and it isn't finding any results, perhaps the facets contain something like "time machine components" which can then be explored to see if this much smaller set of documents does contain a relevant document. If it does, then the documents can be assessed as to why they don't match, as discussed above.
Conclusions
While there are a large number of things that can go wrong with relevance, most are fairly easily identifiable. In fact, many are directly related to bad data going in, or not searching the proper Field in the proper way (or not searching the Field at all). Most of these issues will be quickly addressed by challenging the assumption that it is correct and actually doing the work to verify it is, indeed, correct.
Additionally, be sure to take a systematic approach to testing and debugging. Particularly when batch testing a large number of queries, experiments tend to run together. Take good notes on each experiment and resist the temptation to cut and paste names and descriptions between runs. If possible, use a database or a version control system to help track experiments over time.
Finally, stay focused on relevance as a macro problem and not a micro problem and a better experience for all involved should ensue.
by: Lucid Imagination
It's important to choose the right scale for the right application More Than Fun Flying Toys: A Look At Unique Applications For Kites Nickel, Applications, Biological Role And Limitations Time In The Machine Tool Holder Into The Servo System To The Application - Machine Tools, Time Servo Hard drive media player Applications Benefits of Flex Programming and flex Applications Development MyMoBuy Android Application Scaffolding items & accessories and their application in construction industry Highly Precise Glass Ball Lens & Sapphire Ball for Specialized Applications Sealant Applications How to Uninstall Norman Application & Device Control How to Uninstall Norman Application Control Benefits And Applications Of Leds