Student Name: ______________________________________ Andrew ID: _______________________________________ Seat Number: _______________________________________ Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points will be deducted for irrelevant details. Use the back of the pages if you need more room for your answer. Calculators, phones, and other computational devices are not permitted. If a calculation is required, write fractions, and show your work so that it is clear that you know how to do the calculation. Advice about exam answers... Sometimes an answer says "I would use <technique> to do <x>". That answer shows that you remember a name, but it does not show that you remember how the technique works, or why it is the right tool for this problem. Give a brief description of how the technique works and why it is the right tool for this job. If the technique needs other information, explain where the information comes from. .
1 Evaluation Suppose that a large health provider has a website with a search engine that allows patients to find information about staying fit, eating well, diseases, treatments, and tests. The search engine receives about 35 queries per week. The company doesn’t have data for evaluating the accuracy of the search engine, and doesn’t know its accuracy. Describe how you would evaluate the accuracy of the search engine. Be clear about the method you would use, the data it would require, how much data it would require, and how you would get the data. Explain why your method is the right choice for this problem. [15 points] Answer The search engine doesn’t receive much traffic, so there isn’t enough click data and there isn’t enough traffic to use interleaved testing. The Cranfield methodology is the best choice in this situation. Start by randomly sampling 50-100 queries from the query log. Develop written information needs to describe what each query is about. If possible, index the documents with several open-source search engines, run each query against each search engine, and pool the results for each query to form a pool of documents to be assessed. If it is not possible to use several open-source search engines, then create several (e.g., 5) variants of each query, run each variant against the search engine, and pool the results to form a pool of documents to be assessed. The size of the pools is determined by the available budget, and the nature of the problem; probably top 100 is sufficient for this task because most web-site visitors won’t search very deeply into the results. Sort the pool of documents for each query into a random order. Have someone assess the results, either on a binary scale (relevant vs.

