Suppose you have joined a search engine development team to design a search algorithm based on both the Vector
model and the Boolean model.
You have collected the following documents (unstructured) and plan to apply an index technique to convert them into an inverted index.
Doc 1data science is field to use scientific method, process, algorithm, system to extract knowledge.
Doc 2data mining is the process to discover pattern in large data to involve method at the database system.
Doc 3information system is the study of network of hardware and software that people use to process data.
To answer the below questions, you have to provide the detailed procedures step by step.
Question 1.1: In the process of creating the inverted index, please complete the following steps:
Remove all stop words and punctuation. The list of stop words for this task is provided as follows:
Is, An, That, Use, And, To, From, In, Both, Of, At, The
Question 1.: Draft a merged inverted list including the within-document frequencies for each term.
Question 1.: Use the index created as above to draft a dictionary and the related posting file.
Question 1.4: Please design three Boolean queries, (for example, web AND search) and list the relevant documents for each query. Each query must contain at least two keywords while no one keyword appears in one document only.
Question 1.5: Please use the Vector model to query on the inverted index, and compare the result with the Boolean model. (Hint: you can use cosine similarity and set a similarity threshold).
Question 2 (IR Evaluation) (15 marks)
In this question, you are required to evaluate the performance of different search engines. First, please find two search engines you are familiar with, such as Google, Bing, Yahoo!, etc.
Second, please choose one target from the following list, and design two queries to search in both search engines. So both query 1 and query 2 have to be tested in both search engines.
Target 1: obtain the new features of the new iPad. Target 2: obtain the manual of installing tera term. Target 3: obtain tutorial how to install the oracle SQL. Target 4: obtain the features of new Xbox one.
Third, select the first 20 results in both search engines, if they return the target, then mark them as relevant documents, otherwise, they are irrelevant. We can assume there are 12 relevant documents in total (retrieved and not-retrieved). If you think there are more relevant documents to be searched, you can use higher expected relevance as threshold.
The following questions are based on your search results.
Question 2.1: List your target, results and designed search queries (You can use any keywords you think are related to the target). For each result, you can click the link and go to the page, and take the screenshot if you think this result is relevant. At your report, you are required to provide the screenshots and detailed explanation why they are relevant to your queries.
Question 2.2: Get the precision and recall values for 20 documents for query 1 in search engine 1. Interpolate them to 11 standard recall levels. Then plot them into a chart. Get the precision and recall values for 20 documents for query 1 in search engine 2. Interpolate them to 11 standard recall levels. Then plot them into a chart.
Question 2.3: Get the precision and recall values for 20 documents for query 2 in search engine 1. Interpolate them to 11 standard recall levels. Then plot them into the same chart as above. Get the precision and recall values for 20 documents for query 2 in search engine 2. Interpolate them to 11 standard recall levels. Then plot them into the same chart as above.
Question 2.4: Now find the average interpolated precision of query 1 and query 2 for search engine 1 and plot it into the same chart. So you will have total of 3 interpolated curves in one single chart. Now find the average interpolated precision of query 1 and query 2 for search engine 2 and plot it into the same chart. So, you will have total of 3 interpolated curves in one single chart.
Question 2.5: Plot the average interpolated values for Search Engine 1 and Search Engine 2 on one single chart, and compare the algorithms in terms of precision and recall. Which search engine do you think is superior? Why?
Recently Asked Questions
- We know that using netiquette online is important, but why? What are the consequences of bad netiquette in the classroom? In your response, be sure to reflect
- Consider thequery: SELECT E.Lname, E.Bdate, D.Mgr_ssn FROM EMPLOYEE AS E, DEPARTMENT AS D WHERE E.Salary >= 30000 AND D.Dname = 'Administration' AND E.Dno =
- what is a variable