Information Retrieval Project II: The Vector Space Model October 17, 2012

Information Retrieval Project II: The Vector Space Model October 17, 2012
Information Retrieval
Fall 2012
ETH Zurich
Project II: The Vector Space Model
October 17, 2012
The goal of the second project is to develop a more sophisticated retrieval system which uses the vector
space model to solve information retrieval problems. Additionally, you will implement query expansion
techniques that will extend the accuracy of your retrieved result. This project is again split into two phases.
1
Procedure
The final deadline for this project is Oct. 31st 2012. Additionally, we will provide you with the opportunity
to present your implementations at an earlier date on Oct. 24th 2012 to get feedback. The presentation of
your implementations will remain the same as in the previous project. Every group will be assigned a time
slot for each of the deadlines. Please notify the TAs if your group does not need your time slot. Time tables
will be published at the beginning of the respective week.
For this project we also expect you to have all results for both phases logged before you come to the final
exercise session and ready to be transferred with an USB stick. During the exercise session we will test your
implementation with unknown queries to verify that it runs correctly but because of temporal constraints
we will not be able to verify the whole project during one time slot. Thus, prepare all the required files as
described below per phase and zip them. The naming convention is GroupX-PhaseY.zip. Please structure
your submitted files and folders as to not cause confusion.
If you hand in results on the first deadline, you will have feedback on your results until Oct. 26th 2012.
2
Phase I
Implement an information retrieval system based on the vector space model as it was mentioned in the
lecture. Use the following weighting scheme:
wt,d = (1 + log tft,d ).log
N
dft
Your system should return a ranked list of relevant documents for each query. To test your the correctness
of your implementation, use the collection of documents from the previous project.
In order to evaluate the performance of your system, you are provided with some (45) queries and a list
of relevant documents for each of them. For every query, log the following statistics:
• The overall number of documents that match the query.
• A ranked list of documents that match the query of the form {docID, score}. Exclude documents that
have score=0.
• The final precision and recall values from the precision-recall graph computation explained below.
1
Please use the same file for logging all queries but separate the queries visually. All of your results should
also be displayed on screen during the demonstration of your implementation. After the basic implementation
and results, repeat the experiments including each of the following features:
1. Stemming
2. Stopwords
3. Stemming + Stopwords
Store the results of each of the approaches into a separate file. Separately, you should generate a precisionrecall graph for all queries. How these graphs can be constructed is shown in section 4.3. For submission
you have to create one .png or .jpeg file for the combined precision-recall graph for each of the four system
variations that you will implement in phase I.
You should design your system as follows: When you start your system, it loads the documents into an
online system. You can then either enter a folder that contains queries or a file that contains a query which
are then automatically processed by the system to compute the output as described above. Whether you
want to switch between stemming, stopwords, etc. online or offline is up to you. Note that we do not require
you to write a GUI but the TAs should be able to run your programs with only little instruction.
3
Phase II
In this phase you will improve the quality of your system’s output by modifying the original query using
local and global query expansion methods. The underlying system of phase I can be reused as should the
logging mechanism and precison-recall graph output methods. Please enable now an online option to switch
between the basic approach, local, and global query expansion. You do not need to evaluate this phase with
stopwords or a stemmed dictionary but use the basic version of your parser.
Local Query Expansion Use the pseudo-relevance feedback method to automatically expand the query
as follows: For each query use your current implementation first to return a ranked list of relevant
documents, then choose the first document (the one with highest score) from the returned list and use
this feedback document to modify the term weights in the query according to the Rocchio algorithm.
Note that since we are using only positive feedback for validation, γ = 0, α and β can be chosen by the
user of your system when the query file (folder) is entered.
Repeat the retrieval process with all queries locally expanded, log your results and draw the precisionrecall graphs for the following schemes: (α, β) = (0.5, 0.5), (0.25, 0.75), (0.75, 0.25).
Global Query Expansion We use WordNet as a thesaurus for expanding the queries. First you need to
download and install WordNet (+ an appropriate programming interface). Then for each query:
1. Find the nouns in the query.
2. For each noun add its first synset to the original query.
We will provide you with an extensive list of words that you can use to determine the nouns (NoC)
in the query. The reason why we choose nouns here is because they generally convey more meaning
than any other class of words. Once you have this feature working, repeat the retrieval process with
all queries globally expanded, log your results and draw the new precision-recall graph.
4
FAQ
4.1
What material do I have for this project and where do I get it?
The course homepage provides the following material:
• A standard text corpus (use the project I text corpus).
• 45 queries that can be used to test your system.
2
• A relevancy list for the 45 queries to determine the recall values.
• A list of terms for the global query expansion feature (nouns are marked with NoC in the file).
4.2
What is the grading scheme?
For this project, your grade will be constructed through the following criteria:
• Presentation and usability of your system. (25%)
• Basic results. (20%)
• Stopwords and stemming results. (10%)
• Logging scheme + structure. (5%)
• Precision-recall graphs. (10%)
• Local query expansion. (15%)
• Global query expansion. (15%)
4.3
What is a precision-recall graph?
We will use a simple example to explain the use and purpose of a precision-recall graph here. Assume there
are only two queries in the test collection and their relevancy lists are as follows.
Q1 : D1, D3
Q2 : D2, D3, D4
Now suppose that our IR implementation has returned the following list of documents for each query:
Q1 : D2, D3, D8, D7, D9, D1, ...
Q2 : D3, D2, D5, D7, D4, ...
With this result list, we can compute the precision and recall values at each element of these lists.
Remember that the precision of a result set is measured as the ratio of correct results to retrieved results
and the recall is computed as the number of correct results against the number of expected results. Note
that we do not care about documents ranked after the last relevant document.
For Q1 and Q2 the development of precision and recall are shown in Tables 1 and 2.
Rank
1
2
3
4
5
6
DocID
D2
D3
D8
D7
D9
D1
Correct
No
Yes
No
No
No
Yes
Current Recall
0.00
0.50
0.50
0.50
0.50
1.00
Current Precision
0.00
0.50
0.33
0.25
0.20
0.33
Table 1: Precision-Recall Table for Q1
Rank
1
2
3
4
5
DocID
D3
D2
D5
D7
D4
Correct
Yes
Yes
No
No
Yes
Current Recall
0.33
0.66
0.66
0.66
1.00
Current Precision
1.00
1.00
0.66
0.50
0.60
Table 2: Precision-Recall Table for Q2
The next step then is to standardize these tables. A common method to do this is to create an 11
point recall-precision table per query. Each point represents a recall value R in [0,.1,.2,. . . ,1]. The according
precision value P (ri ) for a recall level ri , i ∈ [0, . . . , 11], is then computed based on the maximum precision
3
value that can be found where the recall value is in [R(ri ),R(ri+1 )]. If there does not exist a recall value
between the levels ri and ri+1 the next higher level is chosen. We call these precision values interpolated
precision values. The formal description of this technique is:
P (ri ) = max(P (r)),
ri ≤ r ≤ ri+1
The interpolated tables for the two example queries are shown in Tables 3 and 4.
Level
1
2
3
4
5
6
7
8
9
10
11
Recall
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Precision
0.50
0.50
0.50
0.50
0.50
0.50
0.33
0.33
0.33
0.33
0.33
Table 3: Interpolated Precision-Recall Table for Q1
Level
1
2
3
4
5
6
7
8
9
10
11
Recall
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Precision
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.60
0.60
0.60
0.60
Table 4: Interpolated Precision/Recall Table for Q2
The last step is to build a single table with the average precision over all queries per recall values. Table 5
shows the averaged results of the running example. These values can then used to construct a precision-recall
graph as shown in Figure 1.
Level
1
2
3
4
5
6
7
8
9
10
11
Recall
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Precision
0.75
0.75
0.75
0.75
0.75
0.75
0.66
0.46
0.46
0.46
0.46
Table 5: Average of Interpolated Precision Values for All Queries (Q1,Q2)
4
Figure 1: Precision-Recall Graph
5
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement