Information Retrieval using a Vector Space Model
Information Retrieval using a Vector Space Model
Matrix Computations and Applications, Assignment 2
Due: Thursday September 24 at 13.00
Introduction
The purpose of this assignment is to walk you through the development of a rudimentary but
functional information retrieval system based on a vector space model and low-rank approximation.
A user of your Matlab-based program will be able to input a database, construct a low-rank
approximation of the term/document matrix, and perform query matching with relevance feedback
as well as detecting clusters of documents and terms. To help you complete the assignment on
time, you are given a Python script that converts a database into Matlab-readable form.
Task A
Before we begin, we need to create a database. Many of the interesting properties of low-rank
approximations are only appreciable on really large data sets. Unfortunately, it is quite timeconsuming and difficult to grasp such large data sets. We therefore settle for a smaller, more
manageable, database.
Your first task is to construct your own database of documents in a context that interests you.
The Python script takes as input a text file with one document per line. It splits each document
into words. The set of words that are found in the documents is sorted and fed to Matlab. Each
document, as a list of words, is also fed to Matlab.
Since the database is going to be (too) small, a few tricks can be employed to make it more
interesting to use:
• The documents should relate to some common context to ensure that they share terms.
• Put all words in lowercase and consider using stemming to increase the number of shared
terms.
One way to get a reasonable database is to enter the titles to a hundred or so scientific articles.
For instance, use the references or bibliography sections of articles and books. Another alternative
would be to enter the abstracts of articles. You need around a hundred or so documents to make
it interesting.
Write a function that builds a sparse term/document matrix in Matlab. Scale the entries
according to the TF-IDF1 approach. Be careful to avoid division by zero.
Task B
Now it is time to approximate the term/document matrix using the Singular Value Decomposition
(SVD). Use the sparse SVD (svds) to get a truncated SVD of A: Ak = Uk Σk VkT . The norm of
the error Ek = A − Ak is the square root of the sum of squares of the discarded singular values:
v
umin(m,n)
u X
σ2 .
kE k = t
k F
i
i=k+1
1 Term
Frequency, Inverse Document Frequency
1
Note that the sum is precisely over the singular values which we have not yet computed (and
can not afford to compute either). Rewrite the formula above so that it only requires the first k
singular values (which you have already computed). Hint: recall that the Frobenius norm of A
is the square root of the sum of squares of its singular values but can also be computed directly
from the entries of A:
v
v
umin(m,n)
uX
n
u X
um X
σi2 = t
kAkF = t
a2ij .
i=1
i=1 j=1
Plot the singular values of A from largest to smallest. Also plot the relative error
kEk kF
kAkF
for k = 1, 2, . . . , min(m, n). Find the smallest k that results in a relative error less than 20%.
Task C
Write a Matlab function that performs query matching using a rank-k approximation. The function
should return the indices of the relevant documents. Let the rank, k, and the cosine cutoff, , be
arguments to the function to make experimenting with different parameters easier.
Present a couple of interesting queries and their results. Try different values for k and discuss
your observations.
Allow the user to select one or more relevant documents from the set of returned documents
and redo the query matching with the relevance feedback included.
Even though it does not matter much for such a small database, be sure to work in kdimensional space instead of the full m-dimensional space of documents. Specifically, work with
the coordinates of q in the k-dimensional subspace spanned by the basis Uk .
Task D
Clustering can be used to find concepts, related documents, trends, synonyms, and to distinguish
between different meanings of a word, among other things. In this task, you will implement a
rudimentary greedy clustering algorithm that clusters the columns of a matrix B based on the
discussion below.
Construct the symmetric matrix C such that
cij = cos Θij =
bTi bj
kbi kkbj k
where bi is the i-th column of B. An entry cij close to 1 means that the i-th vector is similar to
the j-th vector.
While computing the clusters, it is convenient to symmetrically permute the matrix C. This
way, each cluster is associated with a diagonal block in C. Some bookkeeping is required to keep
track of the permutation.
Suppose that you have constructed a cluster of the (permuted) vectors i through j. The cluster
is such that the diagonal block C(i : j, i : j) has no entries smaller than the threshold . This
means that every vector in the cluster is similar to every other vector in the cluster. We can try
to enlarge the cluster by searching among the remaining columns (> j) after a column z that
maximizes the quantity
αz = min(C(i : j, z)).
If αz ≥ , then by symmetrically swapping row/column z with row/column j + 1 we have enlarged
the cluster so that it also includes the (permuted) vector j + 1. Continue this way until no more
2
vectors can be added to the cluster. Restart the process with a new vector and continue until the
entire set of vectors has been clustered.
Experiment with different tolerances and cluster the documents and terms of your (rankreduced) term/document matrix. Discuss what you find.
Note that you should work in k-dimensional space when computing the cosines. Therefore,
choose B such that each column is the coordinates of a document (or term) in its corresponding
k-dimensional subspace.
3
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement