Information Retrieval using a Vector Space Model Matrix Computations and Applications, Assignment 2 Due: Thursday September 24 at 13.00 Introduction The purpose of this assignment is to walk you through the development of a rudimentary but functional information retrieval system based on a vector space model and low-rank approximation. A user of your Matlab-based program will be able to input a database, construct a low-rank approximation of the term/document matrix, and perform query matching with relevance feedback as well as detecting clusters of documents and terms. To help you complete the assignment on time, you are given a Python script that converts a database into Matlab-readable form. Task A Before we begin, we need to create a database. Many of the interesting properties of low-rank approximations are only appreciable on really large data sets. Unfortunately, it is quite timeconsuming and difficult to grasp such large data sets. We therefore settle for a smaller, more manageable, database. Your first task is to construct your own database of documents in a context that interests you. The Python script takes as input a text file with one document per line. It splits each document into words. The set of words that are found in the documents is sorted and fed to Matlab. Each document, as a list of words, is also fed to Matlab. Since the database is going to be (too) small, a few tricks can be employed to make it more interesting to use: • The documents should relate to some common context to ensure that they share terms. • Put all words in lowercase and consider using stemming to increase the number of shared terms. One way to get a reasonable database is to enter the titles to a hundred or so scientific articles. For instance, use the references or bibliography sections of articles and books. Another alternative would be to enter the abstracts of articles. You need around a hundred or so documents to make it interesting. Write a function that builds a sparse term/document matrix in Matlab. Scale the entries according to the TF-IDF1 approach. Be careful to avoid division by zero. Task B Now it is time to approximate the term/document matrix using the Singular Value Decomposition (SVD). Use the sparse SVD (svds) to get a truncated SVD of A: Ak = Uk Σk VkT . The norm of the error Ek = A − Ak is the square root of the sum of squares of the discarded singular values: v umin(m,n) u X σ2 . kE k = t k F i i=k+1 1 Term Frequency, Inverse Document Frequency 1 Note that the sum is precisely over the singular values which we have not yet computed (and can not afford to compute either). Rewrite the formula above so that it only requires the first k singular values (which you have already computed). Hint: recall that the Frobenius norm of A is the square root of the sum of squares of its singular values but can also be computed directly from the entries of A: v v umin(m,n) uX n u X um X σi2 = t kAkF = t a2ij . i=1 i=1 j=1 Plot the singular values of A from largest to smallest. Also plot the relative error kEk kF kAkF for k = 1, 2, . . . , min(m, n). Find the smallest k that results in a relative error less than 20%. Task C Write a Matlab function that performs query matching using a rank-k approximation. The function should return the indices of the relevant documents. Let the rank, k, and the cosine cutoff, , be arguments to the function to make experimenting with different parameters easier. Present a couple of interesting queries and their results. Try different values for k and discuss your observations. Allow the user to select one or more relevant documents from the set of returned documents and redo the query matching with the relevance feedback included. Even though it does not matter much for such a small database, be sure to work in kdimensional space instead of the full m-dimensional space of documents. Specifically, work with the coordinates of q in the k-dimensional subspace spanned by the basis Uk . Task D Clustering can be used to find concepts, related documents, trends, synonyms, and to distinguish between different meanings of a word, among other things. In this task, you will implement a rudimentary greedy clustering algorithm that clusters the columns of a matrix B based on the discussion below. Construct the symmetric matrix C such that cij = cos Θij = bTi bj kbi kkbj k where bi is the i-th column of B. An entry cij close to 1 means that the i-th vector is similar to the j-th vector. While computing the clusters, it is convenient to symmetrically permute the matrix C. This way, each cluster is associated with a diagonal block in C. Some bookkeeping is required to keep track of the permutation. Suppose that you have constructed a cluster of the (permuted) vectors i through j. The cluster is such that the diagonal block C(i : j, i : j) has no entries smaller than the threshold . This means that every vector in the cluster is similar to every other vector in the cluster. We can try to enlarge the cluster by searching among the remaining columns (> j) after a column z that maximizes the quantity αz = min(C(i : j, z)). If αz ≥ , then by symmetrically swapping row/column z with row/column j + 1 we have enlarged the cluster so that it also includes the (permuted) vector j + 1. Continue this way until no more 2 vectors can be added to the cluster. Restart the process with a new vector and continue until the entire set of vectors has been clustered. Experiment with different tolerances and cluster the documents and terms of your (rankreduced) term/document matrix. Discuss what you find. Note that you should work in k-dimensional space when computing the cosines. Therefore, choose B such that each column is the coordinates of a document (or term) in its corresponding k-dimensional subspace. 3

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement