Algorithms for Information Retrieval and Natural Language Processing tasks Pradeep Muthukrishnan Dragomir R. Radev School of Information and Department of EECS University of Michigan Winter, 2008 {mpradeep,radev}@umich.edu September 24, 2008 Contents 1 Introduction 5 2 Text summarization 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 2.2 Types of summaries . . . . . . . . . . . . . . . . . . 2.3 Analysis of Different Approaches for Summarization 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6 6 7 8 3 Semi-supervised learning, random walks and 3.1 Introduction . . . . . . . . . . . . . . . . . . . 3.2 Common methods used in SSL . . . . . . . . 3.2.1 Generative Models . . . . . . . . . . . 3.2.2 Self Training . . . . . . . . . . . . . . 3.2.3 Co-Training . . . . . . . . . . . . . . . 3.2.4 Graph-Based Methods . . . . . . . . . 3.2.5 Mincut approach . . . . . . . . . . . . 3.2.6 Tree-based Bayes . . . . . . . . . . . . 3.3 Active Learning . . . . . . . . . . . . . . . . . 3.4 Random walks and electric networks . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . electrical networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Semi-Supervised Learning and Active Learning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . 4.2 Active Learning with feature feedback . . . . . . 4.3 Using decision Lists . . . . . . . . . . . . . . . . 4.3.1 Word sense Disambiguation . . . . . . . . 4.3.2 Lexical ambiguity resolution . . . . . . . . 4.4 Semi-supervised Learning using Co-Training . . . 4.4.1 Bipartite Graph representation . . . . . . 4.4.2 Learning in large input spaces . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 9 10 10 11 11 12 12 13 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 14 14 15 16 16 16 17 5 Network Analysis and Random graph Models 5.1 Introduction . . . . . . . . . . . . . . . . . . . . 5.2 Types of Networks . . . . . . . . . . . . . . . . 5.2.1 Social Networks . . . . . . . . . . . . . . 5.2.2 Information Networks . . . . . . . . . . 5.2.3 Technological networks . . . . . . . . . . 5.2.4 Biological Networks . . . . . . . . . . . 5.3 Network Properties . . . . . . . . . . . . . . . . 5.3.1 Small-World Effect . . . . . . . . . . . . 5.3.2 Transitivity or Clustering . . . . . . . . 5.3.3 Degree distribution . . . . . . . . . . . . 5.3.4 Network Resilience . . . . . . . . . . . . 5.3.5 Mixing Patterns . . . . . . . . . . . . . 5.3.6 Degree Correlations . . . . . . . . . . . 5.3.7 Community structure . . . . . . . . . . 5.4 Finding Community Structure . . . . . . . . . . 5.5 Random Graph Models . . . . . . . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 21 23 25 6 Prepositional Phrase Attachment Ambiguity 6.1 Introduction . . . . . . . . . . . . . . . . . . . 6.2 Problem . . . . . . . . . . . . . . . . . . . . . 6.3 Methods for PP-Attachment . . . . . . . . . . 6.3.1 Corpus-based approach using t-scores 6.3.2 Rule-Based Approach . . . . . . . . . 6.3.3 Back-Off model . . . . . . . . . . . . . 6.3.4 Nearest-Neighbor Method . . . . . . . 6.3.5 Random Walks . . . . . . . . . . . . . 6.3.6 Formal Model . . . . . . . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 26 26 27 27 28 28 31 31 32 7 Dependency Parsing 7.1 Introduction . . . . . . . . . . . . 7.2 Formal definition of the problem 7.3 Machine Learning based methods 7.3.1 Non-Linear Classifiers . . 7.3.2 Kernels . . . . . . . . . . 7.4 Transition system . . . . . . . . . 7.5 Graph-Based Methods . . . . . . 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 34 34 35 35 36 36 41 . . . . 42 42 42 45 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Statistical Machine Translation 8.1 Introduction . . . . . . . . . . . . . . . . . . . 8.2 A simple statistical machine translator . . . . 8.3 IBM’s Statistical Machine Translation System 8.4 Conclusion . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Decoders and non-source channel approaches for Machine Translation 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Direct Modeling of the Posterior probability . . . . . . . . . . . . . . . . . . 9.3 Syntactical Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Fast Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Sentiment Analysis 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Graph Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Using minimum cuts for sentiment analysis . . . . . . . . . . . 10.2.2 Graph-Based Semi Supervised Learning for sentiment analysis 10.3 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Evaluation of Machine learning algorithms for Sentiment Analysis . . 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Blog Analysis 11.1 Introduction . . . . . . . . . . . . . . . . . . . 11.2 Analysis of Evolution of Blogspace . . . . . . 11.2.1 Identifying Communities in Blogspace 11.3 Ranking of Blogs . . . . . . . . . . . . . . . . 11.4 Conclusion . . . . . . . . . . . . . . . . . . . 12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 49 52 53 54 . . . . . . . 55 55 55 55 57 58 60 61 . . . . . 63 63 63 64 67 68 70 4 Chapter 1 Introduction In this report, I have tried to summarize the common algorithms and approaches used for various tasks in Information Retrieval and Natural Language processing like, Text Summarization, Machine Translation, Sentiment Detection, Dependence Parsing, etc. This report is submitted towards satisfying the requirement of the Directed Study, EECS 599 course, Computer Science division, University of Michigan, Ann Arbor under the supervision of Professor Dragomir R. Radev. 5 Chapter 2 Text summarization 2.1 Introduction Text summarization is an active field of research in Information Retrieval and Machine learning. The major design goal of summarization is to convey the same amount of information available in the original source text in a much smaller document. This chapter talks about the types of summaries, different approaches and problems with respect to summarization. 2.2 Types of summaries The summaries are broadly categorized based on purpose, input and output 1. Abstractive vs Extractive Summaries: Abstractive summaries or abstracts, involve identifying important parts of the source document and reformulating it in novel terms. The phrase ”important parts” here refer to the parts which contain significant information content related to the source document. Extractive summaries or extracts, involve selecting sentences, words, phrases and combining them together in a cohesive and coherent manner. Abstracts are a much harder problem because the second phase in the problem of generating grammatically correct memories using a natural language generator is hard. 2. Single or Multiple document summaries: This categorization is based on the number of input documents to the summarizer. This parameter is often called the span parameter. 3. Informative or Indicative Summaries: An indicative summary just lets the reader know about the contents of the article while an informative summary can be used as a substitute for the original source document. A third category of summary is Critical summaries but no actual summarization system creates a critical summary. 2.3 Analysis of Different Approaches for Summarization Many different approaches to summarization have been tried by various researchers over the years. One of the graph-based approaches by Radev et al [2], uses a graph-based stochastic matrix for 6 computing the important sentences. It was a multi-document extractive summarization system which computes a set of sentences identified as central to the topic of the documents. In this approach, an undirected graph is constructed where all sentences are the vertices and an edge corresponds to the similarity between the two vertices (sentences) being above a specified threshold. The similarity measure used is a modified version of cosine similarity as given below. P idf − modif ied − cosine(x, y) = qP w∈x,y tfw,x tfw,y (idfw )2 2 xi ∈x (tfxi ,x idfxi ) qP 2 yi ∈y (tfyi ,y idfyi ) (2.1) where tfw,x is the term frequency of word w in document x, and idfw is the inverse document frequency of word w. To compute the central or salient sentences, there are a variety of methods that could be used. For eg, the degree of a vertex can be used as a measure of centrality, where a higher degree implies more central. A more advanced method that the authors suggest is using LexRank, a modified version of PageRank. Using this method offsets the effect of outliers (noise) , that is, a set of documents unrelated to the actual topic of the input grouped together by mistake. These unrelated documents could give rise to a set of similar sentences being chosen which are totally unrelated to the actual topic, if the degree-based centrality measure is used. The eigenvector computation for computing the stationary distribution vector is done using the power method. One optimization that, I believe, could be used for computing the stationary distribution vector is to use the repeated squaring method as shown below. input: A stochastic, irreducible and aperiodic matrix M input : matrix size N, error tolerance output: eigenvector p 1. p0 = 1 N1 2. repeat 3. Mt T = (M T t−1 )2 4. δ = Mt T − M T t−1 5. until δ < 6. return Mt T p0 An approach to generate more accurate summaries is to use templates for specific domains with slots for each feature related to the topic of the source documents. With this approach the more ontological rules that are added, the better is the quality of the summary. The problem with this approach is that it is highly limited to specific domains. In the area of specific domain summarizers, Afantenos et al [3], present many different summarizers in the medical domain and the specific features of the domain which can be exploited to produce better summaries. 2.4 Evaluation The evaluation has always been a very hard problem mainly because human summarizers, themselves tend to agree among themselves only 40% of the time, where agreeing refers to common n-grams between the summaries. The current evaluation methods are of two types. 7 1. Quantifiable methods: This method deals with comparing the output summary with summaries written by human judges using the same source documents and computing some quantitative results. The comparison is mostly done by computing the precision and recall using metrics like the number of common 5-grams between the output summary and the human-written summary. For example, ROUGE is a recall-based fixed-length summaries which is based on 1-gram, 2-gram, 3-gram and 4-gram co-occurrence. It reports separate scores for each n − gram co-occurrence but it has been shown by Lin and Hovy that unigram-based ROUGE score agrees the most with human summaries. 2. Human judged: This method involves human judges reading the output summary and seeing if they can relate it to the original document. For example, Pyramid is a manual method for summarization evaluation. It deals with the fundamental problem in evaluation of summarization systems which is different humans choose different sentences for the same facts. The solution to this is a gold-standard summary is created using many different human summaries and each fact’s importance increases with the frequency of the fact being mentioned in the human summaries. 2.5 Conclusion This chapter presented a broad introduction to summarization systems and a stochastic graphbased algorithm for computing an extractive summary for a multiple documents. Also a small optimization with respect to the computation of the stationary distribution vector. The evaluation methods that are generally used in tandem with summarization systems has been discussed in brief. 8 Chapter 3 Semi-supervised learning, random walks and electrical networks 3.1 Introduction Semi-supervised learning (SSL) is one of the major learning techniques in Machine learning. It makes use of both labeled and unlabeled data for training. It assumes that amount of labeled data is very little compared to unlabeled data, which makes it a very practical alternative to supervised learning which uses only labeled data, which is very expensive. The most common idea of semisupervised learning or any learning algorithm is to identify the underlying probability distribution of the input data and then using a probability model (like Bayes) to predict the output for unseen test cases. In the case of semi-supervised learning, the labeled data is used to estimate the probability distribution and the unlabeled data is used to fine-tune the distribution. Figure 2.1 shows how this is achieved. The chapter is organized in the following way. Commonly used methods in SSL are mentioned in chapter 2 in detail and the possibility of using active learning along with SSL is covered in section 3 followed by a brief discussion on random networks and electric networks. 3.2 Common methods used in SSL There are different approaches/models that have been tried in the past. Some of them are listed below in detail. Before looking at the different approaches, the question of which model to use for a particular problem is very important because if the model is incorrect, then it has been proved that unlabeled data can actually degrade the accuracy of the classifier. The answer depends on the nature of the problem. For example, If the different classes cluster very well, then EM with generative mixture models is a good option. If the feature set can be split into conditionally independent sets then co-training is a good option. If inputs with similar features fall into same same classes, then graph-based methods can be used. In essence, one needs to choose the approach which fits the problem structure well. 3.2.1 Generative Models It is one of the oldest-models in SSL. It assumes a model p(x, y) = p(y)p(x|y) where p(x|y) is an identifiable mixture distribution, for example Gaussian mixture models. The mixture components can be identified if we have large amounts of unlabeled data, then with just one labeled input data 9 Figure 3.1: In a binary classification problem, unlabeled data can be used to help parameter estimation, Xiaojin Zhu, Semi-supervised learning literature survey. Technical chapter 1530, Computer Sciences, University of Wisconsin-Madison for every class, we can accurately find the mixture components. Identifiability refers to the property that the distribution can be uniquely represented by a set of parameters. The reason behind the usage of an identifiable mixture is that, if identifiability is not a requirement then it is possible that we could have different distributions leading to different classifiers which output different labels for the same input.Even if the model is right, sometimes the probability distribution found using EM (Expectation maximization) could be far off from the optimal and once again, unlabeled data could degrade learning. Another approach which completely does away with probability distribution is Cluster-and-Label, which clusters the data and labels each cluster with the labeled data. The problem with this approach is that it is very hard to analyze the classification because the error almost completely depends on the clustering algorithm. 3.2.2 Self Training In this approach, the classifier is initially trained with just the labeled data, then they are used to classify the unlabeled data, and the most reliable labelings are added to the labeled data with the predicted labels and the classifier is trained again. This happens iteratively. This procedure is also called self-teaching or bootstrapping. The problem with this method is, it is possible that a wrong labeling can manipulate itself and cause further degradation. A remedy for this is to have a mechanism integrated to unlearn some ”data labelings” whenever the predicted probability of that labeling drops below a threshold. 3.2.3 Co-Training This method could work very well if the feature set can be split into conditionally independent sets given the classification. In such a scenario, we train two different classifiers using the two different feature sets using both the labeled and unlabeled data. Then the most reliable (confident) labeling 10 of each classifier is added to the other classifier as labeled data and re-trained. There have also been attempts to develop two classifiers considering the whole feature set as such and using the reliable classifications of one classifier as labeled data for the other. 3.2.4 Graph-Based Methods In all the graph-based methods, the vertices (nodes) are the labeled and unlabeled data points and the edges (may be weighted) correspond to the similarity between nodes. There is an inherent assumption with all graph-based methods of label smoothness over the graph, i.e., the labels do not change much between neighboring nodes. All graph-based methods are very similar to each other as far as framework is concerned. There is a loss function and a regularizer. The loss function’s functionality is to make sure that the predicted labeling for labeled nodes is as close as possible to the actual labeling. In other words, if this condition is not true, we have lost everything and hence loss is infinity. The regularizer is used to make sure the predicted labelings are smooth over the graph. All the graph-based methods differ only in their choice of loss functions and regularizer. 3.2.5 Mincut approach The SSL problem can be thought of as a mincut problem in graph theory, by making all positive labels to be sources and negative labels to be sinks. Thus the SSL problem boils down to finding the minimum set of edges such that all paths from the sources to sinks are cut. In this case, the loss function can be viewed as quadratic loss with infinity weight ∞ . This constraint is required to make sure the labeled data are fixed at their labels. The regularizer is 1X 1X wij |yi − yj | = wij (yi − yj )2 2 i,j 2 i,j (3.1) The above equality is true only in the binary labels case. Both the functions put together, the mincut problem can be thought of as minimizing the function. ∞ X (yi − yi | L)2 + i∈L 1X wij (yi − yj )2 2 i,j (3.2) subject to the constraint yi ∈ {0, 1}, ∀i. Classification in this approach is performed by assigning the label of the class which the input data point is closest to. There is one problem with mincut, that it gives only hard classifications, instead of marginal probabilities as other approaches do. This can be overcome by adding some random noise to the edges and creating many classifiers from the different graphs that arise from this perturbation. Now, classification can be done by taking majority vote. This approach is sometimes called as ”sof t mincut”. The solution for the labeling function can be viewed as local averaging, i.e, it satisfies the averaging property. 1 X wij f (i) (3.3) f (j) = dj i∼j In other words, the solution f (xi ) at an unlabeled point xi is the weighted average of its neighbors solutions. Since the neighbors are usually unlabeled points too, so this is a self-consistent property. 11 3.2.6 Tree-based Bayes In this approach, a probabilistic distribution P (Y |T ) is defined on discrete labelings Y over an evolutionary tree T. This is different from the graph-based methods in the sense, that the labeled and unlabeled nodes are actually leaves of the tree. And the regularizer function is abstracted by constraining nearby leaves to have similar labelings. The label is assumed to mutate over the path from the root to the leaves, and as a consequence, the tree uniquely defines the label given P (Y |T ). 3.3 Active Learning Active learning is a method which complements the semi-supervised learning techniques. Active learning is used to minimize the estimated expected classification error. This is done by asking for the user annotation of data points selected by the learning algorithm. The selection algorithm should not select the data point naively based on maximum label ambiguity, or least confidence. Instead they should query points which would lead to minimization of estimated expected classification error. The general framework of active learning is captured by the following pseudocode. input: L (set of labeled nodes) , U (set of unlabeled nodes) , weight matrix W output: L and the classifier, F while more labeled data required: 1. Compute labeling (classifier) function f 2. Find best query point k which minimizes the estimated expected error. 3. Query point xk to get the label yk 4. Add (xk , yk ) to L, remove xk from U. end Zhu et al (2003) show how active learning and semi-supervised learning could be combined using harmonic functions and gaussian distribution. The SSL method used is a graph-based method where the edge function is defined to be m 1 X (xid − xjd )2 ) σ 2 d=1 (3.4) 1X (wij (f (i) − f (j))2 ) 2 i,j (3.5) wij = exp(− and the energy function is defined to be E(f ) = where f is the labeling function. The energy function is responsible for the labelings of close-by nodes to have similar labelings. Now the whole problem is to find the function f which minimize the energy function, i.e, f = arg miny|L=yL E(y) (3.6) And the function which satisfies this property is known to be harmonic functions, i.e, functions which satisfy the averaging property. Because of the averaging property, the labelings of nearby 12 nodes are ensured to be similar. To use the active learning method efficiently we need to estimate the risk carefully. The risk function that Zhu et al have suggested is R(f ) = n X X [sgn[fi ) 6= yi ]p∗ (yi |L) (3.7) i=1 yi =0,1 Since this depends on the p∗ (yi |L), which is the true label distribution at node i, given the labeled data, the above formula cannot be directly used to compute R(f ). Thus, as an approximation, p∗ (yi = 1|L) ≈ fi . Therefore, we can compute the estimated risk now as, R(f ) = n X [Sgn(fi ) 6= 0](1 − fi ) + [Sgn[fi ) 6= 1]fi (3.8) i=1 The above formula is a consequence of the fact that fi is the probability of reaching a node 1 before reaching node 1 on a random walk. Now we can use the above formula to compute the best point to be queried which can be used to minimize the risk. That point is queried and after the labeling has been received, this point is added to the labeled data set and the whole process is repeated. Thus, by active learning one can minimize the estimated classification error by means of intelligent querying. 3.4 Random walks and electric networks As mentioned in the previous section, let f (i) be the probability of reaching a node labeled 1 before reaching a node labeled 0 on a random walk. Then it can be shown that f (i) satisfies the averaging property of harmonic functions using basic probability laws. Thus f (i) is a harmonic function. One of the interesting properties of harmonic functions is that if there are two harmonic functions f (x) and g(x) such that f (x) = g(x)∀x ∈ B, where B is the set of boundary values, then f (x) = g(x)∀x. This can be proved with the help of the maximum principle and the uniqueness principle. Both the principles are defined below. Maximum Principle: A harmonic function f (x) defined on S takes on its maximum value M and its minimum value m on the boundary. Uniqueness Principle: If f (x) and g (x) are harmonic functions on S such that f (x) = g (x) on B, then f (x) = g (x) for all x. Also there exists an analogy between electric networks and random walks. The analogy is as follows. Let the entire graph be converted into a electric network, where every edge is replaced by a resistor and all nodes labeled 1 are connected to a positive voltage and all nodes labeled 0 are grounded. In this electric network, let the voltage at node x be represented by V (x). Therefore, V (x) = 0, 1 for all labeled nodes and by basic laws of electricity, it can be shown that V (x) satisfies the averaging property. Thus its a harmonic function which takes the same boundary values as our probability distribution f (x). Thus by maximum principle and uniqueness principle, f (x) = V (x). 3.5 Conclusion This chapter presented an introduction to semi-supervised learning and the different approaches that are used in Semi-supervised learning. Active learning has also been discussed in conjunction with semi-supervised learning and also an example of combining active learning and semi-supervised learning has been described clearly. 13 Chapter 4 Semi-Supervised Learning and Active Learning 4.1 Introduction In the last chapter, some of the most common methods used in semi-supervised learning were covered. In this chapter, we continue to discuss other methods in semi-supervised learning with respect to specific applications. 4.2 Active Learning with feature feedback 4.3 Using decision Lists In this section, decision lists are used in the context of word sense disambiguation and lexical ambiguity resolution. There are a lot of similarities in both approaches, such as, both the approaches are completely unsupervised. Both the problem’s solutions rely largely on the usage of collocation as the most important feature splitting the input into disjoint classes. Decision lists are, in simple terms, a list of feature queries that are posed about the input data points, ranked in some welldefined order to obtain the classification. 4.3.1 Word sense Disambiguation This algorithm makes use of two core properties of the human language. 1. One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship. 2. One sense per discourse: The sense of a target word is highly consistent within any given document. Considering the large amount of redundancy in the human language, the sense of the word is overdetermined by the above two properties. Yarowsky, Gale and Church (1992) , have provided lots of empirical evidence to support both the above two properties. The feature vector for this problem is defined to be the set of collocating words, and they are ranked by the probability log-likelihood 14 ratio of the probability distribution for all collocations of a word. The log-likelihood ratio is defined below. P r(SenseA | Collocationi ) Log( ) (4.1) P r(SenseB |Collocationi ) 1. In the corpus, all examples of the given word, storing their contexts as lines in an untagged training set. 2. For each sense of the word, a few examples are tagged 4.3.2 Lexical ambiguity resolution This algorithm is largely similar to the previous one, differing mainly in the feature vectors, which also includes the syntactic relationship between the collocating words and the word of interest. The specific problem that has been chosen is accent restoration, though the algorithm should work on any of the related problems such as, word-sense disambiguation, capitalization restoration, word choice in machine translation and homograph and homophone disambiguation. All the above problems are important because the whole semantics of the word under consideration changes dramatically with a small change in accent, context. 1. Identify Ambiguities in Accent Pattern Most words in Spanish/French exhibit only one accent pattern. The corpus can be analyzed to find words which have more than one accent pattern, and steps 2-5 are applied to them. 2. Collect Training Contexts All collocating words within a word window of ±k. 3. Measure Collocational Distributions The core strength of this algorithm lies on the fact that there is uneven distribution of collocations with respect to the ambiguous word being classified. Certain collocations tend to indicate one accent pattern and others indicating other patterns. The types of collocations considered are (a) Word immediately to the right (b) Word immediately to the left (c) Word found in ±k word window. (d) Pair of words at offsets -2 and -1 (e) Pair of words at offsets -1 and +1 (f) Pair of words at offsets +1 and +2 4. Sort by Log-Likelihood ratios into decision Lists In this step we compute the log-likelihood ratios according to (9) and sort them in decreasing order. The collocations most strongly indicative of a particular pattern will have the largest log-likelihood. The optimal value of k depends on the type of ambiguity, while semantic or topic-based ambiguities need a large window of k ≈ 20 − 50, while more local syntactic ambiguities need only a small window k ≈ 3−4. For Spanish experiments alone, a a morphological analyzer and a basic lexicon with possible parts of speech analyzer was used along with the collocating words to generate a rich set of evidence. 15 4.4 Semi-supervised Learning using Co-Training This paper considers the problem of using a large unlabeled sample to boost the performance of a learner when only a small amount of labeled data is available. Also, it is assumed that the instance can be described using two views, where one view is complete by itself for learning, provided there are enough labeled data. The goal is to use both views together to reduce the requirement of expensive labeled data, by using more inexpensive unlabeled data. Let X = (X1, X2) be an instance space, where X1 and X2 correspond to two different views of an example. Let D be the distribution over X and let C= (C1, C2) be the combined concept class defined over X, where C1 is defined over X1 and C2 is defined over X2. One significant assumption is that D assigns probability 0 to any example x = (x1, x2) where f 1(x1) 6= f 2(x2), where f 1 ∈ C1 and f 2 ∈ C2 are the target functions. This is called compatibility assumption. 4.4.1 Bipartite Graph representation Consider a bipartite graph GD (X1 , X2 ) where there is an edge (x1 , x2 ) if P rD (x1 , x2 ) 6= 0. The compatibility assumption implies that any connected component in GD will have the same classification. Given a set of unlabeled examples S, one can similarly define a bipartite graph GS having one edge (x1 , x2 ) for every (x1 , x2 ) ∈ S. Consider that S contains all examples in GD . When a new example comes in, the learner will be confident about its label, if it has previously seen a labeled example in the same connected component of GD . Thus if the connected components in GD are c1 , c2 , . . . and their probability masses are p1 , p2 , . . . respectively, then the probability that given m labeled examples, the label of a new example cannot be deduced is just X Pj (1 − Pj )m (4.2) cj ∈GD One can use the two views to achieve a tradeoff between the number of labeled and unlabeled examples needed. As the number of unlabeled examples increase, the number of connected components in GS decrease until it becomes equal to the number of connected components in GD . Since one needs only one labeled example per connected component, the number of labeled examples required to learn the target function decreases. Let H be a connected component in GD . Let αH be the value of the minimum cut of H. In other words, αH is the probability that a random example crosses this cut. If sample S were to contain all of H as one component, it must include at least one edge in that minimum cut. The expected number of unlabeled examples needed to include this edge is α1H . Karger shows that O( logN αH ) examples are sufficient to ensure that a spanning tree is found with high probability. So, if α = minH αH , then O( logN α ) unlabeled examples are sufficient to ensure that GS has same number of components as GD . 4.4.2 Learning in large input spaces The theorem in the paper shows that given a conditional independence assumption on the distribution D, if the target class is learnable from random classification noise in the standard PAC model, then any initial weak predictor can be boosted to arbitrarily high accuracy using unlabeled examples only by co-training. A weak predictor h of a function f is defined to be a function such that 16 1. P rD [h(x) = 1] >= 2. P rD [f (x) = 1|h(x) = 1] >= P rD [f (x) = 1] + This way, using a large amount of unlabeled data, we can prune away the incompatible target concepts and thereby reduce the number of labeled concepts needed to learn the concept classes. 4.5 Conclusion This chapter briefly went over the applications of semi-supervised learning methods to the various applications in text processing. 17 Chapter 5 Network Analysis and Random graph Models 5.1 Introduction This chapter is about the different network properties of real-world graphs and techniques to efficiently find them and about random graphs. The chapter is organized in the following way, section 2 introduces the different types of networks followed by Section 3, which presents the different properties of the networks and physical understanding of their implications on the network. In section 4, we go through an algorithm for finding community structure using eigenvectors of matrices. Finally in section 5, we go through the different random graph models and the distributions of real-world networks. 5.2 Types of Networks The need for identifying the different type of networks arise from the fact that different types of networks have different properties and therefore, to study them we need to model the networks based on the type of the network. Most real-world networks can be broadly categorized into one of the following categories. 5.2.1 Social Networks This network refers to a set of people connected together on the grounds of some underlying pattern and the interactions between them. For example, the network of friendships between people, business relationships between companies are all examples of social networks. There have been lots of experiments carried out in the past about such networks, but the main problem in such experiments is the way in which the data has been collected and the network size. For example, in the ”small world” experiment by Milgram, which was designed to find the path lengths in an acquaintance network by asking participants to try to pass letters to an assigned target individual through their acquaintances, though the data was collected directly from the participants, it is subject to bias from the participants and is also very labor intensive. To get rid of such errors, the trend in this field of research has been shifting towards analyzing collaboration or affiliation networks. An example of this is the graph obtained through the Internet Movie Database, where each actor is a vertex of the graph and there is an edge between two vertices if the two actors have 18 acted together in a movie. Also another such graph is the communication graph, where there is an edge between two vertices if there was any communication between them. 5.2.2 Information Networks There are three major information networks which are well-studied. One of them is the citation graph, where there is a directed edge from vertex A to vertex B, if B is cited in A, where the vertices are academic papers. The second important information network is the World Wide Web, where the vertices are pages on the WWW, and there is an edge from vertex A to vertex B if there is a hyperlink from A to B. The fundamental difference between them is that citation networks, by nature is acyclic, while the WWW is cyclic. The third type is the Preference Network, a bipartite network, where the vertices are of two types, the first being an individual and the second is an entity which they prefer, like books or movies. This network can be used for designing recommendation systems by analyzing the likes and dislikes of other people who look similar entities. 5.2.3 Technological networks This class of networks are the man-made networks used for distribution of some commodity or resource. Common examples include the power grid, the mail routes and the Internet, the physical interconnection of systems. The problem with the internet is that it is hard to find the exact network since most systems are owned by companies and hence to get an approximate representation, traceroute programs are used, which give the path the packet traced between two points. 5.2.4 Biological Networks The fundamental biological network is the metabolic pathways network, where vertices are metabolic substrates and products with directed edges joining them if a known metabolic reaction exists that acts on a given substrate and produces a given product. Other networks of this type include the gene regulatory network, neural network and the food web. 5.3 Network Properties The study of network properties is important because they give insight into the possible formation mechanism and ways to exploit any structure about the graph. A few statistical properties which are generally studied are listed below. 5.3.1 Small-World Effect As we have explained in the previous section, the experiment by Milgram resulted in the finding that it took only six steps to reach from one person to another through acquaintances. This is popularly known as the small-world effect. This measure, also known as, mean shortest (or geodesic) path length is computed using the following formula. l= 1 1 2 n(n X + 1) dij (5.1) i≥j where dij is the shortest path between i and j. This quantity can be measured by a breadth-first search which takes Θ(nm) time for a graph with n vertices and m edges. The problem with the 19 above definition is if the graph is disconnected the some dij values are infinity which would make l to be infinity. To get around this problem, the following formula is instead used for computation of mean shortest path length. l−1 = 1 1 2 n(n X + 1) dij −1 (5.2) i≥j One obvious implication is that, for graphs which exhibit the small-world effect, it would take very less time for information to spread across the network. This result is not particularly surprising when it is viewed mathematically, consider the number of vertices at a distance r from a distinguished vertex to grow exponentially, then the value of l is upper-bounded by log n. This has been proved for many different networks. 5.3.2 Transitivity or Clustering The metric, called clustering coefficient, is the probability of vertex A being connected to vertex C, given that there exists a vertex B which is connected to both A and C. It can be computed using the following formula. C= 3 × number of triangles in the network number of connected triples of vertices (5.3) An alternative definition of the clustering coefficient, given by Watts and Strogatz, is given below to compute a local clustering value at every vertex. ci = number of triangles connected to vertex i number of triples centered on vertex i (5.4) For vertices with degree 0 or 1, Ci = 0, then the clustering coefficient for the whole network is average of clustering coefficient of all vertices. 5.3.3 Degree distribution One way of representing the distribution of degree is computing the fraction of vertices, pk , in the network that have degree k, which can be obtained by drawing the histogram of degree of vertices. For real world networks, the bin size used in the histogram should increase exponentially to get around the problem of noise in the tail, so that there are more samples in the bins in the right. Another way to get around this problem, is to use the cumulative distribution function as shown below. Pk = inf X Pk0 (5.5) k0 =k which is the probability that the degree is greater than or equal to k. This method is preferred over the other because the previous method loses information about the differences of points in the same bin. 5.3.4 Network Resilience Network resilience is defined as the network’s resistance to become disconnected despite removing vertices. This resistance is measured in terms of the mean shortest path distance between two vertices. This measure has been shown to vary depending on the degree of the vertices and the 20 order in which vertices are removed. For example, in the Internet, random removal of vertices affects the mean shortest path distance very slightly whereas when vertices of high degree are attacked, the same metric increases sharply. And this property holds true for most real world graphs. 5.3.5 Mixing Patterns This property, referred to as assortative mixing, is about graphs where there are different ”types” of vertices and we would like to define a metric to compute how much of mixing of different types of vertices exists in graphs. A matrix E is defined as eij = the number of edges between vertices of type i and vertices of type j. This matrix is normalized by dividing each entry by the sum of all entries. The conditional probability P (j|i) that the neighbor of a vertex of type i is of type j can e be computed to be P ije . Gupta et al, have suggested this assortative mixing coefficient can be j ij computed as, −1 (5.6) N −1 Some preferable properties of this formula is that it is 1 for a perfectly assortative network and 0 for completely non-assortative. There are two shortcomings of this formula, one is that since E could be asymmetric it depends on what we have along the horizontal axis. The other shortcoming is it measures all vertices with equal weights which is not very desirable because there could be too many vertices of particular type, thus, it should be given more weight. An alternative formula for assortative mixing coefficient which overcomes this shortcomings is, P Q= r= 5.3.6 i P (i|i) T r(e) − ||e||2 1 − ||e||2 (5.7) Degree Correlations This property is to quantify the relevance of degree of a vertex to assortative mixing, in the sense, do high degree vertices connect more often with high degree vertices? In essence, we try to answer the question of, whether vertices connect with other vertices with some preference. 5.3.7 Community structure Most social networks have some sort of community structure within them, where community is defined to be a set of vertices with a higher than average number of edges between them and lesser edges outside this group of vertices. It is useful to discover this information, because the properties of the network at the community level may be very different from the properties at the whole network level. This information may be exploited to discover more properties of the communities. 5.4 Finding Community Structure A new approach to finding community structures using spectral partitioning has been proposed by Newman et al, though the algorithm takes the general approach to finding community structures which is graph partitioning, the novelty lies in the computation of the partition. In graph partitioning, we want to split the vertex set into two disjoint sets such that the number of edges that run between the sets is minimized. Considering the usual adjacency matrix representation of a graph, 21 the number of edges between the sets is computed as R= 1 2 i,j X Aij (5.8) in dif f erent groups To convert this into matrix form, we define an index vector S with n elements, such that ( si = +1 if vertex i belongs to group 1 −1 if vertex i belongs to group 2 Then R can be computed as, R= 1X (1 − si sj )Aij 4 ij (5.9) Let the degree of vertex i be ki then, ki = X Aij (5.10) j Then R can be written as, R= 1X si sj (ki δij − aij ), 4 ij (5.11) where δij = 1 if i = j and 0 otherwise. The above equation can be written in matrix form, as 1 R = sT Ls, 4 (5.12) where L is the Laplacian matrix of the graph, with Li j = ki if i = j −1 if i and j are connected 0 otherwise Then if vi ’s are the eigenvectors of the Laplacian matrix, then s can be represented in terms of vi ’s P as s = ni=1 ai vi . Then R can be written as, R= X i ai vi T L X aj vj = X j ai aj λj δij = ij X ai 2 λi , (5.13) i where λi are the eigenvalues of L. Now we have a formula for computing R given an index vector S, which represents the split of the network. Since 0 is an eigenvalue of L, and the corresponding normalized eigenvector is (1, 1, ...) , therefore, the choice of S = (1, 1, 1, ...) gives the minimum possible value of R = 0 but the physical interpretation of this choice is to put all vertices in one group, which is not very useful. The most common fix to this problem is to fix the sizes of the two groups, n1 and n2 , which fixes the value of ai to, (n1 − n2 )2 (5.14) n Now the best option that we are left with to minimize R is to choose the S vector to be as much parallel as possible to the second eigenvector, also called the f iedler vector with a1 ’s value being fixed, i.e, to maximize, X X (2) |V2 T | = | vi (2) si | ≤ |vi |, (5.15) ai 2 = (v1 T s)2 = i i 22 (2) where vi (2) is the i’th element of v2 . In other words, |v2T s| is maximized when vi si ≥ 0. This combined with the condition that there should be exactly n1 +1’s or -1’s, we maximize the above equation by assigning vertices to one of the groups in order of the elements in the Fiedler vector, from most positive to most negative, until the groups have the required sizes. For groups of different sizes, we set n1 to be the number of +1’s and calculate R and do the same with n1 -1’s and choose the choice which results in a smaller R. But this whole approach of spectral partitioning does not perform well on most of the real-world graphs. This is mainly because the whole approach forces us to fix the sizes of the groups, which may not be known beforehand in most situations. Therefore, the remedy that Newman suggests is to modify the benefit function to, Q = (number of edges within communities) − (expected number of such edges) (5.16) This benefit function is called modularity, and clearly maximizing this gives us tightly connected communities. The first quantity is easy to compute in the above equation, whereas the second quantity is a little harder to compute. To compute the second quantity, we assume a probability distribution for the edges and based on that we compute the expected number of edges. In the paper by Newman, he assumes that the probability of an edge incident on a vertex depends only on the expected degree of that vertex and derives the expected number of edges. A similar approach as taken in the spectral partitioning method is used again to convert the formula for Q into a matrix based formula and once again the S vector is chosen to be as much parallel as possible to the first eigenvector (when eigenvalues are arranged in non-increasing order) of the modularity matrix. This approach works very well compared to the first approach. A comparison of these two approaches is shown in fig 3.1. The graph shown in the fig 3.1 is of the interactions between dolphins, and the graph was split into two subgraphs after the departure of one of they key dolphins. The figure shows the two splits obtained by the two approaches. The second approach wrongly places only three of the 62 dolphins in the wrong group. 5.5 Random Graph Models To analyze the real-world graphs, it is very useful to model the graphs artificially. In order to be able to do that, we need to identify some properties of the real-world graphs. One of the major properties that we need to analyze is to identify the probability distribution of edges. Most of the distributions in real-world graphs have been found to be right-skewed, i.e, a lot of samples occur for the smaller data points. Another important property about such distributions is they follow a distribution called power law as described below. P (x) = Cx(−α) (5.17) The constant C is called the exponent of the power law, which is not very useful in the sense it is introduced in the formula only for the sake that the probabilities should sum to 1 for all x. In the paper, Newman, gives a lot of real-world examples which have been found to follow power law. To figure out if a given data set follows power law or not, and if it does how do we find out the exponent of the power law. One of the common strategies is to plot the given data set along logarithmic horizontal and vertical axes and if it happens to be a straight line, then it is said to follow a power law. But this turns out to be a poor approach. This is because when we plot the data, we bin the given data points into bins of exponentially increasing bin sizes, the noise in the 23 Figure 5.1: Comparison of the two approaches to graph partitioning, M.E.J Newman, Finding community structure in networks using the networks of matrices, The European Physics Journal B, 38:321 - 330, 2004 24 tail increases and therefore we can not identify the straight line form of the power law. Instead we can plot the cumulative distribution function, Z inf P (x) = p(x0 )dx0 (5.18) x Thus if the original distribution follows a power law with exponent α, then the cumulative distribution function can be shown to be following a power law with exponent α − 1. An alternative method to extract the exponent is to use the formula below, n X α = 1 + n([ i=1 ln xi xmin ])−1 (5.19) where xi are the measure values and xmin is the minimum value of the x’s. 5.6 Conclusion This chapter went over the basics of the different types of networks and the different properties of real-world graphs. Also a novel method for finding the community structures using spectral partitioning was presented and discussed. 25 Chapter 6 Prepositional Phrase Attachment Ambiguity 6.1 Introduction This chapter discusses the various approaches that have been taken to solve the problem of prepositional phrase attachment ambiguity. The chapter is organized in the following way. Section 2 contains a brief description of the problem and Section 3 goes over the different approaches and section 4 summarizes the conclusions of different approaches. 6.2 Problem The problem of prepositional phrase attachment problem (PP-Attachment) is a structural ambiguity problem which mainly arises as a subproblem of natural language parsing. The problem can be viewed as a binary classification task of deciding which site to attach a preposition to, the options being the verb or the noun. The PP-Attachment problem can be best explained with an example. ”I saw the man with the telescope” In the above example, we need to decide if the prepositional phrase ”with the telescope” attaches to the verb ”saw” or to the noun ”man”. In this case, the correct attachment is mostly ”saw” but definitely depends on context in general. The general setting of this problem is converts the input from a sentence to a 4-Tuple (V, N1, P, N2) where V is the sentence verb, N1 is the sentences object, P is the preposition, and N2 is the prepositional object. Then the problem can be restated as finding a function f (V, N1, P, N2) such that the output domain of f is {va, na} where va means verb attachment and na means noun attachment This problem of finding the correct attachment given just the 4-tuples is generally considered to be very hard since even humans have only an accuracy rate of 88.2%. 6.3 Methods for PP-Attachment The earliest simple methods which were used to solve this problem are 1. Right Association: The preposition tends to attach with the noun. This surprisingly has an accuracy rate of 67% 26 2. Minimal Attachment: a constituent tends to attach so as to involve the fewest additional syntactic nodes, which in our problem suggests verb attachment. 6.3.1 Corpus-based approach using t-scores The first corpus-based approach to PP-attachment was done by Hindle and Rooth in 1993. Their approach dwells on the lexical co-occurrences of nouns and verbs with the preposition. They use the 13 million word sample of AP news articles as their data set. They compute the frequency of the (verb, noun, preposition) triples in this data set, and using this data set they come up with a bigram table where the first entry is a noun or a verb, suggesting which constituent should the preposition, the second entry, be attached to. To compute this bigram table they take a 3 step approach as given below. 1. Associate all verbs or nouns which are sure verb-attachment or noun-attachment with prepositions. This is identified with a set of hard-coded rules that they have given. 2. Other triples which are not attached are attached using the help of a t-score, if the t-score is above 2.1 or less than -2.1 then we assign the preposition according to the t-score. The remaining ambiguous triples are split between the noun and the verb. 3. The remaining ambiguous are attached to the noun. Using this bigram table, they come up with a simple procedure for correct attachment based on the t-score as given below. t= p P (P rep|noun) − P (P rep|verb) + σ 2 (P (P rep|verb)) σ 2 (P (prep|noun)) (6.1) where the probabilities are estimated based on the frequency counts in the corpus. If the t-score is greater than 2.1 then its assigned to the noun and if it is lesser than -2.1 its attached to the verb. Based on these new attachments the whole process is iterated over again, and further attachments are made. This method has an accuracy of 78.3%, and their confidence-based approach which is to guess only when the confidence is greater than 95% achieved a precision of 84.5%. 6.3.2 Rule-Based Approach This approach is based on a set of rules that are learned in a particular order according to a greedy strategy to reduce training error. The first step is to run the unannotated text through an initial annotator, which could attach the preposition with the verb or noun according to simple rules. And then it follows a common supervised machine learning algorithm, which is transformationbased Error-Driven learning. This approach is intended to reduce the training error by finding transformations which reduce the error by the maximum amount. It can clearly be seen that this approach is greedy. The transformations that the procedure searches for are a simple set of transformations. The procedure then chooses the transformation which best reduces the error. The rules used are of the following form. 1. Change the attachment from X to Y if T is W. 2. Change the attachment from X to Y if T1 is W1 , T2 is W2 . 27 One specific rule mentioned in the paper is, ”Change the attachment location from nl to v if p is ”until”. Initial accuracy after passing through the initial annotator was found to be 64%, which was to associate always with the noun. And after the transformations, the accuracy was found to be 80.8% and it is shown that the accuracy increases with the amount of training data, which is intuitive for most supervised machine learning algorithms. An optimization to this approach was suggested using word classes. Using WordNet, words were binned into different classes, and these word classes were chosen during training. For example, all words relating to time, like ”in an hour”, ”month”, ”week” were replaced with the word class ”Time”. This way, they were able to reduce the number of transformations learnt and an improved accuracy of 81.8% 6.3.3 Back-Off model This approach introduce by Collins et al, is a supervised machine learning algorithm which is used when there are data sparsity problems. The notation used here is P 0 (1|v, n1 , p, n2 ) = P (1, v, n1 , p, n2 ) P (v, n1 , p, n2 ) (6.2) where P 0 (1|v, n1 , p, n2 ) is the maximum likelihood estimation (MLE) that the attachment should be noun attachment, and P (v, n1 , p, n2 ) is the frequency of the verb v, nouns n1 , n2 and preposition p are found in one sentence, and the P (1, v, n1 , p, n2 ) is the frequency of noun-attachment in such cases. If this P 0 (1|v, n1 , p, n2 ) ≥ 0.5, then it is decided to be noun-attachment, else verb attachment. The only problem with the equation above is that it is possible that we would have never come across the 4-tuple (v, n1, p, n2) and hence the denominator would be 0 in that case and the above formula becomes useless. Infact the authors have measured this and stated that this happens for 95% of the 4-tuples found in the test set. In such cases, we back-off to 3 tuples to compute the MLE and even if that frequency count happens to be too little then we further back-off to 2 tuples and compute the maximum likelihood estimation. The only constraint that is imposed when backing off is to use make sure the 3 tuples or 2 tuples contains the preposition. The description of the algorithm has been shown in figure 5.1. The accuracy chaptered in this paper is 84.1%. An optimization that has been suggested is to use morphological analysis which is kind of similar to Word classes used in the previous approach and this leads to a 0.4% improvement. 6.3.4 Nearest-Neighbor Method This is another approach that has been used to deal with the problem of data sparsity. In this approach the assumption is that similar 4-tuples will have the same attachment site, where there is some notion of similarity between tuples using the word similarity between corresponding nouns, verbs and prepositions. This assumption is justified by Distributional Hypothesis in linguistics, which states that words that tend to appear in the same contexts tend to have similar meanings. Therefore, this leads us into the problem of finding similar words. There is a structure defined to all words in the corpus and this structure is parsed from the corpus. A dependency relationship is an asymmetric binary relationship between a word called head, and another word called modifier. Then the whole sentence forms a dependency tree and the root of the sentence does not modify any sentence and is called the head of the sentence. For finding the nearest-neighbor, we should have some kind of feature set defined. The feature set for a word w, is defined to be (r, w’) where r represents the dependency relationship between w and w’. Since most dependency relationships involve words that are closely situated to one another, such relationships can be approximated by the co-occurrence relationship within a small window. In this case the window is restricted to be 28 Figure 6.1: Description of the algorithm for PP-Attachment using a back-off model, Collins et al, Prepositional Phrase Attachment through a Backed-Off Model, Third Workshop on large corpora 29 of size 1. Then using this as a feature vector for a word, similarity can be computed according to different similarity metrics defined previously such as Cosine Measure, Cosine of Pointwise Mutual information and so on. Let the two feature vectors be (u1 , u2 , .., un ) and (v1 , v2 , .., vn ), then the different similarity metric formulae are given below. Pn i=1 ui × vi Pn 2 2 i=1 vi i=1 ui simcos = Pn (6.3) For cosine of Pointwise Mutual Information, the pointwise mutual information between a feature fi and a word u measures the strength association between them. It is defined as follows: pmi(fi , u) = log( P (fi , u) ) P (fi ) × P (u) (6.4) where P (fi , u), is the probability of fi co-occurring with u. P (fi ) is the probability of fi cooccurring with any word, and P (u) is the probability of any feature co-occurring with u. The Cosine of Pointwise Mutual Information is defined as: Pn i=1 pmi(fi , u) × pmi(fi , v) pPn 2 2 i=1 pmi(fi , u) × i=1 pmi(fi , v) simcosP M I (u, v) = pPn (6.5) Now coming to the actual algorithm used to find the attachment for a preposition, the algorithm is executed in steps and each step is taken only if the previous step fails to result in a decision. The decision is taken at each step by computing a weighted majority vote form the nearest k neighbors of the input node, where the weights are the similarity between the different neighbors and the input node. The algorithm is given below. 1. The nearest neighbors are the nodes which are identical to the input node and hence similarity is 1. 2. The nearest k neighbors are found where the neighbors also contain the same preposition, then similarity between two 4-tuples t1, t2 is computed as follows. sim(t1 , t2 ) = ab + bc + ca (6.6) where a, b, and c are the distributional word similarities between the verb, the nouns N1 and N2 . 3. This step is the same as the previous function, with the only difference being the similarity function, which is, sim(t1 , t2 ) = a + b + c (6.7) 4. The set of nearest neighbors is all the training data examples with the same preposition and the similarity between the neighbors and the input node is constant and if the decision is still tied between N and V, then a noun-attachment is chosen. This algorithm chapters different accuracy rates based on the similarity function used, and it turns out that simcosP M I is the best similarity function giving an accuracy rate of 86.5%. 30 6.3.5 Random Walks The approach taken by Toutanova et al, makes use of Markov Chains and Random walks. The whole problem of computing P (Att|V, N1 , P, N2 ) is solved by using the joint probability distributions of P (V, N1 , P, N2 , Att). For doing this, they make a few assumptions so that they could build a probability model. The assumptions are given below. 1. Given a verbal attachment, the second noun is independent of the first noun 2. Given a noun attachment, the second noun is independent of the verb. This helps us derive formulas for computing the joint distributions as given below. P (V, N1 , P, N2 , va) = P (P, va)P (V |P, va)P (N1 |P, va, V )P (N2 |P, va, V ) P (V, N1 , P, N2 , na) = P (P, na)P (V |P, na)P (N1 |P, na, V )P (N2 |P, n1 , na) All factors except P (P, Att) are estimated using random walks. As an example, they show how to compute P (N2 |P, V, va), they do so by constructing a Markov chain M, whose transition probabilities depend on p, v, and the fact that Att = va so that its stationary distribution π is a good approximation of P (N2 |P, V, va). The set of vertices in the Markov chain are defined to be the set W × {0, 1}, where W is the set of all words, and the bit denotes whether it is reached as the head or as a dependent. For P (N2 |P, V, va), V is a head, and N2 is a dependent. For the sake of brevity, we represent (w, 1) as dw . The initial distribution of the Markov chain gives a probability of 1 to hV because thats where we start at. Consider an example of the sentence as ”hang painting with the nail”, then we are trying to estimate the probability P (N2 = nail|V = hang, P = with, Att = va), if the event of seeing V = hang, P = with, Att = va and N2 = nail already occurred in the training set then we could use the empirical distribution itself as a good approximation. If the word ”nail” occurs a lot in the context of P=”with”, V = ”hang” and att = va, then P (N2 = nail|V = hang, P = with, Att = va) should receive high probability. And once again we use some word classes where the classes are not limited to similarity in meanings and the contexts in which they appear, but also stemming. If we did not see many occurrences of the word nail in the context we described earlier but indeed saw many occurrences of the word ”nails” then we would still like P (N2 = nail|V = hang, P = with, Att = va) to be assigned a large probability, and this is achieved by assigning an edge between dnails and dnail with large probability. And since dnails is surely going to get a high probability as a result of the large number of occurrences of the word ”nails”, dnail is also going to get a high probability. In general, we assign edges between (w1 , b1 ) and (w2 , b2 ) if w1 and w2 belong to the same word class and b1 = b2 . The edge that we just added is called as a link representing a particular inference rule. 6.3.6 Formal Model In the formal model, the graph remains as described above. The only addition that we do to them, are the addition of inference rule links, where each link type leads from one state with memory bit b1 to another state with memory bit b2 . Then the final stationary distribution that we compute will be a result of mixture of the basic transition distributions, where the mixture weights are learned automatically. Let the links l1 , l2 , . . . lk be given by transition matrices T 1 , T 2 , . . . T k . Each matrix T i has rows for states with memory bits startBit (i) and its rows are distributions over successor states with memory bit endBit (i) . Then the probability of going from (w1 , b1 ) to (w2 , b2 ) is given by 31 P (w2 , b2 |w1 , b1 ) = X λ(w1 , b1 , i)T i (w2 , b2 |w1 , b1 ) (6.8) i:startBit(i)=b1 ,endBit(i)=b2 where λ(w1 , b1 , i) is the weight of link type i for the state (w1 , b1 ) The probabilities λ(w, b, i) sum to 1 over all links li having startBit (i) = b. Also the various link types or inference rules that were used are given. They chapter an accuracy of 87.54%. 6.4 Conclusion A brief overview of the algorithms that were developed in the past for PP-Attachment were explained in detail. It can be seen that lots of algorithms have almost reached the human accuracy rate when given just the 4-tuple. But it has been found that the human accuracy rate is 93.2% given the whole sentence instead of just the 4-tuple. This suggests that there is much more information underlying in the sentence which none of the current algorithms have exploited which is a very promising avenue for extending this research. 32 Chapter 7 Dependency Parsing 7.1 Introduction The problem of dependency parsing refers to parsing in the context of grammars in which the structure is determined by the relation between a word (head) and its dependents, such a grammar is called dependency grammar. Dependency grammars are very well suited for linguistic parsing. The general phrase structure grammar does not conform well to the requirements of linguistic parsing of free word order languages like Czech. Dependency parsing is used for finding the syntactic structure of a sentence determined by the relation between a word (head) and its dependents. In dependency syntax, the words are called lexical items and the binary asymmetrical relations are called dependencies. The complete structure showing all the dependencies is called a dependency tree. An example of a dependency tree is shown in Fig 6.1. Before constructing the dependency tree, we need to understand what are the criteria for a dependency. The criteria given by Zwicky [1985] and Hudson [1990] are given below. Figure 7.1: An example showing the syntactic structure (dependencies) of a sentence 1. H determines the syntactic category of C; H can replace C. 33 2. H determines the semantic category of C; D specifies H. 3. H is obligatory; D may be optional 4. H selects D and determines whether D is obligatory. 5. The form of D depends on H (agreement or government) 6. The linear position of D is specified with reference to H, where H is the head, D is dependent and C is the construction. Similarly there are also conditions on the dependency graph. They are given below along with the intuition behind it. The dependency graph is: 1. Weakly Connected: Syntactic structure is complete 2. Acyclic: Syntactic structure is hierarchical 3. Single head: Every word has at most one head 4. Projectivity, that is, being able to draw the dependencies without criss-crossing. In general this is only an assumption we make to relax the constraints, because otherwise the problem becomes very hard, though non-projectivity is needed for free word order languages or long distance word dependencies. 7.2 Formal definition of the problem The problem in dependency parsing can now be formally defined as follows, Input: Sentence x = w0 , w1 , w2 , . . . , wn with w0 = root Output: Dependency tree for G = (V, A) where V = 0, 1, . . ., n is the vertex set, A is the arc set, i.e, (i, j, k) ∈ A represents a dependency from wi to wj with label lk ∈ L. There are two main approaches used for this problem. The first approach is Grammar-Based parsing which is using a lexicalized context free grammar and general CFG parsing algorithms are used. The second one is the approach that has been discussed in detail in this chapter, which is Data-driven parsing. This approach is further sub-divided into Transition-based methods and Graph-Based methods which are the topics of discussion of the next two sections. 7.3 Machine Learning based methods A linear classifier is a linear function on the feature vector which computes a score or probability of a classification. Let w ∈ <m be a high-dimensional weight vector, then the linear function is of the form, For binary classifier, y = sign(w.f (x)) (7.1) For a multiclass classifier, y = argy max(w.f (x, y)) 34 (7.2) A set of points is said to be separable, if there exists a w such that classification is perfect. For a supervised learning algorithm, we need to model some objective function and find the weight vector which optimizes the objective. The perceptron is one of the most basic supervised learning algorithms, and the objective function is the training error function as given below. w = argminw X 1 − yt == sign(w.f (X)) (7.3) 1 − (yt == argmaxy w.f (Xt , y)) (7.4) t For multiclass classification, w = argminw X t The perceptron algorithm is shown in Fig 6.2. It iteratively alters the weight vector such that the number of training errors are reduced, and it can be proven that the training error is bounded, and hence the algorithm is bound to terminate after a finite number of iterations. The margin of a classifier is defined to be the minimum distance between the separating hyperplane and any of the data points. Formally, a training set T is separable with the margin γ ≥ 0 if there exists a vector u with ||u|| = 1 such that: u.f (xt , yt ) − u.f (xt , y 0 ) ≥ γ (7.5) for all y 0 ∈ yt 0 = y − yt and ||u|| is the norm of u. It can be seen that to maximize the margin, we can alternatively minimize the norm of w, which is the basis for support vector machines. 7.3.1 Non-Linear Classifiers Lots of real-life data sets are not linearly separable, in such cases we need to build a non-linear classifier. One of the most common non-linear classifiers is the K-nearest neighbors. In the simplest form, the problem is simplified into finding the K-nearest neighbors in the training set and each of those nodes vote for classification. Thus a training set T, a distance function d, and K define a non-linear classification boundary. The most commonly used distance function is the Euclidean distance. 7.3.2 Kernels A kernel is a similarity function between two points which is symmetric and positive semi-definite generally denoted as φ(xt , xr ) ∈ R. Mercer’s theorem states that any kernel can be written as product of some function on the two points. Noting that in the perceptron algorithm, f (xt , yt ) is added αy,t times and f (xt , yt )) is subtracted αy,t times to the W vector, then we can write W as, W = X αy,t [f (xt , yt ) − f (xt , yt )] (7.6) t,y Using this, the arg max function is written as, y∗ = argmaxy∗ X αy,t [φ((xt , yt ), (xt , y∗)) − φ((xt , y), (xt , y∗))] (7.7) t,y Using the above equation the perceptron algorithm is rewritten using only kernels as shown in Fig 6.3. It still seems unclear as to how this whole new transformation helps. The computation of the non-linear kernel is much more efficient than computing the dot-product in a higher dimensional feature space, which leads to the whole algorithm being extremely efficient. 35 7.4 Transition system Another way to model the construction of the dependency tree is using finite state machines or transition systems. A transition system for dependency parsing is a quadruple S = (C, T, cs , Ct ) where 1. C is a set of configurations, each of which contains a buffer of (remaining) nodes and a set A of dependency arcs, 2. T is a set of transitions, each of which is partial function T:C -¿ C 3. cs is an initialization function, mapping a sentence x = w0 , w1 , . . . wn to a configuration with β = [1, . . . , n], 4. Ct ⊆ C is a set of terminal configurations. A configuration represents a parser state and a transition sequence represents a parsing action. The transitions that the transition system takes determines the set of arcs added to A. Therefore, the only thing that is left is to know which transition to take at every state. A system which tells us which transition to take at every point is called the Oracle and oracles are approximated by linear classifiers. There are some specific models of transition systems using stacks and lists. The stackbased and list-based transition system are shown in Fig 6.3. The stack based transition systems is subdivided into two more transition systems for the purpose of dependency parsing, shift-reduce dependency parsing and arc-eager dependency parsing. They differ only in the way the transitions are modeled. The two are shown in Fig 6.4 and Fig 6.5. It can be clearly seen from the way the transitions are modeled that the space complexity for both transition systems is Θ(n) whereas the time complexity for stack based transition system is θ(n) whereas its θ(n2 ) for list-based system. Now we are concerned with the problem of how to build the oracle given that the systems are in place. As mentioned before the oracles are approximated using classifiers and they are trained using treebank data. The classifier takes a configuration in the form of feature vector of the transition system and outputs a classification or a transition based on the training transition sequences. The feature vectors are usually defined in terms of the target nodes, neighboring words, and the structural context. Most supervised learning algorithms that we discussed above can be used to solve this problem. 7.5 Graph-Based Methods It can be clearly seen that the dependency graph is a spanning tree since it covers all the vertices (words) , is connected, and is a tree. We define Gx to be the complete multi-digraph for the sentence x = (w0 , w1 , . . . , wn ) with V = w0 , w1 , . . . , wn and (i, j, k) ∈ A for all (i, j) ∈ V × V such that i 6= j and k ∈ L where L is the set of all labels or dependency relations. Every edge (i, j, k) in the graph has a weight equal to wij k . Now the problem of finding a dependency tree is modeled as finding the maximum spanning tree of Gx . The maximum spanning tree problem is mathematically defined as, Y G0 = argmaxG0 ∈T (G) w(G0 ) = argmaxG0 ∈T (G) wij k (7.8) (i,j,k)∈G0 where T (G) is the set of all spanning trees. From the above formula, it can be seen that we are making the assumption that choosing an edge is dependent from other choices. A simple reduction that is applied to this graph is to replace the |L| edges between every pair of vertices 36 Figure 7.2: Stack Based transition System Figure 7.3: Shift-Reduce Dependency parsing 37 Figure 7.4: Arc-Eager Dependency parsing Figure 7.5: List-based Transition System 38 Figure 7.6: Projective Parsing for list-based transition systems by wij k = argmaxk wij k . Now the complete multi-digraph has been reduced to a simple digraph. McDonald et al, use the Chu-Liu Edmonds algorithm to compute the maximum spanning tree. But a small problem is that the Chu-Liu-Edmonds algorithm assumes the weight of the spanning tree to be the sum of weights of edges. This can be easily rectified by taking logarithms. Y G = argmaxG0 ∈T (G) wij k (7.9) wij k (7.10) (i,j,k)∈G0 = argmaxG0 ∈T (G) log Y (i,j,k)∈G0 = Y logwij k (7.11) (i,j,k)∈G0 Therefore, we can set wij k to log wij k and apply the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;. Edmonds, 1967) which is given in Fig 6.6. The algorithm first finds the maximum incoming arc for each vertex and if this happens to be a tree then it is the maximum spanning tree. Else we contract the cycle and consider that as a single vertex and recalculate arc weights into and out of cycle. Now we have a smaller graph compared to the original graph and we recursively apply the algorithm on the reduced graph. The way we compute the arc weights after contraction is 1. Outgoing arc weight = Equal to the max of outgoing arc over all vertices in cycle 2. Incoming arc weight = Equal to the weight of the best spanning tree that includes the head of the incoming arc and all nodes in cycle 39 Figure 7.7: Chu-Liu-Edmonds Algorithm, Y. J. Chu and T. H. Liu, On the shortest arborescence of a directed graph, Science Sinica, v.14, 1965, pp.1396-1400 and J. Edmonds, Optimum branchings, J. Research of the National Bureau of Standards, 71B, 1967, pp.233-240. 40 Now that we have a method for solving the problem given a weighted digraph, the only problem is how to find the weights. As before, we turn to linear classifiers for this. The weights are computed as, wij k = ew.f (i,j,k) (7.12) The features that could be used for this purpose were discussed in McDonald et al [2005]. Some of them are listed below. 1. Part-of-speech tags of the words wi and wj and the label lk 2. Part-of-speech of words surrounding and between wi and wj 3. Number of words between wi and wj , and their orientation 4. Combinations of the above Thus we can calculate the weights using the machine learning algorithms as discussed above. 7.6 Conclusion This chapter clearly explained about the problem of dependency parsing and showed how it can be expressed as a classification problem. All the required learning algorithms for classification were also discussed. The graph-based models and transition system based models were also explained in the context of dependency parsing. 41 Chapter 8 Statistical Machine Translation 8.1 Introduction Automatic translation from one human language to another using computers is known as machine translation. There have been many different approaches to machine translation like Rule-based machine learning, where algorithms are developed to understand and generate the syntactic structure of sentences. The most common approach to machine translation has been the statistical approach which is mainly because of the growing availability of bilingual machine readable-text. The statistical approach refers to probabilistic rules learned from various statistics observed in the bilingual corpora available. 8.2 A simple statistical machine translator One of the earliest and simple statistical machine translators was given by Brown et al [1]. As with most modern approaches too, the underlying model is based on Bayes theorem, i.e, given a source sentence S seeking a translated sentence T, the error is minimized by choosing the sentence S which is most probable given T, which can be mathematically represented as given below. P r(S|T ) = P r(S)P r(T |S) P R(T ) (8.1) Since the denominator does not depend on T, we can drop it when choosing S to maximize P r(S|T ). The above equation has two factors P r(S), referred to as the language model, and P r(T |S) known as the translation model. Also we need a method for searching over all sentences S which maximizes the product of the two factors, this method is often referred to as Decoder. These three methods (models) together constitute a statistical translation system and is shown in Fig 7.1. This is the model that is considered in all the approaches described in this chapter. P r(S) can not be computed as such from the corpora statistics, since it is very possible, almost for sure, that we would not have seen every possible sentence in the source language. For example, to compute P r(S = s1 , s2 , . . . , sn ) we can use the following formula, P r(s1 , s2 , . . . , sn ) = P r(s1 )P r(s2 |s1 ) . . . P r(sn |sn−1 , . . . , s1 ) (8.2) The problem with the above equation is that there are too many probabilities to be computed. Because of this problem they settle for an approximate language model, the N-gram model, which 42 Figure 8.1: Statistical Machine Translation System, Brown et al, Computational Linguistics Volume 16, Number 2, June 1990 43 is based on how many N-grams (consecutive set of N words) have been seen before. For example, the formula for calculating P r(S = s1 , s2 , . . . , sn ) using a bigram model (N = 2) is, P r(S = s1 , s2 , . . . , sn ) = P r(s1 as start of sentence)P r(s2 |s1 ) . . . P r(sn as end of sentence) (8.3) The translation model is a little more complex. There are three different factors in this model. They are P r(t|s) which denotes the probability that the word t is the translation of s. Before talking about the other two factors we need to introduce the notion of distortion and fertility. For example, consider the French string ”Le chien est battu par Jean” whose corresponding English translation is ”John does beat the dog”. The word ”Le” corresponds to ”the” in English and ”battu” corresponds to ”beat” and it can be seen that the French words are not aligned with the corresponding English words, and this is referred to as distortion. The distortions are also modeled using a probability distribution and their model for distribution is P r(i|j, l) where i is a target position, j is a source position, and l the target sentence length. It can also be seen from the example translation the word ”does” does not have any corresponding equivalent in the French sentence, whereas the word ”beat” produces two words, ”est” and ”battu”. Fertility is defined as the number of French words that an English word produces in a given alignment. An example alignment is shown in Fig 7.2. Thus the translation model is modeled using three probability distributions, namely, fertility probabilities P r(n|e) for each English word e and some moderate value of n, translation probabilities P r(f |e) for each French word f and English word e, and the distortion probability distribution. Figure 8.2: An example of Alignment, Brown et al, Computational Linguistics Volume 16, Number 2, June 1990 The decoder is based on stack search, in which initially there is a hypothesis that the target sentence arose in some sense from a sequence of source words that we do not know. The hypothesis is then extended through a series of iterations. The most promising entries are extended in every iteration. For example if the initial state in the stack is (Jean aime M arie|∗) then it could be extended in the following ways. • (Jean aime Marie | John (1) *) , • (Jean aime Marie | *loves (2) *) , • (Jean aime Marie | *Mary (3) ) , • (Jean aime Marie | Jeans (1) *) The search ends when there is a complete alignment on the list that is more promising than other alignments. 44 8.3 IBM’s Statistical Machine Translation System The IBM’s statistical machine translation was described in The mathematics of Statistical Machine Translation by Brown et al[1993] [2]. The modeling of the system takes place in an iterative fashion. The base model remains the same as what I described in the previous section. This model too has three components, the language model, the translation model and the decoder. The language of occurrences(”xyz”) model uses the trigram model b(z|xy) = number number of occurrences(”xy”) . But it is possible that even in the above model that we can get a value of zero for the number of occurrences count. Hence instead of the above formula they use a smoothed version of it as given below. number of occurrences(”xyz”) number of occurrences(”xy”) + of occurrences(”yz”) 0.04 × number number of occurrences(”z”) + of occurrences(”xyz”) + 0.008 × numbertotal words seen b(z|xy) = 0.95 × 0.002 Since we have a 0.002 added to the fractional counts, we will never get a probability of zero. It also makes sense to have different coefficients chosen for different trigrams that we are calculating the probability for. Before moving on to the translation model we need to devise a method to compare two different language models. For this reason, they introduce the notion of perplexity. perplexity = 2 −log(P (e)) N (8.4) where N is the number of words in the test data. A good model will have relatively high P (e) values and hence low perplexity. Thus lower the perplexity, the better the model. Now the translation model is initially a set of three probability distributions which are, the fertility distribution, the translation distribution and the distortion distribution. The difference between this translation model and the translation model described in the previous section lies in the slight change in the distortion distribution. The distortion distribution of this model is slightly simpler. Here the distribution is P r(tpos |spos ) which is the probability that a source word at spos will generate a target word at tpos . The next model is built on top of this model. The distortion probability distribution for this model is not based on the position of the source word alone but instead is based on the lengths of the source and target sentences. There is another addition to this model which is a mechanism to introduce some French words for which there are no corresponding English words. One can think of this as the source sentence having an invisible NULL word at the start of the sentence and that NULL word generates some words in the target language. This introduces another factor which is the probability, p0 , that the NULL word generates a target word. Now the whole model is shown in Fig 7.3. This leads to the problem of how to estimate these probability distributions automatically from large corpora. Assuming we have aligned sentence pairs then computing the distribution is fairly easy. Since we can just count the number of times the event occurs and divide by the size of the sample space. But since we do not have aligned sentence pairs we choose to generate a distribution called the alignment probability distribution. Suppose we have computed the alignment probabilities P r(a|e, f ), now consider a particular sentence pair and compute the counts required for the distortion distribution, the fertility distribution and the translation distribution considering all possible alignments. Then the counts are weighted as follows f ractional − count = X all possible alignments 45 P r(a | e, f ) × count (8.5) Figure 8.3: Statistical Machine Translation System, Kevin Knight, Statistical machine translation tutorial workbook, JHU summer workshop [1999] 46 |e) We can compute P r(a | e, f ) as PPr(a,f r(f |e) . The numerator of this fraction can be computed using the model itself. And the denominator can be computed as X P r(f | e) = P r(a, f | e). (8.6) all possible alignments Therefore the whole problem is reduced to computing P r(a, f | e). The formula for computing this probability is given below. ! m − phi0 0 × p0 × (m − 2phi0 ) × pphi × 1 phi0 P r(a, f | e) = n X i=1 m X m X n(phii | ei ) × t(fj | eaj )× j=1 d(j | aj , l, m) × j:aj 6=0 l X i=0 phii ! × 1 phi0 ! where, 1. e = English sentence 2. f = French sentence 3. ei = the ith English word 4. fj = the j th French word 5. l = number of words in English sentence 6. m = number of words in French sentence 7. a = alignment (vector of integers a1 , . . . , am , where each aj ranges from 0 to 1) 8. aj = the English position connected to by the j th French word in alignment a 9. eaj = the actual English word connected to by the j th French word in alignment a 10. phii = fertility of English word i ( where i ranges from 0 to 1) , given the alignment a. There is one problem with the above methodology for computation of the parameters. Initially we said we can use the parameter values to be able to compute P (a, f | e) and for getting parameter values we need to use P (a, f | e). The way we solve this problem is using the standard Expectation minimization algorithm. The idea for this algorithm is to use uniform probability values and an uniformly random value for p1 and now we can compute the alignments probabilities and using these probabilities we can compute the counts for the parameter values. We repeat this procedure again and again and they will converge on good parameter values. There is still one problem with this approach. It computes the probability distributions by enumerating all alignments, the number of alignments are too many to compute even for a single pair of sentences. Fortunately there is an efficient way of computing the probability distributions without enumerating all the alignments which is explained below. 47 P r(a, f | e) P r(a | e, f ) = P all possible alignments P (a, f | e) (8.7) The denominator can be computed as X X P (a, f | e) = m X t(fj | eaj ) (8.8) all possible alignments j=1 all possible alignments When the above equation’s right hand side is expanded and factorized by taking common factors out we get the following equation. X m X t(fj | eaj ) = all possible alignments j=1 m X l X t(fj | ei ) (8.9) j=1 i=0 More methods to estimate parameter values are given in the Mathematics of Statistical translation by Brown et al. All of these methods rely on using a simple model to estimate the parameter values and these estimates are passed to the more complex model as initial parameter values and then these estimates are used as the initial values of the EM algorithm. 8.4 Conclusion This chapter covered the fundamental algorithms in statistical machine translation (SMT) . The source-channel approach, also called the Bayesian approach is predominantly the standard approach in SMT and all the basics of this approach has been discussed in this chapter. 48 Chapter 9 Decoders and non-source channel approaches for Machine Translation 9.1 Introduction The predominant approach taken towards Statistical Machine Translation (SMT) is the source channel approach, which was covered in the last chapter. The IBM models were also based on this approach. However, there are a few shortcomings of this approach as given below. 1. The IBM model translation operations are movement (distortion) , duplication, translation of individual words. Thus it does not model the structural and syntactic aspects of the source language. 2. The combination of the language model and the translation model can be proved to be the optimal translation models if the probability distributions of the language model and the translation model are accurate. But it is often the case that the approximations are poor. 3. There is no way to extend the model to incorporate other features and dependencies. 4. The model is trained using the maximum-likelihood approach but the final evaluation criterion is totally different. We would rather prefer to train a model such that the end-to-end performance is optimal. In the remaining sections, we will discuss approaches that overcome the above shortcomings. 9.2 Direct Modeling of the Posterior probability This approach tries to directly model the posterior probability P r(e | f ) instead of using the Bayes rule. This is the approach taken in Och et al[2002][2003]. The framework for this model is based on the maximum entropy model (Berger et al, 1996) . In this framework there are M feature functions hi (e, f ) for all i = 1, 2, . . . , M each having a model parameter λi . In this model the translation probability is given by, P exp[ M i=1 λi hi (e, f )] P r(e | f ) = P (9.1) PM 0 e0 exp[ i=1 λi hi (e , f )] 49 The decision is then based on the following equation, M X ê = argmaxe { λi hi (e, f )} (9.2) i=1 Now the problem is to train the model, in other words, finding parameter values. The parameter values are set using the equation λ̂i = argmaxλi { S X logpλi (es | fs )} (9.3) s=1 This corresponds to maximizing the likelihood of the direct translation model. Some of the features which could be used are, 1. Sentence Length feature: h(f, e) = length(e) (9.4) This tries to pull down the sentence length of the generated sentence. 2. Other language models could be used in the following way, h(f, e) = h(e) (9.5) 3. Lexical features which fire if a certain lexical relationship (f, e) occurs h(f, e) = ( J X I X δ(f, fj ))( δ(e, ei )) (9.6) i=1 j=1 4. We could use certain grammatical dependencies between the source and target language, suppose k (.) counts the number of verb groups that exists in source or target language, then one possible feature function could be, h(f, e) = δ(k(f ), k(e)) (9.7) The training is done using the Generalized Iterative Scaling algorithm. There are two problems with this model right now. The first problem is that the renormalization step in equation 8.1 will take a long time since it considers all possible sentences. This problem is overcome using not all the sentences but considering only the best n probable sentences. The second problem is that in equation 8.3, there is not one single reference translation but Rs reference translations. Hence that equation is modified to accommodate this, λ̂i = argmaxλi { Rs S X 1 X s=1 Rs logpλi (es , r | fs )} (9.8) r=1 For the purpose of evaluation, they do not use just a single evaluation criterion mainly because there is no one single accepted criterion. A few of the evaluation criterion they have used are, 1. SER (Sentence Error Rate) : This is the number of sentences which match exactly with one of the reference translations. 50 2. WER (Word Error Rate) : This is the minimum number of substitution, insertion and deletion operations that need to be performed to transform the generated sentence into the reference translation. 3. mWER (multi-reference word error rate) : Since there are multiple reference translations, it is better to use the minimum edit distance between the generated translation and reference translations 4. BLEU: This commonly used measure is the geometric mean of the precision of n − grams for all values of n ≤ N between the generated translation and the reference translation multiplied by a factor BP (.) that penalizes short sentences BLEU = BP (.).exp( N X log(pn ) n=1 N ), (9.9) where pn denotes the precision of n − grams in the generated translation. Clearly larger number of common n − grams is better, hence larger BLEU scores are better. The experiments were conducted on the VERBMOBIL task which is a speech translation task in the domain of hotel reservation, appointment scheduling etc. The results were such that addition of each new feature results in a systematic improvement in almost all the evaluation criterion. In Och et al[2003], they show how to use the same evaluation criterion when training and hence obtain optimal end-to-end performance. In this paper they train the model considering different evaluation criterion like BLEU, mWER, etc. Without loss of generality, assume the function E(r, e) gives the error between r and e. Then the total error for s pairs of sentences is given by P total error = Ss=1 E(rs , es ). Then in such a setting the training is performed using the equation below. λ̂i = argminλi { S X E(rs , ê(fs ; λi ))} = argminλi { K S X X E(rs , es,k )δ(ê(fs ; λi ), es,k )} (9.10) s=1 k=1 s=1 with ê(fs ; λi ) = argmaxe∈Cs { M X λm hm (e | fs )} (9.11) m=1 The above method has two problems, of which the first problem is the argmax operation because of which one can not use the gradient descent method. The second problem is that there are many local optima. Both these problems are overcome by using a slight approximation (smoothed) version of the above equation as given below. X λ̂i = argminλi { s,k p(es,k | f )α E(es,k ) P α} k p(es,k | f ) (9.12) To see how the above equation has helped solve the two problems mentioned, consider Figure 7.1, the unsmoothed error count has clearly many optima compared to the smoothed error count, which makes it easier for the optimization algorithm. They also propose a new optimization algorithm for optimizing the unsmoothed error count given in Eq 10. Their experiments were run on the 2002 TIDES Chinese-English small data track task. Their results indicate that it is indeed the case that training based on the final evaluation criterion gives better results at the end. This suggests the possibility of using an evaluation criterion which correlates well with human evaluation and improve the quality of translation systems and even other NLP-related tasks. 51 Figure 9.1: The graph on the left is the error count and the the graph on the right is the smoothed error count, Och et al, Proceedings of the 41st Association for Computational Linguistics, 2003 9.3 Syntactical Machine translation One of the major criticisms of the IBM models for SMT is that they do not incorporate syntax or the structural dependencies of the languages they model. It is suspected that the model may not work well for a pair of languages with different word order like English and Japanese. This problem was approached using translating a source parse tree instead of translating a source sentence. This approach was suggested in Yamada et al[2001] and Charniak et al[2003]. Since the input is a source parse tree, the operations required should be different from those required for the IBM model. The operations they have suggested are reordering child nodes, inserting extra words at each node, and translating leaf words. The reordering operation is to bridge the word order differences in SVO-languages and SOV-languages. The word-insertion operation is intended to capture linguistic differences to specify syntactic cases. Figure 7.2 shows how these operations are used to translate English sentences into Japanese. Each of the operations are modeled using a probability distribution. The reordering probability distribution is modeled using the r-table which gives the probability of reordering the child nodes of a non-terminal node. The insertion operation is modeled using two probability distributions, the first of which gives the probability of inserting at a particular position and the second distribution gives the probability of the word to be introduced. The translation probability distribution, the t-table specifies the probability t(wi , wj ), of translating word wi into word wj . Formally speaking, there are three random variables N, R, and T which model the channel operations. Let the English sentence be (1 , 2 , . . . n ) and the target sentence be (f1 , f2 , . . . , fn ). The notation θ = (v, p, t) be the values associated with the random variables N, R and T. The probability of the translated sentence given the English parse tree is, p(f | ) = X p(θ | ) (9.13) θ:str(θ())=f ) where str(θ()) is the sequence of leaf words of a tree transformed by θ from . Assuming all the 52 Figure 9.2: Channel Operations: Reorder, Insert and Translate, Yamada et al, Proceedings of the 39th Association for Computational Linguistics, 2001 transform operations are independent and the random variables are determined only by the node itself. Therefore, X p(f | ) = Y i = 1n n(vi | N (i ))r(pi | R(i ))t(ti | T (i )) (9.14) θ:str(θ())=f The model parameters are estimated automatically from the corpus using the algorithm given in Figure 7.3. Since there are too many combinations possible in the algorithm, they have given an EM algorithm which computes the parameters in polynomial time. The evaluation was done by human judgement of the alignments produced by the decoder. These alignments were graded on a scale of 1.0 with three quantization levels of 1.0 being Correct, 0.5 being Not sure, and 0.0 Wrong. The average alignment score was 0.582 whereas IBM model 5 scored 0.431 justifying the need to model the syntactical aspects of the languages. 9.4 Fast Decoder In the previous section we saw the use of syntax-based machine translation system. The performance of such a system is dependent on the decoder used. One such decoder for the syntax-based translation system discussed above was proposed in Yamada and Knight [2002]. Since the system works on the noisy channel model, the decoder works in the reverse direction of the channel, i.e, given a Chinese sentence they try to find the most probable English parse tree. To parse in such a way, they obtain a Context Free Grammar (CFG) for English using the training corpus (pairs of English parse trees and Chinese sentences) . This grammar needs to be extended to include the channel operations like reordering, insertion, and translation. These rules are added along with the corresponding probabilities from the r-table, n-table and t-table respectively. Using this information we can build a decoded tree in the foreign language order. It is very simple to convert this tree to an English parse tree. We just need to apply the reorder operations in reverse and translating the 53 Figure 9.3: Training Algorithm: Yamada et al, Proceedings of the 39th Association for Computational Linguistics, 2001 Chinese words into English using the t-table. Then the probability can be calculated as the product of the probability of the parse tree and the probability of translating. The probability of the parse tree is the prior probability of the sentence, which is computed using n-gram language model. The probability of translating is the product of probabilities associated with each rule that we applied during converting the decoded tree. Thus we can choose the tree with the highest probability. The only problem that we are left with right now is that the space of decoded trees is very high. This is easy to imagine since there are many sequences of channel operations that can be performed which give the same tree. For example in their experiment in converting Chinese News into English, there are about 4M non-zero entries in the t-table and about 10k CFG rules for parsing English and about 120k rules added to that for accommodating the channel operations. Thus this algorithm is not going to work because it would take too much time to translate even small sentences. The solution to this is forego optimality and build a less time-consuming approximate decoder. The idea is to use a dynamic programming based beam search algorithm. The dynamic programming part is to build the parse tree bottom up. The recurrence is on the tuple ¡non-terminal, input substring¿. If the parser cost can be computed only based on the features of the subtree, then we choose to keep only the best subtree, otherwise we choose to keep the beam-width best subtrees. For decoding faster, they have suggested pruning based on the different probability distributions. For example, for a given Chinese sentence, we only consider those English words e such that P (e | f ) is high for the Chinese words f . Pruning based on the r-table is done by choosing the top N rules such P that the N i=1 prob(rule) × prob(reord) > 0.95. This limits the number of rules to be used. In their experiment, of the total of 138, 862 rules, only 875 rules contribute towards the 95% of the probability mass. Similarly other pruning rules are used to further limit the number of choices available at every stage in the algorithm. 9.5 Conclusion In this chapter, the developments regarding decoders and non-source channel approaches are covered in the field of statistical machine translation. 54 Chapter 10 Sentiment Analysis 10.1 Introduction Sentiment analysis refers to the process of determining the attitude, emotion of a speaker with respect to some topic. This relatively new field of research is a very interesting and challenging field. The attraction and motivation for this field is largely based on the potential applications. For example, most search engines currently in use are keyword-based but there are other features of the textual data based on which we could search, for example we could search based on the sentiment of the page, or also known as the polarity of the page. As another use of sentiment analyzer, consider a company X which has a product P and there are many feedback mails about it. In this case the sentiment analyzer could be used to find what features of the product P are appreciated and what features have to be improved. Broadly speaking, there are two tasks associated with sentiment analysis. One of the tasks is to find the polarity of the sentence and the second task is to assign the sentiment to the target of the sentiment. For example, the sentence ”The Canon Powershot camera is great” conveys a positive sentiment towards the Canon Powershot camera, which is the target. This chapter is organized in the following way. Section 2 discusses some of the graph based approaches for sentiment analysis and section 3 reviews other approaches followed by the conclusion in Section 5. 10.2 Graph Based Methods This section would cover two graph based methods for sentiment analysis. One of the approaches models the graph using only the information from the sentences in the document, whereas another approach uses the results of another classifier. 10.2.1 Using minimum cuts for sentiment analysis This paper by Pang et al, deals with the problem of identifying which sentences are subjective in a given document. The idea is that if we reduce the number of objective sentences in a document then the accuracy of document polarity classifiers would improve. For this purpose they use existing classifiers for predicting polarity of a document which are based on Support Vector Machines (SVM) and Naive Bayes (NB) . The architecture of their system is shown in Fig 9.1. The contribution of this paper is stage 2 in the diagram which is the subjectivity detector. They model the problem in terms of the min-cut graph problem. They define a weighted undirected graph for a document 55 Figure 10.1: Architecture of the Sentiment Analyzer, B Pang and L Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL, pp. 271–278, 2004. in the following way. Graph G = (V, E) where the set of vertices, V, is the set of sentences in the document and 2 extra vertices, source and sink. The source and sink are the two different classifications, in our case, the subjective node, and the objective node. The edges between the two nodes xi and xj have a weight assoc(xi , xj ), proportional to the importance of both the nodes being in the same class. Each edge between x and the class node Ci , has a weight indi (x) corresponding to the importance of x being in class Ci . A cut in the graph G is defined as splitting the vertex set S V into two disjoint sets S and S’ such that S S 0 = V . The cost of the cut is defined to be Cost = X x∈C1 ind1 (x) + X ind2 (y) + y∈C2 X assoc(x, y) (10.1) x∈C1 ,y∈C2 The min-cut can be computed efficiently using any polynomial-time algorithm designed for this purpose. The biggest strength of this algorithm is that we could use knowledge about the language to get the costs for the edges between a sentence node and the class node and simultaneously use knowledge-lean methods to get the assoc (x, y) weights. Now the only problem that remains is how to compute the weights. The individual weights are computed by running a Naive Bayes algorithm or SVM on the sentences of the document. The feature vector, f, they use is the bag of words vector, i.e, f (x) =1 if the word x is present in the document. Then the weight of the edge between a sentence and C1 , where C1 is the subjective class, is set to the probability that the sentence is subjective according to the Naive Bayes algorithm. The association weights are set as a function of the proximity between the two sentences. assoc(si , sj ) = f (j − i) if (j − i) ≤ T and 0 otherwise (10.2) The different functions tried for f (d) = 1, e( 1 − d), and d12 . Thus using this algorithm we obtain all the subjective sentences which are sent to a Naive Bayes document polarity classifier. This improves accuracy of the document polarity classifier to 86.4% when compared to using the whole 56 document. The number of words in the subjective sentences obtained are only 60% of total words in the document. The data was taken from the movie reviews from rottentomatoes.com. The size of the subjectivity extracts can be controlled by just using the N most subjective sentences, which are the sentences with the highest probability according to the Naive Bayes detector. It is seen that using just the top 15 sentences gives the same accuracy as using the whole review. The accuracies using N-sentence extracts using the Naive bayes and the SVM as default polarity classifiers is shown in Fig 9.2. This paper shows that when more objective sentences are removed the accuracy of the Figure 10.2: Accuracy using N-sentence extracts using the Naive Bayes and SVM, Pang et al, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, Proceedings of ACL, pp. 271–278, 2004. document polarity classifier can be improved. 10.2.2 Graph-Based Semi Supervised Learning for sentiment analysis The problem they try to solve in this paper by Goldberg et al, is the rating inference problem. The rating inference problem is given lots of reviews and the corresponding numerical rating, we should predict the rating for an unseen review. Unlike the paper by Pang et al, Goldberg et al not only use labeled data but also use unlabeled data to propose a semi-supervised algorithm. The rating inference problem is formally defined as, given n review documents x1 , x2 , . . . , xn each represented by a feature vector. The first l ≤ n documents are labeled with ratings y1 , y2 , . . . , yn ∈ C. The remaining documents are unlabeled in the general setting but can also be optionally labeled with ˆ(yl+1 ), . . . ˆ(yn ) obtained from another learner. The unlabeled documents also form the test set, this setting is called as transduction. The set of numerical ratings are C = {c1 , c2 , . . . , cn } with c1 ≤ . . . ≤ c2 ∈ R. Now the problem is to find a function F:x-¿R that gives a continuous rating. Then classification is performed by mapping f (x) to the nearest quantized value in C. Now we define how the graph is built. The graph G = (V, E) , V consists of 2n nodes and weighted edges among some of the vertices. Each document is a node in the graph, and all labeled documents have a dongle node attached to it, which is the label of the document. This edge has 57 a weight of a large number M, which represents the influence of the label. Similarly all unlabeled documents are connected to the label obtained by the previous learner. The weight between the unlabeled document and the dongle node is set to 1. Each unlabeled node is connected to K nearest labeled neighbor nodes. The distance between two documents is measure using a similarity measure w. They use two different similarity measures, one of which is the positive-sentence-percentage and the other is mutual information modulated word-vector cosine similarity. Thus the edge weight between a labeled document xj and unlabeled document xi xi and xj is a.wij . Similarly each unlabeled document is connected to k’ unlabeled documents with a weight of b.wij . In this model k, k’, a and b are parameters. It is assumed that nodes which are connected would have similar labels with the probability of similarity being proportional to the weight of the edge between them. Thus the rating function f (x) should be smooth with respect to the graph. The general approach to semi-supervised learning algorithm is to define an energy function over the graph and minimize the energy function by tweaking the parameters. Let L be 1 . . . l and U be l + 1, . . . n be labeled and unlabeled vertices respectively. Then the energy function is defined as P P L(f ) = i∈L M ((f (xi ) − yi )2 + i∈U (f (xi ) − yˆi )2 + P P a.wij (f (xi ) − f (xj ))2 + Pi∈U Pj∈KN NL (i) 2 i∈U j∈K 0 N NU (i) b.wij (f (xi ) − f (xj )) where KN NL (i) is the set of labeled neighbors of i and k 0 N NU (i) is the set of unlabeled neighbors of i. A loss in this function means that the unlabeled nodes’ labels are close to its labeled neighbors and also its unlabeled neighbors, which is exactly what we want, a smooth function over the graph. This formula is modeled as a matrix operation and a formula for minimizing it is obtained. They tune the parameters of their algorithm using cross-validation. Then this algorithm is compared to two other algorithms, which are metric labeling and regression. They use reviews by each author as a dataset. On evaluation it is found that the positive-sentence-percentage similarity function performs better than the cosine similarity between the word vector. The SSL method performs well for small sizes of the labeled data but when there is more labeled data available, the SVM regressor achieves better accuracy which is probably because the positive-sentence-percentage similarity function is not good enough. 10.3 Other approaches In the paper by Hurst et al, they strongly believe in the need for the fusion of sentiment detection and assignment of the sentiment to a target, referred to as topicality. In other words, they argue that it is not enough to just find the polarity of a sentence but the topicality of the sentence should also be identified and check if the sentiment expressed is towards the topicality. As an example, consider the sentiment analyzer to be used to identify the pros and cons of a product. The sentence ”It has a BrightScreen LCD screen and awesome battery life” does not express anything positive towards BrightScreen LCD screen, the product in question, whereas a sentiment analyzer would evaluate this sentence to be positive. They solve this problem by using a topicality detector and a sentiment detector independent of each other and hope that the sentiment expressed would be towards the topic of the sentence. The polarity detector is explained below. Their polarity detector works as follows. In the first step, the sentences are tokenized and segmented into discrete chunks. The chunking is done by running a statistical tagger for achieving Part-Of-Speech tagging and tagging each word with a positive/negative orientation using a polarity lexicon. As a final step, these chunks are grouped into higher order groupings and compared with 58 a small set of syntactic patterns to identify the overall polarity of the sentence. For example, the sentence ”This car is really great” is tokenized and POS tagged into (this DET) , (car NN) BNP, (is VB) BVP, (really RR, great JJ) BADJP. The basic chunk categories are DET, BNP, BADVP, BADJP, BVP, OTHER. This POS tagged sentence is grouped into higher order groupings and compared to syntactical patterns. The syntactic patterns used are Predicative modification (it is good) , Attributive modification (a good car) , equality (it is a good car) , Polar cause (it broke my car) . The system also captures negation, for example, sentences like ”It is not good”, ”it is never any good” are correctly identified to be negative though there are positive words in them. They evaluate their algorithms on 20, 000 messages obtained by crawling usenet, online message boards, etc. Of these messages they found 982 messages to be about a particular topic and used only those messages. These messages when tokenized gave rise to 16, 616 sentences. On evaluation of their polarity detector, they found that positive polarity was detected with a 82% precision and negative polarity with 80% precision. The topical detector for sentences is very hard to design since there are very few features for a single sentence. Hence they design a topical detector for the whole message at first and then try to use the same algorithm for sentences. The document topical detector works using a winnow algorithm. Assume documents are modeled using the bag-of-words model. The winnow algorithm learns a linear classifier using the given labeled data. The linear classifier is of the following form. h(x) = X fw cw (x) (10.3) w∈V where cw (x) is 1 if the word w is present in the document x else it is 0. fw is the weight of the feature w. If h(x) ≥ V , then the classifier predicts topical, otherwise predicts irrelevant. The learning algorithm is given below. 1. Initialize all fw to 1. 2. For each labeled document x in the training set: calculate h (x) If the document is topical, but winnow predicts irrelevant, update each weight fw where cw = 1 by fw = 2 × fw If the document is irrelevant and the winnow predicts relevant, update each weight fw where cw = 1 by fw = f2w This algorithm is adapted to predict topicality for a sentence. If a document is predicted to be irrelevant, then all the sentences are predicted to be irrelevant. Else the topicality detection algorithm is run on each of the individual sentences considered as the whole document. The result of the algorithm is the prediction for the sentence. On evaluation it is found that the document-level topical detection has a precision of 85.4% while the sentence-level topical detection has a precision of 79%. But the problem is that there is a significant drop in the recall of sentence-level topical detection. This is because the number of features at the sentence-level is very few. Hence for a sentence to be predicted topical, the words should be very topical. Thus for truly topical sentences, the strength of the word weights may not be large enough to overcome the default feature weight leading to a loss in recall. Finally they evaluate their original claim of a sentiment expressed is indeed targeted at the topic of the sentence. A precision of 72% demonstrates that the locality assumption is true in most instances. 59 10.4 Evaluation of Machine learning algorithms for Sentiment Analysis In the paper by Pang et al, they analyze the effectiveness of the different machine learning algorithms when applied to sentiment analysis. The accuracy of these algorithms are compared with the human-produced baseline performances. The three learning algorithms that they have evaluated are Naive Bayes, SVM and maximum entropy classification. First let’s consider the human-produced baseline. In this experiment, two graduate students in computer science are asked to come up with positive and negative word lists to classify movie reviews. The classification procedure is very simple. If the frequency of positive words is greater than the frequency of the negative words, the review is classified as positive, and negative otherwise. If the frequencies are equal then it is considered a tie. For both the lists the accuracy rate is quite low and the tie rate is very high. In fact they try to create a list of words by looking at the test data and investigating the frequency counts. Even this list of words has a low accuracy rate of 69% but a better tie rate of 16%. The results of this experiment is shown in Fig 9.3. The results of this experiment is that corpus-based techniques are required to classify based on sentiment. Before discussing the evaluation of the three machine learning algorithms, they explain the feature Figure 10.3: Results of the human-produced baseline, Pang et al, Proceedings of the ACL-02 conference on Empirical methods in natural language processing, p.79-86, July 06, 2002 vector used. They use the standard bag-of-features framework. Let {f1 , f2 , . . . , fm } be the set of features for document d. Example features are unigram presence, bigram presence. Let ni (d) be the frequency of the fi feature. The document vector is represented as (n1 (d), n2 (d), . . . , nm (d)) Naive Bayes: One common approach to classify text is to assign a document d to class C∗ = argmaxc P (c | d). Using Bayes theorem P (c | d) = P (c)P (d | C) P (d) (10.4) P (d) can be safely left out since it does not affect the choice of c. For computing P (d | c) they 60 assume the features are independent of each other and therefore use the following formula for computing P (c | d) Q ni (d) P (c) m i=1 P (fi | c) PN B (c | d) = (10.5) P (d) The training is done using relative frequency estimation of P (c) and P (fi | c), using add-one smoothing. Maximum Entropy (ME) : It has been shown by Nigam et al (1999) , that maximum entropy classification outperforms Naive Bayes sometimes. It also uses conditional probability to classify. P (c | d) is computed as follows PM E (c | d) = X 1 exp( λi,c Fi,c (d, c)) Z(d) i (10.6) where Z (d) is a normalization function. Fi,c (d, c) is a feature function for feature fi and class c, defined as follows fi,c (d, c) = 1 if ni (d) ≥ 0 and c0 = c and 0 otherwise (10.7) One of the advantages of using maximum entropy over naive Bayes is that ME does not make any assumptions about the independence of the feature functions. λi,c is the weight of the feature function for class c. If this value is large then the feature fi is a strong indicator that the document is of class c. The improved iterative scaling algorithm by Della Pietra et al (1997) is used for training. During training the parameter values are tuned to maximize the entropy of the induced distribution subject to the constraint that the expected value of the feature functions with respect to the model are equal to the expectation with respect to training data. Support Vector Machines: In SVM the idea is to find a separating hyperplane of large margin. This hyperplane is used for classification by finding out which side of the hyperplane the input falls into. The search for the separating hyperplane, w, is carried out using the following formula w= X αj cj dj , αj ≥ 0, (10.8) j where αi ’s are obtained by solving a dual optimization problem. Those dj ’s for which αj ≥ 0 are called support vectors because they are the only vectors contributing to w. For the evaluation they use the IMDB movie reviews located in the rec.arts.movies.reviews newsgroup. In their evaluation they find that the SVM is the best performing algorithm while the Naive Bayes is the worst performing one. They conclude by saying that though the algorithms beat the human-produced baseline, they are still very far away from the accuracies obtained in the topic categorization experiment. They also reason this saying it is mainly because of the mixture of positive and negative features in many documents which the bag-of-features feature vector fails to model. 10.5 Conclusion This chapter covered fundamentals of sentiment analysis and the different approaches used for detecting the polarity of a sentence. To summarize, The problem of identifying opinions and sentiments is a lot harder than a general topic classification task. The reason for it is that the 61 features used concentrate on the statistics of the corpus, whereas more semantic knowledge should be integrated to have a high precision high recall classifier. It could also be the case that the statistical machine learning algorithms do not perform very well because of the features. The most commonly used feature vector is based on the bag-of-features model. Also it is important to not just classify a sentence as polar, but also put it in context. In other words, we need to bind the sentiment detected with the target of the sentiment. 62 Chapter 11 Blog Analysis 11.1 Introduction A Weblog, commonly known as Blog, is a website (mostly personal) where entries are commonly displayed in reverse chronological order. The Blogs are generally used for providing commentary or news on a particular subject or mostly personal diaries. These Blogs are heavily hyperlinked forming many active micro-communities. Also technorati.com (blog search engine) announces that it has been tracking more than 100 million blogs, which means the Blogspace (the space of all Blogs) is a useful source of unstructured information. Recently there has been a lot of work in analyzing the Blogspace and retrieving information from them. This chapter would survey some of the papers about the analysis and ranking of blogs. The chapter is organized in the following way. The analysis of the evolution of Blogpsace is covered in Section 2 followed by an algorithm for ranking Blogs in Section 3. The information diffusion analysis is covered in Section 4 followed by the conclusion. 11.2 Analysis of Evolution of Blogspace The Blogspace can not be analyzed with the standard graph model, since a simple directed graph does not capture the link creation as point phenomenon in time. The first paper to analyze the evolution of Blogspace by Kumar et al, identified the need for modeling the Blogspace as an evolving directed graph and introduced a new tool called Time Graphs. The major contributions of this paper has been in finding the rapid growth of the Blogspace around the end of 2001 and in finding the growth of the number of active micro-communities, and going on to show that this is not a direct implication of the growth of the Blogspace itself. The notion of communities in the web graph is generally characterized by a dense bipartite graph and almost all dense bipartite subgraphs was assumed to denote a community. But this idea does not extend well for Blogs, since they are heavily influenced by time, i.e, a given topic might start an intense debate leading to heavy hyperlinking and then fade away. Thus in Blogspace, communities are identified by not just the subgraphs but also a time interval. But there could also exist communities which exist all through because of a common interest. As an example of such temporal communities, a burst occurred when one member Firda, (http://www.wannabegirl.org) posts a series of daily poems about other bloggers in the community and includes link to their blogs. Thus burst occurs from March-April 2000. 63 11.2.1 Identifying Communities in Blogspace The process of identifying communities is split into two sub-processes. The first task is to extract dense subgraphs from the blog graph and the second task is to perform burst analysis on each of the subgraphs. For this purpose, they define a novel data structure which takes the time of edge creations into account. A time graph G = (V, E) consists of a set of nodes, V, where each node v ∈ V has an associated interval D (v) on the time axis (called the duration of v) . Each edge, e ∈ E, T is a triple (u, v, t) where u and v are nodes in V and t is a point of time in the interval D(u) D(v). T The prefix of G at time t, is also a time graph Gt = (Vt , Et ) where Vt = {v ∈ V | D(v) [0, t] 6= φ}. Likewise Et = {(u, v, t0 ) ∈ E | t0 ≤ t} As a preprocessing step in identifying the community subgraph, we consider the blog graph as an undirected graph and remove nodes that are linked-to by a large number of nodes. The intuition is that these nodes are too well-known for the type of communities they want to identify. The actual algorithm consists of two steps the pruning step and the expansion step. In the pruning step, vertices of degree zero and degree one are removed and vertices of degree two are checked if they participate in a complete graph of 3 vertices. If they do, then they are passed to the next step which is the expansion step. The expansion step is an iterative process. In each iteration, the vertex which contains the maximum number of edges to the current community is identified and added to the community if the number of edges is greater than tk which is a threshold depending on the number of iterations. Thus, the expansion step gives a set of densely connected nodes which is the identified community. The burst analysis is done by using the Kleinberg’s burst analysis algorithm which is defined below. In the case of Blog graph evolution, the events are defined to be the arrival of an edge. In the Kleinberg algorithm, the generation of sequence of events is modeled using the automaton with two states, high and low. The inter-arrival time is independently distributed according to an exponential distribution where the parameter depends on the current state. There is a cost associated with a state transition and the optimal low cost state transition sequence that is likely to generate the the given stream. The blog graph was obtained by crawling the data from seven popular blog sites: http://www.blogger.com, http://www.memepool.com, http://www.globeofblogs.com, http://www.metafilter.com, http://blogs.salon.com, http://www.blogtree.com, and Web Logs subtree of Yahoo!. Four features of the Blogspace are analyzed in this experiment. First the degree distribution is seen to be increasing along the y-axis with time but maintaining the same shape as time progresses. The in-degree and the out-degree distribution do not vary much with time. Both the distributions are shown in Fig 10.1. The next feature to be analyzed is connectivity, in particular the strongly connected component (SCC) . The graph for this is shown in Fig 10.2. The graph shows the fraction of nodes involved in the three biggest SCC’s in the blog graph. The rise in the fraction of nodes involved shows that increasing number of nodes are involved in the SCC’s or in other words, the whole blog graph is becoming increasingly more connected. Another interesting feature that they have analyzed is the community structure. They show that more and more nodes (blog posts or pages) are participating in at least one community. They do so by plotting the number of nodes participating in a community as time progresses. They also plot the fraction of nodes participating in some community along with time. Both these graphs are shown in Fig 10.3. Both the graphs clearly indicate that the blog graph is being structured into many communities. The last feature that they analyze is the burstiness of the communities, they do this by plotting the number of communities in the high-state along with time. This graph is also clearly increasing 64 Figure 11.1: Degree distribution evolution in blogspace, Kumar et al, On the bursty evolution of blogspace, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary 65 Figure 11.2: Evolution of connectedness in Blogspace, Kumar et al, On the bursty evolution of blogspace, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary 66 indicating that more and more communities are becoming increasingly active. All the features synthesized together clearly shows that there has been a significant change in the amount of blogging during 2000-01. To see if these features are observed because of the sociology of the Blogspace, they plot the same features for a random graph generated using a time-evolving version of Erdös−Rényi graphs. All the features show the same pattern for the random graph except the community structure graph, which shows that the blog graph being structured into communities is not a coincidence but indeed a result of the sociology of the Blogspace. Figure 11.3: Evolution of community structure in Blogspace, Kumar et al, On the bursty evolution of blogspace, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary 11.3 Ranking of Blogs The ranking algorithms used by search engines for web pages are not suited for ranking blog posts. The reason for that is the ranking algorithms used by search engines like google use the explicit link structure between the pages. Such an algorithm would fail in the case of ranking blog posts because bloggers do not post explicit links to the blogs where they read about an article, but instead post direct links to the URL, i.e, post links to the source directly. Thus what we really need is to track information flow and give credit to the blog which was the first to post about the article. This ranking algorithm would then be able to rank blog based on how important they are for for propagating information. To do this, we should infer links other than the explicit links present in the page, i.e, find out from which post another post could have possibly learnt of an URL before posting it. That does not mean we can leave out the explicit links present. Two types of explicit 67 link information available are the blogroll and the via links. The blogroll is the set of links that are found on the left side of the blog page. This is generally the set of links which the blogger visits often and they too visit the blog often. Via links are links to the blog which posted the article first. But unfortunately, these two types of links account for only 30% of the URLs mentioned. This further motivates the need to infer links other than those present to identify information spread in blogspace. To predict if there exists a link between two blogs through which information could have spread is predicted using a SVM classifier. The similarity function between two nodes is based on the following 1. blog sim: The number of other blogs the two sites will have in common 2. link sim: The number of non-blog links (URLs) shared by the two 3. text sim: Textual similarity 4. Timing and repeating history information reflecting the frequency of one blog being affected before the other. Both blog sim and link sim are calculated as vector-space cosine similarity measure that ranges between 0 (no overlap in URLs) and 1 (complete overlap in URLs) . The formula for computing the similarity is given below, ||nA nB || p link sim(A, B) = p (||nA ||) × (||nB ||) S (11.1) Textual similarity is calculated as the cosine similarity of term vectors weighted using TFIDF scheme. Timing and history information is obtained by answering the following question. Of the URLs mentioned in Blog A how many were mentioned in blog B a day earlier? Two SVM classifiers were constructed using these features. The first classifier classified the input into three classes (reciprocated links, one-way links, unlinked) The second classifier differentiated between linked and unlinked alone. The second classifier had a much better accuracy and hence was used in the final ranking algorithm. Now we can describe the complete algorithm. The algorithm builds a graph of all the blogs as nodes and all the inferred links are weighted as follows. Suppose blog bi cites a URL u, after blog bj or on the same day, then the weights are assigned as wji u = w(4dji ) where dji = dj − di where di is the day of which the URL u is posted in blog bi and the weight function, w, is a decreasing function. This weight is normalized as follows, wji u nj k wjk u P (11.2) After all the URLs are considered all the directed edges are merged together and all the weights of the same edge are summed. This graph is called the implicit information flow graph. Now PageRank is run on this graph to rank the blogs. The top 20 results obtained using this algorithm and PageRank alone are completely different which indicates that information flow is completely left out by Google search engine and other search engines which rely only on explicit link structure. 11.4 Conclusion This chapter covered an extensive analysis of the blogspace and explained why using only the explicit link structure links are not good enough to rank blogs. It also explained how links between 68 blogs can be inferred through which information might have spread and use this information to better rank blogs. 69 Chapter 12 References 1. Dragomir Radev, Eduard Hovy, Kathleen McKeown, Introduction to the special issue on summarization, Computational Linguistics, December 2002, Volume 28, Issue 4, pages 399408. 2. Gunes, Erkan, Dragomir Radev, LexRank: Graph-based Lexical Centrality as Salience in Text Summarization, JAIR, 2004. 3. Stergos D. Afantenos, Vangelis Karkaletsis, Panagiotis Stamatopoulos, Summarization from Medical Documents: A Survey, Artificial Intelligence in Medicine, Volume 33, Issue 2, February 2005, Pages 157-177 4. Dragomir Radev, Information Retrieval tutorial presented at the ACM SIGIR 2004, http://www.summarizatio 5. Peter G. Doyle (Dartmouth), J. Laurie Snell (Dartmouth), Random Walks and Electric Networks, Categories: math.PR Probability Theory http : //f ront.math.ucdavis.edu/0001.5057 6. Xiaojin Zhu, Semi-supervised learning literature survey. Technical Report 1530, Department of Computer Sciences, University of Wisconsin, Madison, 2005. http : //www.cs.wisc.edu/ jerryzhu/pub/ssl survey.pdf 7. Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani, Combining active learning and semisupervised learning using Gaussian fields and harmonic functions. In ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003. http : //pages.cs.wisc.edu/ jerryzhu/pub/zglactive.pdf 8. Gunes Erkan, Graph-based Semi-Supervised Learning Using Harmonic Functions http : //tangra.si.umich.edu/ radev/767/gunes2.pdf 9. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co- training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 19-26, 2001. Morgan Kaufmann. 10. Thorsten Joachims. Transductive learning via spectral graph partitioning. In Tom Fawcett and Nina Mishra, editors, Proceedings of the 20th International Conference on Machine Learning, pages 290-297, 2003. AAAI Press. 70 11. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Meeting of the Association for Computational Linguistics, pages 189-196, 1995. Morgan Kaufmann. 12. David Yarowsky. Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of the 32nd Meeting of the Association for Computational Linguistics, pages 88-95, 1994. Morgan Kaufmann. 13. Hema Raghavan, Omid Madani and Rosie Jones, ”Active Learning on Both Features and Documents” Journal of Machine Learning Research (JMLR), 7(Aug):1655- 1686, 2006 14. M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45(2):167– 256, 2003. 15. M. E. J. Newman. Power Laws, Pareto distributions and Zipf’s law, Contemporary Physics, Vol 46, Issue 5, P.323-351 16. M. E. J. Newman. Finding community structure in networks using the eigenvalues of matrices, Physical Review E, Vol 74, 2006. 17. Donald Hindle, Mats Rooth, Structural Ambiguity and Lexical Relations, Meeting of the Association for Computational Linguistics, 1993 18. Adwait Ratnaparkhi: Statistical Models for Unsupervised Prepositional Phrase Attachment CoRR cmp-lg/9807011: (1998) Adwait Ratnaparkhi, Jeffrey C. Reynar, Salim Roukos: A Maximum Entropy Model for Prepositional Phrase Attachment. HLT 1994 19. Kristina Toutanova, Christopher D. Manning, and Andrew Y. Ng. 2004 .Learning Random Walk Models for Inducing Word Dependency Distributions. In Proceedings of ICML 2004. 20. Eric Brill, Philip Resnik: A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation. COLING 1994: 1198-1204 21. Michael Collins, James Brooks, Prepositional Phrase Attachment through a Backed-Off Model, 1995, Third workshop on large corpora 22. Shaojun Zhao and Dekang Lin. 2004. A Nearest-Neighbor Method for Resolving PPAttachment Ambiguity. In Proceedings of IJCNLP-04. for Resolving PP-Attachment Ambiguity 23. http://dp.esslli07.googlepages.com/ 24. Nivre, J. Dependency Grammar and Dependency Parsing. MSI report 05133. V äxj ö University: School of Mathematics and Systems Engineerin, 2005. 25. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kbler, S.,Marinov, S. and Marsi, E. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95-135, 2007. 26. McDonald, R., Pereira, F., Crammer, K., and Lerman, K. Global Inference and Learning Algorithms for Multi-Lingual Dependency Parsing. Unpublished manuscript, 2007. 71 27. McDonald, R., and Nivre, J. Characterizing the Errors of Data-Driven Dependency Parsing Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing and Natural Language Learning, 2007. 28. Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A statistical approach to machine translation, Computational Linguistics, v.16 n.2, p.79-85, June 1990 29. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer, The mathematics of statistical machine translation: parameter estimation, Computational Linguistics, v.19 n.2, June 1993 30. Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky. Statistical machine translation: Final report. Technical report, The Center for Language and Speech Processing, John Hopkins University, Baltimore, Maryland, USA, September 1999. 31. Knight, Kevin. 1999b. A Statistical MT Tutorial Workbook. Available at http://www.isi.edu/natural-language/ mt/wkbk.rtf. 32. Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Meeting of the Association for Computational Linguistics, pages 160-167, 2003. MIT Press. 33. Franz Josef Och. GIZA++: Training of statistical translation models, 2000. 34. Franz Josef Och and Hermann Ney. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Meeting of the Association for Computational Linguistics, pages 295-302, 2002. MIT Press. 35. Franz Josef Och and Hermann Ney. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417-449, December 2004. 36. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. In Proceedings of the 39th Meeting of the Association for Computational Linguistics, pages 228-235, 2001. MIT Press. 37. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and Dragomir Radev. A smorgasbord of features for statistical machine translation. In Proceedings of the Human Language Technology Conference.North American chapter of the Association for Computational Linguistics Annual Meeting, pages 161-168, 2004. MIT Press. 38. Kenji Yamada and Kevin Knight. A syntax-based statistical translation model. In Proceedings of the 39th Meeting of the Association for Computational Linguistics, pages 523-530, 2001. MIT Press. 39. Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting, 2003. MIT Press. 72 40. Kenji Yamada and Kevin Knight. A decoder for syntax-based statistical machine translation. In Proceedings of the 40th Meeting of the Association for Computational Linguistics, pages 303-310, 2002. MIT Press. 41. Eugene Charniak, Kevin Knight, and Kenji Yamada. Syntax-based language models for statistical machine translation. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT ’03), April 23-25, 2003. 42. Bo Pang , Lillian Lee, A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.271-es, July 21-26, 2004, Barcelona, Spain 43. Andrew Goldberg and Xiaojin Jerry Zhu, Seeing stars when there aren’t many stars: Graphbased semi-supervised learning for sentiment categorization, HLT-NAACL 2006 workshop Textgraphs: Graph-based Algorithms for Natural Language Processing. 44. M. Hurst and K. Nigam. Retrieving topical sentiments from online document collections. In Document Recognition and Retrieval XI, pages 27–34, 2004. 45. Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) 46. Adar, E., Zhang, L., Adamic, L., and Lukose, R. Implicit structure and the dynamics of blogspace. Presented at the Workshop on the Weblogging Ecosystem at the 13th International World Wide Web Conference (New York, May 18, 2004) 47. Daniel Gruhl , R. Guha, David Liben-Nowell, Andrew Tomkins, Information diffusion through blogspace, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA 48. Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins, On the bursty evolution of blogspace, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary 73

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement