Helmholtz Principle-Based Keyword Extraction Anima Pradhan

Helmholtz Principle-Based Keyword Extraction Anima Pradhan
Helmholtz Principle-Based
Keyword Extraction
Anima Pradhan
Department of Computer Science and Engineering
National Institute of Technology Rourkela
Rourkela-769 008, Odisha, India
Helmholtz Principle-Based
Keyword Extraction
Thesis submitted in
May 2013
to the department of
Computer Science and Engineering
of
National Institute of Technology Rourkela
in partial fulfillment of the requirements
for the degree of
Master of Technology
in
Computer Science and Engineering
Specialization : Computer Science
by
Anima Pradhan
[ Roll No. 211cs1048 ]
under the guidance of
Asst. Prof. Korra Sathya Babu
Department of Computer Science and Engineering
National Institute of Technology Rourkela
Rourkela-769 008, Odisha, India
Department of Computer Science & Engineering
National Institute of Technology Rourkela
Rourkela-769 008, Odisha, India.
www.nitrkl.ac.in
Korra Sathya Babu
Asst. Professor
May 20, 2011
Certificate
This is to certify that the work in the thesis entitled Helmholtz Principle-Based Keyword
Extraction by Anima Pradhan, bearing Roll No. 211CS1048, is a record of an original
research work carried out by him under my supervision and guidance in partial
fulfilment of the requirements for the award of the degree of Master of Technology
in Computer Science and Engineering. Neither this thesis nor any part of it has been
submitted for any degree or academic award elsewhere.
Korra Sathya Babu
Department of Computer Science and Engineering
National Institute of Technology Rourkela
Rourkela-769 008, India.
www.nitrkl.ac.in
Mr. Korra Sathya Babu
Assistant Professor
May, 2013
Certificate
This is to certify that the work in the project entitled Keyword Extraction Based
Helmholtz Principle by Anima Pradhan is a record of an original work carried out by
her under my supervision and guidance in partial fulfillment of the requirements for
the award of the degree of Master of Technology in Computer Science and Engineering.
Neither this project nor any part of it has been submitted for any degree or academic
award elsewhere.
Korra Sathya Babu
Acknowledgment
I would like to express my gratitude to my thesis guide Asst. Prof. Korra Sathya
Babu for the useful comments, remarks and engagement through the learning process
of this master thesis. The flexibility of work he has offered me has deeply encouraged
me producing the research.
Furthermore I would like to thank Dr. Pankaj Kumar Sa for for being a source of
support and motivation for carrying out quality work.
My hearty thanks goes to Mr.
Sambit Bakshi for consistently showing me
innovative research directions for the entire period of carrying out the research and
helping me in shaping up the thesis. Also, I like to thank the participants in my survey,
who have willingly shared their precious time during the process of interviewing. I
would like to thank my loved ones, who have supported me throughout entire process,
both by keeping me harmonious and helping me putting pieces together. I will be
grateful forever for your love.
Last but not the least,i like to thank my family and the one above all of us, the
omnipresent God, for answering my prayers for giving me the strength to plod on
despite my constitution wanting to give up, thank you so much Dear Lord.
Anima Pradhan
Abstract
In today’s world of evolving technology, everybody wishes to accomplish tasks in
least time. As information available online is perpetuating every day, it becomes very
difficult to summarize any more than 100 documents in acceptable time. Thus, ”text
summarization” is a challenging problem in the area of Natural Language Processing
(NLP) especially in the context of global languages.
In this thesis, we survey taxonomy of text summarization from different
aspects. It briefly explains different approaches to summarization and the evaluation
parameters. Also presented are a thorough details and facts about more than fifty
automatic text summarization systems to ease the job of researchers and serve as a
short encyclopedia for the investigated systems.
Keyword extraction methods plays vital role in text mining and document
processing.
Keywords represent essential content of a document.
Text mining
applications take the advantage of keywords for processing documents. A quality
Keyword is a word that represents the exact content of the text subsetly.
It is
very difficult to process large number of documents to get high quality keywords in
acceptable time.
This thesis gives a comparison between the most popular keyword extractions
method, tf-idf and the proposed method that is based on Helmholtz Principle.
Helmholtz Principle is based on the ideas from image processing and derived from the
Gestalt theory of human perception. We also investigate the run time to extract the
keywords by both the methods. Experimental results show that keyword extraction
method based on Helmholtz Principle outperformancetf-idf.
Keywords: Text Mining, Text Summarization, Stemming, Helmholtz Peinciple, Information Retrieval,
Keyword Extraction, Term Frequency - Inverse Document Frequency.
Contents
Certificate
ii
Acknowledgement
iv
Abstract
v
List of Algorithms
viii
List of Figures
ix
List of Tables
x
1
Introduction
1
1.1
Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Input Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.2
Purpose Factors . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.1.3
Output Factors . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.1
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2.3
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2
2
Related Work
9
2.1
9
Taxonomy of Text Summarization . . . . . . . . . . . . . . . . . . . . .
2.1.1
Extractive Summarization Method . . . . . . . . . . . . . . . . 10
2.1.2
Approaches of Text Summarization . . . . . . . . . . . . . . . . 10
2.1.3
Abstractive Text Summarization . . . . . . . . . . . . . . . . . . 25
vi
3
2.2
Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3
Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Comparison between Performances of two Keyword Extraction Methods
35
3.1
Term Frequency-Inverse Document Frequency (TF-IDF): . . . . . . . . 35
3.2
Optimization of Meaningful Keywords Extraction using Helmholtz
Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4
5
Evaluation and Results
42
4.1
Text Summarization Systems . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2
Experimental results on Keyword Extraction Methods . . . . . . . . . . 83
Conclusion and Future work
Bibliography
87
89
List of Algorithms
3.1
Calculate NFA(N,L,M,k,m) . . . . . . . . . . . . . . . . . . . . . . . . . 39
viii
List of Figures
3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
ix
List of Tables
4.1
Overview of Text Summarization Systems . . . . . . . . . . . . . . . . . 44
x
Chapter 1
Introduction
In this thesis We have done briefly survey on Automatic Text Summarizers which help
us to have an idea of what Text Summarization is and how it can be useful for. Also
We propose approaches for comparison of keyword extraction using term weighting
and Helmholtz Principle in multi documents. We focus on two text mining tasks: text
summarization and keyword extraction. We aim to identify and tackle the challenges
of multi documents and compare the performance of the proposed approaches against
a wide range of existing methods. Text mining, sometimes alternately referred to as
text data mining, roughly equivalent to text analytics, refers to the process of deriving
high-quality information from text. It is a well research field; for instance, during
the 1990’s and early 2000 text summarization received a lot of attention due to its
relevance to both information retrieval and machine learning.
There are several approaches to term weighting of which the Term Frequency
- Inverse Document Frequency (TF-IDF) is probably the most often used. It is an
approach that relies heavily on term frequency (TF); i.e., a statistic of how many times
a word appears within a document. In many cases, TF is a good statistic to measure
the importance of a word: if it occurs often, it could be important.
1.1
Text Summarization
A summary is a reduced transformation from original text through selection and
generalization of the important concept [2]. Summarization model consists of three
1
1.1. TEXT SUMMARIZATION
stages:
• Interpretation : Original text converted into structured representation so that
necessary computation and modification can be performed on it.
• Transformation : convert into summary representation.
• Generalization : summary representation converted into summary text.
Effective summarizing requires an explicit, and detailed, analysis of context factors,
as is apparent when we recognize that what summaries should be like is defined by
what they are wanted for, as well as by what their sources are like [3]. Context Factor
distinguishes three main factors:
1.1.1
Input Factor
The features of input document can affect the resulting summary according to the
following aspects :
Document Structure
Structure is a explicit organization of a Input Document. Examples are : header,
chapters, sections, lists, table etc. Structure of the Document should be well organized,
so that information can be use to analyze the document.
Summarizer [4], PALSUMM that create summaries by choosing sentences or parts
of sentences corresponding to nodes at a given level of depth of a tree structured
representation of the structure of the text produce excellent summaries of the original
text [5]shows structural properties of medical articles.
Domain
Domainsensitive systems are able to obtain summaries of single or specific topic
domain (e.g. all of medicine as a single domain) with varying degrees of probability.
For example [6] applied two independent method (BIOChain and FreqDist) for
identifying salient sentences in biomedical texts. [7] shows how argumentation
schemes and story schemes form most relevant forms of commonsense knowledge
2
1.1. TEXT SUMMARIZATION
in the context of reasoning with evidence. Some other domain specific information
summarizer for different kinds of documents.
Specialization level
A text may be broadly characterized as ordinary, specialized, or restricted, in relation
to the presumed subject knowledge of the source text readers. This aspect can be
considered same as domain aspect.
Language
The language of the input can be general language or restricted to sub language
within a domain, purpose or audience. Summarization algorithm may or may not use
language dependent information. Considering specific form factors, TIDES include
information detection, extraction, summarization and translation focusing currently
on English, Chinese and Arabic with some research on Korean and Spanish. LDC
work on Chinese and Arabic language. English has been the main language (see DUC),
with substantial effort in Japanese (see NTCIR) and work on Chinese and German,
and both raw Arabic and automaticallytranslated Arabic news in DUC .
Media
Although Our main focus of Summarization is textual summarization but summaries
of nontextual documents like , audio, video [8],Multimedia, Images etc.Summarizing
of Multimedia resources by the technology DREL [9]. To achieve consistency of image
content representation and highquality results, imagebased summarization needs to
be geared toward specific image types [10].
Unit
Different Number documents can be used to create summary of the document. If
only single document is used to create summary, it is named as Single Document
Summarization System. If more than one document is used, then it is named as
MultiDocument Summarization System. In Multidocument Summarization system
does not simply shorten the source texts but presents information organized around
3
1.1. TEXT SUMMARIZATION
the key aspects to represent a wider diversity views on the topic. Different types
of Summarizer for different kinds of documents developed by Columbia University
are: SUMMONS, MultiGen, FociSum.University of South California produces a
summarizer system, Summarist.SUMMONS and MultiGen works for news domain
where FociSum based on question and answer approach.
Summarist produces
summaries of Web Documents.
Genre
Some systems exploit typical genredetermined characteristics of texts, like pyramidal
organization of newspaper article ,development of scientific article, etc.
Some
summarizers are independent of type of documents but some are specialized on some
certain type of documents like Broadcast fragments [11], e-mails [12][13],web pages
[14], news, medical articles [15], scientific articles[16], News agency [17]etc.
Scale
Scale means length of Input source.
Length of input documents can be varies.
Longer documents like reports, books contains more important informative parts,
contain more topics, less redundant information, etc. Where shorter document like
news articles, sentences contain less information, contain less topic, less meaningful
information.
1.1.2
Purpose Factors
Here it is describe for what purpose we are doing summarizing. Purpose factors are
fall under three categories :
Situation
Situation is the context of Summary. It refers to the environment where summary is to
be used. The environment of Summary means, by whom, for what purpose and when
it will be used, it may or may not be known. If it is known in advance then it can fulfill
4
1.1. TEXT SUMMARIZATION
the requirements of context of the summary. For example, Medical literatures on the
web are the important sources to help clinicians in patient care.
Audience
Audience refers to the readers for whom summarization is to be done. It may be done
according to the interest of the audience.
Use
Use refers to for what reason summarization is to be done. Summaries can be used for
retrieving information, developed Search Engine, informationcovering substitutes for
their source text, as devices for refreshing the memory of an alreadyread source.
1.1.3
Output Factors
There are at least three major output factors are:
Material
The summary of a document can contain all important concepts of original document
or only some aspects of it. Summaries may be designed to contain some specific type
of information like, in papers what was observed, plot, etc. Generic summaries cover
all important concepts where querybased summaries cover related to the need of user.
Format
created summary organized into different sections like headings, etc. In some journal
papers, like an abstracts, or Test results.
Style
A Summary can be :
1. Informative : It cover concept of original document.
2. Indicative : It gives brief explanation of original document.
5
1.2. KEYWORD EXTRACTION
3. Aggregative : It gives partial information of which does not cover in original
document.
4. Critical : It review summaries whether it is wrong, right or require some
modification.
1.2
Keyword Extraction
Keyword extraction is highly related to automated text summarization.
In text
summarization, most indicative sentences are extracted to represent the text. In order
to utilize the information from short documents, whether we want to categorize the
text or extract information from it, we need to identify which words are the most
important within the text. This can be achieved by various methods. I focused on
comparison of performance of two keyword extraction methods on very large data set
such as very popular method term weighting method TF-IDF and another is based on
Helmholtz Principle.
Helmholtz Principle is based on the ideas from image processing and especially on
the Helmholtz Principle from the Gestalt Theory of human perception. According to
a basic principle of perception due to Helmholtz, an observed geometric structure is
perceptually meaningful if it has a very low probability to appear in noise.
1.2.1
Application
Automatic text summarization can be used:
• To summarize news to SMS or WAP-format for mobile phones/PDA.
• To let a computer synthetical read the summarized text. Written text can be to
long and boring to listen to.
• In search engines to present compressed descriptions of the search results (see
the Internet search engine Google).
• To search in foreign languages and obtain an automatically translated summary
of the automatically summarized text.
6
1.2. KEYWORD EXTRACTION
In this chapter, motivation of the research and the outline of the work is
introduced.
1.2.2
Motivation
Due to growth of online information it is difficult for human beings to accomplish
their task in the field of natural language processing in stipulated time.
Huge
number of available documents in digital media makes it difficult to obtain the
necessary information related to the needs of a user. In order to solve this issue,
text summarization systems can be used. The text summarization systems extract
brief information from a given document while preserving important concepts of that
document. By using the summary produced, a user can decide if a document is related
to his/her needs without reading the whole document. Also other systems, such as
search engines, news portals etc., can use document summaries to perform their jobs
more efficiently.
To extract important information or sentences, high quality keyword plays crucial
role as per user requirement. They help users to search information more efficiently.
Due to growth of online information it is difficult for human beings to accomplish
their task in the field of natural language processing in stipulated time. Extracting
high quality keywords automatically are expensive and time consuming. This shows
keyword extraction is challenging problem in the area of natural language processing
especially in the context of global languages in acceptable time.
Annotation of keyword of document can be used to build keyword query. In an
electronic magazine keyword give a clue about the main idea of an article . In a book
they quickly lead the leader to the whereabout of the information sought.On the Web,
tag annotations help to find multimedia and other resources.Moreover, creation of
annotations is time consuming, such that automatic ways of keyword extraction form
the document are required.
There are many existing algorithms have been proposed for Automatic Keyword
extraction. Helmholtz Principle is developed for mining textual, unstructured or
sequential data. Here We define a new measure of meaningful keywords with good
performance on different type of documents. TF-IDF is successful and most well
7
1.2. KEYWORD EXTRACTION
tested technique in Information Retrieval. So we compare most popular method
TF-IDF with my proposed algorithm based on Helmholtz Principle for large number
of documents.
1.2.3
Thesis Outline
The rest of the thesis is organized as follows:
Chapter 2 presents the related work in documents summarization and keyword
extraction methods. Taxonomy of text summarization systems, text summarization
approaches in literature, and evaluation measures of the text summarization systems
are explained briefly. Approaches of keyword extraction methods are presented.
Chapter 3 explains briefly on automatic text summarizer systems with their
features.
Also explain TF-IDF method and Helmholtz principle based keyword
extraction. I presented proposed algorithm for keyword extraction.
Chapter 4 I present the experimental results.
Chapter 5 presents the concluding remarks.
8
Chapter 2
Related Work
2.1
Taxonomy of Text Summarization
The summary can have different categorization according to their characteristics.
• Based on Number of source documents : If single document is used for
summarization, it is known as SingleDocument Summarization. More than one
document is used, and then it is known as Multidocument Summarization.
• Based on Summary Usage
– Generic Summarization :If whole document is used for creating summary.
– Query based Summarization : If specific topic is used related to the query.
• Based on techniques
– Supervised Summarization : The training data set is known.
– Unsupervised Summarization : Training data set is not known.
• Based on characteristics of a summary as text
– Extractive : Its process is to find more important information or sentences
from input document to create a summary.
– Abstractive : In this process, machine need to understand the concept of all
the input documents then produce summary with its own sentences.
9
2.1. TAXONOMY OF TEXT SUMMARIZATION
Taxonomy of Text Summarization
Based on Number Based on
Summary Usage
of Source
Document
Single Document
Multi Document
Genetic
Query based
Based on
Techniques
Supervised
Unsupervised
Based on
Characterstics of
Summary as text
Extractive
Abstractive
Based on
Level of
Linguistics
Process
Shallow Approach
Deeper Approach
• Based on the level in the linguistic space
– Shallow approaches : It related to the syntactic level of representation.
– Deeper approach : It is related to the semantic level of representation and
allows linguistic process at some level.
2.1.1
Extractive Summarization Method
It finds more important information or sentences from input document to create a
summary. There is different level processing to get more informative parts or high
concepts information form input document. Based on these levels of processing, text
summarization is categorized into different approaches.
2.1.2
Approaches of Text Summarization
Statistical Approaches
In 1958,[18] describe that a sentence gives useful measurement of significance, if
frequency of particular term(or word) is high in an article. Term Frequency : Number
of occurrence of words. At the time of implementation, he proposed some key ideas:
• Stemming : In a document some words can be seen in different variant like
10
2.1. TAXONOMY OF TEXT SUMMARIZATION
singular vs Plural, present verses past, past verses future, written in small or
capital letter etc. Ex. School, schools, School, SCHOOL all are same. Stemmer
is a tool which reduces a word to its root form. For example, reads, reading, read
is stemmed into read. Here frequency of read is 3.Advantage is, it reduces the
memory usage for storing words. Another stemmer is, Porter Stemmer (Porter
Stemmer, 2000) for English document. In 1996, Rao et al. [19] and in 2012,
U.Mishra et al. propropose a stemmer MAULIK [20] for Hindi document and
in 1999, Zemberek proposed Zemberek Morphological Analyzer for Turkish
document.
• Stop word Removal : The words which do not conveying any significance
semantic to the text. These are “the”,“a”,“an”, “from”,“to”,“of”, etc. Stop
word removal is done using human made list of words. This list is different
for different languages. Here author applied this scheme in a set of 50 articles.
The sentence which comprises more significance words set as highest ranking
sentence and keeps all sentences in a decreasing order based on their rank. Then
it extracts sentences whose rank is more than predefined threshold value. In
1969, Edmundson [21] introduce four basic methods for automatic extracting
system was based on assigning to text sentences numerical weights that were
functions of the weights assigned to certain machinerecognizable characteristics.
These four basic methods are:
1. Cue Method : Relevance of a sentence is affected by presence of pragmatic
words (“significant”, “impossible” and “hardly”). In this method,Cue
dictionary comprises three sub dictionaries :
Bonus words : positively relevant,
Stigma words : Negatively relevant,
Null words : Irrelevant.
2. Key Method : According to this, more frequent words are positively
relevant.
First it finds the total number of word occurrences in the
document. The words are set according to the nondecreasing order and the
word whose frequencies above the threshold were assumed as Key words
11
2.1. TAXONOMY OF TEXT SUMMARIZATION
and assigned positive weights equal to the frequency of occurrence.The
final Key weight of a sentence is the sum of the Key weights of its constituent
words.
3. Title Method : In this method, the machine recognizes certain specific
characteristics of the document, like title, headings, and format. The Title
method compiles for each document, a Title glossary comprises of nonNull
words of the title, subtitle, and headings for that document. Words in
the title glossary are assigned positive weights. The final weight for each
sentence is the sum of the Title weights of its constituent words.
4. Location Method : In the Location method, the sentences which contain
specific headings are positively relevant sentences. It selects headings of
documents which are appear in corpus and stored in a Heading dictionary.
Mostly heading words are appearing in the “Introduction”, “Purpose”, and
“Conclusion” parts of a document. The final Location weight for each
sentence is the sum of heading weight.
Author applied these methods in a set of 400 documents, and find that,
the CueTitleLocation method gives highest mean co selection score while
Keymethod give less.Emundsons settled features for extracting sentence. These
are:
• Sentence Length Cutoff Feature : If a sentence is longer than the pre specified
threshold value is more important than shorter sentence.
• FixedPhrase Feature : If Sentence containing any fixed phrases like “this
letter”,“In conclusion” or following immediately after heading containing a
keywords like “conclusion”,“results”,“summary” are more important.
• Paragraph Feature : If a paragraph containing more than one sentence than
importance of sentence is based on position, whether it is paragraphinitial
or paragraphfinal or paragraphmedial. Paragraphinitial sentence is, more
important than Paragraphfinal sentence.
12
2.1. TAXONOMY OF TEXT SUMMARIZATION
• Thematic Word Feature: The most frequent content words is known as thematic
words. A sentence is scored based on function of frequency.
• Uppercase Word Feature : Proper names are important. e.g. “ ASTM (American
Society for Testing Materials)”.This feature is computed with the constraints
that an uppercase thematic word is not sentenceinitial and begin with a capital
letter. Actions are : TFIDF, entropy, mutual information and statistics. Another
Statistical approaches used for keyword extraction are : TFIDF, entropy, mutual
information and statistics [22],[23].
Coherent Based Approach
A coherent based approach basically deals with the cohesion relations among the
words.Cohesion relations among elements in a text: reference, ellipsis, substitution,
conjuction, and lexical cohesion [24].
• Lexical chain : Lexical chain is a method of identifying set of words which are
semantically related. Semantic relationships among the words can be systematic
semantic, and nonsystematic semantic.
Semantically related words can be extracted using dictionaries and WordNet.
• WordNet : In NLP WordNet is used for measuring of conceptually similarity
and relatedness information from document. Concept can be related in any
ways beyond similar to each other. For Example, a wheel is a part of a car, night
is the opposite of day and so forth [25],[26].
In [27] describe four features based on lexical chains. Features are:
• Lexical chain score of a word : A word can be a member of more than one lexical
chains as it can appear in a same text with different sense. The score of a lexical
chain depends on relations appearing in the lexical chain.
• Direct Lexical chain score of a word : Score was calculated based on the relations
that belong to the word.
• Lexical chain span score of a word : It depends on the portion of the text that is
covered by the lexical chain. The covered portion of the text is considered as the
13
2.1. TAXONOMY OF TEXT SUMMARIZATION
distance between the first positions of a lexical chain member (word) occurred
first in the text and the last occurrence position of a lexical chain member which
occurred last in the text.
• Direct lexical chain span score of a word : It is computed same as the
lexical chain span score except that it is considered the words which are
directly related with the word in the lexical chain.Author applied these four
features with a corpus consists of 155 abstracts and got 45% precision in the
extraction of keywords. In [27] propose a CRF based keyword extraction
approach.Rhetorical Structure Theory (RST) based methods are another
example that uses Coherent based summarization.
• RST: It organizes texts into treelike structure to represent the coherence
relations among the words [28]. In [29] propose a automatic summarizer GIST
based on RST processes. In [30] propose a Automatic text summarization
method based on RST. Here author assigned weights to the sentence in RStrees
according to the utility, and cut out lower weight nodes.
As a result the
system generates complete, cohesive and readable summarization on the basis
of relation between sentences in the original text.
Graph Based Approach
Well known graph based algorithms are HITS and Google’s PageRank [31]
• HITS (Hyperlinked Induced Topic Search):
It is a ranking algorithm for web page developed by Jon Kleinberg.
It
determines two set of scoresauthority: pages with large number of incoming
links and hub: pages with large numbers of outgoing links [32].
HIT S H (Vi ) =
X
HIT A (V j )HIT S H (Vi ) =
v j Out(vi )
X
HIT H (V j )
(2.1)
v j Out(vi )
equqtion——————————————————• Google’s Pagerank Algorithm :
It is a ranking algorithm to determine quality of web pages. It is used by
14
2.1. TAXONOMY OF TEXT SUMMARIZATION
Google to improve search result, named after Larry Page [33]. PageRank
integrates both incoming and outgoing links into one single model, and therefore
it produces only one set of scores:
PR(Vi ) = (1 − d) + d ∗
PR(v j )
|Out(v j )|
v In(v )
X
j
(2.2)
i
Where d is a parameter that is set between 0 and 1. Aardvark is a social
search engine based on the village paradigm [34]. Miles Efron propose a page
rank algorithm for Microblogs (e.g. Twitter) search [35].Daniel and Tunkelang
proposed “a Twitter analog to PageRank¨[49]. It determines two set of scores
Authority and Influence.
In f luence(u) =
1 + p ∗ In f luence(v)
||Following(v)||
vFollower(u)
X
(2.3)
Where Followers (.) is the set of people following a given user and Following is
the set of people a given user follows and p is a realvalued number corresponding
to the probability that a given tweet is retweeted.
Machine Learning Approach
Initially the system assumes that the features are independent. After that some feature
dependent approaches are developed. The machine learning based summarization
algorithms use techniques like NaveBayes, decision Trees, Hidden Markov Model.
• NaiveBayes Methods :
NaiveBayes classifier, long a favorite punching bag of new classification
techniques [36].A machine learning approach is based on three steps: Learning,
Development and Test. Bayes rule takes feature of words and sentences as
random events and relates to the conditional and marginal probabilities of those
random events. According to Bayes rulen :
P (sS |F1 , F2 , ..., Fk ) =
Where s is a sentence from the document
15
P (F1 , F2 , ..., Fk |sS )
P(F1 , F2 , ..., Fk )
(2.4)
2.1. TAXONOMY OF TEXT SUMMARIZATION
S is the target summary
(Fi )1≤i≤k are features.
Andrew et al. compare Multivariate Bernoulli model and MultiNomial model
and shown multivariate Bernoulli model performance is better. Rennie et.al
discusses Multinomial Nave Bayes model and problems associated with it [37].
Mouratis et al. propose Discriminative Multinomial Bayesian Classifier, which
increases the accuracy with a feature selection technique that evaluates the
worth of an attribute by computing the value of the chisquared statistics with
respect to the class [38].
• Decision Trees : Decision tree is a classifier generated from training data to
finding the feature in toptodown direction i.e. root to leaf node. Each node is
generated based on the rules corresponding to the feature and this process is
repeated until no further information gain is obtained.
Lin et al. assumed that the features are independent and applied decision
tree algorithm for sentence extraction problem [39].
Data are used
for this measurements provided by the TIPSTERSUMMAC. Collection of
independent data is provided from SUMMARIST for assign score to sentences.
SUMMARIST got same texts after applying each combination of functions,
features and parameters. Some specific features are:
Baseline: Scoring sentence by its position.
Query signature : Normalized score of each sentence according to the number
of uery words they contain.
IR signature : most salient terms ranked by tfidf.
Average lexical connectivity : Number of words shared with other sentences
divided by the total number of sentences in the text.
Numerical data: Boolean value 1 is given, if sentences contain numerical
expression.
Pronoun and adjective: Boolean value 1 is given if a sentence contain proper
noun.
Weekday and month : Boolean values 1 given to sentence if it contains weekdays
and months.
16
2.1. TAXONOMY OF TEXT SUMMARIZATION
Quotation : Boolean value 1 given to sentences containing quote.
When author applied these features to the query topic, they conclude that no
single feature suffices for query based summaries.Kevin et al. proposed a model
of sentence compression function using decision tree method [40].
17
2.1. TAXONOMY OF TEXT SUMMARIZATION
• Hidden Markov Model(HMM):
It is a p̈robabilisticfinitestate model for data. The structure of this model consists
of number of states and transition between the states which is selected by the a
priori of the domain.HMM is defined as follows [41]:
λ = (A, B, π)
(2.5)
S = Set of States, S 1 , S 2 .....S M
V= Set of Output symbols, V1 , V2 .....VN
Q= Fixed state sequence of length T,
O= Set of Observations of length T,
A= transition probability from State S i to S j , denoted as ai j , where
ai j = P(QT = S j |QT −1 = S i )
(2.6)
B= Probability of Observation at k, produced from state S j denoted as,B =
(bi (k))
bi k = P(xT = VT |QT = S i )
(2.7)
π=Initial probability array, denoted as,π = [πi ]
πi = P(Q1 = S i )
(2.8)
There are two assumptions are made in the Markov model : Current state
is dependent only on the previous state, and Output observation at time t
is dependent only current state. It is also used for speech and handwriting
recognition [42].
Zhou et al describe granularity refined DOM tree to extract detailed information
combined with regular expression to extract fixed formative information [43].
They took training data set consists of address, room size, rent, area, telephone
number, name etc, applied in DOM tree. Experiment showed better extraction
results when it compared with RAPIER algorithm with same data sets.
18
2.1. TAXONOMY OF TEXT SUMMARIZATION
• MaximumEntropy Model :
A maximum entropy classifier can be used to extract sentences form documents.
Osborne et al. specify that maximum entropy classifier showed better result in
sentence extraction than naivebayes classifier when information is encoded in
dependent features and independent features [44]. Maximum Entropy defined
as [45]:
P(c|s) =
Where z(s) =
P
c
X
1
exp( λi ,c fi ,c (s, c)))
Z(s)
i
(2.9)
P
exp( i λi fi (c, s)),is a normalized function,
Fi,c is a function for feature and c is class defined as:


0


 1 ni (d) > 0 and c = c
0
Fi,c (d, c ) = 


 0 otherwise
The λi,c0 s are feature weight parameters. The parameters values are used to
maximize the entropy of the induced distribution based on the constraint.
Chieu et al. present a maximumentropy classification approach on a singleslot
and multislot information extraction [46]. For singleslot task, they worked on
seminar announcements. For this, they took several features such as,
Unigram: The string of each word w is used as a feature. So is that of the
previous word w-1 and the next word w+1.
Bigram : The pair of word strings (w-2, w-1) of the previous two words is used
as a feature. So is that of the next two words (w+1, w+2).
Zone and InitCaps : Texts within the pair of tags ¡sentence¿ and¡/sentence¿ are
taken to be one sentence. Words within sentence tags are taken to be in TXT
zone. Words outside such tags are taken to be in a FRAG zone. This group of
feature consists of 2 features (InitCaps, TXT) and (InitCaps,FRAG). For words
starting with a capital letter (InitCaps), one of the 2 features (InitCaps,TXT) or
(InitCaps,FRAG) will be set to 1, depending on the zone the word appears in.
Zone and InitCaps of w-1 and w+1 : If the previous word has InitCaps, another
feature (InitCaps, TXT)PREV or (InitCaps, FRAG)PREV will be set to 1.Same
for the next word.
Heading : Heading is defined to be the word before the last colon :̈ .̈ The system
19
2.1. TAXONOMY OF TEXT SUMMARIZATION
will distinguish between words on the first line of the heading (e.g. Whofirstline)
from words on other lines (Whootherlines). There is at most one feature set to 1
for this group.
First Word : This group contains only one feature FIRSTWORD, which is set to
1 if the word is the first word of a sentence.
Time Expressions : If the word string of w matches the regular expression
:[digit]+ :[digit]+, then this feature will be set to 1.
Names : If w has InitCapsand is found in the list of first names, the feature
FIRSTNAME will be set to 1. If w-1 (or w+1) has InitCaps and is found in the
list of first names then FIRSTNAMEPREV (FIRSTNAMENEXT) will be set to
1. Similarly for LASTNAME. For multislot task, they worked on Management
Succession.
The multislot IE system made iup of four components, such
as, TextFiltering, Candidate Selection, Relation Classification, and Template
Building. Author applied two benchmark data set for both task showed better
accuracy in the Information extraction. Robert et al. compare number of
algorithms for estimating the parameters of maximum entropy model including
iterative scaling, gradient ascent, conjugate gradient, and variable metric
methods [47]. Another new model; HiddenState Maximum Entropy (HSME)
proposed which is based on fusion method for confidence measure [48].Concept
of Maximum entropy model is also applied to Biological text terms boundary
identification [1].
• Neural Networks:
In 1997, Ruiz and Srinivasan [49] modeled a problem of recognizing MeSH term
for a particular document . To solve this problem, they used backpropagation
and counterpropagation networks.
Backpropagation networks : Backpropagation network consists of two phases,
one to propagate the input pattern and other to adapt the output by changing the
weights in the network. The training procedure of a backpropagation network
is iterative, with the weights adjusted after the presentation of each case. The
20
2.1. TAXONOMY OF TEXT SUMMARIZATION
Kohonen layer
Input layer
Grossberg layer
input (N j ) and the output (O) of the network is defined as follows :
Nj =
X
ωi j Oi + Θ j andO j =
i
1
1 + e−N j
(2.10)
Counterpropagation Network :
The Counterpropagation Network consists of an input layer, a hidden layer
(also called Kohonen layer) , and an output layer (called Grossberg layer).
The training process of consists of two steps, first, an unsupervised learning is
performed by the hidden layer, then after the hidden layer is stable a supervised
learning is performed by the outer layer. The formula of the hidden layer is :
ωnew = ωold + α(x − ωold )
(2.11)
Svore et al. approached a model based on neural nets, called NetSum for
summarization and thirdparty datasets for features. Authors used as dataset
of Wikipedia and CNN.com and applied a ranking algorithm, RankNet. The
system performed well over the baseline of choosing the first n sentences of the
document.
NTC (Neural Network Categorizer) [50] is a neural network model for
representing documents into numerical vectors. It solved two problems, first :
it can classify documents with its huge dimensionality completely and second is,
21
2.1. TAXONOMY OF TEXT SUMMARIZATION
provides transparency about its classification. For text categorization, authors
gather dataset from Newspage.com, 20NewsGroups, and Reuter . They applied
four approaches, SVM, NB, KNN, BackPropagation, and compare with NTC
for evaluation and got successful result as an approach to text categorization.
The Probabilistic Neural Network (PNN) was first proposed by Donald Specht
in 1990. In 2009, Patrick et al. describe a modified model of PNN to solve the
problem of Economic Activities Classification of Brazil [51].
Counterpropagation Network : Support vector machine is a learning method,
developed by Vapnik et al.
as stated in [52] Hirao et al.
introduced a
classification learning algorithm, Support Vector Machine (SVM) to categorize
important or unimportant sentence in Single Document Summarization at
Document Understanding Conference (DUC) [53].
Given training dataset
(xn , yn ), n=1 to n,x j Rn and y j −1, +1 ,where x j is a feature vector of the jth
sample and y j is its class label (positive or negative). To rank sentences, they
took features of sentences, such as Position of sentences, Length of sentences,
weight of sentences, Similarity between Headline, Prepositions and verbs.
They presented sentence ranking algorithm by SVM for multidocument
summarization. To minimize redundancy, they applied Maximum Marginal
Relevance (MMR) method. Novel features they used for ranking sentence are
similar, the features they used for Single document summarization but in place
of Similarity between Headlines, named entity is used.
In 2005, Minh et al. proposed a sentence extraction algorithm based on SVM
ensemble classification to improve the accuracy for the data [54]. To correctly
classify area in the training samples, they trained each SVM independently from
the random chosen trained samples and to combine each machine, they used
boosting strategy. To run this method, they implement Adaboosting algorithm
to select training data for each individual SVM. Feature set were Location
method, Length method, Relevant to title, term frequent and document
frequent, cue phrase, distance of a word within a sentence.
22
2.1. TAXONOMY OF TEXT SUMMARIZATION
Algebraic Approach
• Latent Semantic Analysis: It is a algebraic statistical method to determine words
and sentences which are semantically related. It creates a matrix representation
by comparing semantic words .It is an algebraicbased Unsupervised approach.
LSA produces measures of wordword, worddocument and documentdocument
relations that are well correlated with several human cognitive phenomena
involving association or semantic similarity [55].
Latent Semantic Indexing is a information retrieval method that project queries
and documents into a space with “latent” semantic dimensions [56]. Singular
Value Decomposition (SVD) is a method to find out the relations among very
large number of words. It can reduce the noise and improve the accuracy.
SVD : SVD of a matrix Am×n defined as follows:
T
A = Um×n × S r×r × Vr×n
(2.12)
Where U is Eigen vectors of AAT ,called term matrix,V is Eigen vectors of
AT A,called document matrix, and S is Eigen values of both AT A and AAT ,called
diagonal matrix of nonzero singular values.
Probabilistic Latent Semantic
Analysis is a statistical model for word or document cooccurrences by the
following scheme [57] :
Select a document di with probability P(di ), Pick a latent class zk with probability
P(zk |di ), Generate a word ω j with probability P(ω j zk ).
Where P(di ) is a
probability that a word occur in a particular document.
P(zk |di ) denote the probability distribution over a latent variable space,and
P(ω j zk ) denote the class conditional probability of a specific word conditioned
on the unobserved class variable.
Meta Latent Semantic Analysis (MLSA) [58] improved accuracy model of
LSA. It has the ability to create metaclusters by taking symbolic ontologies
relevant for the analyzed collection of documents.Adaptive PLSA has the
incremental learning capability to absorb the domain knowledge form new
observed documents. It deals with domain mismatch for language processing
23
2.1. TAXONOMY OF TEXT SUMMARIZATION
applications. To resolved updating problems, authors go through the foldingin,
SVD recomputing, and SVD updating processes [59].
• NonNegative Matrix factorization(NMF) : NMF is linear representation of
nonnegative data applied to the set of multivariate ndimensional data vector.
NMF model is defined as :
An×m ≈ Bn×r Cr×m
(2.13)
Li et al. presented a multidimensional summarization framework based on
sentence level semantic analysis (SLSS) and symmetric nonnegative matrix
factorization (SNMF). SNMF can be 3factor nonnegative matrix factorization
is defined as :
X ≈ FS G
(2.14)
Where S provides lowrank matrix representation, F gives row clusters and G
gives column clusters. After creating the clusters, authors rank the sentences
based on sentence score. Sentence score can be measured as :
S core(S i ) = λF1 (S i ) + (1 − λ)F2 (S i )
(2.15)
Where F1 (S i ) measure the average similarity score between sentence S i and all
the other sentences in the cluster and F2 (S i ) is the similarity between sentence
and the given topic. λ is the weight parameter [60].
In 2009, Lee et al. [61] proposed an unsupervised NMF method to extract
important sentences for automatic generic document summarization. Author
claimed that NMF provide better performance in identifying subtopics of a
document as compared with the methods using LSA because semantic feature
vectors obtained using NMF have nonnegative values but in LSA method, it
contain both positive and negative values.
• Semi-Discrete Decomposition (SDD): SDD can be used in place of truncated
SVD matrix is defined as [62]:
Ak =
k
X
i=1
24
di xi yTi
(2.16)
2.1. TAXONOMY OF TEXT SUMMARIZATION
A rankk SDD requires the storage of k(m+n) values from the set-1,0,1 and k
scalars.The scalar need to be only single precision because algorithm is self
correcting.
To querybased text summarization, authors compared SVD and SDD based LSI
methods on the MEDLINE dataset, requires only about half the query time, and
requires less than onetwentieth the storage but to compute SDD approximation
takes five times as long as computing the SVD approximation. Let ARm×n be
a given matrix and let wRm×n be a nonnegative weighted matrix [62]. The
weighted approximation problem is to find a matrix ARm×n that solves
min||A − B||2ω
(2.17)
To overcome the problem of “curse of dimensionality”, Vaclav et al. proposed
a model Wordnet and Wordnet+LSI for dimension reduction [63]. Here SDD
method is used to identify most conceptual terms. For identifying topic, SDD
concept is used in two ways : to map the terms on synsets and use synset as input
to the SDD for document and vectors.
2.1.3
Abstractive Text Summarization
In this method, machine need to understand the concept of all the input documents
then produce summary with its own sentences. To accomplish this task, it go through
these sub processes : information extraction, ontological information, information
fusion and compression [64]. Machine uses linguistic methods to examine and
interpret the text and then to find the new concepts and expressions to best describe
it by generating a new shorter text that conveys the most important information form
the original text document [65]. Witbrock et al. proposed a statistical approach
model of nonextractive summarization process based on sentence compression. Main
steps in this system are [66]:
a. Tokenization : Tokens may include not only the words, but additional information
such as parts of speech tag, semantic tags applied to words, even phrases. Long
25
2.2. EVALUATION MEASURE
distance relationships between words or phrases in the document, positions of words
or phrases, markup information obtained from the document such as existence of
different font, etc. could be used. This preprocessing model is applied in both input
documents and target documents.
b. The statistical model is built describing the relationship between the source text
units in a document and the target text units to be used in the summary of that
document. It describes both the order and likelihood of appearance of the tokens in
the target documents.
c. The statistical model generated information about user or task requirements, are
used to produce the summary of a document.
2.2
Evaluation Measure
After creating automatic summary require to know, how useful it is. Whether it can
fulfill the requirement for human or it is giving quality information or not. For this,
automatic evaluation is done. TIPSTER Text Summarization Evaluation (SUMMAC),
which was the first largescale, developerindependent evaluation of automatic text
summarization system [67]. To evaluate a summary, baseline summaries need to
create : single baseline summary for singledocument summarization and one baseline,
lead baseline, coverage baseline summaries for multidocument summarization which
is a difficult task [68]. Human evaluation task is expensive, very difficult and take
more time. BLEU is a automatic evaluation of machine translation, inexpensive, quick
and languageindependent, that correlates highly with human evaluation [69].There
is no standard metric is defined for evaluation, which makes very hard to compare
different systems and establish a baseline [70].
NIST did not define any official performance metric in DUC 2001 as stated by Lin
(2002).Evaluation measures are categorized into two types, intrinsic and extrinsic
evaluation. Intrinsic evaluation, judges the quality of the summary directly based
on analysis in terms of some set of norms but extrinsic evaluation judges the quality
of the summary based on the how it affects the completion of some other task. The
taxonomy of evaluation measure as stated in [71] shown in figure ??.
26
2.2. EVALUATION MEASURE
Evaluation Measure
Intrinsic
Extrinsic
(task-based)
document categorization
information retrieval
question answer
text Quality
evaluation
Characterstics of
grammatical
non-redundancy
Referential clarity
Structure and coherence
Co-selection
Content Based
Precision,recall,F-score
Cosine Similarity
relative utility
Unit Overlap
Longest Common subsequence
n-gram matching (ROUGE)
Pyramids
LSA Based Measure
27
2.2. EVALUATION MEASURE
• Text Quality Measure : Grammaticality : The summary should not contain any
grammatical error like punctuation errors or incorrect words. Nonredundancy
: the summary should not contain any redundant information. Referencequality
:The reference in the summary clearly matched with the known object.
Coherence and structure : The summary should have good structure and the
sentences are coherently related.
• Coselection Measures: Here sentences are extracted from the created summary
and evaluate against the human selection.
The metrics of coselection are
Precision, Recall and Fscore. Precision : Precision defined as the proportion of
retrieved documents that are relevant [72] or Common extracted the number
sentences from system and human choice summary divided by number of
sentences extracted from system summary.
Precision =
S ystemS entences ∪ HumanJudgesChoiceS entences
S ystemS entences
(2.18)
Recall : Recall is defined as the proportion of relevant documents that are
retrieved [89] or common number sentences extracted from system summary
and human choice summary divided by number of sentences extracted from
human choice summary.
Precision =
S ystemS entences ∪ HumanJudgesChoiceS entences
HumanJudgesChoiceS entences
(2.19)
F Score : FScore is a statistical method that combines precision and recall.
F-score is defined as harmonic average of precision and recall. Its value lies
between 0 and 1 where 1 is best value.
FS core =
2 × Precision × Recall
Precision + Recall
(2.20)
Another formula for FScore for measuring the FScore :
FS core =
(β2 + 1) × Precision × Recall
β2 × Precision + Recall
28
(2.21)
2.2. EVALUATION MEASURE
Where β is a weight value not equal to zero. For β > 1, it indicate Precision is
more important and for β < 1 indicate Recall is more important.
Relative Utility :Relative Utility measure to overcome the problem of the
Precision and recall based evaluation as stated in [73]. Suppose a manual
summary contain sentences 1, 2, 3, and 4 from a document. There are two
systems S1 and S2, creates summaries consisting of sentences 1, 2, 4 and 1, 2, 3.It
can be possible that two sentences in one document are equally important. Using
Precision and Recall, S 1 can rank higher than S 2 . Judges to judges, ranking of
sentences are varies. If a particular sentence ranked 8 by judge 1 and same
sentence is ranked 10 by judge2, then utility score of that sentence is 0.8 ( 108 ).
To calculate Relative Utility, a number of judges (N ≥ 1) are asked to assign
utility score to all sentences in the document. The top e number of sentences is
extracted according to utility score. Relative Utility of a system is calculated as :
i=1 δ j
Pn
i=1 η j
Pn
RelativeUtility =
PN
j=1
PN
j=1
λi j
λi j
(2.22)
Coselection based evaluation focused on summaries where sentences are
extracted.
• Content based Measure : Content based evaluation mainly focuses on extracted
summaries where comparison is done among words. Measures of Contentbased
evaluation are : Cosine similarity, unit overlap, longest common subsequence,
ROUGE score, and pyramid. Cosine Similarity :
sim(D1 , D2 ) = p
P
P
d1i d2i
qP
2
2
(x
)
i i
i (yi )
i
(2.23)
Where D1 and D2 are two documents represented using a vector space model
and di is a term weight for wordi .
Unit Overlap : Unit Overlap is defined as :
overlap(X, Y) =
29
||X ∩ Y||
||X|| + ||Y|| − ||X ∩ ||
(2.24)
2.2. EVALUATION MEASURE
Where X and Y are text representations based on sets. Here ||S || is the size of set
S. Longest Common Subsequence : LCS finds longest common subsequence of
X and Y. It can be calculated as [74] :
2 × lcs(X, Y) = length(X) + length(Y) − editdi (X, Y)
(2.25)
Where length(X) and length(X) are length of the string X and Y respectively and
editdi (X, Y) is the edit distance between X and Y.
ROUGEN : Ngram CoOccurrence Statistics: ROUGEN is an ngram recall
between a candidate summary and a set of reference summaries. ROUGE-N is
computed as follows :
P
P
sRe f erencesS ummaries
gram Countmatch (gramn )
P n
ROUGE − N = P
sRe f erencesS ummaries
gramn Count(gramns )
(2.26)
Where gramn and Countmatch (gramn ) is the maximum number of ngrams
cooccurring in a candidate summary and a set of reference summaries, and n
stands for length of the ngram.
In case of multiple references, pairwise summarylevel ROUGEN between a
candidate summary s and every reference,ri , in the reference set. ROUGEN
can be computed for multiple reference as follows :
ROUGE − Nmulti = argmaxi ROUGE − N(ri , s)
(2.27)
ROUGE can be computed based on longest common subsequence, known as
ROUGEL. Fmeasure base on LCS can be computed as :
Flcs =
where Rlcs =
LCS (X,Y)
m
and Plcs =
(1 + β2 Rlcs Plcs
Rlcs + β2 Plcs
LCS (X,Y)
,
n
(2.28)
X is a reference summary sentence of
length of m and Y is a reference summary sentence of length n. ROUGEL does
not require consecutive matches but insequence matches that reflect sentence
level word order as ngram and it automatically includes longest insequence
30
2.2. EVALUATION MEASURE
common ngrams; therefore no predefined ngram length is necessary. Sentence
level LCS based can be applies to the summarylevel. In this process, it take
union LCS matches between reference summary sentence,ri ,and every candidate
summary sentence,c j . Fmeasure can be computed as [75] :
Flcs =
Pu
where Rlcs =
i=1
LCS ∪ (ri ,c)
m
(1 + β2 )Rlcs Plcs
Plcs + β2 Plcs
and Plcs =
Pu
i=1
(2.29)
LCS ∪ (ri ,c)
,and
n
number of sentences
containing a total number on m words in reference summary and v number
of sentences containing a total number of n words in candidate summary.
Pyramids : To identify relevant information from s document or set of
documents.
It is based on Summary Content Unit (SCU). A SCU is a
semantically atomic unit representing a single fact, but is not tied its lexical
realization [76]. Let be the number of SCUs in the summary that appear in
tier Ti, and X is the total number of SCUs in the summary. Total SCU weight
can be computed as :
D=
n
X
i × Di
(2.30)
i=1
This SCU weight is then normalized by the optimal content score for a summary
X SCUs.The optimal content score is computed as :
Max =
n
X
i|T i | + j(X −
i= j+1
X
n|T i |)
(2.31)
i= j+1
P
where j = max( ni=1 |T i | ≥ X) This pyramid score lies between 0 and due to
normalization.
LSA Based measure : It has the ability to capture the most important topics is
used by the two evaluation metrics proposed by Steinberg et al. It evaluates
a summary quality via content similarity between a reference document and
the summary. The quality is measured by the similarity between the matrix U
derived from the SVD performed on the reference document and the matrix
U derived from the SVD performed on the summary. There are two similarity
measures proposed : Main Topic Similarity and Term Significance Similarity.
31
2.2. EVALUATION MEASURE
• Task-Based Measure : Task based evaluation focus on the quality of a summary
according to the fulfillment of a user. It requires more effort than intrinsic
evaluation. Approaches of taskbased summarization evaluation are : Document
categorization, information retrieval and question answering.
Document categorization : It determines whether the summary is effective
in capturing whatever information in the document is needed to correctly
categorize the document. Categorization can be done by human judges or
automatic classifier. By comparing the upper and lower bounds of the error
generated by a classifier and one that by a summarizer, we can compare the
system performance. The evaluation metrics of categorization are : Precision
and recall. Precision in this context is the number of correct topics assigned
to a document divided by the total number of topics assigned to the document.
Recall is the number of correct topics assigned to a document divided by the
total number of topics that should be assigned to the document.
Information Retrieval : It is a appropriate taskbased evaluation of a summary
quality.
Relevance Correlation is an IR based measure for assessing the
relative decrease in retrieval performance when moving from full documents
to summaries. It measures the quality of summaries by comparing how well
the summary and full document does.There are several methods for measuring
the similarity of rankings. One such method is Kendall’s tau and another
is Spearman’s rank correlation.
Relevance correlation r is defined as the
linear correlation of the relevance scores (x and y) assigned by two different
IR algorithms on the same set of documents or by the same IR algorithm on
different data sets :
P
− x̄)(yi − ȳ)
pP
2
2
i (xi − x̄)
i (xi − x̄)
r = pP
i (xi
(2.32)
Here x̄ and ȳ are the means of the relevance scores x and y for the document
sequence respectively.
Question Answer : Here Authors take a test which consists of multiple choices,
with a single answer to be selected from answer shown alongside each question.
Authors measured how any of the questions the subjects answered correctly
32
2.3. KEYWORD EXTRACTION
under different conditions by compare with professional answer.
2.3
Keyword Extraction
To extract important information or sentences, high quality keyword plays crucial
role as per user requirement. They help users to search information more efficiently.
Keyword extraction can be used in many applications, such as text summarization,
clustering, classification, topic detection, etc [77]. Due to growth of online information
it is difficult for human beings to accomplish their task in the field of natural language
processing in stipulated time. Extracting high quality keywords automatically are
expensive and time consuming. This shows keyword extraction is challenging problem
in the area of natural language processing especially in the context of global languages
in acceptable time.
Frank et al.
investigate keyword extraction algorithm as a supervised learning
algorithm [78]. They also introduced KEA algorithm for keyword extraction.tf-idf
method is used for feature calculation [79] and it performed well. In 2000, Turneyet al.
used decision algorithm and genetic algorithm for keyword extraction [80].Kerner et
al. [81]investigate tf-idf is very effective in extracting keywords for scientific journals.
Keyword extraction also solved as unsupervised approach task shown by Lie et al
[82].Barker et al. discusses a key phrase extraction system that scores to noun phrases
based on frequency and length and it also filter some noise from the set of top scoring
keyphrases [83].Daille et al. applied linguistic knowledge to identify noun phrases for
both in English and French terms [84]. They used statistical methods to score good
terms.
Keyword extraction methods can be divided into different categories based on
approaches:Statistical approach : These methods are simple and do not need the training data.
The statistics information of the words can be used to identify the keywords in
the document. It includes n-gram, word frequency, tf-idf, and word co-occurrence
methods.Burnett et al. used n-gram to identify index terms in document [85]. Cohen
investigates n-gram count method to extracting highlights from the document [86].In
1957,Luhn described statistical approach that a sentence gives useful measurement
33
2.3. KEYWORD EXTRACTION
of significance, if frequency of particular term (or word) is high in an article.
Frequencies of pair of words is high in the documents then term co-occurrence value
is high [87].
Linguistic approach: These approaches use the linguistics feature of the words mainly,
sentences and document.
The linguistics approach includes the lexical analysis,
syntactic analysis etc.Lexical chain is a method of identifying set of words which are
semantically related. WordNet is used for measuring of conceptually similarity and
relatedness information from document [88], [27]. Hulth used syntactic features for
extracting keywords. To give an idea about pattern, frequently occurring keywords
present in the training data are adjective noun (singular or mass), noun noun (both
singular or mass), adjective noun (plural), noun (singular or mass) noun (plural)
and noun (singular or mass) [89].Other researchers used lexical cohesion method
for keyword extraction such as, Brazilay et al, Angheluta et al. [90], [91].
Machine learning approach: It includes methods like naive bayes, support vector
machine, etc. Bayesian decision theory based on tradeoffs between the classifications
decisions using probability and the costs that accompany those decisions [92].It
examined that it is less favourable due to large training data set.Zhang et al.defined
three categories of keywords, such as ’good keyword’, ’indifferent keyword’ and ’bad
keyword’. They applied support vector machine as a classification model for keywords
[93].
Other approach: It includes method that uses some heuristic knowledge, such as the
position, length, html tag etc. Position of the word appears defined by its position
normalized by the total number of words in the document. Keywords are extracted
based on the maximum length and highest salience score of the sentences [94].
Humphreys investigate on HTML keyword extractor. It is based on phrase rate that
includes word rate, docrate, ratephrases and selector [95].It is especially suitable for
online keyword aid.
34
Chapter 3
Comparison between Performances of
two Keyword Extraction Methods
Here I describe our work on comparison between performance of keyword extraction
methods that are most popular TF-IDF method and another is based on Helmholtz
Principle. Here I propose a algorithm based on Helmholtz Principle to get meaningful
words in stipulated time.
3.1
Term
Frequency-Inverse
Document
Frequency
(TF-IDF):
tf-idf (term frequency-inverse document frequency) weight identify importance of
words to a document collection. Important keywords that appear frequently in a
document, but that don’t appear frequently in the remainder of the corpus [23].The
tf measures the number of times a word appears in the current document which can
reflects the frequency of the word in this article, while the idf reflects the number of
documents in which the word occurs. When the word is more frequent in the sentence
but less frequent in the whole document, the tf-idf value is higher.tf-idf is defined as:
t f − id f = t f × id f
35
(3.1)
3.2. OPTIMIZATION OF MEANINGFUL KEYWORDS EXTRACTION USING
HELMHOLTZ PRINCIPLE
id f (i) = log
n
n(i)
(3.2)
Where tf= number of times term i occur in document,
n= number of documents in the corpus and
n(i)= number of documents in which the word i occurs
tf-idf assigns to term t a weight in document d that is
• highest when t occurs many times a small within a small number of documents;
• lower when the term occurs fewer times in a document, or occurs in many
documents;
• lowest when the term occurs in virtually all documents
3.2
Optimization of Meaningful Keywords Extraction
using Helmholtz Principle
Jon Kleinberg present a formal approach for modeling ”bursts,” so that they can be
robustly and efficiently identified [24]. According to a basic principle of perception
due to Helmholtz, an observed geometric structure is perceptually meaningful if it has
a very low probability to appear in noise. As a common sense statement, this means
that events that could not happen by chance are immediately perceived. For example,
a group of five aligned dots exists in both images in Figure ??, but it can hardly be
seen on the left-hand side image. Indeed, such a configuration is not exceptional in
view of the total number of dots. In the right-hand image we immediately perceive the
alignment as a large deviation from randomness that would be unlikely to happen by
chance.
In the case of textual, sequential or unstructured data, Balinsky et al. derive
qualitative measure for such deviations. Suppose we are given a set of N documents
D1 , D2 , ...., DN (containers) of the same length [26]. Let W be some words inside these
N documents. Assume that the word W appears K times in all N documents and let
us collect all of them into one set S w = w1 , w2 , ..., wN .Let us denote by Cm ,a random
36
3.2. OPTIMIZATION OF MEANINGFUL KEYWORDS EXTRACTION USING
HELMHOLTZ PRINCIPLE
Figure 3.1:
variable that counts how many times an m-tuple of the elements of S w appears in the
same document. Now we would like to calculate the expected value of the random
variable Cm under an assumption that elements from S w are randomly placed into N
containers. Form different indexes i1 , i2 , ...im between 1 and K i.e. 1 < i1 , i2 , ..., im < k a
random variable
Xi1,i2,...,im




 1
=


 0
if wi1,...,wi m are in same document
otherwise
The function Cm ,
X
Cm =
Xi1,i2,...,im
(3.3)
1≤i1 <i2 <im ≤
and that expected Value E(Cm ) is sum of expected values of all Xi1,i2,...,im<k ,:
X
E(Cm ) =
E(Xi1,i2,...,im )
(3.4)
1≤i1 <i2 <im ≤
Since Xi1,i2,...,im has only values zero and one, the expected value E(Xi1,i2,...,im ) is equal to
the probalbility that all wi1 , wi2 ....wim belong to the same document, i.e.
E(Xi1,i2,...,im ) =
1
N m−1
(3.5)
From the above identities, we can see that
E(Cm ) =
We define
K!
. 1
m!(K−m)! N m−1
K!
1
. m−1
m!(K − m)! N
(3.6)
as the number of false alarms(NFA) of a m-tuple of the word
W.
37
3.2. OPTIMIZATION OF MEANINGFUL KEYWORDS EXTRACTION USING
HELMHOLTZ PRINCIPLE
The word W appears m times in the same document, then we define this word as a
meaningful word if and only if its NFA is smaller than 1.
If NFA is less than , we say that W is meaningful.
38
3.2. OPTIMIZATION OF MEANINGFUL KEYWORDS EXTRACTION USING
HELMHOLTZ PRINCIPLE
Algorithm 3.1 Calculate NFA(N,L,M,k,m)
Input: Store each document into an array from D1 to DN
Set corpus=[];
Add all the documents D1 to D2 into corpus Array;
L ← length(corpus);
Set W=[];
for i := 0 to L do
W = append(Uniquewords(corpus));
end for
K=[];
for i := 1 to lenth(W) do
Set counter=0;
for j := 1 to L do
if W[i] == corpus[ j] then
counter=counter+1;
end if
end for
K[j]=append(counter);
end for
B;
Window Size
x=[],y=[],z=[];
for i := 1 to N do
l ← Di
for j := 1 to B do
X[j]=appendDi [ j];
if B ≤ l then
for k = ( j + 1) to (B + 1) do
y[k] = append(Di [k]);
x=GetIntersection(x,y);
j=j+1;
B=B+1;
end for
end if
end for
end for
M ← BL ;
for D( i = 1)toD( i = N) do
m=[];
for j = 1 to length(x) do
counter =0;
for k = 1 to l do
if x[i] == Di [k] then
counter=counter+1;
end if
end for
m[ j] = append(counter);
end for
39
end for
3.2. OPTIMIZATION OF MEANINGFUL KEYWORDS EXTRACTION USING
HELMHOLTZ PRINCIPLE
Set Word=[];
for i = 1 to lengh(x) do
for j = 1 to lengh(W) do
if x[i] == W[ j] then
p=K[j];
q=m[i];
p!
1
× M( q−1)
¡1 then
if q!(p−q)!
Word = append(x[i])
end if
end if
end for
end for
40
3.2. OPTIMIZATION OF MEANINGFUL KEYWORDS EXTRACTION USING
HELMHOLTZ PRINCIPLE
Figure 3.2:
In a case of one document or data stream it can be divided into a sequence of
disjoint and equal size blocks and performs analysis for the documents of equal size.
Since such a subdivision can cut topics and is not shift invariant, the better way is to
work with a ”moving window”. An example of moving window is shown in Figure 3.2.
More precisely, if we are given a document D of the size L and B is a block size.
We define N as[ BL ]. For any word W from D and any windows of consecutive B words
let m count number of Win this windows and K count number of W in D. If NFA < 1,
where
1
K!
. m−1 < 1
m!(K − m)! N
(3.7)
then add W to a set of keywords and say that W is meaningful in these windows. In
the case of one big document that has been subdivided into subdocuments or sections,
the size of such parts are natural selection for the size of windows.
In real life examples it cannot be possible that a corpus of N documents D1 , D2 , ..DN
have the same length. Let li denotethe length of the documentDi . We followed some
strategies for creating a set of keywords, such as:
• Subdivide the set D1 , D2 , ..DN into several subsets of approximately equal size
documents, and perform analysis above for each subset separately.
• ’Scale’ each document to common length l of the smallest document. More
N
P
precisely, for any word we calculate as K = [ ml i ], where [x] denotes an integer
i=1
part of a number x and mi counts the number of appearances of the word W in
a document Di .For each documentDi , we calculate the NFA with this K and the
new mi ← [ ml i ].All words with NFA¡1 comprise a set of keywords.
41
Chapter 4
Evaluation and Results
After brief description of the Text summarization systems, in this chapter I have
collected information on automatic text summarization systems. Also I summarize
experimental evaluation of two keyword extraction methods presented in the previous
chapter. First, in Section 4.1, I have given short description of more than 50 automatic
text summarization systems.
In Section 4.2, I show the experimental result of
comparison between two keyword extraction methods for automatically extracting
meaningful keywords and their execution time. Also I present a model how proposed
algorithm is implemented.
4.1
Text Summarization Systems
In order to understand what each column means, the following information is
provided in Table 4.1. In first column (SYS, [REF], YEAR) the name of the system
with its reference and year is written, the second column (INPUTs) distinguish
between single document or multi-document summarization (both inputs can be
possible). Third column (DOMAIN) indicates genre of the input that is, whether it is
designed for specific domain or for non-restricted domain. Next column (FEATURES)
describes the characteristics and techniques used in each system.
Fifth column
(EVALUATION) represents what the authors evaluate to get required output. Next
column (METRICS) represents the metrics used in each system. The last column
(OUTPUT) represents whether the summary generated is either an extract or an
42
4.1. TEXT SUMMARIZATION SYSTEMS
abstract.
43
Single
Document
Multi
document
1975
ANES, [97],
1995
INPUTs
ADAM, [96],
YEAR
SYS, [REF],
analysis
and sentence selection
44
evaluation.
effectiveness
Retrieval
Summary rejection
sentence weighting,
news
summaries
signature word selection,
acceptability of the
Abstract size
Program speed and
EVALUATION
specific:
Statistical corpus analysis,
Syntactic codes and coherence.
Chemistry
Domain
rejection or selection.
Semantic codes and sentence
FEATURE
Specific :
Domain
DOMAIN
Table 4.1: Overview of Text Summarization Systems
Abstracts
Indicative
Abstracts
Indicative
OUTPUT
Continued on next page . . .
Precision
Recall and
Judgements
Human
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Single
document
[98], 1997
INPUTs
Dimsum,
YEAR
SYS, [REF],
of
some
to
domain
phrases
tool
signature
words,
45
variants
with
Batch
Feature
summarization
Combiner
Combiner and Trainable Feature
features:
combining
Experimented with two stage
morphological pre-processing.
morphological
synonyms with WordNet, and
aliasing with the NameTag tool,
signature words through name
to capture lexical cohesion of
index
and creating a word association
deriving collocations statistically,
selecting
by calculating idf values for
knowledge from a large corpus
Acquisition
automatically
system
multi-word
NLP
extract
DUses
FEATURE
Independent
Domain
DOMAIN
summary
a full-text document
could substitute for
generic
EVALUATION
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
Recall
and
Precision
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
46
FUF/SURGE
Precision
Summary generation
and
Topic
Interpretation
Ratio,
Extracts
Abstracts
Extracts
OUTPUT
Continued on next page . . .
Recall
and
Ratio,
It performs Topic identification,
summary.
Retention
Compression
knowledge.
of
news
NLP
processing with symbolic world
robust
specific:
document
combines
language
Coverage
METRICS
[100], 1998
It
generation system.
using
through a sentence generator
lexical chooser then it passed
the summary is passed to the
Quality
evaluation
of
Conceptual
representation
under
summary.
taskbased
generated summary
a conceptual representation of the
of
Online news
quality
determine
sources and then combines it into
To
EVALUATION
specific:
It extracts data from the different
FEATURE
Domain
document
[99], 1998
Domain
DOMAIN
SUMMARIST, Multi
Multi
INPUTs
SUMMONs,
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
4.1. TEXT SUMMARIZATION SYSTEMS
Single
document
[101], 1999
INPUTs
Marcu,
YEAR
SYS, [REF],
for discourse-based
methods
structure of the text of given
input,
units of the text.
the elementary and parenthetical
Determine partial ordering on
summarizing texts
algorithm to determine discourse
for
determine
news
adequacy
To
EVALUATION
it uses the rhetorical parsing
Discourse-based Summarizer.
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
and Recall
Precision
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
47
Multi
document
[102], 1999
INPUTs
MultiGen,
YEAR
SYS, [REF],
predicate
multiple sentences
phrases throughout
To identify common
48
FUF/SURFE
generator.
-used
sentence.
language
common phrases to generate new
theme sentences to identify the
in the summary, it intersects the
-to avoid redundant information
them in novel contexts.
combine these phrases, arranges
-produce fluent sentences that
needed for clarification
arguments with the information
stage.
the
phrases
finds
-also orders selected phrases and
comparing
of
planner
for content selection
by
news
intersection
content
EVALUATION
argument structures,
an
The
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
Abstracts
OUTPUT
Continued on next page . . .
Recall
and
Precision
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
[103], 2000
INPUTs
Chin & Len,
YEAR
SYS, [REF],
a
monolingual
and
49
kinds
of
linguistic
the
similarity
among
meaningful units in the articles.
-find
linking elements and topic chains.
knowledge- punctuation marks,
-three
multilingual cluster.
clusters in different languages in
It finds matching among the
multilingual clustering.
-contains
Chinese
multilingual
and
language generator.
English
-used
news
proposed
summarizer.
It
FEATURE
specific:
Domain
DOMAIN
Units.
of
Meaningful
Perform similarity
EVALUATION
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
Recall Rate
Rate,
Precision
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
[104], 2001
INPUTs
MEAD,
YEAR
SYS, [REF],
sentence
boundaries
50
Theory (CST)
Cross-Document
Experimented
sentence.
Structure
with
and also considers length of the
It discards most similar sentences
automatically
mark
to
Uses
articles
software
overlap with first sentence.
news
LT-POS
as centroid score, position, and
MEAD used three features such
FEATURE
specific:
Domain
DOMAIN
articles
between
Relationship
EVALUATION
Table 4.1 – continued from previous page . . .
pair
Extracts
OUTPUT
Continued on next page . . .
Judgments
Human
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
I find articles by traversing links
from the page and add into the
news
articles
between clusters
CST is used to find relations
URL to fetch new page.
NewsTroll determines interesting
to the search engines.
cluster of similar articles by going
of articles.
document
specific:
It summarize topic-based cluster
FEATURE
[105], 2001
DOMAIN
Domain
INPUTs
NewsInEssence, Multi
YEAR
SYS, [REF],
blank
EVALUATION
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
Unknown
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
51
document
readability
Usability
used.
Four major modes of operations
52
engine called MySearch.
It uses a personalized search
etc
Personalize mode.
summaries
Personalized
Extracts and
OUTPUT
Continued on next page . . .
context
the
search+Clustering+Summarization,
results,
query
Catching
document,
Catching
Keyword in
and
improve
Score,
Cluster
METRICS
search+Summarization, Generic
Generic search, Generic
scalability,
- centroid-based technique is
are:
most
relevant clusters.
identify
To
and
To
EVALUATION
recommendation system.
summarization
system
FEATURE
[106], 2001
DOMAIN
Domain-independent
It is a Web-based multi-document
INPUTs
WebInEssence, Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
[107], 2002
INPUTs
NeATS,
YEAR
SYS, [REF],
records
the
importance of sentence positions.
To
relative
human judgment.
Webclopedia’s ranking algorithm
is used to rank sentences.
and compare with
the
summary
of
and time stamps are used.
sentences
relevant
system
and
most
DUC-01.Evaluate
In
EVALUATION
coherence, stigma word filters
cohesion
To
articles
improve
signature term clustering.
sentence
news
used:
position, term frequency, topic
Techniques
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
F-Measure.
Recall,
Precision,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
53
Multi
document
[108], 2002
INPUTs
Columbia,
YEAR
SYS, [REF],
two
abstractive summaries
Ability to generate extractive and
extract sentences.
Statistical parameters are used to
summaries respectively.
summaries and multi-document
of
articles
composite
for generating single document
a
news
is
systems, MultiGen and DEM
It
FEATURE
specific:
Domain
DOMAIN
with
human judgment.
compared
Quality of summary
EVALUATION
Table 4.1 – continued from previous page . . .
Abstracts
Extracts
OUTPUT
&
Continued on next page . . .
Recall
Precision,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
54
Multi
document
[109], 2002
INPUTs
GLEANS,
YEAR
SYS, [REF],
Unknown
DOMAIN
event,
and
natural
a
55
of
by
predefined
summaries
set
database.
extracting sentences from the
-generate
templates.
using
-it generates a short headline
disaster.
multiple
event,
different categories
single
single
person,
Determine error of
-classifies into four categories:
on
DUC-2002 corpus.
Evaluate
EVALUATION
database-like representation.
It maps all the documents into
FEATURE
Table 4.1 – continued from previous page . . .
Abstracts
Extracts
Headlines,
OUTPUT
&
Continued on next page . . .
Cohesion
Coherence,
effect,
grammatical
score,
Coverage
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
overlap
systems
summaries
the gold standard
summary,
made summary
out unnecessary information. For
multi-document summarization,
when Topic is known: CICERO
Information Extraction identifies
all the necessary information
used
the
multi-document
56
manner to generate the summary.
modeling the topic in an ad-hoc
summary. Topic is not known:
in
corpus
human
and
generated
between
and measure the
DUC-2002
articles
sentence
extraction is done and filters
summarization,
news
on
document
Evaluated
specific:
single-document
EVALUATION
Multi
For
FEATURE
[110], 2002
Domain
DOMAIN
Single and
INPUTs
GISTexter,
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Abstracts
Extracts
Headlines,
OUTPUT
&
Continued on next page . . .
and Recall.
Precision
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
SumUM,
[111], 2002
Single
document
[53],
INPUTs
2002
NTT,
YEAR
SYS, [REF],
57
quality
and text regeneration processes
how
Extrinsic
is.
helpful a summary
and
the
summary
of
measures
analysis, concept identification,
the
Intrinsic evaluation
shallow syntactic and semantic
articles
extrinsic fashions.
summarization. -it composite of
technical
made in intrinsic or
Abstracts
Extracts
OUTPUT
Continued on next page . . .
and F-Score.
Precision
metrics
Learning Algorithm is used.
Coverage,
-Adjusted
Length
Readability
generated
quality
Coverage,
Mean
METRICS
Vector Machine and Machine
summary
the
evaluate
To classify sentence, Support
similarity between
DUC-2002 data to
of
weight,
sentence position, length
Experimented with
EVALUATION
headlines prepositions, verbs.
,
are:
Features for sentence extraction
FEATURE
Domain-specific:Explore the issues of dynamic
Unknown
DOMAIN
Table 4.1 – continued from previous page . . .
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
[108], 2002
INPUTs
Newsblaster,
YEAR
SYS, [REF],
58
(Dissimilarity
Engine
Multidocument
Summarization)
model to group similar features.
Features are: terms, noun phrase
heads and proper nouns.
Thumbnails
displayed.
DEMS
method and log-linear statistical
are
documents.
It uses agglomerative clustering
images
and
system.
of
the
automatically
used
biographical
is
documents.
for
system
for
multi-event
single-event
summarizes
It
Used DUC corpus.
Detection and Tracking (TDT)
news
articles
on-line
Articles are clustered using Topic
an
news
is
EVALUATION
summarization system.0
It
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
and Recall.
Precision
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
59
2003
Unknown
and
extracted using scoring process.
Extracts
Extracts
OUTPUT
Continued on next page . . .
coverage.
length-adjusted
mean
Relevant
sentences
questions,
mean quality
coverage,
median
coverage,
mean
F-Score
Recall
Precision,
METRICS
topic detection.
were
documents.
Lexical information is used TDT
evaluates
the
2003
quality of output
DUC
documents.
Used
techniques.
short
It
very
-used TDT and TREC clusters
produce
feature
made
with
document
can
combination
best
Identified
documents.
human
the
document
of
the
summaries.
It
query
compared
output
quality
Determine
EVALUATION
and Multi
Lethbridge,[113],
Single
based summarization.
and
Supports
2003
generic
statistical techniques.
Summarization,[112],
ANNIE,
GATE components produced by
FEATURE
combined with well established
document
Generic and
Unknown
DOMAIN
Query based
Single
INPUTs
Robust
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
4.1. TEXT SUMMARIZATION SYSTEMS
2003
al.,[114],
Copeck
YEAR
et
SYS, [REF],
document
Single
INPUTs
Unknown
DOMAIN
detect
sentences
and
proper-quality
substring database,
duplicates matches
of viewpoint phrase.
-First extracts key phrases from
the document, than depending on
most pertinent key phrases, it
picks sentences.
60
specification.
TDT technique is used for topic
to generate headlines.
LT-POS chunking parser is used
summaries,
page headers and footers.
of
Length
is used.
DUC 2003 data set
EVALUATION
abstracts, reference list , and
paragraphs, it remove tables,
To
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
coverage.
median
coverage,
mean
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
2004
al.,[115],
Erkan
YEAR
et
SYS, [REF],
document
Multi
INPUTs
Unknown
DOMAIN
and QueryPhraseMatch.
LexPageRank,
SimWithFirst,
Centroid,
LengthCutoff,
are,
Position,
Features
performance.
reranker.
system
overall
extractor, feature vector, and
evaluates
It
2004.
Evaluate in DUC
EVALUATION
It consists of three steps: feature
environment.
It is an extractive summarization
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-W
ROUGE-1,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
61
Single
document
2004
INPUTs
UAM,[116],
YEAR
SYS, [REF],
Unknown
DOMAIN
algorithm.
genetic
and
bigrams,
62
combination
best
Identified
feature
four-gram results.
trigrams
using
as
threshold
identify
phrases
were extracted using x weight
2
recall but failed to
verb
sentences
and
Identified unigram
for
It identifies of the most relevant
data
DUC-2003
implementation.
called
Used
EVALUATION
headlines generation.
(less than 75 bytes),
Generate very short summaries
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-1.
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
document
et
2004
al.,[117],
Multi
INPUTs
Filatova,
YEAR
SYS, [REF],
Unknown
DOMAIN
compared
human judgments.
Page Rank algorithms used to
identify highly weighted nodes in
a document graph.
summaries
to summary realization.
with
as
generated
the
of
determine
to explore a generation approach
event-centric
quality
an
approach to summarization, and
explore
To
and
to
2003,
with
DUC 2004 data sets.
DUC
Experiment
EVALUATION
Main goal of the system is,
system.
It is a MSR-NLP Summarization
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
63
Multi
document
,[118], 2004
INPUTs
CRL/NYU
YEAR
SYS, [REF],
Unknown
DOMAIN
estimate
significance
of
64
of key sentences.
corresponding to the distribution
document sets into two groups
A module is used to categorize
estimated.
similarity between sentences is
To get redundant information,
tf*idf, Headline.
used, such as position, length,
sentences, scoring functions are
To
2004.
quality
used for cluster.
check
question in DUC
To
EVALUATION
TDT and TREC techniques are
It is based on sentence extraction.
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
and ROUGE
Coverage
Mean
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
document
2005
65
output
as compared with
thematic segmentation for each
Sentence
compression,
sentences),
Sentence
Sentence selection.
and
of
the
quality,
made
summaries
summaries.
human
relevance
(TextTiling algorithm is used for
Scoring,
determine
analysis
Document
to
analysis,
2005,
DUC
It consists of 5 steps: Question
in
news articles
summaries
evaluates
Summarization system.
NIST
specific:
topic-oriented
Multi
CATS,[120],
a
DUC 2005 data sets.
Model is used.
is
Experimented with
a document, Hidden Markov
It
made summaries.
To score the each sentences in
Domain
compared to human
as
output
used.
the
summaries
of
”Shallow parsing” techniques is
To identify quality
news articles
Query-based
summarization system.
a
specific:
is
EVALUATION
document
It
FEATURE
2005
DOMAIN
Domain
INPUTs
CLASSY,[119], Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Extracts
Extracts
OUTPUT
Continued on next page . . .
ROUGE
score
pyramid
ROUGE-1,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
2005
INPUTs
ERSS,[121],
YEAR
SYS, [REF],
is used.
is run in sequence for processing
Chunker, Fuzzy Coreferencer.
Tagger, NE Transducer, NP/VP
Important components are: POS
documents.
DUC 2005 data set
Pipeline of processing component
and
DUC
theory.
2004
framework.
GATE
conference chains using fuzzy set
the
news articles
on
Implemented based
EVALUATION
the generation and processing of
It is based on a single strategy,
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
Basic
Extracts
OUTPUT
Continued on next page . . .
Score.
Element
and
Pyramid
ROUGE-SU4,
ROUGE-2,
ROUGE-1,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
66
accuracy
summarizer.
used.
To improve accuracy of term
applied.
is
performance,
problem, genetic algorithm is
method
To check quality,
To solve NP hard optimization
TFS
used.
articles sentences.
frequency,
and DUC 2005 is
of
DUC 2003
the conjunction of the original
2002,
news articles
document
from set of summaries formed by
Data set from DUC
EVALUATION
specific:
Optimal summary is extracted
FEATURE
2006
DOMAIN
Domain
INPUTs
MSBGA,[122], Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-W
ROUGE-1,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
67
specific:
news articles
Multi
document
2007
DOMAIN
Domain
INPUTs
FEMsum,[123], Single and
YEAR
SYS, [REF],
summary-length
on
against
sentences.
components:
68
Summary Composer (SC).
(RID), Content Extractor (CE),
Relevant Information Detector
independent
Organized in three language
candidate
Two baseline s used.
quality
linguistic
between
Uses
Check
DUC-2007
human summarizer.
task.
query-focused
data sets.
Uses
EVALUATION
graph to represent the relations
summarization
Based
answers to complex questions.
Providing
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
Mean
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
performance
algorithm
quality
signature terms.
components.
Developed in the language C
and C++ and tested under the
operating systems SunOS and
69
application.
-developed
Linux.
to
of the constituent text analysis
as
client
server
of
measure
the
and
the
data set are used.
Experimented
with
different instantiations of each
experimentation
document
2002-2004
permits
DUC
EVALUATION
Multi
It is portable, modular, and
FEATURE
2007
Unknown
DOMAIN
Single and
INPUTs
QCS,[124],
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-2.
and
ROUGE-1
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
and XSLT.
in a tree structures using XML
To manipulate datas, represented
knowledge, FIPS.
linguistic
linguistic quality of
of
Uses
source
responsiveness,
the summaries.
Measures
for DUC 2007.
document
news articles
Evaluated by NIST.
summarizing system developed
and
EVALUATION
specific:
Topic-answering
FEATURE
2007
DOMAIN
Domain
INPUTs
GOFAIsum,[125],
Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-SU4.
ROUGE-2,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
70
sentences
a
is
RankNet,
algorithm
LambdaRank
highlight n.
RankNet is implemented in the
n
sentences to match
choosing
block summary.
-to speed up the performance of
rank
the
:
to
as
the first three
sentences.
used
network
:
three highlights.
neural
choosing
of
against
baseline
various characteristics of the
the
news articles
document
single document that best match
Compare
EVALUATION
specific:
-to extract three sentences from a
FEATURE
2007
DOMAIN
Domain
INPUTs
NeTsum,[126], Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-2.
ROUGE-1,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
71
with
and
competitions.
- Better than he
PYTHY system for
2006.
-evaluate
performance
applying
feature separately.
-two sets features used:
(i)Word-based : probability of
words for the different container
is considered.
(ii) Sentence-based : length and
position of the sentence in the
document is considered.
-regression SVM is used for
72
learning the feature weights.
each
by
the
DUC-2007
DUC-2006
performance
topic cluster.
document
news articles
the
used to rank all sentences in the
Compared
EVALUATION
specific:
-machine learning approach is
FEATURE
2008
DOMAIN
Domain
INPUTs
FastSum,[127], Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-2.
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
INPUTs
2008
document
PPRSum,[128], Multi
YEAR
SYS, [REF],
Unknown
DOMAIN
with
of other systems.
-computed global features of
salience model of sentence by
73
model is used.
-to reduce redundancy, MMR
probability is computed.
the corpus, personalized prior
relevance model of sentence in
- to get salience model and
Nave Bayes Model.
system, it compare
performance
the
importance of the pages.
analyze
To
DUC-2007 dataset
-calculate personalized view of
document
is used.
query-based
EVALUATION
summarization
-
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE-4
and
ROUGE-2
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
INPUTs
Adaptive
various
combination.
-SVM
74
is
used
to
other events.
distinguish negative events from
classifier
helpful
for
negative event identification.
is
in
feature
-performance
for experiment.
records
of
information
kind
Medical
what
-investigate
CRF toolkit is used
specific:
document
top
with
performing systems
2007
2009
to identify negative event
boosted is assumed.
representation can be mutually
and
DUC
Summary
topic
is used.
DUC-2007 data set
-
text
for
Compared
model
EVALUATION
summarization system.
topic-oriented
-
FEATURE
Domain
document
Unknown
DOMAIN
TEXT2TABLE,[130],
Multi
2008
Adasum,[129], Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
structure
table
convert into
Extracts and
Extracts
OUTPUT
Continued on next page . . .
F-measure
Recall and
Precision,
ROUGE-SU4
and
ROUGE-2
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
2009
INPUTs
OHSU,[131],
YEAR
SYS, [REF],
Independent
Domain
DOMAIN
features.
query-focused ranking used.
75
used
system
performance
system.
entity
used
KBP
linking
of
evaluate
set
query
for
2009
-TAC
links.
internal Wikipedia
linking
-entity
run.
system are used to
and CSLU-CHSU2
CSLU-OHSU1
for testing different
data
are query neural ranking and
methods
development
ranking
-
Sentence
DUC-2006
classify each word in a sentence
for
training data and
for
-log-linear model is used to
Evaluation:
DUC-2005
summarization
EVALUATION
system
Query-based
FEATURE
Table 4.1 – continued from previous page . . .
Extracts
OUTPUT
Continued on next page . . .
ROUGE
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
Multi
document
2009
INPUTs
Hachey,[132],
YEAR
SYS, [REF],
summaries.
done.
-rely on dependency parsing has
allocation
model based on latent Dirichlet
similarities between connector
semantic
-capture
with
the human made
data sets used.
(IE) which includes relations
latent
METRICS
Extracts
OUTPUT
Continued on next page . . .
ROUGE-SU4
and
Evaluation:DUC-2001 ROUGE-1
Compared
Relation
-model Information Extraction
Generic
news article
on
EVALUATION
Extraction
-based
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
4.1. TEXT SUMMARIZATION SYSTEMS
76
Multi
document
2010
INPUTs
TIARA,[133],
YEAR
SYS, [REF],
keyword selection
collections of text.
-used topic analysis techniques
to
room
records.
77
and
its
associated
for different time segments.
-select time-sensitive keywords
topics.
document
-Lucene is used to index each
documents.
derive topics from large
of
users explore and analyze large
toolkit
time
Evaluate
sensitive
quality
LDA topic analysis.
is used to perform
modeling
emergency
and
interactive visualization to help
analytics
e-mail,
text
Evaluation:topic
EVALUATION
combines
-visual analytic system, which
FEATURE
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
view
visualization
Extracts and
OUTPUT
Continued on next page . . .
distinctiveness
and
completeness
F-Score,
Precision,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
78
2010
al.,[135],
Shi
2010
al.,[134],
Anlei
YEAR
et.
et
SYS, [REF],
document
Multi
document
Multi
INPUTs
To
finding
visual
analytical process is done.
-for
interactions of data facet.
pattern
manipulating, and customizing
-navigation methods are used for
structured facet.
associated
presented.
studies
are
case
and
content
facet
of the system, two
check performance
time facet, category facet, text
four categories:
-classifying its data facets into
gain
are used.
Visualization
Text
metrics
evaluation
OUTPUT
Continued on next page . . .
Unknown
(NDCG).
cumulative
discounted
normalized
(DCG) and
gain
cumulative
discounted
METRICS
different categories of features
Evaluation:
are done.
-to rank recency sensitive queries,
Unknown
offline experiments
or not.
quality
of ranking model
Evaluate
for both online and
into
-determine query is time sensitive
freshness
news queries
takes
Evaluation:
EVALUATION
account.
which
-ranking documents by relevance
FEATURE
Breaking
specific:
Domain
DOMAIN
Table 4.1 – continued from previous page . . .
4.1. TEXT SUMMARIZATION SYSTEMS
INPUTs
79
with
geographical
-Automatic Antichain Selection
to specify crowd resolution.
studies
presented.
area
a
of
are
large
users
groups
with
compactly for a given time stamp.
corpus
the
a
goal of identifying
cloud,
microblogging
to
convey crowd size and content
tag
particular
applied
Twiiter users
-multilevel
to
clusters
topic
relating
relevant
collection of
users
-most
of
salient
units, similarities.
overlap
2011
domain
select
into
and
Measures
2006 data sets.
DUC 2005 and DUC
EVALUATION
specific:
Multi
and
split
sentences from document (s).
sentences
-documents
are
redundancy
relevance,
length.
three
-optimize
document
text
properties:
summarization
-unsupervised
FEATURE
Multi
Unknown
DOMAIN
Crowds,[137][], document
Theme
2011
MCMR,[136], Single and
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
summaries.
and tag of
Resolution
Correct
Extracts
OUTPUT
Continued on next page . . .
No
ROUGE-SU4
and
ROUGE-2
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
2011
al.,[139],
document
Independent
Domain
Multi
Genest
et
NUS1 and NUS2.
Concept
of
Information
80
characteristics of the predicates.
predicates between them, and
properties,
overall
quality
responsiveness.
their
the
text,
and
in
all
-identified
entities
Evaluate
used.
TAC 2010 data set
sentences.
Items (INIT) is used to extract
-
-calculate CSI for each sentence.
category-specific features.
and
generic
features
Comparison against
-consists of two classes of feature:
news articles
is used.
learning approach
specific:
TAC-2011 data set
EVALUATION
document
-based on supervised machine
FEATURE
2011
DOMAIN
Domain
INPUTs
SWING,[138], Multi
YEAR
SYS, [REF],
Table 4.1 – continued from previous page . . .
Abstract
Extract
OUTPUT
Continued on next page . . .
score.
pyramid
quality,
linguistic
ROUGE-SU4
ROUGE-2,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
81
document
2012
document
Multi
Multi
et
INPUTs
UWN,[141],
2011
al.,[140],
Tsarve
YEAR
SYS, [REF],
Independent
Domain
Unknown
DOMAIN
already
meanings
defined
to
terms
in
the
in
other extensions.
-introduced MENTA and some
WordNet.
languages
link
different
-automatically
different languages
for
200 languages.
links
term
-sense
of
random
relationships of words in over
Evaluate
samples
base,
is
which describes the meanings
knowledge
evaluated.
and Hierarchical clustering
-lexical
methods
applied such as, Flat clustering
are
summarization
methods
two
clustering
the summary, art of
-In Unsupervised classification,
of
performance
and
determine
multi-label is learned.
classification,
quality
supervised
-In
used.
To
classification
DUC 2002 dataset is
EVALUATION
techniques.
unsupervised
-It used both supervised and
FEATURE
Table 4.1 – continued from previous page . . .
Extract
Extract
OUTPUT
Continued on next page . . .
Precision
ROUGE-W.
and
Rouge-S4
ROUGE-L,
ROUGE-2,
METRICS
4.1. TEXT SUMMARIZATION SYSTEMS
document
et
2012
al.,[142],
Multi
INPUTs
Mirroshandel
YEAR
SYS, [REF],
blank
DOMAIN
for
retrieve
82
and
data
OTC,
,
TimeBank
corpus is used.
TDT
extraction.
EVITA, for event
SVM classification.
LIBSVM, for the
for
software
extracting features.
used
INDRI
related texts.
relations
EMTRL
temporal
and
:
EVALUATION
between events. -SVM is used for
extracting
BCDC
-Introduced two new methods,
FEATURE
Table 4.1 – continued from previous page . . .
relations.
and Overlap
Before, After,
METRICS
Extract
OUTPUT
4.1. TEXT SUMMARIZATION SYSTEMS
4.2. EXPERIMENTAL RESULTS ON KEYWORD EXTRACTION METHODS
4.2
Experimental results on Keyword Extraction Methods
The performance of the proposed algorithm was studied on a relatively large corpus of
documents. To illustrate the result, we selected the set of more than hundred articles
from the set globalization articles. Each document consists of more than hundred
words in average. At first the punctuations were removed from the documents. In
preprocessing approach only stop word filtering is performed. To address the problem
of the variable document length, adaptive window size mi for each document was
applied. Each document, K and mi value is varied. To implement tf-idf values,
idf is varied for each document. The meaningful words are extracted using two
methods, one is tf-idf and the other is Helmholtz principle. Comparison of number
of words extraction from above two methods is implemented. To extract meaningful
words according to Helmholtz principle; expression (7) is applied from the corpus
of different length documents. In Figure 4.1 when document size is increased the
number of meaningful words is increasing in case of NFA. But for tf-idf, it is shown
that number of meaningful words is not depend on size of documents. Number of
meaningful words is more as compared with the meaningful words extracted from
tf-idf in each document. We followed principle of Helmholtz Principle to calculate
tf-idf. To find tf-idf, adaptive window size is applied in each document. idf and tf value
is varied in each document. To separate easily the number meaningful words extracted
using tf-idf with different threshold values log function is applied. log(t f − id f ) value
is greater than -8.5,-7.5,-6.5, compare with number of words using NFA is applied,
shown in Figure 4.2. More number of meaningful words are extracted as compared
with nfa. In Figure 4.3 NFA with log(t f − id f ) values greater than -4.5 and -5.5 is
compared. Each data is analyzed and the number of words in these documents varies
dramatically. For log(t f −id f ) > −4.5, the number of meaningful words is greater than
those in NFA up to 1500 words approximately. Beyond that the number of extracted
words decreases. However, for log(t f −id f ) > −5.5, more number of meaningful words
are extracted up to 13000 words approx then decreases.
The six most meaningful words extracted from the set of globalization articles
are: economic,population, government, poor, development and political. To execute
83
4.2. EXPERIMENTAL RESULTS ON KEYWORD EXTRACTION METHODS
Figure 4.1:
Figure 4.2:
84
4.2. EXPERIMENTAL RESULTS ON KEYWORD EXTRACTION METHODS
Figure 4.3:
Figure 4.4:
85
4.2. EXPERIMENTAL RESULTS ON KEYWORD EXTRACTION METHODS
experiment, we use python tool in high configured system.
Figure 4.4 shows a
comparison of run time of extracting keywords using NFA and tf-idf as executed in
Figure 4. To get meaningful words using Helmholtz principle is very fast as compared
to using tf-idf.
86
Chapter 5
Conclusion and Future work
In this thesis, we have described an overview of text summarization. We present
taxonomy of text summarization based on different approaches.
Categories of
Evaluation methods are also explained. We have also presented a general overview
of automatic text summarization systems with its main features.
It is benefits
for many other tasks, mainly information retrieval, information extraction or text
categorization. Research on text summarization has started more than 70 years
ago, still it is going on. Day by day more developed techniques are applied but
still it requires improvement. In future we plan to study more systems with applied
techniques which improve quality.
Keyword extraction method using Helmholtz principle was compared with the most
popular Keyword extraction method i.e. tf-idf. We observe the comparison of NFA
with the different level of tf-idfvalues to extract the meaningful words. Time consumed
for implementing both the method to extract meaningful words was shown. When
the size of documents is increased, the meaningful words are also gradually increased.
Whereas for tf-idf, it is taking maximum time to implement and extracting the number
of meaningful words are more as compared with NFA.
The meaningful words attained through the NFA and tf-idf method will help to
create summaries of the documents. The tf-idf values can be applied in SVD to give
output.We will apply evaluation measures to the output summaries from both key
extraction methods and compare quality of summaries.
87
Bibliography
[1] J. Wang, W. Shao, and F. Zhu. Biological term boundary identification by maximum entropy
model,. In Sixth IEEE conference on Industrial Electronics and Applications, pages 2446–2448.
IEEE, June 2011.
[2] I.Mani. Automatic Summarization, volume 3. John Benjamins Publishing Company, 2001.
[3] E.Lloret. Text summarization : An overview. 2006.
[4] G.L.Thione, M.V.D.Berg, L.Polanyi, and C.Culy. Hybrid text summarisation:combining external
relevance measures with structural analysis. In Proceedings of the Association of Computational
Linguistics, Workshop Text Summarization Branches Out, pages 25–26, Barcelona, Spain, July
2004.
[5] M.Y.Kan. Automatic text summarization as applied to information retrieval : Using indicative and
Informative Summaries. PhD thesis, Columbia University, 2003.
[6] L.H.Reeve, H.Han, and A.D.Brooks.
text summarization.
November 2007.
The use of domain-specific concepts in biomedical
Journal of Information Processing and Management, 43(6):1765–1776,
[7] F.Bex and B.Verheij. Arguments, stories and evidence: critical questions for fact-finding. In
Proceedings of Conference of the International the seventh Society for the Study of Argumentation,
pages 71–84, Sic Sat, Amsterdam, 2010.
[8] E.Sanocki L.He, A.Gupta, and J.Grudin. Automatic summarization of audio-video presentation.
In Proceeding of Seventh ACM International Conference on Multimedia, pages 489–498, New
York,USA, 1999. ACM.
[9] D.Wu and Y.Bao. A summarization of multimedia resource digital rights expression language.
In Proceedings of second International Conference on Network Security, Wireless Communications
and Trusted Computing, pages 374–377. IEEE, 2010.
[10] E.B.Euripides, E.G.M.Petrakis, and E.Milios.
Automatic website summarization by image
content: A case study with logo and trademark images. IEEE Transaction on Knowledge and
Data Engineering, 20(9):1195–1204, September 2008.
[11] G.Manson and S.A.Berrani. Automatic tv broadcast structuring. International Journal of Digital
Multimedia Broadcasting, pages 153160–153176, January 2010.
88
Bibliography
[12] T.S.Guzella and W.M.Caminhas. A review of machine learning approaches to spam filtering,.
Journal of Expert Systems with Applications,Elsevier, 36(7):10206–10222, September 2009.
[13] J.C.Gomez, E.Boiy, and M.F.Moens.
Highly discriminative statistical features for e-mail
classification. Journal of Knowledge and Information Systems,Springer, 31(1):23–53, April 2012.
[14] X.L.Wang, H.Zhao, and B.L.Lu. Automated quality assessment of web pages form textual
content. In Proceedings of the International Conference on Machine Learning and Cybernetics,
pages 2000–2006. IEEE, July 2012.
[15] W.Hersh. Informative Retrieval: a Health and Biomedical Perspective. Springer, New York, 2009.
[16] S.Teufel and M.Moens.
Summarizing scientific articles: Experiments with relevance and
rhetorical status. Journal of Computational Linguistics, 28(4):409–445, December 2002.
[17] I.T.Medeni, S.Peker, and M.E.Uyar. A knowledge visulization model for evaluating internet news
agencies on conflicting news. In Proceedings of the thirty fourth International Convention, MIPRO,
pages 850–853. IEEE, 2011.
[18] H.P.Luhn.
The automatic creation of literature abstract.
IBM Journal of Research and
Development, 2(2):159–165, April 1958.
[19] A.Ramanathan and D.D.Rao.
A lightweight stemmer for hindi.
In In Proceedings of the
10th Conference of the European Chapter of the Association for Computational Linguistics, on
Computatinal Linguistics for South Asian Languages Workshop, April 2003.
[20] U.MIshra and C.Prakash. Maulik: An effective stemmer for hindi language. International
Journal on Computer Science and Engineering, 4(5):711–717, May 2012.
[21] H.P.Edmundson. New methods in automatic extracting. Journal of Association of Computational
Linguistics, 16(2):264–285, April 1969.
[22] M.Chandra, V.Gupta, and S.Paul. A statistical approach for automatic text summarization
by extraction. In Proceeding of the International Conference on Communication Systems and
Network Technologies, pages 268–271. IEEE, 2011.
[23] M.R.Murthy, J.V.R.Reddy, P.P.Reddy, and S.C.Satapathy. Statistical approach based keyword
extraction aid dimensionality reduction. In Proceedings of the International Conference on
Information Systems Design and Intelligent Applications. Springer, 2011.
[24] J.Morris and G.Hirst. Lexical cohesion computed by thesaural relations as an indicator of the
structure of text. Journal of Computational Linguistics, ACM, 17(1):21–48, March 1999.
[25] T.Pedersen, S.Patwardhan, and J.Michelizzi. Wordnet::similarity : Measuring the relatedness of
concepts. In Proceedings in Human Language Technology Conference - North American chapter
of the Association for Computational Linguistics Annual Meeting- Demonstrations, pages 38–41.
ACM, 2004.
89
Bibliography
[26] H.G.Silber and K.F.McCoy. Efficient text summarization using lexical chains. In Proceedings
of Fift International Conference on Intelligent User Interfaces, pages 252–255, New York, USA,
2000. ACM.
[27] G.Ercan and I.Cicekli. Using lexical chains for keyword extraction. International Journal of
Information Processing and Management,ACM, 43(6):1705–1714, November 2007.
[28] W.C.Mann and S.A.Thompson. Rhetorical structure theory: Towards a function theory of text
organization. Text - Interdisciplinary Journal for the Study of Discourse, 8(3):243–281, November
1988.
[29] V.R.Uzeda, T.Pard, and M.Nunes. Evaluation of automatic summarization methods based on
rhetorical structure theory. In Eighth International Conference on Intelligent Systems Design and
Applications, pages 389–394. IEEE, November 2008.
[30] L.Chengcheng.
Automatic text summarization based on rhetorical structure theory.
In
International Conference on Computer Application and System Modeling, pages 595–598,
Piscataway, NJ, USA, October 2010. IEEE.
[31] M.Litvak and M.Last. Graph based keyword extraction for single document summarization.
In Proceedings of the Workshop on Multi-source Multilingual, Information Extraction and
Summarization, pages 17–24. ACM, 2008.
[32] S.Brin and L.Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings
of the Seventh International World Wide Web Conference, Computer Networks and ISDN Systems,
pages 107–117. Elsevier, April 1988.
[33] R.Mihalcea.
Graph-based ranking algorithms for sentence extraction, applied to text
summarization. In Proceedings of the Association for Computational Lingusitics on Interactive
poster and demonstration sessions. ACM, 2004.
[34] D.Horowitz and S.D.Kamvar. Anatomy of a large-scale social search engine. In Nineteenth
International Conference on World Wide Web, pages 431–440, New York, USA, 2010. ACM.
[35] M. Efron. Information search and retrieval in microblogs. Journal of the American Society for
Infromation Science and Technology, 62(6):996–1008, June 2011.
[36] D.D.Lewis. Naive (bayes) at forty: The independence assumption in information retrieval. In
Proceedings of tenth Europian Conference on Machine Learning, pages 4–15, London, UK, 1998.
Springer- Verlag.
[37] J.D.M.Rennie, L.Shih, J.Teevan, and D.R.Karger. Tackling the poor assumptions of nave bayes
classifiers. In Proceedings of International Conference on Machine Learning, pages 616–623,
2003.
[38] T.Mouratis and S.Kotsiantis. Increasing the accuracy of discriminative of multinomial bayesian
classifier in text classification. In Proceedings of fourth International Conference on Computer
Sciences and Convergence Information Technology, pages 1246–1251. IEEE, November 2009.
90
Bibliography
[39] C.Y.Lin. Training a selection function for extraction. In Proceedings of the eighth International
Conference on Information and Knowledge Management, pages 55–62, NY, USA, 1999. ACM.
[40] K.Knight and D.Marcu. Statistics-based summarization- step one: Sentence compression. In
Proceeding of the Seventeenth National Conference of the American Association for Artificial
Intelligence, pages 703–710, 2000.
[41] L.Rabiner and B.Juang. An introduction to hidden markov model. Acoustics Speech and Signal
Processing Magazine,, 3(1):4–16, April 2003.
[42] R.Nag, K.H.Wong, and F.Fallside.
Script recognition using hidden markov models.
In
Proceedings of International Conference on Acoustics Speech and Signal Processing, pages
2071–2074. IEEE, April 1986.
[43] C.Zhou and S.Li. Research of information extraction algorithm based on hidden markov model.
In Proceedings of second International Conference on Information Science and Engineering, pages
1–4. IEEE, December 2010.
[44] M.Osborne. Using maximum entropy for sentence extraction. In Proceedings of the AssociaCL-02
Workshop on Automatic Summarization, pages 1–8, Morristown, NJ, USA, 2002. Association for
Computational Linguistics.
[45] B.Pang, L.Lee, and S.Vaithyanathan. Thumbs up? : Sentiment classification using machine
learning technique. In Proceedings of the Association for Computational Linguistics conference
on Empirical methods in Natural Language Processing, pages 79–86. ACM, 2002.
[46] H.L.Chieu and H.T.Ng.
A maximum entropy approach information extraction from semi-
structured and free text. In Proceedings of the Eighteenth National Conference on Artificial
Intelligence, American Association for Artificial Intelligence, pages 786–791, Menlo Park, Ca,
USA, 2002.
[47] R.Malouf.
A comparison of algorithms for maximum entropy parameter estimation.
In
Proceedings of sixth conference on Natural Language Learning, pages 1–7, Stoudsburg, PA, USA,
2002. ACM.
[48] P.Yu, J.Xu, G.L.Zhang, Y.C.Chang, and F.Seide. A hidden−state maximum entropy model
forward confidence estimation. In International Conference on Acoustic, Speech and Signal
Processing, pages 785–788. IEEE, 2007.
[49] M.E.Ruiz and P.Srinivasan. Automatic text categorization using neural networks. In Proceedings
of the Eighth American Society for Information Science/ Classification Research, American Society
for Information Science, pages 59–72, Washington, 1997.
[50] T.Jo. Ntc (neural network categorizer) neural network for text categorization. International
Journal of Information Studies, 2(2), April 2010.
[51] P.M.Ciarelli, E.Oliveira, C.Badue, and A.F.De-Souza.
Multi-label text categorization using
a probabilistic neural network. International Journal of Computer Information Systems and
Industrial Management Applications, 1:133–144, 2009.
91
Bibliography
[52] I.Guyon, J.Weston, S.Barnhill, and V.Vapnik. Gene selection for cancer classification using
support vector machines. mach. learn. Machine Learning, 46(1–3):389–422, March 2002.
[53] T.Hirao, Y.Sasaki, H.Isozaki, and E.Maeda. Ntt’s text summarization system for duc-2002. In
Proceedings of the Document Uderstanding Conference, pages 104–107, 2002.
[54] L.N.Minh, A.Shimazu, H.P.Xuan, B.H.Tu, and S.Horiguchi. Sentence extraction with support
vector machine ensemble.
In Proceedings of the First World Congress of the International
Federation for Systems Research, pages 14–17, Kobe, Japan, November 2005.
[55] T.K.Landauer, P.W.Foltz, and D.Laham. Introduction to latent semantic analysis. Journal of
Discourse Processes,, 25(2–3):259–284, 1998.
[56] B.Rosario. Latent semantic indexing: An overview. Technical report, Technical Report of
INFOSYS 240, University of California, 2000.
[57] T.Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine
Learning, 42(1–2):177–196, January–February 2001.
[58] M.Simina and C.Barbu. Meta latent semantic analysis. In IEEE International conference on
Systems, Man and Cybernetics, pages 3720–3724. IEEE, October 2004.
[59] J.T.Chien. Adaptive bayesian latent semantic analysis. IEEE Transactions on Audio, Speech, and
Language Processing, 16(1):198–207, January 2008.
[60] D.Wang, T.Li, S.Zhu, and C.Ding. Multi-document summarization via sentence-level semantic
analysis and symmetric matrix factorization.
In Proceedings of the Thirty First Annual
International ACM Special Interest Group on Information Retrieval Conference on Research and
Development in Information Retrieval, pages 307–314, New York, USA, 2008. ACM.
[61] J.H.Lee, S.Park, C.M.Ahn, and D.Kim. Automatic generic document summarization based
on non-negative matrix factorization. Journal on Information Processing and Management,
45(1):20–34, January 2009.
[62] T.G.Kolda and D.P.O’Leary. Algorithm 805: Computation and uses of the semidiscrete matrix
decomposition. Transactions on Mathematical Software, 26(3):415–435, September 2000.
[63] V.Snasel, P.Moravec, and J.Pokorny. Using semi-discrete decomposition for topic identification.
In Proceedings of the Eighth International Conference on Intelligent Systems Design and
Applications, pages 415–420, Washington, DC, USA, 2008. IEEE.
[64] D.R.Radev, E.Hovy, and K.McKeown. Introduction to the special issue on summarization.
Journal of Computational Linguistics, 28(4):399–408, December 2002.
[65] J.S.Kallimani, K.G.Srinivasa, and B.E.Reddy. Information retrieval by text summarization
for an indian regional language. In Proceedings of IEEE Natural Language Processing and
Knowledge Engineering, pages 1–4, Beijing, China, August 2010.
92
Bibliography
[66] M.J.Witbrock and V.O.Mittal. Ultra-summarization: A statistical approach to generating highly
condensed non-extractive summaries. In Proceedings of the 22nd annual international ACM
SIGIR conference on Research and development in information retrieval, pages 315–316, New
York, NY, USA, 1999. ACM.
[67] I.Mani, D.House, G.Klein, L.Hirschman, T.Firmin, and B.Sundhein. The tipster summac text
summarization evaluation. In Proceedings of the Ninth Conference on European chapter of the
Association for Computational Linguistics, pages 77–85. ACM, 1999.
[68] C.Y.Lin and E.Hovy. Manual and automatic evaluation of summaries. In Proceedings of the
Association Computational Linguistics-02 Workshop on Automatic Summarization, pages 45–51.
ACM, 2002.
[69] K.Papineni, S.Roukos, T.Ward, and W.J.Zhu.
of machine translation.
Bleu: a method for automatic evaluation
In Proceedings of the fortieth Annual Meeting on Association for
Computational Lingusitics, pages 311–318. ACM, July 2002.
[70] M.Hassel. Evaluation of Automatic Text Summarization. PhD thesis, School of Computer Science
and Communication,Royal Institute of Technology, Stockholm, Sweden, 2004.
[71] J.Steinberger and K.Jezek.
Evaluation measures for text summarization.
Computing and
Informatics, 28(2):251–275, 2009.
[72] C.Buckley and E.M.Voorhees. Evaluating evaluation measure stability. In Proceedings of the
Twenty Third Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 33–40, NY, USA, 2000. ACM.
[73] D.R.Radev and D.Tam. Single-document and multi-document summary evaluation via relative
utility. In Proceedings of the ACM Conference on Information and Knowledge Management, New
Orleans, LA, 2003. ACM.
[74] H.Saggion, D.Radev, S.Teufel, W.Lam, and S.M.Strassel. Developing infrastructure for the
evaluation of single and multidocument summarization systems in a cross-lingual environment.
In In Proceedings of Third International Conference on Language Resources and Evaluation,
pages 747–754, Las Palmas, Gran Canaria , Spain, 2002.
[75] C.Y.Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the
Association for Computational Linguistics Workshop on Text Summarization Branches Out, pages
74–81, 2004.
[76] S.Maskey and A.Rosenberg. Power mean pyramid scores for summarization evaluation. In
Thirteenth Annual Conference of the International Speech Communication Association, Portland,
Oregon, USA, September 2002.
[77] J.Kaur and V.Gupta. Effective approaches for extraction of keywords. International Journal of
Computer Science, 7(6), November 2010.
93
Bibliography
[78] E.Frank, G.W. Paynter, I.H.Witten, C.Gutwin, and C.G.Nevill. Domain specific key phrase
extarction. In Proceeding of Sixteenth International Conference on Artificial Intelligence, pages
668–673, San Francisco, CA,USA, 1999.
[79] I.H.Witten, G.W. Paynter, E.Frank, C.Gutwin, and N. Manning C.G. Kea: Practical automatic
keyphrase extraction.
In Proceeding of fourth ACM conference on Digital libraries, pages
254–=255, New York, USA, 1999. ACM.
[80] P.D.Turney. Learning algorithm for keyphrase extraction. Journal of Information Retrieval,
2(4):303–336, May 1999.
[81] Y.H. Kerner, Z. Gross, and A. Masa. Automatic extraction and learning of keyphrases from
scientific articles. In Proceedings of sixth International Conference on Computational Linguistics
and Intelligent Text Processing, pages 657–669, Mexico City , Mexico, February 2005. Springer.
[82] F. Liu, D. Pennell, F. Liu, and Y. Liu. Unsupervised approaches for automatic keyword extraction
using meeting transcripts. In Proceedings of Annual Conference of the North American Chapter
of the Association for Computational Linguistics, pages 620–628, Stroudsburg, PA, USA,, June
2009.
[83] K. Barker and N. Cornacchia. Using noun phrase heads to extract document keyphrases. In
Proceeding of thirteenth Biennial Conference of the Canadian Society on Computational Studies of
Intelligence, pages 40–52, Berlin, German, May 2000. Springer.
[84] E. Gaussier B. Daille and J.M.Lange.
Towards automatic extraction of monolingual; and
bilingual terminology. In Proceeding of the fifteenth conference on Computational linguistics,
pages 515–521, Stroudsburg, PA, USA, 1994. Springer.
[85] J.E.Burnett, D.Cooper, M.F.Lynch, P.Willett, and M.Wycherley. Document retrieval experiments
using indexing vocabularies of varying size. 1.variety generation symbols assigned to the fronts
of index terms. Journal of Documentation, 35:197–206, 1979.
[86] J.D.Cohen.
Highlights: Language- and domain-independent automatic indexing terms for
abstracting. Journal of the American Society for Information Science, 46(3):162–174, April 1995.
[87] Y.Matsuo and M.Ishizuka.
Keyword extraction from a single document using word
co-occurrence statistical information.
International Journal on Artificial Intelligence Tools,
13(1):157–169, 2004.
[88] H.G.Silber and K.F.McCoy. Efficient text summarization using lexical chains. In Proceedings
of Fifth International Conference on Intelligent User Interfaces,, pages 252–255, New York, USA,
January 2000. ACM.
[89] A.Hulth.
Improved automatic keyword extraction given more linguistic knowledge.
In
Proceeding of Conference on Empirical Methods in natural language processing,, pages 216–223,
Stroudsburg, PA, USA, 2003.
94
Bibliography
[90] R.Brazilay and M.Elhadad. Using lexical chains for text summarization. In Proceeding of the
ACL’97/EACL workshop on Intelligent Scalable Text Summarization, pages 10–17, Madrid, Spain,
July 2007.
[91] R.Angheluta, R.Busser, and M.Moens.
summarization.
The use of topic segmentation for automatic
In Proceedings of the workshop on automatic summarization, pages 66–70,
Philadelphia,PA, USA, 2002.
[92] Y.Uzun. Keyword extraction using nave bayes. 2005.
[93] K.Zhang, H.Xu, J.Tang, and J.Li. Keyword extraction using support vector machine. In
Proceedings of seventh International Conference on Web-Age Information Management, pages
85–96, Springer-Verlag, Berlin, Germany, June 2006.
[94] F.Liu, F.liu, and Y.Liu. Automatic keyword extraction for the meeting corpus using supervised
approach and bigram expansion. In IEEE Workshop on Spoken Language Technology, pages
181–184, NJ, USA, December 2008. IEEE.
[95] J.B.K. Humphreys. Phraserate: An html keyphrase extractor. Technical report, University of
California, Riverside, California,, 2002.
[96] J.J.Pollock and A.Zamora. Automatic abstracting research at chemical abstracts service. Journal
of Chemical Information and Computer Sciences, 15(4):226–232, 1975.
[97] R.Brandow, K.Mitze, and L.F.Rau.
sentence selection.
Automatic condensation of electronic publications by
Journal of Information Processing and Management, 31(5):675–685,
September 1995.
[98] C.Aone, M.E.Okurowski, and J.Gorlinsky. Trainable, scalable summarization using robust nlp
and machine learning. In Seventeeth International Conference on Computational Linguistics,
pages 62–66, 1998.
[99] D.R.Radev and K.R.McKeown. Generating natural language summaries from multiple on-line
sources.
Journal of Computer Linguistics : Special issue on natural language generation,
24(3):470–500, September 1998.
[100] E.Hovy and C.Y.Lin. Automated text summarization and the summarist system. In Proceedings
of a workshop on held at Baltimore, pages 13–15, Baltimore, Maryland, October 1998.
[101] D.Marcu. Discourse trees are good indicators of importance in text. In Advances in Automatic
Text Summarization, pages 123–136. The MIT Press, 1999.
[102] R.Brazilay, K.R.McKeown, and M.Elhadad.
Information fusion in the context of
multi-document summarization. In Proceedings of the thiry seventh Annual Meeting of the
Association Computational Linguistics on Computational Linguistics, pages 550–557. ACM, 1999.
[103] H.H.Chen and C.J.Lin. A multilingual news summarizer. In Proceedings of the Eighteenth
conference on Computational linguistics, pages 159–165. ACL, 2000.
95
Bibliography
[104] D.R.Radev.
Experiments in single and multidocument summarization using mead.
In
Proceedings of First Document Understanding Conference, New Orleans, LA, 2001.
[105] D.R.Radev, S.Blair-Goldensohn, Z.Zhang, and R.S.Raghavan.
Newsessence: A system for
domain-independent, real-time news clustering and multi-document summarization.
In
Proceedings of the First International Conference on Human Language Technology Research,
pages 1–4. ACM, 2001.
[106] D.R.Radev, S.Blair-goldensohn, and Z.Zhang.
Webinessence: a personalized web-based
multi-document summarization and recommendation system. In North American Chapter of
the Association for Computational Linguistics: Human Language Technologies Workshop on
Automatic Summarization, pages 79–88, 2001.
[107] C.Y.Lin and E.Hovy. Automated multi-document summarization in neats. In Proceedings of
the second international conference on Human Language Technology Research, pages 59–62, San
Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
[108] K.R.McKeown, R.Barzilay, D.Evans, and V.Hatzivassiloglou. Tracking and summarizing news
on a daily basis with columbia’s newsblaster.
In Proceedings of the Second International
Conference on Human Language Technology Research, pages 280–285, 2002.
[109] H.Daume III, A.Echihabi, D.Marcu, D.S.Munteanu, and R.Soricut. Gleans: A generator of
logical extracts and abstracts for nice summaries. In Proceedings of the Workshop on Automatic
Summarization, pages 9–14, Philadelphia, PA, 2002.
[110] S.H.Finely and S.M.Harabagiu.
gistexter.
Generating single and multi-document summaries with
In Proceedings of the Workshop on Automatic Summarization, pages 30–38,
Gaithersburg, MD, July 2002. NIST.
[111] H.Saggion and G.Lapalme. Generating indicative-informative summaries with sumum. Journal
of Computational Linguistics, 28(4):497–526, December 2002.
[112] H.Saggion, K. Bontcheva, and H. Cunningham. Robust generic and query-based summarization.
In Proceedings of the tenth Conference on European chapter of the Association for Computational
Linguistics, pages 235–238, 2003.
[113] M.Brunn, Y.Chali, and B.Dufour. The university of lethbridge text summarizer at duc 2003. In
Proceedings of the text summarization workshop and 2003 document understanding conference,
pages 148–152, 2002.
[114] T.Copeck and S.Szpakowicz. Picking phrases, picking sentences. In Proceedings of the Workshop
on Automatic Summarization, Human Language Technology/ North America Chapter of the
Association for Computational Linguistics, 2003.
[115] G.Erkan and D.R.Radev. The university of michigan at duc 2004. In Proceedings of the Document
Understanding Conferences, pages 120–127, 2004.
96
Bibliography
[116] E.Alfonseca, A.Moreno-Sandoval, and J.M.Guirao. Description of the uam system for generation
very short summaries at duc 200.
In Proceedings of the Human Language Technology/
North America Chapter of the Association for Computational Linguistics workshop on Automatic
Summarization/Document Understanding Conference. ACL, 2004.
[117] E. Filatova. Event-based extractive summarization. In Proceedings of Association Computational
Linguistics Workshop on Summarization, pages 104–111, 2004.
[118] C.Nobata and S.Sekine. Crl/nyu summarization system at duc-2004. In Proceedings of the
Document Inderstanding Conference ( DUC), 2004.
[119] J.M.Conoroy and J.D.Schlesinger. Classy query-based multi-document summarization,. In
Proceedings of the Document Understanding Workshop at the Human Language Technology
Conference/Conference on Empirical Methods in Natural Language Processing, Boston, 2005.
[120] A.Farzindar, F.Rozon, and G.Lapalme. Cats a topic-oriented multi-document summarization
system at duc 2005. In Proceedings of Document Understanding Workshop,, 2005.
[121] R.Witte, R.Kreste, and S.Bergler.
Erss 2005: Coreference-based summarization reloaded.
In Proceedings of Document Understanding Conference Workshop at the Human Language
Technology Conference/Conference on Empirical Methods in Natural Language Processing, pages
9–10, Vancouver, B.C., Canada, October 2005.
[122] Y.X.He, D.X.Liu, D.H.Ji, H.Yang, and C.Teng. Msbga: A multi-document summarization system
based on genetic algorithm. In Proceedings of the Fifth International Conference on Machine
Learning and Cybernetics, pages 2659–2664, August 2006.
[123] M.Fuentes, H.Rodr’quez, and D.Ferres. Femsum at duc 2007. In Proceedings of the Document
Understanding Conference at Human Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics, Rochester, NY, 2007.
[124] D.M.Dunlavy, D.P.O’Leary, J.M.Conroy, and J.D.Schlesinger. Qcs: A system for querying,
clustering and summarizing documents. Journal of Information Processing and Management,
43(6):1588–1605, 2007.
[125] F.Gotti, G.Lapalme, M.Quebec, L.Nerima, and E.Wehrli. Gofaisum: a symbolic summarizer for
duc. In Proceedings of Document Understanding Conference, 2007.
[126] K.M.Svore, L.Vanderwende, and C.J.C.Burges. Enhancing single-document summarization by
combining ranknet and third-party sources. In Proceedings of the Joint Conference on Empirical
Methods in Natural Language Procssing and Computational Natural Language Learning, pages
448–457, 2007.
[127] F.Schilder and R.Kondadadi.
Fastsum: Fast and accurate query-based multi-document
summarization. In Proceedings of forty sixth Annual Meeting of the Association for Computational
Linguistics on Human Language Technologies, pages 205–208. ACM, 2008.
97
Bibliography
[128] Y.Liu, X.Wang, J.Zhang, and H.Xu.
Personalized pagerank based multi-document
summarization. In Proceedings of the IEEE International Workshop on Semantic Computing and
Systems, pages 169–173, Washington, DC, USA, 2008. IEEE.
[129] J.Zhang and X.Cheng. Adasum: An adaptive model for summarization. In Proceeding of
Conference on Information and Knowledge Management,, pages 26–30, California, USA, October
2008. ACM.
[130] E.Aramaki, Y.Miura, M.Tonoike, T.Ohkuma, H.Mashuichi, and K.Ohe. Text2table: medical
text summarization system based on named entity recognition and modality identification. In
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing,
pages 185–192. ACL, 2009.
[131] S.Fisher, A.Dunlop, B.Roark, Y.Chen, and J.Burmeister. Ohsu summarization and entity linking
systems. In Proceedings of the Second Text Analysis Conference, Gaithersburg, November 2009.
NIST.
[132] B.Hachey. Multi-document summarisation using generic relation extraction. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, pages 420–429,
Stoudsburg, PA, USA, 2009. ACL.
[133] F.Wei, S.Liu, Y.Song, S.Pan, M.X.Zhou, W.Qian, L.Shi, L.Tan, and Q.Zhang. Tiara: a visual
exploratory text analytic system. In Proceedings of the Sixteenth Association for Computing
Machinery’s Special Interest Group on Knowledge Discovery and Data Mining, International
Conference on Knowledge discovery and data mining, pages 153–162, New York, NY, USA, 2010.
ACM.
[134] A.Dong, Y.Chang, Z.Zheng, G.Mishne, J.Bai, R.Zhang, K.Buchner, C.Liao, and F.Diaz. Towards
recency ranking in web search. In Proceedings of the Third ACM international conference on Web
search and data mining, pages 11–20, 2010.
[135] L.Shi, F.Wei, S.Liu, L.Tan, X.Lian, and M.X.Zhou. Understanding text corpora with multiple
facets. In IEEE Symposium on Visual Analytics Science and Technology, pages 99–106, 2010.
[136] R.M.Alguliev, R.M.Aliguliyev, M.S.Hajirahimova, and C.A.Mehdiyev.
Mcmr: Maximum
coverage and minimum redundant text summarization model. Journal on Expert Systems with
Applications, 38(12):14514–14522, 2011.
[137] D.Archambault, D.Greene, P.Cunningham, and N.J.Hurley. Themecrowds: multiresolution
summaries of twitter usage. In Proceedings of the 3rd international workshop on Search and
mining user-generated contents, pages 77–84, New York, NY, USA, 2011. ACM.
[138] J.P.Ng, P.Bysani, Z.Lin, M.Y.Kan, and C.L.Tan. Exploiting category-specific information for
multi-document summarization. In Proceeding of the Text Analysis Conference, pages 2093–2108,
Gaithersburg, Maryland, USA, November 2011.
[139] P.E.Genest and G.Lapalme.
Framework for abstractive summarization using text-to-text
generation. In Proceedings of the Workshop on Monolingual Text-To-Text Generation. Association
for Computational Linguistics, 2011.
98
Bibliography
[140] D.Tsarev, M.Petrovskiy, and I.Mashechkin. Using nmf-based text summarization to improve
supervised and unsupervised classification. In Eleventh International Conference on Hybrid
Intelligent Systems, pages 185–189. IEEE, December 2011.
[141] G.de Melo and G.Weikum. Uwn: A large multilingual lexical knowledge base. In Proceedings of
the Association for Computational Linguistics 2012 System Demonstrations, pages 151–156. ACL,
2012.
[142] S.A.Mirroshandel and G.Gassem-Sani. Towards unsupervised learning of temporal relations
between events. Journal of Artificial Intelligence Research, 45(1):125–163, September 2012.
99
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement