A Novel Approach of Mining Write-Prints for Authorship Attribution in E

A Novel Approach of Mining Write-Prints for Authorship Attribution in E
digital investigation 5 (2008) S42–S51
available at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/diin
A novel approach of mining write-prints for authorship
attribution in e-mail forensics
Farkhund Iqbal*, Rachid Hadjidj, Benjamin C.M. Fung, Mourad Debbabi
Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science, Concordia University, Montreal,
Quebec, Canada H3G 1M8
abstract
Keywords:
There is an alarming increase in the number of cybercrime incidents through anonymous
E-mail forensic analysis
e-mails. The problem of e-mail authorship attribution is to identify the most plausible au-
Authorship identification
thor of an anonymous e-mail from a group of potential suspects. Most previous contribu-
Data mining
tions employed a traditional classification approach, such as decision tree and Support
Write-print
Vector Machine (SVM), to identify the author and studied the effects of different writing
Frequent itemsets
style features on the classification accuracy. However, little attention has been given on ensuring the quality of the evidence. In this paper, we introduce an innovative data mining
method to capture the write-print of every suspect and model it as combinations of features
that occurred frequently in the suspect’s e-mails. This notion is called frequent pattern,
which has proven to be effective in many data mining applications, but it is the first
time to be applied to the problem of authorship attribution. Unlike the traditional approach, the extracted write-print by our method is unique among the suspects and, therefore, provides convincing and credible evidence for presenting it in a court of law.
Experiments on real-life e-mails suggest that the proposed method can effectively identify
the author and the results are supported by a strong evidence.
ª 2008 Digital Forensic Research Workshop. Published by Elsevier Ltd. All rights reserved.
1.
Introduction
E-mail is one of the most widely used ways of written communication over the Internet, and its traffic has increased exponentially with the advent of World Wide Web. Trillions of
business letters, financial transactions, governmental orders
and friendly messages are exchanged through e-mail system
each year. The increase in e-mail traffic comes also with an increase in the use of e-mails for illegitimate purposes (Teng
et al., 2004). Phishing, spamming, e-mail bombing, threatening, cyber bullying, racial vilification, child pornography, and
sexual harassments are common examples of e-mail abuses.
Terrorist groups and criminal gangs are using e-mail systems
as a safe channel for their communication. E-mail is also
abused for committing infrastructure crimes by transmitting
worms, viruses, Trojan horses, hoaxes and other malicious
executables over the Internet. In many misuse cases, the
criminals attempt to hide their true identity. Likewise, in
phishing, a person may try to impersonate a manager or a financial adviser to obtain clients’ secret information.
E-mail systems are inherently vulnerable to misuse for
three main reasons. First, an e-mail can be spoofed and the
meta data contained in its header about the sender and the
path along which the message has travelled can be forged or
anonymized. An e-mail can be routed through anonymous
e-mail servers to hide the information about its origin. Second, e-mail systems are capable of transporting executables,
hyperlinks, Trojan horses, and scripts. Third, the Internet
* Corresponding author.
E-mail addresses: iqbal_f@encs.concordia.ca (F. Iqbal), hadjidj@encs.concordia.ca (R. Hadjidj), fung@ciise.concordia.ca (B.C.M. Fung),
debbabi@ciise.concordia.ca (M. Debbabi).
1742-2876/$ – see front matter ª 2008 Digital Forensic Research Workshop. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.diin.2008.05.001
S43
digital investigation 5 (2008) S42–S51
including e-mail services is accessible through public places,
such as net cafes and libraries, which further deteriorates
the anonymity issues. Presently, there is no adequate proactive mechanism to prevent e-mail misuses, and merely installing filters and firewalls are insufficient. In this situation,
forensic e-mail analysis with special focus on authorship attribution can help prosecute the offender of e-mail misuse
by means of law (Teng et al., 2004).
The problem of authorship attribution in the context of e-mail
forensics can be described as follows: a cyber forensic investigator wants to determine the author of a given malicious
e-mail m and has to identify that the author is likely to be
one of the suspects {S1,., Sn}. The problem is to identify the
most plausible author from the suspects {S1,., Sn} and to
gather convincing evidence to support the finding in a court
of law. In forensic science, an individual can be uniquely identified by his/her fingerprint. Similarly, in cyber forensics, an
investigator would like to identify the ‘‘write-print’’ of an individual from his/her e-mails and use it for authorship attribution. The key question is:
What exactly are the patterns that can represent the writeprint of an individual?
Our insight is that the write-print of an individual is the
combinations of features that occur frequently in his/her written
e-mails. The commonly used features are lexical, syntactical,
structural and content-specific attributes (see Section 2.1). By
matching the write-print with the malicious e-mail, the true
author can be identified. Most importantly, the matched
write-print should provide credible evidence for supporting
the conclusion. The research community (De Vel, 2000; Teng
et al., 2004; Zheng et al., 2006) has devoted a lot of efforts in
studying stylistic and structural features individually, but
very few of them has studied the combinations of features
that form a write-print and addressed the issue of evidence
gathering.
The classification models employed in previous contributions on authorship attribution have two broad categories: Decision tree (C4.5) (Quinlan, 1986) and Support Vector Machine
(SVM) (Cristianini and Shawe-Taylor, 2000). While building
a decision tree, a decision node is constructed by simply considering the local information of one attribute, therefore, it
fails to capture the combined effect of several features. In contrast, SVM avoids such problem by considering all features
when a hyperplane is created. However, SVM is a like
a black-box function which takes some input (the malicious
e-mail) and provides an output (the author). It fails to provide
intuitive explanation of how it arrives to a certain conclusion.
Therefore, SVM may not be the best choice in the context of email forensic investigation, where collecting credible evidence
is one of the primary objectives.
In this paper, we are introducing a novel approach of authorship attribution in which the unique write-print of every
suspect is extracted. These write-prints are used to identify
the true author of a disputed e-mail, and to gather convincing
and credible evidence to support the finding. To concisely
model the write-print of an individual, we borrow the concept
of frequent pattern (a.k.a. frequent itemset) (Agrawal et al., 1993)
from data mining to capture the combinations of features that
frequently occurred in an individual’s e-mails. Frequent pattern mining has been proven to be a very successful data mining technique for finding hidden patterns in DNA sequences,
customer purchasing habits, security intrusions, and many
other applications of pattern recognition. To the best of our
knowledge, this is the first paper introducing the concept of
frequent pattern to the problem of authorship attribution.
Fig. 1 depicts an overview of our proposed method. We first
extract the set of frequent patterns independently from the
e-mails Ei written by suspect Si. Though the set of frequent
patterns captures the writing style of a suspect Si, it is inappropriate to use all the frequent patterns to form the write-print
of a suspect Si because an other suspect, say Sj, may share
some common writing patterns with Si. Therefore, it is crucial
to filter out the common frequent patterns and identify the
unique patterns that can differentiate the writing style of a suspect from that of others. These unique patterns form the writeprint of a suspect. This approach has the following merits that
are not found in most of the existing works.
Justifiable evidence: the write-print, represented as a set of
unique patterns, is extracted from the e-mails of a particular
suspect. Our method guarantees that the identified patterns
are frequent in the e-mails of one suspect only, and are not
frequent in others’ e-mails. It will be difficult for the accused
suspect to deny the validity of the findings. The results
obtained are traceable, justifiable, and can be presented
quantitatively with a statistical support.
Flexible writing styles: the frequent pattern mining technique
can adopt all four types of commonly used writing style features (described in Section 2.1). This flexibility is important
for determining the combined effect of different features.
This is much more flexible than the traditional decision
tree, which primarily relies on the nodes at the top of the
tree to differentiate the writing styles of all suspects.
E-mails E1 of S1
E-mails E2 of S2
E-mails E3 of S3
Mining
Mining
Mining
Frequent
Patterns
FP(E1)
Frequent
Patterns
FP(E2)
Frequent
Patterns
FP(E3)
Filtering
Filtering
Write-Print
WP(E2) of S2
Filtering
Fig. 1 – Mining write-print WP(E2) of S2.
S44
digital investigation 5 (2008) S42–S51
Features optimization: unlike the traditional approaches
where it is hard to determine the contribution of each feature in the authorship attribution process (De Vel et al.,
2001a), the proposed technique is based on the distinctive
patterns, which are the combination of features. The support associated to each pattern in the write-print set determines the contribution of each pattern.
Generic application: the dataset used in most of the existing
techniques are constrained by the number, size and topic
of e-mails. Our experiments on the real-life data, the Enron
e-mail corpus, suggest that the proposed approach is very
robust to these factors. This is crucial for the application
in real world investigations.
The rest of the paper is organized as follows. Section 2 reviews the previous contributions. Section 3 formally defines
the problem and the notions of write-print. Section 4
describes our proposed approach. Section 5 evaluates our proposed method on real-life e-mail dataset. Section 6 concludes
the paper.
2.
Related work
Most previous contributions on authorship attribution are applications of text classification analysis (De Vel, 2000). The
process starts by identifying a set of writing style features of
a person that are relatively common in most of his works. A
classifier is trained on the collected writing style features to
build a model, which is then used to classify the disputed
e-mail to the most plausible author among the suspects. In
this section, we review the commonly employed writing style
features and summarize the techniques of e-mail authorship
attribution found in the literature of authorship attribution.
2.1.
Writing style features
There is no predefined set of features that can be used to differentiate the writing styles of different suspects. The writing
patterns usually contain the characteristics of words usage,
words sequence, composition and layouts, common spelling
and grammatical mistakes, vocabulary richness, hyphenation
and punctuation. Abbasi and Chen (2008) presented a comprehensive analysis on the stylistics features. Below, we provide
a summary of the common writing style features, namely, lexical, syntactical, structural and content-specific attributes.
Lexical features are the characteristics of both characters
and words or tokens. In terms of characters, for instance, frequency of letters, frequency of capital letters, total number of
characters per token and character count per sentence are the
most relevant metrics. Word-based lexical features may include word length distribution, words per sentence, and vocabulary richness. Initially, researchers thought that
vocabulary richness (Yule, 1938, 1944) and word usage
(Holmes, 1998) are discriminating features to be used for authorship attribution. Syntactic features include the distribution
of function words (such as ‘‘upon’’, ‘‘thus’’, ‘‘above’’) and
punctuation play a significant role in authorship attribution
(Burrows, 1989; Holmes and Forsyth, 1995; Tweedie and
Baayen, 1998). Structural features are used to measure the
overall layout and organization of text within documents.
For instance, average paragraph length, number of paragraphs per document, presence of greetings and their position
within the e-mail are common structural features. Moreover,
the presence of sender signature including his contact information is one of the special structural features of e-mail documents. Content-specific features are collection of certain
keywords commonly found in a specific domain and may
vary from context to context even for the same author. Zheng
et al. (2006) used 11 keywords from the cybercrime taxonomy
in authorship analysis experiments.
2.2.
E-mail authorship analysis
Authorship analysis has been very successful in resolving authorship attribution disputes over literary and conventional
writings (Mendenhall, 1887). However, e-mail authorship attribution poses some special challenges due to its special
characteristics of size and composition, as compared to literary works (De Vel et al., 2001a). Literary documents are usually
large in size comprising of (at least) several paragraphs and
have a definite syntactic and semantic structure. In contrast,
e-mails are short in size and usually do not follow definite
syntactic or grammatical rules, therefore, it is hard to learn
from them about the writing patterns of their author. Ledger
and Merriam (1994), for instance, established that authorship
analysis results would not be significant for texts containing
less than 500 words. Moreover, e-mails are more interactive
and informal in style, and people are not couscous about the
spelling and grammatical mistakes particularly in informal
e-mails. Therefore, techniques which are very successful in
literary and traditional works are not applicable in the email authorship attribution.
Teng et al. (2004) and De Vel (2000) applied Support Vector
Machine (SVM) classification model over a set of stylistic and
structural features for e-mail authorship attribution. De Vel
et al. (2001b) and Corney et al. (2002) performed extensive experiments and found that the classification accuracy decreases
when the size of training set decreases, the number of authors
increases, or the length of documents decreases. Recently,
Zheng et al. (2006, 2003) used a comprehensive set of lexical,
syntactical and structural features including 10–11 contentspecific keywords. Haltern (2007) used the same set of linguistic
features for authorship attribution of class essays. De Vel (2000)
further found that by increasing the number of function words
from 122 to 320, the performance of SVM worsened, which
weakens the argument that SVM supports high dimensionality. This result also illustrates that adding more features does
not necessarily improve the accuracy. In contrast, in this paper
we focus on identifying the combinations of key features that
can differentiate the writing style of different suspects and filtering out the useless features that do not contribute towards
the goal of authorship attribution.
In the current literature, each type of the four features’ sets
is applied independently from the other, which may otherwise produce different results (De Vel, 2000). For instance
the word usage and composition style may vary from one
structural pattern to another. In our approach, the writeprints could be the combination of all the four types of writing
style features. Moreover, the current literature of authorship
S45
digital investigation 5 (2008) S42–S51
attribution suffers from the problem of having too many features. It is difficult to determine the set of features that should
be used for a given set of e-mails. Previous research (De Vel,
2000) has shown that adding useless features may decrease
the accuracy because a classifier may capture the useless features as noise. Using those noisy features for classification diminishes the justification of evidence for supporting the
finding. Some studies identify some particular useful style
markers, but the identified style markers are data dependent
and may not be applicable to other data sets. Our approach
overcomes this limitation by flexibly extracting the evidence
(combinations of frequently occurred features) from the data
itself, provided the insignificant noisy features are filtered out.
3.
Problem statement
Let {S1,., Sn} be the set of suspected authors of a malicious email m. We assume that there is a collection of e-mails,
denoted by Ei, for each suspect Si ˛ {S1,., Sn}. The problem of authorship attribution is to identify the most plausible author Sa,
from the suspects {S1,., Sn}, whose collection of e-mails Ea
has the ‘‘best match’’ with the patterns in the malicious email m. Intuitively, a collection of e-mails Ei matches m if Ei
and m share similar patterns of vocabulary usage, structural
and/or stylometric features. The primary objective of a cyber
forensic investigator is to precisely extract the patterns of
each suspect, so he/she can use such patterns to identify the
author of the malicious e-mail m and present such patterns
as evidence to support his/her findings.
What are the patterns that can represent the ‘‘write-print’’
of a suspect Si? Specifically, we want to extract the patterns
that uniquely represent the writing style of a suspect Si, but
does not represent the writing style of any other suspect Sj,
where i s j. In the rest of this section, we discuss the pre-processing of features and formally define the notions of frequent
pattern and write-print.
3.1.
Pre-processing
Let Ei be a collection of e-mails written by suspect Si ˛ {S1,.,
Sn}. First, we extract the features from each e-mail in Ei. In
the rest of this section, the term ‘‘feature’’ refers to either a stylometric feature described in Section 2.1 or a word appearing
in the e-mails. The spaces, punctuation, special characters
and blank lines are removed. Next, we discretize each normalized word frequency into a set of intervals, for example,
[0–0.25], (0.25–0.5], (0.5–0.75], (0.75–1]. Each interval is called
a feature item. The normalized feature frequency is then
matched with these intervals. Then assign value 1 to the feature item if the interval contains the normalized feature
frequency; otherwise assign value 0. This will simplify the
procedure by determining the presence or absence of a pattern. Common discretization techniques are:
Equal-width discretization, where the size of each interval is
the same.
Equal-frequency discretization, where each interval has approximately the same number of records assigned to it.
Clustering-based discretization, where clustering is performed
on the distance of neighboring points.
Example 3.1. Consider Table 1, which contains 10 e-mails. We
extracted three features {A, B, C} from the 10 e-mails. We first
discretize each feature into feature items. For example, a stylometric feature A having a normalized range of [0, 1] can be discretized into four intervals A1 ¼ [0, 0.25], A2 ¼ (0.25, 0.5],
A3 ¼ (0.5, 0.75], A4 ¼ (0.75, 1], representing four feature items.
Similarly, features B and C are discretized into B1 ¼ [0, 0.5],
B2 ¼ (0.5, 1], C1 ¼ [0, 0.5], and C2 ¼ (0.5, 1]. An e-mail 31 having
features A ¼ 0.3, B ¼ 0.25, and C ¼ 0.25 can be represented as
feature vector C0, 1, 0, 0, 1, 0, 1, 0D.
3.2.
Frequent pattern
Intuitively, the ‘‘writing pattern’’ or the ‘‘writing style’’ in an
ensemble of e-mails Ei (written by suspect Si) is a combination
of feature items that frequently occurs in Ei. We concisely
model and capture such frequently occurred patterns by the
concept of frequent itemset (Agrawal et al., 1993) described as
follows.
Let U ¼ { f1,., fm} denote the universe of all feature items.
Let Ei be a set of e-mails where each e-mail 3 is represented
as a set of feature items such that 3 4 U. An e-mail 3 contains
a feature item fi if the numerical feature value of the e-mail 3
falls within the interval of fi. For example, e-mail 31 in Table 1
can be represented as a set of feature items 31 ¼ {A2, B1, C1}.
Table 2 shows the 10 e-mails from Table 1 in this format.
Let F 4 U be a set of feature items called a pattern. An e-mail
3 contains a pattern F if F 4 3. A pattern that contains k feature
items is a k-pattern. For example, the pattern F ¼ { f1, f4, f6} is
a 3-pattern. The support of a pattern F is the percentage of
e-mails in Ei that contains F. A pattern F is a frequent pattern
in a set of e-mails Ei if the support of F is greater than or equal
to some user-specified minimum support threshold.
Definition 3.1 (Frequent pattern). Let Ei be the set of e-mails
written by suspect Si. Let support(FjEi) be the percentage of
e-mails in Ei that contain the pattern F, where F 4 U. A pattern
F is a frequent pattern in Ei if support(FjEi) min_sup, where the
minimum support threshold min_sup is a real number in an
interval of [0, 1].
Table 1 – Feature vectors
E-mail
31
32
33
34
35
36
37
38
39
310
Feature A
Feature B
Feature C
A1
A2
A3
A4
B1
B2
C1
C2
0
0
0
1
0
0
0
0
0
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
1
1
1
1
1
0
1
0
1
1
0
0
0
0
0
1
0
1
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
S46
digital investigation 5 (2008) S42–S51
Table 2 – Feature items
E-mail
31 ¼ {A2, B1, C1}
32 ¼ {A2, B1, C1}
33 ¼ {A2, B1, C1}
34 ¼ {A1, B1, C1}
35 ¼ {A4, B1, C1}
36 ¼ {A3, B2, C2}
37 ¼ {A4, B1, C2}
38 ¼ {A3, B2, C2}
39 ¼ {A2, B1, C2}
310 ¼ {A1, B1, C2}
The writing pattern of a suspect Si is represented as a set of
frequent patterns, denoted by FP(Ei) ¼ {F1,., Fk}, extracted
from his/her e-mails Ei.
Example 3.2. Consider Table 2. Suppose the user-specified
threshold min_sup ¼ 0.3, which means that a pattern
F ¼ {f1,., fk} is frequent if at least 3 out of the 10 e-mails contain all feature items in F. {A1} is not a frequent pattern because it has support 2/10 ¼ 0.2. {A2} is a 1-frequent pattern
because it has support 0.4. {A2, B1} is a 2-frequent pattern because it has support 0.4. {A2, B1, C1} is a 3-frequent pattern because it has support 0.3. Example 4.1 will show how to
efficiently compute all frequent patterns.
3.3.
Write-print
Discussion: our notion of write-print has two special properties that make it different from the traditional notion of
write-print in the literature.
First, the combination of feature items that composes the
write-print of a suspect is dynamically generated based on
the embedded pattern in the e-mails. This flexibility allows
us to succinctly model the write-print of different suspects
by using different combinations of feature items. In contrast,
the traditional notion of write-print considers one feature at
a time without considering the combinations.
Second, every frequent pattern F in our notion of writeprint captures a piece of writing pattern that can be found
only in one suspect’s e-mails, but not in any other suspects’
e-mails. The cyber forensic investigator could precisely point
out such matched patterns in the malicious e-mail to support
his/her conclusion of authorship identification. In contrast,
the traditional classifier, e.g., decision tree, attempts to use
the same set of features to capture the write-print of different
suspects. It is quite possible that the classifier would capture
some common writing patterns and the investigator could unintentionally use those common patterns to draw the wrong
conclusion of authorship. Our notion of write-print avoids
such problem and, therefore, provides more convincing and
reliable evidence.
3.4.
Refined problem statement
The problem of authorship attribution can be refined into
three subproblems: (1) to identify the write-print WP(Ei) from
each set of e-mails Ei ˛ {E1,., Em}. (2) To determine the author
of the malicious e-mail m by matching m with each of
{WP(E1),., WP(Em)}. (3) To extract evidence for supporting
the conclusion on authorship. The evidence has to be intuitive
enough for convincing the judge and the jury in the court of
law. These three subproblems summarize the challenges in
typical investigation procedure.
To solve subproblems (1) and (2), we can first extract the set
of frequent patterns FP(Ei) from Ei and then filter out the common frequent patterns that also appear in any other sets of emails Ej. For subproblem (3), the write-print WP(Ea) could serve
the evidence for supporting the conclusion, where Ea is the set
of e-mails written by the identified author Sa.
In forensic science, an individual can be uniquely identified
by his/her fingerprint. In cyber forensics, can we identify
the ‘‘write-print’’ of an individual from his/her e-mails?
We do not claim that the identified write-print in this paper
can uniquely distinguish every individual in the world, but
the identified write-print is accurate enough to uniquely
identify the writing pattern of an individual among the suspects {S1,., Sn} because common patterns among the suspects are filtered out and will not become part of the
write-print.
The notion of frequent patterns in Definition 3.1 captures
the writing pattern of a suspect. However, two suspects Si
and Sj may share some similar writing patterns. Therefore, it
is important to filter out the common frequent patterns and
retain the frequent patterns that are unique to each suspect.
This leads us to the notion of write-print.
Intuitively, a write-print can uniquely represent the writing style of a suspect Si if its pattern is found only in the e-mails
written by Si, but not in any other suspect’s e-mails. In other
words, the write-print of a suspect Si is a pattern F that is
frequent in the e-mails Ei written by Si but not frequent in
the e-mails Ej written by any other suspect Sj where i s j.
Algorithm 1 presents a novel data mining method, called
AuthorMiner, for determining the authorship of a malicious
e-mail m from a group of suspects {S1,., Sn} based on the
extracted features of their previously written e-mails {Ei,.,
En}. In this section, an e-mail is represented by a set of feature
items. Below, we summarize the algorithm in three phases.
Sections 4.1–4.3 discuss each phase in detail.
Definition 3.2 (Write-print). A write-print, denoted by WP(Ei),
is a set of patterns where each pattern F has support(FjEi) min_sup and support(FjEj) < min_sup for any Ej where
i s j, min_sup is a user-specified minimum threshold. In other
words, WP(Ei) 4 FP(Ei), and WP(Ei) X WP(Ej) ¼ B for any 1 i,
j n and i s j.
Phase 1: Mining frequent patterns (lines 1–3). Extract the frequent
patterns FP(Ei) from each collection of e-mails Ei written by suspect Si. The extracted frequent patterns
capture the writing pattern of a suspect.
Phase 2: Filtering common frequent patterns (lines 4–16). Though
FP(Ei) has captured the writing patterns of suspect
4.
Our method
digital investigation 5 (2008) S42–S51
Algorithm 1: AuthorMiner
Require: The malicious e-mail m.
Require: A set of e-mail {E1,., En}, written by {S1,., Sn}.
/* Mining frequent patterns */
1:
for each Ei ˛ {E1,., En} do
2:
extract frequent patterns FP(Ei) from Ei;
3:
End for
/* Filtering out common frequent patterns */
4:
for each FP(Ei) ˛ {FP(E1),., FP(En)} do
5:
for each FP(Ej) ˛ {FP(Eiþ1),., FP(En)} do
6:
for each frequent pattern Fx ˛ FP(Ei) do
7:
for each frequent pattern Fy ˛ FP(Ej) do
8:
if Fx ¼ ¼ Fy then
9:
FP(Ei) ¼ FP(Ei) Fx;
10:
FP(Ej) ¼ FP(Ej) Fy;
11:
End if
12:
End for
13:
End for
14:
End for
15:
WP(Ei) ¼ pattern(Ei);
16:
End for
/* Identifying author */
17:
highest_score ¼ 1.0;
18:
For all WP(Ei) ˛ {WP(E1),., WP(En)} do
19:
if Score(m z WP(Ei)) > highest_score then
20:
highest_score ¼ Score(m z WP(Ei));
21:
author ¼ Si;
22:
End if
23:
End for author;
Si, FP(Ei) may contain frequent patterns that are common to other suspects. Therefore, Phase 2 is to
remove the common frequent patterns. Specifically,
a frequent pattern F in FP(Ei) is removed if any other
FP(Ej) also contains F, where i s j. The remaining frequent patterns in FP(Ei) form the write-print WP(Ei) of
suspect Si. When this phase completes, we have a set
of write-prints {WP(E1),., WP(En)} of suspects {S1,.,
Sn}. Fig. 1 illustrates that the write-print WP(E2) comes
from FP(E2) and filters out the common patterns by
comparing with FP (E1) and FP(E3).
Phase 3: Identifying author (lines 17–24). Compare the malicious
e-mail m with each write-print WP(Ei) ˛ {WP(E1),.,
WP(En)} and identify the most similar write-print
that matches m. Intuitively, a write-print WP(Ei) is
similar to the e-mail m if many frequent patterns in
WP(Ei) can be found in m. Our insight is that the frequent patterns are not equally important. Their importance is reflected by their supprt(FjEi); therefore,
we derive a score function Score(m z WP(Ei)) to measure the weighted similarity between the e-mail m
and the frequent patterns in WP(Ei). The suspect Sa
of write-print WP(Ea), which has the highest
Score(m z WP(Ei)), is classified to be the author of the
malicious e-mail m.
4.1.
Mining patterns (lines 1–3)
Lines 1–3 mine the frequent patterns FP(Ei) from each collection of e-mail Ei ˛ {E1,., En}, for 1 i n}. There are many
data mining algorithms for extracting frequent patterns, for
example, Apriori (Agrawal et al., 1993), FP-growth (Han and
S47
Pei, 2000), and ECLAT (Zaki, 2000). Below, we provide an overview of the Apriori algorithm which has been previously applied to various text mining tasks (Fung et al., 2003; Holt and
Chung, 1999).
Apriori is a level-wise iterative search algorithm that uses
frequent k-patterns to explore the frequent (k þ 1)-patterns.
First, the set of frequent 1-patterns is found by scanning the email Ei, accumulating the support count of each feature item,
and collecting the feature item f’s that has support({f}jEi) min_
sup. The resulting frequent 1-patterns are then used to find
frequent 2-patterns, which are then used to find frequent 3-patterns, and so on, until no more frequent k-patterns can be found.
The generation of frequent k þ 1-pattern from frequent k-patterns is based on the following Apriori property.
Property 4.1 (Apriori property). All nonempty subsets of a frequent pattern must also be frequent.
By definition, a pattern F0 is not frequent if support(F0 jEi) < min_sup. The above property implies that adding
a feature item f to a non-frequent pattern F0 will never make
it more frequent. Thus, if a k-pattern F0 is not frequent, then
there is no need to generate (k þ 1)-pattern F0 W f because
F0 W f is also not frequent. The following example shows
how the Apriori algorithm exploits this property to efficiently
extract all frequent patterns. Refer Agrawal et al. (1993) for
a formal description.
Example 4.1. Consider Table 2 with min_sup ¼ 0.3. First, identify all frequent 1-patterns by scanning the database once to
obtain the support of every item. The items having
support 0.3 are frequent 1-patterns, denoted by L1 ¼ {{A2},
{B1}, {C1}, {C2}}. Then, join L1 with itself, i.e. L1 ) L1, to generate
the candidate set C2 ¼ {{A2, B1}, {A2, C1}, {A2, C2}, {B1, C1}, {B1,
C2}, {C1, C2}} and scan the database once to obtain the support
of every pattern in C2. Identify the frequent 2-patterns,
denoted by L2 ¼ {{A2, B1}, {A2, C1}, {B1, C1}, {B1, C2}}. Similarly,
perform L2 ) L2 to generate C3 scan the database once to identify the frequent 3-pattern which is L3 ¼ {{A2, B1, C1}}. The finding of each set of frequent k-patterns requires one full scan of
the e-mail feature items in Table 2.
4.2.
Filtering common patterns (lines 4–16)
This phase filters out the common frequent patterns among
{FP(E1),., FP(En)}. Lines 4–16 in Algorithm 1 describe the filtering procedure. The general idea is to compare every frequent
pattern Fx in FP(Ei) with every frequent pattern Fy in all other
FP(Ej), and to remove them from FP(Ei) and FP(Ej) if Fx and Fy
are the same. The computational complexity of this step is
O(jFP(E )jn), where jFP(E )j is the number of frequent patterns
in FP(E ) and n is the number of suspects. The remaining
frequent patterns in FP(Ei) form the write-print WP(Ei) of
suspect Si.
Example 4.2. Suppose there are three suspects S1, S2, and S3
having three sets of e-mails E1, E2, and E3 respectively, as
depicted in Fig. 1. Let FP(E1) ¼ {{A1}, {B1}, {C2}, {A1, B1}, {A1, C2},
{B1, C2}, {A1, B1, C2}} be the frequent patterns of S1. Let
S48
digital investigation 5 (2008) S42–S51
80
60
40
20
0
0.1
0.3
0.5
0.7
0.9
c
100
80
Accuracy (%)
b
100
Accuracy (%)
Accuracy (%)
a
60
40
20
0
0.1
0.3
min_sup
0.5
0.7
0.9
100
80
60
40
20
0
0.1
0.3
min_sup
0.5
0.7
0.9
min_sup
Fig. 2 – Accuracy vs. min_sup (n [ 6, m [ 20). (a) Number of discretized intervals [ 2; (b) number of discretized intervals [ 4
(c) number of discretized intervals [ 6.
FP(E2) ¼ {{A2}, {B1}, {C1}, {C2}, {A2, B1}, {A2, C1}, {B1, C1}, {B1, C2},
{A2, B1, C1}} be the set of frequent patterns from Example 4.1
of S2. Let FP(E3) ¼ {{A1}, {B3}, {C2}, {A1, B3}, {A1, C2}, {B3, C2}, {A1,
B3, C2}} be the set of frequent patterns of S3. Then, we discard
{A1}, {B1}, {C2}, {A1, C2}, {B1, C2} because more than one set of frequent patterns contains them. The remaining frequent patterns form the write-print of the suspect: WP(E1) ¼ {{A1, B1},
{A1, B1, C2}}, WP(E2) ¼ {{A2}, {C1}, {A2, B1}, {A2, C1}, {B1, C1}, {A2,
B1, C1}}, and WP(E3) ¼ {{B3}, {A1, B3}, {B3, C2}, {A1, B3, C2}}.
4.3.
Identifying author (lines 17–24)
Lines 17–24 determine the author of the malicious e-mail m by
comparing m with each write-print WP(Ei) ˛ {WP(E1),., WP(En)}
and identifying the most similar write-print to m. Intuitively,
a write-print WP(Ei) is similar to m if many frequent patterns
in WP(Ei) matches the style in m. Formally, a frequent pattern
F matches m if m contains every feature item in F.
Eq. (1) shows the score function that quantifies the similarity between the malicious e-mail m and a write-print WP(Ei).
The frequent patterns are not equally important, and their importance is reflected by their support in Ei, i.e., the percentage
of e-mails in Ei sharing such combination of features. Thus,
the score function accumulates the support of a frequent pattern and divides the result by the number of frequent patterns
in WP(Ei) to normalize the factor of different sized WP(Ei).
Pp
j¼1 support MPj Ei
(1)
ScoreðmzWPðEi ÞÞ ¼
jWPðEi Þj
where MP ¼ {MP1,., MPp} is a set of matched patterns between
WP(Ei) and the malicious e-mail m. The score is a real number
within the range of [0, 1]. The higher the score means the
higher similarity between the write-print and the malicious
e-mail m. The suspect having the write-print with the highest
score is the author of the malicious e-mail m.
Example 4.3. Let the patterns found in the malicious e-mail m be
{A2, B1, C1} and {A1, B1, C2}. Comparing them to the write-prints
in Example 4.2, we notice that the first pattern matches to a pattern in WP(E2) while the second pattern matches to a pattern in
WP(E1). The score calculated according to Eq. (1) is higher for
WP(E1) because jWP(E1)j < jWP(E2)j. As a result, the malicious email m is most similar to WP(E1), suggesting that S1 is the author.
In an unlikely case where multiple suspects have the same
highest score, we return all of them to the user.
5.
Experimental evaluation
Our goals in this section are to evaluate the proposed method,
AuthorMiner, in terms of authorship identification accuracy
and to verify if the extracted write-print exhibits strong evidence for supporting the conclusion on authorship. We
employed the Enron E-mail Dataset,1 which contains 200,399
real-life e-mails from 158 employees of the Enron corporation
after cleaning. As a pre-processing step, we removed the
empty spaces, special characters, and blank lines and tokenized the e-mails as described in Section 3.1. Unlike the ordinary text mining application which aims at extracting the
general trends in the text, our goal is to differentiate the writing style of different suspects. Therefore, we keep all the function words and short words.
To evaluate the authorship identification accuracy of our
method, we randomly select n employees from the Enron Email Dataset, representing n suspects {S1,., Sn}. For each suspect Si, we choose m of Si’s e-mails, where 2/3 of the m e-mails
are for training and the remaining 1/3 of the m e-mails are for
testing. We then applied our method, AuthorMiner, to extract
the write-prints from {S1,., Sn} from the training set and then
determine the author of each e-mail in the testing set. The authorship identification accuracy is measured by the percentage of correctly matched authors in the testing set.
Fig. 2 depicts the authorship identification accuracy for
n ¼ 6 and m ¼ 20 (i.e., a total of 120 e-mails) on different number of discretized intervals. The accuracy spans from 86% to
90% at min_sup ¼ 0.1, 0.3 and 0.5, suggesting that our proposed
method can effectively identify the author of an e-mail based
on the extracted write-prints when a reasonable min_sup is
specified. As min_sup increases, the number of extracted frequent patterns, i.e. jFP(Ei)j, decreases and the extracted frequent patterns tend to capture the general writing style that
is common to other suspects, thus, are likely to be eliminated
by the filtering process of our method. As a result, the writeprint becomes less effective for authorship identification and
the accuracy decreases.
1
http://www.cs.cmu.edu/wenron/.
S49
digital investigation 5 (2008) S42–S51
95
90
Accuracy (%)
85
80
75
70
65
60
2
4
6
Number of Intervals
min_sup = 0.1
min_sup = 0.5
min_sup = 0.3
min_sup = 0.7
min_sup = 0.9
Fig. 3 – Accuracy vs. number of intervals (n [ 6, m [ 20).
Fig. 2 illustrates that the accuracy spans from 70% to 90%
for 2 intervals, from 66% to 90% for 4 intervals, and from
73% to 90% for 6 intervals. Though we are testing a broad range
of min_sup, the accuracy is relatively stable. These results suggest that our method is very robust to different user-specified
min_sup. In the effort to study the effect of how the number of
discretized intervals could on the accuracy, we measure the
authorship identification accuracy with respect to the number
of intervals. Fig. 3 also suggests that our method is very robust
to different number of intervals.
Fig. 4 depicts the authorship identification accuracy for
n ¼ 10 and m ¼ 10 (i.e., a total of 100 e-mails) on different number of discretized intervals. The accuracy spans from 80% to
90% at min_sup ¼ 0.1 and 0.3, suggesting that our proposed
method can effectively identify the author of an e-mail based
on the extracted write-prints when a reasonable min_sup is
specified. As min_sup increases, the accuracy decreases as
explained before.
Fig. 4 illustrates that the accuracy spans from 66% to 83%
for 2 intervals, from 63% to 83% for 4 intervals, and from
66% to 90% for 6 intervals. Though we are testing a broad range
b
100
Accuracy (%)
Accuracy (%)
Conclusion
In this paper, we formally define the problem of authorship attribution and refine the problem into three subproblems: (1) to
identify the write-print of each suspect. (2) To determine the
author of the malicious e-mail. (3) To extract evidence for supporting the conclusion on authorship. Generally, the same
three phased methodology is applied in the court of law for resolving the attribution issue. Most previous contributions focused on improving the classification accuracy of authorship
identification, but only very few of them study how to gather
strong evidence for the court of law.
We introduce a novel approach of authorship attribution
and formulate a new notion of write-print based on the concept of frequent patterns. Unlike the write-prints in previous
literature that are a set of predefined features, our notion of
write-print is dynamically extracted from the data as combinations of features that occur frequently in a suspect’s
c
100
80
80
60
40
20
0
6.
60
40
0.3
0.5
min_sup
0.7
0.9
0
60
40
20
20
0.1
100
80
Accuracy (%)
a
of min_sup, the accuracy is relatively stable. These results suggest that our method is very robust to different user-specified
min_sup. In the effort to study the effect of how the number of
discretized intervals could on the accuracy, we measure the
authorship identification accuracy with respect to the number
of intervals. Fig. 5 also suggests that our method is very robust
to different number of intervals.
Comparing Figs. 2 and 4, we notice that the authorship
identification accuracy drops from the average of 80.5% in
Fig. 2 to the average of 77% in Fig. 4. Though there is a drop
in accuracy, the drop is relatively small compared to the increase of suspects from 6 to 10. Most of traditional classifiers
would have a very significant drop as the number of target
classes (suspects) increases.
In additional to measuring the quality of write-print using
authorship identification accuracy, we also manually examined the extracted write-print and found that frequent patterns can succinctly capture combinations of features that
occur frequently in the suspect’s e-mails. Many of those hidden patterns are not obvious. Due to the fact that all the
matched frequent patterns can be found in the anonymous
(malicious) e-mail, the frequent patterns themselves serve
as a strong evidence for supporting the conclusion on
authorship.
0.1
0.3
0.5
min_sup
0.7
0.9
0
0.1
0.3
0.5
0.7
0.9
min_sup
Fig. 4 – Accuracy vs. min_sup (n [ 10, m [ 10). (a) Number of discretized intervals [ 2; (b) number of discretized intervals [ 4
(c) number of discretized intervals [ 6.
S50
digital investigation 5 (2008) S42–S51
95
90
Accuracy (%)
85
80
75
70
65
60
2
4
6
Number of Intervals
min_sup = 0.1
min_sup = 0.3
min_sup = 0.5
min_sup = 0.7
min_sup = 0.9
Fig. 5 – Accuracy vs. number of intervals (n [ 10, m [ 10).
e-mails, but not frequently in other suspect’s e-mails. The
experimental results on real-life e-mail dataset suggest that
the identified write-print does not only help identify the
author of an anonymous e-mail, but also presents intuitive
yet strong evidence for supporting the authorship finding.
This novel approach opens up a new promising direction of
authorship attribution. We will further extend our tool to
adopt different types of stylometric features and utilize the
concept of frequent pattern to identify hidden write-print of
individuals for the purpose of e-mail forensics. Similarly,
more interesting results can be obtained by using the proposed approach on real e-mail traffic containing malicious emails.
Acknowledgments
The research is supported in part by the Discovery Grants
(356065-2008) from the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Faculty Start-up
Funds from Concordia University.
references
Abbasi A, Chen H. Writeprints: a stylometric approach to identitylevel identification and similarity detection in cyberspace.
ACM Transactions on Information Systems March 2008;26(2).
Agrawal R, Imielinski T, Swami A. Mining association rules
between sets of items in large databases. ACM SIGMOD Record
June 1993;22(2):207–16.
Burrows J. An ocean where each kind: statistical analysis and
some major determinants of literary style. Computers and the
Humanities August 1989;23(4–5):309–21.
Corney M, De Vel O., Anderson A, G. Mohay. Gender-preferential
text mining of e-mail discourse. In: 18th annual computer
security applications conference (ACSAC); 2002.
Cristianini N, Shawe-Taylor J. An introduction to support vector
machines. UK: Cambridge University Press; 2000.
De Vel O. Mining e-mail authorship. Paper presented at the
workshop on text mining. In: ACM international conference
on knowledge discovery and data mining (KDD); 2000.
De Vel O, Anderson A, Corney M, Mohay G. Mining e-mail content
for author identification forensics. SIGMOD Record 2001a;
30(4):55–64.
De Vel O, Anderson A, Corney M, Mohay G. Multi-topic e-mail
authorship attribution forensics. In: ACM conference on
computer security – Workshop on data mining for security
applications; November 2001b.
Fung CM, Wang K, Ester M. Hierarchical document clustering
using frequent itemsets. In: Proceedings of the third SIAM
international conference on data mining (SDM); May 2003.
p. 59–70.
Haltern HV. Author verification by linguistic profiling: an
exploration of the parameter space. ACM Transactions on
Speech and Language Processing January 2007;4(1).
Han J, Pei J. Mining frequent patterns by pattern-growth:
methodology and implications. ACM SIGKDD Explorations
Newsletter 2000;2(2).
Holmes I. The evolution of stylometry in humanities. Literary and
Linguistic Computing 1998;13(3):111–7.
Holmes I, Forsyth RS. The federalist revisited: new directions in
authorship attribution. Literary and Linguistic Computing
1995;10(2):111–27.
Holt JD, Chung SM. Efficient mining of association rules in text
databases. In: Proceedings of the eighth ACM international
conference on information and knowledge management
(CIKM). Kansas City, Missouri, United States; 1999. p. 234–42.
Ledger GR, Merriam TVN. Shakespeare, fletcher, and the two noble
kinsmen. Literary and Linguistic Computing 1994;9:235–48.
Mendenhall TC. The characteristic curves of composition.
Science 1887;11(11):237–49.
Quinlan JR. Induction of decision trees. Machine Learning 1986;
1(1):81–106.
Teng G-F, Lai M-S, Ma J-B, Li Y. E-mail authorship mining based
on SVM for computer forensic. In: Proceedings of the third
international conference on machine learning and
cyhemetics. Shanghai, China; August 2004.
Tweedie J, Baayen RH. How variable may a constant be? measures
of lexical richness in perspective. Computers and the
Humanities 1998;32:323–52.
Yule G. On sentence length as a statistical characteristic of style
in prose. Biometrika 1938;30:363–90.
Yule G. The statistical study of literary vocabulary. Cambridge,
UK: Cambridge University Press; 1944.
Zaki MJ. Scalable algorithms for association mining. IEEE
Transactions of Knowledge and Data Engineering (TKDE) 2000;
12:372–90.
Zheng R, Li J, Chen H, Huang Z. A framework for authorship
identification of online messages: writing-style features and
classification techniques. Journal of the American Society for
Information Science and Technology 2006;57(3):378–93.
Zheng R, Qin Y, Huang Z, Chen H. Authorship analysis in
cybercrime investigation. In: Proceedings of the first
international symposium on intelligence and security
informatics (ISI); 2003.
Farkhund Iqbal is a Ph.D. candidate in the Faculty of Electrical
and Computer Engineering (ENCS), Concordia University,
Montreal. His Ph.D. topic is E-mail Forensics with special
focus on addressing e-mail anonymity issues with the help
of analyzing e-mail social networks. He is an active member
digital investigation 5 (2008) S42–S51
of Computer Security Laboratory (CSL) and cyber forensic
team at Concordia Institute for Information Systems Engineering (CIISE). The forensic team is working on designing
and implementing state-of-the-art Cyber Forensic Analysis
Framework. His research interests include cyber crime analysis, E-mail crimes analysis, social network analysis, authorship analysis, data mining and privacy.
Rachid Hadjidj is a post Doctoral fellow at Concordia University working on software security and computer forensics.
He got his Ph.D. degree in 2006 from l’Ecole polytechnic de
Montreal where he worked on the formal validation of real
time systems. Some of his main fields of interest are: Real
time systems, System engineering, Hardware and Software
verification, and Formal methods.
Benjamin Fung is an Assistant Professor of the Concordia Institute for Information Systems Engineering (CIISE) at Concordia University. He received the Ph.D. degree in Computing
Science from Simon Fraser University, Canada. In the past,
he worked at Business Objects Inc. and designed reporting
systems for various Enterprise Resource Planning (ERP) and
Customer Relationship Management (CRM) systems. His current research interests include data mining, database, privacy
preservation, information security, and digital forensics, as
well as their applications in emerging technologies. He has
published in data engineering, data mining, and security conferences, journals, and books, including ACM SIGKDD, IEEE
TKDE, IEEE ICDE, IEEE ICDM, IEEEISI, EDBT, SIAM SDM, KAIS.
He serves as an editorial board member for IJDATS, and a program committee member for ACM SIGKDD, ACM CIKM, and
SIAM SDM. His research has been supported in part by the
S51
Discovery Grants from Natural Sciences and Engineering Research Council of Canada (NSERC) and Faculty Start-up Funds
from Concordia University.
Dr. Mourad Debbabi is a Full Professor and the Director of the
Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Quebec, Canada. He holds the Concordia Research Chair Tier I in Information Systems Security. He
is the founder and one of the leaders of the Computer Security
Laboratory (CSL) at Concordia University. He is the Specification
Lead of four Standard JAIN (Java Intelligent Networks) Java Specification Requests (JSRs) dedicated to the elaboration of standard
specifications for presence and instant messaging. In the past,
he served as Senior Scientist at the Panasonic Information and
Network Technologies Laboratory, Princeton, New Jersey, USA;
Associate Professor at the Computer Science Department of
Laval University, Quebec, Canada; Senior Scientist at General
Electric Research Center, New York, USA; Research Associate
at the Computer Science Department of Stanford University,
California, USA; and Permanent Researcher at the Bull
Corporate Research Center, Paris, France. Dr. Debbabi holds
Ph.D. and M.Sc. degrees in computer science from Paris-XI Orsay, University, France. He published more than 130 research
papers in journals and conferences on computer security, formal semantics, Java security and acceleration, cryptographic
protocols, malicious code detection, programming languages,
type theory and specification and verification of safety-critical
systems. He supervised to completion more than 50 graduate
students at M.Sc. and Ph.D. levels. He can be reached at
debbabi@ciise.concordia.ca <https://mail.encs.concordia.ca/
horde/imp/message.php?index=15693>. His webpage is at
<http://www.ciise.concordia.ca/~debbabi>
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising