Sequential data analysis An introduction to R Outline

Sequential data analysis
Sequential data analysis
Outline
Sequential data analysis
An introduction to R
Gilbert Ritschard
Department of Econometrics and Laboratory of Demography, University of
Geneva
http://mephisto.unige.ch/biomining
APA-ATI Workshop on Exploratory Data Mining
University of Southern California, Los Angeles, CA, July 2009
23/7/2009gr 1/64
1
Introduction
2
Installing and launching R
3
Objects and operators
4
Elements of statistical modeling
5
Growing trees: rpart and party
6
Custom functions and programming
23/7/2009gr 2/64
Sequential data analysis
Introduction
Sequential data analysis
Installing and launching R
R
Installation
R is:
R and the modules can be downloaded from the CRAN
http://cran.r-project.org
Software environment for statistical computing and graphics
Based on the S language (as is S-PLUS)
By default, no GUI is proposed under Linux.
Freely distributed under GPL licence
Under Windows and MacOSX, the basic GUI remains limited.
Available for any platform: Windows/Mac/Linux/Unix
... but try Rcmdr (can be download from the CRAN)
Easily extensible with numerous contributed modules
23/7/2009gr 4/64
23/7/2009gr 6/64
Sequential data analysis
Installing and launching R
Sequential data analysis
Objects and operators
Introduction to R objects
First steps in R
Objects
R works with objects
Four possibilities to send commands to R
1
Type commands in the R Console.
2
The script editor -> File/New script (only Windows/Mac)
3
The Rcmd module
4
Use a text editor with R support (Tinn-R, WinEdt, etc.)
Assigning a value to an object ‘a’
R> a <- 50
Operation on an object
R> a/50
[1] 1
Case-sensitive: a 6= A
In addition, you can also use your preferred text editor and
copy-paste the commands into the R Console,
R> A/50
Error: object "A" not found
23/7/2009gr 7/64
23/7/2009gr 10/64
1
Sequential data analysis
Objects and operators
Introduction to R objects
Sequential data analysis
Objects and operators
Introduction to R objects
Types of objects
Factors I
A factor is defined by “levels” (possible values) and an
indicator of whether it is ordinal or not.
Different types of objects
vector: 4 5 1 or in R c(4,5,1)
”D” ”E” ”A” or in R c("D","E","A")
Vector of “strings”
R> sex <- c("man", "woman", "woman", "man", "woman")
R> sex
factor: categorical variable
[1] "man"
matrix: table of numerical data
"woman" "woman" "man"
"woman"
Creation of a factor
data frame: general data table (columns can be of different
types)
R> sex.fac <- factor(sex)
R> sex.fac
...
[1] man
woman woman man
Levels: man woman
woman
R> attributes(sex.fac)
23/7/2009gr 11/64
23/7/2009gr 12/64
Sequential data analysis
Objects and operators
Introduction to R objects
Sequential data analysis
Objects and operators
Introduction to R objects
Factors II
$levels
[1] "man"
Objects (continued) I
"woman"
$class
[1] "factor"
Results can always be stored in a new object
R> table(sex.fac)
Example:
sex.fac
man woman
2
3
R>
R>
R>
R>
To change the order of the “levels”
R> sex.fac2 <- factor(sex, levels = c("woman", "man"))
R> sex.fac2n <- as.numeric(sex.fac2)
R> table(sex.fac2, sex.fac2n)
library(TraMineR)
data(mvad)
tab.male.gcse <- table(mvad$male, mvad$gcse5eq)
tab.male.gcse
no yes
no 186 156
yes 266 104
sex.fac2n
sex.fac2 1 2
woman 3 0
man
0 2
23/7/2009gr 13/64
23/7/2009gr 14/64
Sequential data analysis
Objects and operators
Introduction to R objects
Sequential data analysis
Objects and operators
Introduction to R objects
Objects (continued)
Row and marginal distributions
Depending of its class, methods can be directly applied to it
Row and column distributions
R> prop.table(tab.male.gcse, 1)
R> plot(tab.male.gcse, cex.axis = 1.5)
no
yes
no 0.5438596 0.4561404
yes 0.7189189 0.2810811
tab.male.gcse
yes
R> prop.table(tab.male.gcse, 2)
no
yes
no 0.4115044 0.6000000
yes 0.5884956 0.4000000
no
no
Margins
yes
R> margin.table(tab.male.gcse, 1)
no yes
342 370
R> margin.table(tab.male.gcse, 2)
no yes
452 260
23/7/2009gr 15/64
23/7/2009gr 16/64
2
Sequential data analysis
Objects and operators
Acting on subsets of objects
Sequential data analysis
Objects and operators
Acting on subsets of objects
Indexes
Crosstable on data subsets
Indexing vectors
x[n]
nth element
x[-n]
all but the nth element
x[1:n]
first n elements
x[-(1:n)]
elements from n+1 to the end
x[c(1,4,2)]
specific elements
x["name"]
element named "name"
x[x > 3]
all elements greater than 3
x[x > 3 & x < 5] all elements between 3 and 5
x[x %in% c("a","and","the")] elements in the given set
Cross tables for catholic and non catholic
R> table(mvad$male[mvad$catholic == "yes"], mvad$gcse5eq[mvad$catholic ==
+
"yes"])
no yes
no
82 77
yes 133 52
Indexing matrices
x[i,j]
x[i,]
x[,j]
x[,c(1,3)]
x["name",]
R> table(mvad$male[mvad$catholic == "no"], mvad$gcse5eq[mvad$catholic ==
+
"no"])
element at row i, column j
row i
column j
columns 1 and 3
row named "name"
no yes
no 104 79
yes 133 52
Indexing data frames (matrix indexing plus the following)
x[["name"]]
x$name
column named "name"
idem
23/7/2009gr 18/64
23/7/2009gr 19/64
Sequential data analysis
Objects and operators
Acting on subsets of objects
Sequential data analysis
Objects and operators
Importation/exportation
3-dimensional crosstables
Opening and closing R
Alternatively
R saves the working environment in the .RData file of the
current directory.
R> table(mvad$male, mvad$gcse5eq, mvad$catholic)
, ,
= no
getwd()
provides the current directory
setwd("C:/introR/")
sets the current directory
save.image()
saves the working directory in .RData
load("example.RData")
loads working directory example.RData
no yes
no 104 79
yes 133 52
, ,
= yes
no yes
no
82 77
yes 133 52
On line help command: help(subject), or ?sujet
23/7/2009gr 20/64
23/7/2009gr 22/64
Sequential data analysis
Objects and operators
Importation/exportation
Sequential data analysis
Objects and operators
Importation/exportation
Object Management
Importing text files
R can import text files (tab-delimited, CSV, ...) with
read.table()
List of objects in the “Workingspace”
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".",
row.names, col.names, as.is = FALSE, na.strings = "NA",
colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#")
R> ls()
[1] "a"
[5] "mvad"
[9] "sex.fac2"
"datadir"
"pngdir"
"sex.fac2n"
"filename"
"graphdir"
"sex"
"sex.fac"
"tab.male.gcse"
Removing objects
R> rm(sex, sex.fac2)
R> ls()
[1] "a"
"datadir"
[5] "mvad"
"pngdir"
[9] "tab.male.gcse"
Ex: importing a tab-delimited file with variables names in first row:
"filename"
"sex.fac"
R> example <- read.table(file = "example.dat", header = TRUE,
+
sep = "\t")
R> example
"graphdir"
"sex.fac2n"
1
2
3
23/7/2009gr 23/64
age revenu sexe
25
100 homme
45
200 femme
30
50 homme
23/7/2009gr 24/64
3
Sequential data analysis
Objects and operators
Importation/exportation
Sequential data analysis
Objects and operators
Importation/exportation
Importing data from other formats
Exportation
R can import SPSS, Stata, SAS, minitab, ... files with the
foreign library
Exporting in text file
Loading the library
R> write.table(mydata, file = "export.txt", sep = "\t")
R> library(foreign)
Labels are lost, factors are saved as strings
Reading the SPSS file
Alternatively with foreign
R> mydata <- read.spss("example.sav", to.data.frame = TRUE)
R> write.foreign(mydata, datafile = "export.txt",
+
codefile = "export.sps", package = "SPSS")
Same principle for other formats
See help on the foreign library
R> library(help = "foreign")
23/7/2009gr 25/64
23/7/2009gr 26/64
Sequential data analysis
Elements of statistical modeling
Sequential data analysis
Elements of statistical modeling
Statistical modeling: Regression
Loading the data
R>
R>
R>
+
R>
We use the mvad data of TraMineR
Regression of longitudinal entropies on
mvad.lab <- seqstatl(mvad[, 17:86])
mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad[, 17:86], labels = mvad.lab,
states = mvad.shortlab)
summary(mvad.seq)
[>]
[>]
[>]
[>]
[>]
[>]
[>]
[>]
[>]
male, catholic, ...
23/7/2009gr 28/64
dimensionality of the sequence space: 350
712 sequences in the data set
490 unique sequences in the data set
min/max sequence length: 70 / 70
alphabet: 1=EM 2=FE 3=HE 4=JL 5=SC 6=TR
colors: 1=#7FC97F 2=#BEAED4 3=#FDC086 4=#FFFF99 5=#386CB0 6=#F0027F
labels: 1=employment 2=FE 3=HE 4=joblessness 5=school 6=training
code for missing statuses: *
code for void state: *
23/7/2009gr 29/64
Sequential data analysis
Elements of statistical modeling
Sequential data analysis
Elements of statistical modeling
Computing longitudinal entropies
Linear regression: lm() I
Computing entropies
R> entrop <- seqient(mvad.seq)
Boxplot of their distribution
R> boxplot(entrop, col = "lightblue", cex.axis = 1.5)
0.8
Creating the regression object
0.0
0.2
0.4
0.6
R> lm.entrop <- lm(entrop ~ male + catholic + gcse5eq, data = mvad)
●
23/7/2009gr 30/64
23/7/2009gr 31/64
4
Sequential data analysis
Elements of statistical modeling
Sequential data analysis
Elements of statistical modeling
Linear regression: results
Plotting a regression object
Displaying the results
R> plot(lm.entrop, which = 2)
R> summary(lm.entrop)
Call:
lm(formula = entrop ~ male + catholic + gcse5eq, data = mvad)
3
Normal Q−Q
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
310 ●●●
421
193
●
●
−2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.38916
0.01277 30.482 < 2e-16 ***
maleyes
-0.04177
0.01329 -3.143 0.00174 **
catholicyes 0.01768
0.01307
1.353 0.17643
gcse5eqyes
0.06451
0.01378
4.680 3.43e-06 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1
2
Max
4.479e-01
0
3Q
1.286e-01
Standardized residuals
Median
2.750e-05
−1
Residuals:
Min
1Q
-4.713e-01 -9.506e-02
Residual standard error: 0.1741 on 708 degrees of freedom
Multiple R-squared: 0.05376, Adjusted R-squared: 0.04975
F-statistic: 13.41 on 3 and 708 DF, p-value: 1.615e-08
−3
−2
−1
0
1
2
3
Theoretical Quantiles
lm(entrop ~ male + catholic + gcse5eq)
23/7/2009gr 32/64
23/7/2009gr 33/64
Sequential data analysis
Elements of statistical modeling
Sequential data analysis
Elements of statistical modeling
Logistic regression I
Logistic regression II
Logistic regression: specific case of the generalized linear
model glm() with family = binomial
R> lg.gr <- glm(gcse5eq ~ male + catholic, family = binomial,
+
data = mvad)
R> summary(lg.gr)
(Dispersion parameter for binomial family taken to be 1)
Call:
glm(formula = gcse5eq ~ male + catholic, family = binomial, data = mvad)
Deviance Residuals:
Min
1Q
Median
-1.1286 -1.0821 -0.7929
3Q
1.2758
Null deviance: 934.62
Residual deviance: 910.51
AIC: 916.51
Max
1.6189
on 711
on 709
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2283
0.1315 -1.736
0.0825 .
maleyes
-0.7677
0.1588 -4.833 1.34e-06 ***
catholicyes
0.1124
0.1586
0.709
0.4783
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
23/7/2009gr 34/64
23/7/2009gr 35/64
Sequential data analysis
Elements of statistical modeling
Sequential data analysis
Elements of statistical modeling
computing the “odds ratios”
ANOVA
Retrieve coefficients and compute their exp()
ANOVA
R> exp(lg.gr$coefficients)
R> summary(aov(entrop ~ male, data = mvad))
(Intercept)
0.795889
Df Sum Sq Mean Sq F value
Pr(>F)
male
1 0.4888 0.4888 15.645 8.406e-05 ***
Residuals
710 22.1807 0.0312
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
maleyes catholicyes
0.464072
1.118991
Completing the table of coefficients, standard errors and
significativity with exp(β)
R> summary(aov(lm.entrop))
R> lg.gr.coeff <- as.data.frame(summary(lg.gr)$coefficients)
R> lg.gr.coeff <- cbind(lg.gr.coeff, `Exp Estim.` = exp(lg.gr.coeff[,
+
"Estimate"]))
R> lg.gr.coeff
Df Sum Sq Mean Sq F value
Pr(>F)
male
1 0.4888 0.4888 16.1322 6.539e-05 ***
catholic
1 0.0662 0.0662 2.1847
0.1398
gcse5eq
1 0.6637 0.6637 21.9046 3.434e-06 ***
Residuals
708 21.4508 0.0303
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate Std. Error
z value
Pr(>|z|) Exp Estim.
(Intercept) -0.2282955 0.1314786 -1.7363697 8.249849e-02
0.795889
maleyes
-0.7677155 0.1588396 -4.8332757 1.343046e-06
0.464072
catholicyes 0.1124274 0.1585663 0.7090246 4.783092e-01
1.118991
23/7/2009gr 37/64
23/7/2009gr 36/64
5
Sequential data analysis
Growing trees: rpart and party
Sequential data analysis
Growing trees: rpart and party
rpart and party
rpart: methods
At least two R-packages for growing (binary) trees:
rpart proposes several methods:
rpart (Therneau and Atkinson, 1997): recursive partitioning
CART, Relative risk trees,
party (Hothorn et al., 2006): conditional partitioning
Based on a statistical conditional inference method
(permutation tests)
"poisson", rate tree, for binary responses.
"class", classification tree, when response is a factor.
"anova", regression tree, when response is quantitative.
"exp", risk tree, for a survival response.
We propose here a short introduction to these packages
If method is not specified, rpart tries a guess from the
response type.
rpart Essentially Cart + extension for relative risk trees
party much more powerful and flexible.
better visual rendering (Plots distributions inside the nodes)
23/7/2009gr 40/64
23/7/2009gr 39/64
Sequential data analysis
Growing trees: rpart and party
rpart
Sequential data analysis
Growing trees: rpart and party
rpart
Growing a rate tree rpart
Plotting the tree
R> par(xpd = NA)
R> plot(cart.mvad.gcse)
R> text(cart.mvad.gcse, use.n = T, cex = 0.9, fancy = F)
R> library(rpart)
R> cart.mvad.gcse <- rpart(gcse5eq ~ male + catholic, data = mvad,
+
method = "poisson", control = list(minsplit = 20,
+
minbucket = 10, cp = 1e-04))
R> cart.mvad.gcse
male=b
|
n= 712
node), split, n, deviance, yval
* denotes terminal node
1) root 712 115.74880 1.365169
2) male=yes 370 53.52554 1.281247 *
3) male=no 342 58.23767 1.455946
6) catholic=no 183 30.99274 1.431429 *
7) catholic=yes 159 27.08354 1.483731 *
1.281
474/370
23/7/2009gr 42/64
catholic=a
1.431
262/183
1.484
236/159
23/7/2009gr 43/64
Sequential data analysis
Growing trees: rpart and party
rpart
Sequential data analysis
Growing trees: rpart and party
rpart
The printcp() function
A classification tree
R> printcp(cart.mvad.gcse)
We build a tree for the activity status in October 1996, i.e. 3
years after end of compulsory school.
Rates regression tree:
rpart(formula = gcse5eq ~ male + catholic, data = mvad, method = "poisson",
control = list(minsplit = 20, minbucket = 10, cp = 1e-04))
Variables actually used in tree construction:
[1] catholic male
R> cart.mvad.oct96 <- rpart(Oct.96 ~ male + catholic + gcse5eq,
+
data = mvad, method = "class", control = list(minsplit = 20,
+
minbucket = 10, cp = 0))
R> cart.mvad.oct96
Root node error: 115.75/712 = 0.16257
n= 712
n= 712
node), split, n, loss, yval, (yprob)
* denotes terminal node
CP nsplit rel error xerror
xstd
1 0.0344335
0
1.00000 1.00210 0.016719
2 0.0013943
1
0.96557 0.97028 0.021159
3 0.0001000
2
0.96417 0.97543 0.021656
1) root 712 333 employment (0 0.11 0.53 0.072 0.079 0.21)
2) gcse5eq=no 452 158 employment (0 0.1 0.65 0.082 0.11 0.058) *
3) gcse5eq=yes 260 139 HE (0 0.12 0.33 0.054 0.031 0.47) *
23/7/2009gr 44/64
23/7/2009gr 45/64
6
Sequential data analysis
Growing trees: rpart and party
rpart
Sequential data analysis
Growing trees: rpart and party
rpart
survival risk tree
Creating the time to event variable
R>
R>
R>
R>
R>
+
+
R>
R>
R>
R>
We build a survival data object for time to marriage from the
biofam data set provided with TraMineR
States of interest are:
2:
3:
6:
7:
married without leaving home
married and left home
married with child
divorced
If divorce occurs before any marriage, we assume marriage and
divorce the same year.
data(biofam)
svar <- 10:25
durmax <- length(svar)
biofam.seq <- seqdef(biofam, svar)
fmar <- data.frame(s2 = seqfpos(biofam.seq, state = 2),
s3 = seqfpos(biofam.seq, state = 3), s6 = seqfpos(biofam.seq,
state = 6), s7 = seqfpos(biofam.seq, state = 7))
fmar <- data.frame(fmar, fpos = apply(fmar, 1, min, na.rm = TRUE))
fmar <- data.frame(fmar, mar = (fmar$fpos != Inf))
fmar$fpos[fmar$fpos == "Inf"] <- durmax
head(fmar)
[1]
[2]
[3]
[4]
[5]
[6]
23/7/2009gr 46/64
s2
NA
NA
NA
NA
NA
NA
s3
10
12
13
NA
NA
NA
s6
11
13
14
NA
14
NA
s7 fpos
mar
NA
10 TRUE
NA
12 TRUE
NA
13 TRUE
NA
16 FALSE
NA
14 TRUE
NA
16 FALSE
23/7/2009gr 47/64
Sequential data analysis
Growing trees: rpart and party
rpart
Sequential data analysis
Growing trees: rpart and party
rpart
Survival analysis
KM Survival curves
1.0
Kaplan−Meier Survival Curves, Time to Marriage
men
women
0.4
0.6
0.8
library(survival)
surv.fmar <- Surv(time = fmar$fpos, event = fmar$mar)
sf.fmar.fh <- survfit(surv.fmar ~ biofam$sex, type = "kaplan-meier")
plot(sf.fmar.fh, main = "Kaplan-Meier Survival Curves, Time to Marriage",
xlab = "Time from 15 to marriage", col = c("red",
"blue"))
legend("topright", legend = c("men", "women"), lwd = 2,
col = c("red", "blue"))
0.0
0.2
R>
R>
R>
R>
+
+
R>
+
0
5
10
15
Time from 15 to marriage
23/7/2009gr 48/64
23/7/2009gr 49/64
Sequential data analysis
Growing trees: rpart and party
rpart
Sequential data analysis
Growing trees: rpart and party
rpart
Covariates
Survival risk tree (rpart)
Grow and plot the tree
R> stree.risk <- rpart(surv.fmar ~ sex + coho1 + coho2 +
+
coho3 + lang, data = covariates, method = "exp",
+
control = list(minsplit = 20, minbucket = 10, cp = 0.001))
R> stree.risk
n= 2000
Preparing covariates
R>
R>
+
R>
R>
R>
R>
R>
1
2
3
4
5
6
coho1 <- factor(biofam$birthyr < 1940)
coho2 <- factor(biofam$birthyr >= 1940 & biofam$birthyr <
1950)
coho3 <- factor(biofam$birthyr >= 1950)
lang <- biofam$plingu02
sex <- biofam$sex
covariates <- data.frame(sex, lang, coho1, coho2, coho3)
head(covariates)
sex
lang
man german
man german
woman french
man german
man german
man italian
coho1
FALSE
TRUE
FALSE
TRUE
FALSE
TRUE
coho2
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
node), split, n, deviance, yval
* denotes terminal node
1) root 2000 2681.7330 1.0000000
2) sex=man 908 978.1932 0.8333197
4) coho3=TRUE 286 315.0478 0.7084977 *
5) coho3=FALSE 622 655.1129 0.8975413 *
3) sex=woman 1092 1656.4480 1.1828020
6) coho2=FALSE 750 1128.3620 1.1009180
12) lang=german,italian 580 861.0025 1.0439180 *
13) lang=french 170 261.3472 1.3270480 *
7) coho2=TRUE 342 517.5828 1.3944170 *
R> par(xpd = NA)
R> plot(stree.risk)
23/7/2009gr 51/64 R> text(stree.risk, use.n = T, cex = 0.9, fancy = F)
coho3
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
23/7/2009gr 50/64
7
Sequential data analysis
Growing trees: rpart and party
rpart
Sequential data analysis
Growing trees: rpart and party
party
Plot of survival risk tree (rpart)
party principle
sex=a
|
party selects each split in two steps (to avoid bias in favor of
predictors with many different values):
First, selects the predictor with strongest association with
target,
Then, selects the best binary split for selected predictor.
coho3=b
0.7085
193/286
coho2=a
0.8975
479/622
lang=bc
1.044
444/580
1.394
285/342
1.327
141/170
23/7/2009gr 52/64
23/7/2009gr 54/64
Sequential data analysis
Growing trees: rpart and party
party
Sequential data analysis
Growing trees: rpart and party
party
Linear statistic and permutation test
A R script for generating a tree
Both steps are based on the conditional distribution of linear
statistics in a permutation test framework.
Linear statistic is:
You grow the tree with the ctree command
n
X
T Tj = vec
wi gj (Xji )h Yi , (Y1 , . . . , Yn )
∈ Rpj q
R> library(party)
R> ctree.mvad.gcse <- ctree(gcse5eq ~ male + catholic, data = mvad,
+
controls = ctree_control(mincriterion = 0.3, minsplit = 0),
+
)
R> plot(ctree.mvad.gcse, drop_terminal = F, inner_panel = node_barplot)
i=1
where gj (Xji ) is a transformation of Xji , and h() an influence
function.
Tj is computed for each permutation of the Y values among
cases, and results characterize its conditional independence
distribution.
the variable and split selection is then based on the p-value of
the observed t under this conditional independence distribution.
23/7/2009gr 55/64
23/7/2009gr 56/64
Sequential data analysis
Growing trees: rpart and party
party
Sequential data analysis
Growing trees: rpart and party
party
Classification tree, party
yes
no
Node 1 (n = 712)
yes
R> ctree.mvad.gcse
Conditional inference tree with 3 terminal nodes
no
1
no
Node 3 (n = 342)
0.8
0.6
no
Node 4 (n = 183)
no
0.2
Response: gcse5eq
Inputs: male, catholic
Number of observations:
1
0.8
0.6
0.4
0.2
0
1
Node 5 (n = 159)
1) male == {yes}; criterion = 1, statistic = 23.462
2)* weights = 370
1) male == {no}
3) catholic == {no}; criterion = 0.448, statistic = 0.945
4)* weights = 183
3) catholic == {yes}
5)* weights = 159
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
yes
0
yes
712
yes
no
0.4
yes
1
0.8
0.6
0.4
0.2
0
yes
no
Node 2 (n = 370)
Classification tree, text output, party
0
23/7/2009gr 57/64
23/7/2009gr 58/64
8
Sequential data analysis
Growing trees: rpart and party
party
Sequential data analysis
Growing trees: rpart and party
party
survival tree, party
Survival tree, text output, party
R> marrtree
Conditional inference tree with 5 terminal nodes
Response: surv.fmar
Inputs: sex, coho1, coho2, coho3, lang
Number of observations: 2000
Grow the tree with ctree and a survival response object.
Just plot the result to get the tree with survival curves.
R> marrtree <- ctree(surv.fmar ~ sex + coho1 + coho2 + coho3 +
+
lang, data = covariates, controls = ctree_control(mincriterion = 0.5,
+
minsplit = 0), )
R> plot(marrtree)
23/7/2009gr 59/64
1) sex == {woman}; criterion = 1, statistic = 54.008
2) coho2 == {TRUE}; criterion = 0.989, statistic = 9.395
3)* weights = 342
2) coho2 == {FALSE}
4) lang == {french}; criterion = 0.862, statistic = 7.067
5)* weights = 170
4) lang == {german, italian}
6)* weights = 580
1) sex == {man}
7) coho3 == {FALSE}; criterion = 0.996, statistic = 11.284
8)* weights = 622
7) coho3 == {TRUE}
9)* weights = 286
23/7/2009gr 60/64
Sequential data analysis
Growing trees: rpart and party
party
Sequential data analysis
Custom functions and programming
Plotted survival tree, party
Functions
R> discretize <- function(a) {
+
if (a < 0.4) {
+
return(1)
+
}
+
else {
+
if (a < 0.6) {
+
return(2)
+
}
+
else {
+
return(3)
+
}
+
}
+
}
R> discretize(0.33)
1
sex
p < 0.001
woman
man
2
coho2
p = 0.011
7
coho3
p = 0.004
TRUE
FALSE
4
lang
p = 0.138
FALSE
TRUE
french {german, italian}
1
Node 3 (n = 342)
1
Node 5 (n = 170)
1
Node 6 (n = 580)
1
Node 8 (n = 622)
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0
0
0
5
10 15
0
5
10 15
0
5
10 15
Node 9 (n = 286)
[1] 1
0.2
0
0
5
10 15
0
5
R> table(apply(entrop, 1, discretize))
10 15
1
2
385 243
23/7/2009gr 61/64
23/7/2009gr 63/64
Sequential data analysis
Custom functions and programming
References I
Gabadinho, A., G. Ritschard, M. Studer, and N. S. Müller (2008). Mining
sequence data in R with TraMineR: A user’s guide. Technical report,
Department of Econometrics and Laboratory of Demography, University of
Geneva, Geneva. (TraMineR is on CRAN the Comprehensive R Archive
Network).
Hothorn, T., K. Hornik, and A. Zeileis (2006). party: A laboratory for recursive
part(y)itioning. User’s manual.
Maindonald, J. and J. Brown (2006). Data Analysis and Graphics Using R: An
Example-based Approach. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge: Cambridge University Press.
Paradis, E. (2006). R for beginners. Manual, Institut des Sciences de l’
Evolution, Université Montpellier II.
R-Development-Core-Team (2008). An introduction to R (v 2.8.0). Manual,
R-project.
Spector, P. (2008). Data Manipulation with R. New York: Springer.
Therneau, T. M. and E. J. Atkinson (1997). An introduction to recursive
partitioning using the rpart routines. Technical Report Series 61, Mayo
Clinic, Section of Statistics, Rochester, Minnesota.
23/7/2009gr 64/64
9
3
84