The Newsletter of the R Project News Volume 8/1, May 2008 Editorial by John Fox Welcome to the first issue of R News for 2008. The publication of this issue follows the recent release of R version 2.7.0, which includes a number of enhancements to R, especially in the area of graphics devices. Along with changes to the new version of R, Kurt Hornik reports on changes to CRAN, which now comprises more than 1400 contributed packages, and the Bioconductor Team reports news from the Bioconductor Project. This issue also includes news from the R Foundation, and an announcement of the 2008 useR! conference to be held in August in Dortmund, Germany. R News depends upon the outstanding service of external reviewers, to whom the editorial board is deeply grateful. Individuals serving as referees during 2007 are acknowledged at the end of the current issue. This issue features a number of articles that should be of interest to users of R: Gregor Gorjanc explains how Sweave, a literate-programming facility that is part of the standard R distribution, can be used with the LYX front-end to LATEX. Jeff Enos and his co-authors introduce the tradeCosts package, which implements computations and graphs for securities transactions. Hormuzd Katki and Steven Mark describe the NestedCohort package for fitting standard survival models, such as the Cox model, Contents of this issue: Editorial . . . . . . . . . . . . . . . . . . . . . . Using Sweave with LyX . . . . . . . . . . . . . Trade Costs . . . . . . . . . . . . . . . . . . . . . Survival Analysis for Cohorts with Missing Covariate Information . . . . . . . . . . . . . segmented: An R Package to Fit Regression Models with Broken-Line Relationships . . . Bayesian Estimation for Parsimonious Threshold Autoregressive Models in R . . . . . . . . 1 2 10 14 20 26 when some information on covariates is missing. Vito Muggeo presents the segmented package for fitting piecewise-linear regression models. Cathy Chen and her colleagues show how the BAYSTAR package is used to fit two-regime threshold autoregressive (TAR) models using Markov-chain Monte-Carlo methods. Vincent Goulet and Matthieu Pigeon introduce the actuar package, which adds actuarial functionality to R. The current issue of R News also marks the revival of two columns: Robin Hankin has contributed a Programmer’s Niche column that describes the implementation of multivariate polynomials in the multipol package; and, in a Help Desk column, Uwe Ligges and I explain the ins and outs of avoiding — and using — iteration in R. Taken as a whole, these articles demonstrate the vitality and diversity of the R Project. The continued vitality of R News, however, depends upon contributions from readers, particularly from package developers. One way to get your package noticed among the 1400 on CRAN is to write about it in R News. Keep those contributions coming! John Fox McMaster University Canada [email protected] Statistical Modeling of Loss Distributions Using actuar . . . . . . . . . . . . . . . . . . . . Programmers’ Niche: Multivariate polynomials in R . . . . . . . . . . . . . . . . . . . . . . R Help Desk . . . . . . . . . . . . . . . . . . . . Changes in R Version 2.7.0 . . . . . . . . . . . . Changes on CRAN . . . . . . . . . . . . . . . . News from the Bioconductor Project . . . . . . Forthcoming Events: useR! 2008 . . . . . . . . . R Foundation News . . . . . . . . . . . . . . . . R News Referees 2007 . . . . . . . . . . . . . . . 34 41 46 51 59 69 70 71 72 Vol. 8/1, May 2008 2 Using Sweave with LyX How to lower the LATEX/Sweave learning curve by Gregor Gorjanc Introduction LATEX (LATEX Project, 2005) is a powerful typesetting language, but some people find that acquiring a knowledge of LATEX presents a steep learning curve in comparison to other “document processors.” Unfortunately this extends also to “tools” that rely on LATEX. Such an example is Sweave (Leisch, 2002), which combines the power of R and LATEX using literate programming as implemented in noweb (Ramsey, 2006). Literate programming is a methodology of combining program code and documentation in one (source) file. In the case of Sweave, the source file can be seen as a LATEX file with parts (chunks) of R code. The primary goal of Sweave is not documenting the R code, but delivering results of a data analysis. LATEX is used to write the text, while R code is replaced with its results during the process of compiling the document. Therefore, Sweave is in fact a literate reporting tool. Sweave is of considerable value, but its use is somewhat hindered by the steep learning curve needed to acquire LATEX. The R package odfWeave (Kuhn, 2006) uses the same principle as Sweave, but instead of LATEX uses an XML-based markup language named Open Document Format (ODF). This format can be easily edited in OpenOffice. Although it seems that odfWeave solves problems for non-LATEX users, LATEX has qualities superior to those of OpenOffice. However, the gap is getting narrower with tools like OOoLATEX (Piroux, 2005), an OpenOffice macro for writing − → LATEX equations in OpenOffice, and Writer 2 LATEX (Just, 2006), which provides the possibility of converting OpenOffice documents to LATEX. LATEX has existed for decades and it appears it will remain in use. Anything that helps us to acquire and/or use LATEX is therefore welcome. LYX (LYX Project, 2006) definitely is such tool. LYX is an open source document preparation system that works with LATEX and other “companion” tools. In short, I see LYX as a “Word”-like WYSIWYM (What You See Is What You Mean) front-end for editing LATEX files, with excellent import and export facilities. Manuals shipped with LYX and posted on the wiki site (http://wiki.lyx.org) give an accessible and detailed description of LYX, as well as pointers to LATEX documentation. I heartily recommend these resources for studying LYX and LATEX. Additionally, LYX runs on Unix-like systems, including MacOSX, as well as on MS Windows. The LYX installer for MS Windows provides a neat way to install all the tools that are needed to work with LATEX in general. R News This is not a problem for GNU/Linux distributions since package management tools take care of the dependencies. TEX Live (TEX Live Project, 2006) is another way to get LATEX and accompanying tools for Unix, MacOSX, and MS Windows. LYX is an ideal tool for those who may struggle with LATEX, and it would be an advantage if it could also be used for Sweave. Johnson (2006) was the first to embark on this initiative. I have followed his idea and extended his work using recent developments in R and LYX. In the following paragraphs I give a short tutorial “LYX & Sweave in action”, where I also show a way to facilitate the learning of LATEX and consequently of Sweave. The section “LYX customisation” shows how to customise LYX to work with Sweave. I close with some discussion. LyX and Sweave in action In this section I give a brief tutorial on using Sweave with LYX. You might also read the “Introduction to LYX” and “The LYX Tutorial” manuals for additional information on the first steps with LYX. In order to actively follow this tutorial you have to customise LYX as described in the section “LYX customisation”. Open LYX, create a new file with the File –> New menu and save it. Start typing some text. You can preview your work in a PDF via the View –> PDF (*) menu, where * indicates one of the tools/routes (latex, pdflatex, etc.) that are used to convert LATEX file to PDF. The availability of different routes of conversion, as well as some other commands, depend on the availability of converters on your computer. The literate document class To enable literate programming with R you need to choose a document class that supports this methodology. Follow the Document –> Settings menu and choose one of the document classes that indicates Sweave, say “article (Sweave noweb)”. That is all. You can continue typing some text. Code chunk To enter R code you have to choose an appropriate style so that LYX will recognise this as program code. R code typed with a standard style will be treated as standard text. Click on the button “Standard” (Figure 1 — top left) and choose a scrap style, which is used for program code (chunks) in literate programming documents. You will notice that now the text you type has a different colour (Figure 1). This is an indicator that you are in a paragraph with a scrap style. There are different implementations of literate ISSN 1609-3631 Vol. 8/1, May 2008 programming. Sweave uses a noweb-like implementation, where the start of a code chunk is indicated with <<>>=, while a line with @ in the first column indicates the end of a code chunk (Figure 1). Try entering: <<myFirstChunkInLyX>>= xObs <- 100; xMean <- 10; xVar <- 9 x <- rnorm(n=xObs, mean=xMean, sd=sqrt(xVar)) mean(x) @ Did you encounter any problems after hitting the ENTER key? LYX tries to be restrictive with spaces and new lines. A new line always starts a new paragraph with a standard style. To keep the code “together” in one paragraph of a scrap style, you have to use CTRL+ENTER to go onto a new line. You will notice a special symbol (Figure 1) at the end of the lines marking unbroken newline. Now write the above chunk of R code, save the file and preview a PDF. If the PDF is not shown, check the customisation part or read further about errors in code chunks. You can use all the code chunk options in the <<>>= markup part. For example <<echo=FALSE, fig=TRUE>>=, will have an effect of hidding output from R functions, while plots will be produced and displayed. Inline code chunks LYX also supports the inclusion of plain LATEX code. Follow the Insert –> TeX Code menu, or just type CTRL+L and you will get a so-called ERT box (Figure 1) where you can type LATEX code directly. This can be used for an inline code chunk. Create a new paragraph, type some text and insert \Sexpr{xObs} into the ERT box. Save the file and check the result in a PDF format. This feature can also be used for \SweaveOpts{} directives anywhere in the document. For example, \SweaveOpts{echo=FALSE} will suppress output from all R functions after that line. ERT boxes are advantageous since you can start using some LATEX directly, but you can still produce whole documents without knowing the rest of the LATEX commands that LYX has used. Equations Typing mathematics is one of the greatest strengths of LATEX. To start an equation in LYX follow the Insert –> Math –> Inline/Display Formula menu or use CTRL+M and you will get an equation box. There is also a maths panel to facilitate the typing of symbols. You can also type standard LATEX commands into the equation box and, say, \alpha will be automatically replaced with α. You can also directly include an inline code chunk in an equation, but note that backslash in front of Sexpr will not be displayed as can be seen in Figure 1. R News 3 Floats A figure float can be filled with a code chunk and Sweave will replace the code chunk “with figures”. How can we do this with LYX? Follow the Insert –> Float –> Figure menu and you will create a new box — a figure float. Type a caption and press the ENTER key. Choose the scrap style, insert the code chunk provided below (do not forget to use CTRL+ENTER), save the file, and preview in PDF format. <<mySecondChunkInLyX, fig=TRUE>>= hist(x) @ If you want to center the figure, point the cursor at the code chunk, follow the Edit –> Paragraph Setting menu and choose alignment. This will center the code and consequently also the resulting figure. Alignment works only in LYX version 1.4.4 and later. You will receive an error with LYX version 1.4.3. If you still have LYX version 1.4.3, you can bypass this problem by retaining the default (left) alignment and by inserting LATEX code for centering within a float, say \begin{center} above and \end{center} below the code chunk. Check the section “LYX customisation” for a file with such an example. Errors in code chunks If there are any errors in code chunks, the compiation will fail. LYX will only report that an error has occurred. This is not optimal as you never know where the error occured. There is a Python script listerrors shipped with LYX for this issue. Unfortunately, I do not know how to write an additional function for collecting errors from the R CMD Sweave process. I will be very pleased if anyone is willing to attempt this. In the meantime you can monitor the weaving process if you start LYX from a terminal. The weaving process will be displayed in a terminal as if R CMD Sweave is used (Figure 1, bottom right) and you can easily spot the problematic chunk. Import/Export You can import Sweave files into LYX via the File –> Import –> Sweave... menu. Export from LYX to Sweave and to other formats also works similarly. If you want to extract the R code from the document — i.e., tangle the document — just export to R/S code. Exported files can be a great source for studying LATEX. However, this can be tedious, and I find that the View menu provides a handy way to examine LATEX source directly. Preview of LATEX and Sweave formats will work only if you set up a viewer/editor in the ‘preferences’ file (Figure 3) as shown in the following section. Do something in LYX and take a look at the produced LATEX file via the View menu. This way you can easily become acquainted with LATEX. ISSN 1609-3631 Vol. 8/1, May 2008 4 Figure 1: Screenshot of LYX with Sweave in action: LYX GUI (top-left), produced PDF (top-right), source code (Sweave) in an editor (bottom-left), and echo from weaving in a terminal (bottom-right) R News ISSN 1609-3631 Vol. 8/1, May 2008 In LYX version 1.5 the user can monitor LATEX code instantly in a separate window. Users of LYX can therefore easily become acquainted with LATEX and there should be even less reason not to use Sweave. 5 Formats LYX formats describe general information about file formats. The default specification for the LATEX file format is shown in Figure 2. This specification consists of the following fields: • format name ("latex"); LyX customisation LYX already supports noweb-like literate programming as described in the “Extended LYX Features” manual. Unfortunately, the default implementation does not work with R. To achieve this, LYX needs to be customised to use R for weaving (replacing R code with its ouput) and tangling (extracting program code), while LYX will take care of the conversion into the chosen output format, for example, PostScript, PDF, etc. LYX can convert to, as well as from, many formats, which is only a matter of having proper converters. For example latex is used to convert a LATEX file to DVI format, dvips is used to convert a DVI file to PostScript, and you can easily deduce what the ps2pdf converter does. Of course, pdflatex can also be used to directly convert LATEX to PDF. So, the idea of providing Sweave support to LYX is to specify a converter (weaver) of a Sweave file that will be used for the evaluation of R code, replacing it with the results in the generated LATEX file. Additionally, a tangler needs to be specified if only the extraction of R code is required. I describe such customisation in this section, which is deliberately detailed so that anyone with interest and C++ experience could work with the LYX team on direct support of Sweave. I also discuss a possible way for this in the subsection “Future work”. Users can customise LYX via the Tools –> Preferences menu or via configuration files. Although menus may be more convenient to use, I find that handling a configuration file is easier, less cluttered and better for the customisation of LYX on different machines. Since the readers of this newsletter already know how to work with R code, the handling of another ASCII file will not present any problems. The use of menus in LYX should be obvious from the given description. Configuration files for LYX can be saved in two places: the so-called library and the user directory. As usual, the settings in the user directory take precedence over those in the library directory and I will show only the customisation for the user. The manual “Customizing LYX: Features for the Advanced User” describes all LYX customisation features as well as system-wide customisation. The configuration file in the user directory is named ‘preferences’. Formats, converters, and document classes need to be customised to enable Sweave support in LYX. I will describe each of these in turn. Skip to the subsection “Install” on page 7, if you are not interested in the details. R News • file extension ("tex"); • format name that is displayed in the LYX GUI ("Latex (Plain)"); • keyboard shortcut ("L"); • viewer name (""); • editor name (""); • type of the document and vector graphics support by the document ("document"). Literate programming in LYX is implemented via the literate file format. The latter needs to be modified to work with R, and a new file format for R code must be introduced. The name literate must be used as this is a special file format name in LYX for literate programming based on the noweb implementation. The entries in the ‘preferences’ file for a modified literate file format and a new r file format are shown in Figure 3. The values in separate fields are more or less obvious — editor stands for your favourite editor such as Emacs, Kate, Notepad, Texmaker, Tinn-R, vi, WinEdt, Wordpad, etc. It is very useful to define your favourite editor for both the viewing and the editing of Sweave, R, latex, and pdflatex file formats. This provides the possibility of viewing the file in these formats from LYX with only two clicks, as noted in the "LYX & Sweave in action" section. Converters I have already mentioned that LYX has a powerful feature of converting between various file formats with the use of external converter tools. For our purpose, only tools to weave and tangle need to be specified, while LYX will take care of all other conversions. To have full support for Sweave in LYX the following conversions are required: • convert (import) the Sweave file into a LYX file with R chunks; • convert (weave) the LYX file with R chunks to a specified output format (LATEX, PostScript, PDF, etc.); • convert (tangle) the LYX file with R chunks to a file with R code only; and • convert (export) LYX file with R chunks to a Sweave file. ISSN 1609-3631 Vol. 8/1, May 2008 6 \format "latex" "tex" "Latex (Plain)" "L" "" "" "document" Figure 2: The default format specification for a LATEX file # # FORMATS SECTION ########################## # \format \format \format \format "literate" "r" "latex" "pdflatex" "Rnw" "R" "tex" "tex" "Sweave" "R/S code" "LaTeX (plain)" "LaTeX (pdflatex)" "" "" "" "" "editor" "editor" "editor" "editor" "editor" "editor" "editor" "editor" "document" "document" "document" "document" # # CONVERTERS SECTION ########################## # \converter "literate" "r" "R CMD Stangle $$i" "" \converter "literate" "latex" "R CMD Sweave $$i" "" \converter "literate" "pdflatex" "R CMD Sweave $$i" "" Figure 3: Format and converter definitions for Sweave support in LYX The first task can be accomplished with LYX’s import utility tool tex2lyx and its option -n to convert a literate programming file, in our case a Sweave file, to the LYX file format. This can be done either in a terminal “by hand” (tex2lyx -n file.Rnw) or via the File –> Import menu within LYX. No customisation is required for this task. tex2lyx converts the literate programming file to the LYX file format with two minor technicalities of which it is prudent to be aware. The first one is that LYX uses the term scrap instead of chunk. This is due to a historical reason and comes from another literate programming tool named nuweb (Briggs et al., 2002). I shall use both terms (scrap and chunk) interchangeably to refer to the part of the document that contains the program code. Another technicality is related to the \documentclass directive in a LATEX/Sweave file. At the time of writing, LYX provides article, report and book LATEX classes for literate programming. These are provided via document classes that will be described later on. When converting a LYX file with R chunks to other formats, the information on how to weave and possibly also tangle the file is needed. The essential part of this task is the specification of R scripts Sweave and Stangle in a ‘preferences’ file as shown in Figure 3. These scripts are part of R from version 2.4.0. Note that two converters are defined for weaving: one for latex and one for the pdflatex file format. This way both routes of LATEX conversion are supported — i.e., LATEX –> PostScript –> PDF for the latex file format, and LATEX –> PDF for the pdflatex file format. The details of weaving and tangling processes are described in the “Extended LYX Features” manual. R News Document classes LYX uses layouts for the definition of environments/styles, for example the standard layout/style for normal text and the scrap layout/style for program code in literate programming. Layout files are also used for the definition of document classes, sometimes also called text classes. Document classes with literate support for the article, report and book LATEX document classes already exist. The definitions for these files can be found in the ‘layout’ subdirectory of the LYX library directory. The files are named ‘literate-article.layout’, ‘literatereport.layout’ and ‘literate-book.layout’. That is the reason for the mandatory use of the literate file format name as described before in the formats subsection. All files include the ‘literate-scrap.inc’ file, where the scrap style is defined. The syntax of these files is simple and new files for other document classes can be created easily. When LYX imports a literate programming file it automatically chooses one of these document classes, based on a LATEX document class. The default document classes for literate programming in LYX were written with noweb in mind. There are two problems associated with this. The default literate document classes are available to the LYX user only if the ‘noweb.sty’ file can be found by LATEX during the configuration of LYX — done during the first start of LYX or via the Tools –> Reconfigure menu within LYX. This is too restrictive for Sweave users, who require the ‘Sweave.sty’ file. Another problem is that the default literate class does not allow aligning the scrap style. This means that the R users cannot center figures. ISSN 1609-3631 Vol. 8/1, May 2008 To avoid the aforementioned problems, I provide modified literate document class files that provide a smoother integration of Sweave and LYX. The files have the same names as their “noweb” originals. The user can insert R code into the Sweave file with noweb- like syntax <<>>= someRCode @ or LATEX-like syntax \begin{Scode} someRCode \end{Scode} or even a mixture of these two (Leisch, 2002). LYX could handle both types, but LYX’s definition of the style of LATEX-like syntax cannot be general enough to fulfil all the options Sweave provides. Therefore, only noweb-like syntax is supported in LYX. Nevertheless, it is possible to use LATEX-like syntax, but one has to resort to the use of plain LATEX markup. LYX has been patched to incorporate the \SweaveSyntax{}, \SweaveOpts{}, \SweaveInput{}, \Sexpr{} and \Scoderef{} commands. These commands will be handled appropriately during the import of the Sweave file into LYX. The same holds for the LATEX environment Scode, but the default layout in LYX used for this environment is not as useful as the noweb-like syntax. “Install” At least LYX version 1.4.4 and R version 2.4.0 are needed. Additionally, a variant of the Unix shell is needed. All files (‘preferences’, ‘literate-article.layout’, ‘literate-report.layout’, ‘literatebook.layout’, and ‘literature-scrap.inc’) that are mentioned in this section are available at http://cran. r-project.org/contrib/extra/lyx. There are also other files (‘test.lyx’, ‘Sweave-test-1.lyx’, and ‘templatevignette.lyx’) that demonstrate the functionality. Finally, the ‘INSTALL’ file summarises this subsection and provides additional information about the Unix shell and troubleshooting for MS Windows users. Follow these steps to enable use of Sweave in LYX: • find the so-called LYX user directory via the Help –> About LYX menu within LYX; • save the ‘preferences’ file in the LYX user directory; 7 • start LYX and update the configuration via the Tools –> Reconfigure menu; and • restart LYX. It is also possible to use LYX version 1.4.3, but there are problems with the alignment of code chunk results in floats. Use corresponding files from the ‘lyx-1.4.3’ subdirectory at http://cran.r-project. org/contrib/extra/lyx. Additionally, save the ‘syntax.sweave’ file in the LYX user directory. TEX path system It is not the purpose of this article to describe LATEX internals. However, R users who do not have experience with LATEX (the intended readership) might encounter problems with the path system that LATEX uses and I shall give a short description to overcome this. So far I have been referring to LATEX, which is just a set of commands at a higher level than “plain” TEX. Both of them use the same path system. When you ask TEX to use a particular package (say Sweave with the command \usepackage{Sweave}), TEX searches for necessary files in TEX paths, also called texmf trees. These trees are huge collections of directories that contain various files: packages, fonts, etc. TEX searches files in these trees in the following order: • the root texmf tree such as ‘/usr/share/texmf’, ‘c:/texmf’ or ‘c:/Program Files/TEX/texmf’; • the local texmf tree such ‘/usr/share/local/texmf’; ‘c:/localtexmf’ ‘c:/Program Files/TEX/texmf-local’; and as or • the personal texmf tree in your home directory, where TEX is a directory of your TEX distribution such as MiKTEX (Schenk, 2006). R ships ‘Sweave.sty’ and other TEX related files within its own texmf tree in the ‘pathToRInstallDirectory/share/texmf’ directory. You have to add R’s texmf tree to the TEX path, and there are various ways to achieve this. I believe that the easiest way is to follow these steps: • create the ‘tex/latex/R’ sub-directory in the local texmf tree; • copy the contents of the R texmf tree to the newly created directory; • save the ‘literate-*.*’ files to the ‘layouts’ subdirectory of the LYX user directory; • rebuild TEX’s filename database with the command texhash (MiKTEX has also a menu option for this task); and • assure that LATEX can find and use the ‘Sweave.sty’ file (read the TEX path system subsection if you have problems with this); • check if TEX can find ‘Sweave.sty’ — use the command kpsewhich Sweave.sty or findtexmf Sweave.sty in a terminal. R News ISSN 1609-3631 Vol. 8/1, May 2008 Users of Unix-like systems can use a link instead of a sub-directory in a local texmf tree to ensure the latest version of R’s texmf tree is used. Debian GNU/Linux and its derivatives, with R installed from official Debian packages, have this setup automatically. Additional details on the TEX path system can be found at http://www.ctan.org/ installationadvice/. Windows useRs might also be interested in notes about using MiKTEX with R for Windows at http://www.murdoch-sutherland. com/Rtools/miktex.html. Future work The customisation described above is not a difficult task (just six steps), but it would be desirable if LYX could support Sweave “out of the box”. LYX has a convenient configuration feature that is conditional on availability of various third party programs and LATEX files. Sweave support for editing could be configured if ‘Sweave.sty’ is found, while R would have to be available for conversions. To achieve this, only minor changes would be needed in the LYX source. I think that the easiest way would be to add another argument, say -ns, to the tex2lyx converter that would drive the conversion of the Sweave file to LYX as it is done for noweb files, except that the Sweavespecific layout of files would be chosen. Additionally, the format name would have to be changed from literate to avoid collision with noweb. Unfortunately, these changes require C++ skills that I do not have. Discussion LYX is not the only “document processor” with the ability to export to LATEX. AbiWord, KWord, and OpenOffice are viable open source alternatives, while I am aware of only one proprietary alternative, Scientific WorkPlace (SWP) (MacKichan Software, Inc., 2005). Karlsson (2006) reviewed and compared SWP with LYX. His main conclusions were that both LYX and SWP are adequate, but could “learn” from each other. One of the advantages of SWP is the computer algebra system MuPAD (SciFace Software, Inc., 2004) that the user gets with SWP. LYX has some support for GNU Octave, Maxima, Mathematica and Maple, but I have not tested it. Now Sweave brings R and its packages to LYX, so the advantage of SWP in this regard is diminishing. Additionally, LYX and R (therefore also Sweave) run on all major platforms, whereas SWP is restricted to Windows. Sweave by default creates PostScript and PDF files for figures. This eases the conversion to either PostScript and/or PDF of a whole document, which LYX can easily handle. The announced support for the PNG format (Leisch, personal communication) in Sweave will add the possibility to create lighter PDF R News 8 files. Additionally, a direct conversion to HTML will be possible. This is a handy alternative to R2HTML (Lecoutre, 2003), if you already have a Sweave source. The current default for R package-vignette files is Sweave, and since Sweave is based on LATEX, some developers might find it hard to write vignettes. With LYX this need not be the case anymore, as vignettes can also be created with LYX. Developers just need to add vignettespecific markup, i.e., %\VignetteIndexEntry{}, %\VignetteDepends{}, %\VignetteKeywords{} and %\VignettePackage{}, to the document preamble via the Document –> Settings –> LaTeX Preamble menu within LYX. A template for a vignette (with vignette specific markup already added) is provided in the file ‘template-vignette.lyx’ at http:// cran.r-project.org/contrib/extra/lyx. A modified layout for Sweave in LYX also defines common LATEX markup often used in vignettes, for example, \Rcode{}, \Robject{}, \Rcommand{}, \Rfunction{}, \Rfunarg{}, \Rpackage{}, \Rmethod{}, and \Rclass{}. Summary I have shown that it is very easy to use LYX for literate programming/reporting and that the LATEX/Sweave learning curve need not be too steep. LYX does not support Sweave out of the box. I describe the needed customisation, which is very simple. I hope that someone with an interest will build upon the current implementation and work with the LYX developers on the direct support of Sweave. Acknowledgements I would like to thank the LYX team for developing such a great program and incorporating patches for smoother integration of LYX and Sweave. Acknowledgements go also to Friedrich Leisch for developing Sweave in the first place, as well as for discussion and comments. Inputs by John Fox have improved the paper. Bibliography P. Briggs, J. D. Ramsdell, and M. W. Mengel. Nuweb: A Simple Literate Programming Tool, 2002. URL http://nuweb.sourceforge.net. Version 1.0b1. P. E. Johnson. How to use LYX with R, 2006. URL http://wiki.lyx.org/LyX/ LyxWithRThroughSweave. − → H. Just. Writer 2 LATEX, 2006. URL http://www. hj-gym.dk/~hj/writer2latex. Version 0.4.1d. ISSN 1609-3631 Vol. 8/1, May 2008 A. Karlsson. Scientific workplace 5.5 and LYX 1.4.2. Journal of Statistical Software, 17(Software Review 1):1–11, 2006. URL http://www.jstatsoft.org/ v17/s01/v17s01.pdf. M. Kuhn. Sweave and the open document format – the odfWeave package. R News, 6(4):2– 8, 2006. URL http://CRAN.R-project.org/doc/ Rnews/Rnews_2006-4.pdf. 9 G. Piroux. OOoLATEX, 2005. URL http://ooolatex. sourceforge.net. Version 2005-10-19. LATEX Project. LATEX - A document preparation system, 2005. URL http://www.latex-project.org/. Version 2005-12-01. LYX Project. LYX - The Document Processor, 2006. URL http://www.lyx.org. Version 1.4.4. E. Lecoutre. The R2HTML package. R News, 3(3):33– 36, 2003. URL http://CRAN.R-project.org/doc/ Rnews/Rnews_2003-3.pdf. N. Ramsey. Noweb - a simple, extensible tool for literate programming, 2006. URL http://www.eecs. harvard.edu/~nr/noweb. Version 2.11b. F. Leisch. Dynamic generation of statistical reports using literate data analysis. In W. Haerdle and B. Roenz, editors, Compstat 2002 - Proceedings in Computational Statistics, pages 575–580, Heidelberg, Germany, 2002. Physika Verlag. ISBN 3-79081517-9. C. Schenk. MikTEX Project, 2006. URL http://www. miktex.org/. Version 2.5. TEX Live Project. A distribution of TeX and friends, 2006. URL http://www.tug.org/texlive/. Version 2005-11-01. MacKichan Software, Inc. Scientific Workplace, 2005. URL http://www.mackichan.com. Version 5.5. R News SciFace Software, Inc. MuPad, 2004. //www.sciface.com. Version 3.1. URL http: Gregor Gorjanc University of Ljubljana Biotechnical faculty Slovenia [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 10 Trade Costs by Jeff Enos, David Kane, Arjun Ravi Narayan, Aaron Schwartz, Daniel Suo and Luyi Zhao > data("trade.mar.2007") > head(trade.mar.2007) Introduction 1 2 3 4 5 6 Trade costs are the costs a trader must pay to implement a decision to buy or sell a security. Consider a single trade of a single equity security. Suppose on the evening of August 1, a trader decides to purchase 10,000 shares of IBM at $10, the decision price of the trade. The next day, the trader’s broker buys 10,000 shares in a rising market and pays $11 per share, the trade’s execution price. How much did it cost to implement this trade? In the most basic ex-post analysis, trade costs are calculated by comparing the execution price of a trade to a benchmark price.1 Suppose we wished to compare the execution price to the price of the security at the time of the decision in the above example. Since the trader’s decision occurred at $10 and the broker paid $11, the cost of the trade relative to the decision price was $11 − $10 = $1 per share, or $10,000 (9.1% of the total value of the execution). Measuring costs relative to a trade’s decision price captures costs associated with the delay in the release of a trade into the market and movements in price after the decision was made but before the order is completed. It does not, however, provide a means to determine whether the broker’s execution reflects a fair price. For example, the price of $11 would be a poor price if most transactions in IBM on August 2 occurred at $10.50. For this purpose a better benchmark would be the day’s volumeweighted average price, or VWAP. If VWAP on August 2 was $10.50 and the trader used this as her benchmark, then the trade cost would be $0.50 per share, or $5,000. The first version of the tradeCosts package provides a simple framework for calculating the cost of trades relative to a benchmark price, such as VWAP or decision price, over multiple periods along with basic reporting and plotting facilities to analyse these costs. Trade costs in a single period Suppose we want to calculate trade costs for a single period. First, the data required to run the analysis must be assembled into three data frames. The first data frame contains all tradespecific information, a sample of which is in the trade.mar.2007 data frame: > library("tradeCosts") 1 For period 2007-03-01 2007-03-01 2007-03-01 2007-03-01 2007-03-01 2007-03-01 id side exec.qty exec.price 03818830 X 60600 1.60 13959410 B 4400 32.21 15976510 X 13600 7.19 22122P10 X 119000 5.69 25383010 X 9200 2.49 32084110 B 3400 22.77 Trading data must include at least the set of columns included in the sample shown above: period is the (arbitrary) time interval during which the trade was executed, in this case a calendar trade day; id is a unique security identifier; side must be one of B (buy), S (sell), C (cover) or X (short sell); exec.qty is the number of shares executed; and exec.price is the price per share of the execution. The create.trade.data function can be used to create a data frame with all of the necessary information. Second, trade cost analysis requires dynamic descriptive data, or data that changes across periods for each security. > data("dynamic.mar.2007") > head(dynamic.mar.2007[c("period", "id", "vwap", + "prior.close")]) 1 2 3 4 5 6 period id vwap prior.close 2007-03-01 00797520 3.88 3.34 2007-03-01 010015 129.35 2.53 2007-03-01 023282 613.57 12.02 2007-03-01 03818830 1.58 1.62 2007-03-01 047628 285.67 5.61 2007-03-01 091139 418.48 8.22 The period and id columns match those in the trading data. The remaining two columns in the sample are benchmark prices: vwap is the volume-weighted average price for the period and prior.close is the security’s price at the end of the prior period. The third data frame contains static data for each security. > data("static.mar.2007") > head(static.mar.2007) 1301 2679 3862 406 3239 325 id symbol name sector 00036020 AAON Aaon Inc IND 00036110 AIR Aar Corp IND 00040010 ABCB Ameris Bancorp FIN 00080S10 ABXA Abx Air Inc IND 00081T10 ABD Acco Brands Corp IND 00083310 ACA Aca Capital Hldgs Inc -redh FIN The id column specifies an identifier that can be linked to the other data frames. Because this data is static, there is no period column. Once assembled, these three data frames can be analysed by the trade.costs function: an in-depth discussion of both ex-ante modeling and ex-post measurement of trade costs, see Kissell and Glantz (2003). R News ISSN 1609-3631 Vol. 8/1, May 2008 11 > result <- trade.costs(trade = trade.mar.2007, + dynamic = dynamic.mar.2007, + static = static.mar.2007, + start.period = as.Date("2007-03-01"), + end.period = as.Date("2007-03-01"), + benchmark.price = "vwap") The trade, dynamic, and static arguments refer to the three data frames discussed above. start.period and end.period specify the period range to analyse. This example analyses only one period, March 1, 2007, and uses the vwap column of the dynamic data frame as the benchmark price. result is an object of class tradeCostsResults. with only one period, each trade falls into its own batch, so this section shows the most and least expensive trades for March 1. The next section displays the best and worst securities by total cost across all periods. Because there is only one trade per security on March 1, these results match the best and worst batches by cost. Calculating the cost of a trade requires a non-NA value for id, period, side, exec.price, exec.qty and the benchmark price. The final section shows a count for each type of NA in the input data. Rows in the input data with NA’s in any of these columns are removed before the analysis is performed and reported here. > summary(result) Trade Cost Analysis Costs over multiple periods Benchmark Price: vwap Summary statistics: Total Market Value: First Period: Last Period: Total Cost: Total Cost (bps): 1,283,963 2007-03-01 2007-03-01 -6,491 -51 Best and worst batches over all periods: batch.name exec.qty cost 1 22122P10 (2007-03-01 - 2007-03-01) 119,000 -3,572 2 03818830 (2007-03-01 - 2007-03-01) 60,600 -1,615 3 88362320 (2007-03-01 - 2007-03-01) 31,400 -1,235 6 25383010 (2007-03-01 - 2007-03-01) 9,200 33 7 13959410 (2007-03-01 - 2007-03-01) 4,400 221 8 32084110 (2007-03-01 - 2007-03-01) 3,400 370 Best and worst securities over all periods: id exec.qty cost 1 22122P10 119,000 -3,572 2 03818830 60,600 -1,615 3 88362320 31,400 -1,235 6 25383010 9,200 33 7 13959410 4,400 221 8 32084110 3,400 370 NA report: id period side exec.price exec.qty vwap count 0 0 2 0 0 1 The first section of the report provides high-level summary information. The total unsigned market value of trades for March 1 was around $1.3 million. Relative to VWAP, these trades cost -$6,491, indicating that overall the trades were executed at a level “better” than VWAP, where better buys/covers (sells/shorts) occur at prices below (above) VWAP. This total cost is the sum of the signed cost of each trade relative to the benchmark price. As a percentage of total executed market value, this set of trades cost -51 bps relative to VWAP. The next section displays the best and worst batches over all periods. We will discuss batches in the next section. For now, note that when dealing R News Calculating trade costs over multiple periods works similarly. Cost can be calculated for each trade relative to a benchmark price which either varies over the period of the trade or is fixed at the decision price. Suppose, for example, that the trader decided to short a stock on a particular day, but he wanted to trade so many shares that it took several days to complete the order. For instance, consider the following sequence of trades in our sample data set for Progressive Gaming, PGIC, which has id 59862K10: > subset(trade.mar.2007, id %in% "59862K10") 166 184 218 259 period 2007-03-13 2007-03-15 2007-03-19 2007-03-20 id side exec.qty exec.price 59862K10 X 31700 5.77 59862K10 X 45100 5.28 59862K10 X 135800 5.05 59862K10 X 22600 5.08 How should we calculate the cost of these trades? We could calculate the cost for each trade separately relative to a benchmark price such as vwap, exactly as in the last example. In this case, the cost of each trade in PGIC would be calculated relative to VWAP in each period and then added together. However, this method would ignore the cost associated with spreading out the sale over several days. If the price of the stock had been falling over the four days of the sale, for example, successive trades appear less attractive when compared to the price at the time of the decision. The trader can capture this cost by grouping the four short sales into a batch and comparing the execution price of each trade to the batch’s original decision price. Performing this type of multi-period analysis using tradeCosts requires several modifications to the previous single period example. Note that since no period range is given, analysis is performed over the entire data set: > result.batched <- trade.costs(trade.mar.2007, + dynamic = dynamic.mar.2007, + static = static.mar.2007, + batch.method = "same.sided", + benchmark.price = "decision.price") ISSN 1609-3631 Vol. 8/1, May 2008 12 > summary(result.batched) The simplest plot is a time series of total trade costs in basis points over each period: > plot(result.batched, "time.series.bps") Trade costs by period 500 400 Basis points First, trade.costs must be instructed how to group trades into batches by setting the batch.method parameter. This version of tradeCosts provides a single multi-period sample batch method, same.sided, which groups all consecutive samesided orders into a single batch. Provided there were no buys in between the four sales in PGIC, all four trades would be grouped into the same batch. Second, setting benchmark.price to decision.price sets the benchmark price to the prior closing price of the first trade in the batch. Running summary on the new result yields the following: Trade Cost Analysis Benchmark Price: decision.price 100 47,928,402 2007-03-01 2007-03-30 587,148 123 Best and worst batches over all periods: batch.name exec.qty 1 04743910 (2007-03-19 - 2007-03-19) 17,800 2 31659U30 (2007-03-09 - 2007-03-13) 39,800 3 45885A30 (2007-03-13 - 2007-03-19) 152,933 274 49330810 (2007-03-13 - 2007-03-30) 83,533 275 15649210 (2007-03-15 - 2007-03-28) 96,900 276 59862K10 (2007-03-13 - 2007-03-20) 235,200 200 0 20 0 20 7−0 0 3− 20 7−0 01 0 3− 20 7−0 02 0 3− 20 7−0 05 0 3− 20 7−0 09 0 3− 20 7−0 13 0 3− 20 7−0 15 0 3− 20 7−0 19 0 3− 20 7−0 20 0 3− 20 7−0 21 0 3− 20 7−0 22 0 3− 20 7−0 23 0 3− 20 7−0 26 0 3− 20 7−0 27 0 3− 20 7−0 28 07 3− −0 29 3− 30 Summary statistics: Total Market Value: First Period: Last Period: Total Cost: Total Cost (bps): 300 cost -82,491 -33,910 -31,904 56,598 71,805 182,707 Best and worst securities over all periods: id exec.qty cost 1 04743910 17,800 -82,491 2 31659U30 51,400 -32,616 3 45885A30 152,933 -31,904 251 49330810 83,533 56,598 252 15649210 118,100 73,559 253 59862K10 235,200 182,707 Figure 1: A time series plot of trade costs. This chart displays the cost for each day in the previous example. According to this chart, all days had positive cost except March 2. The second plot displays trade costs divided into categories defined by a column in the static data frame passed to trade.costs. Since sector was a column of that data frame, we can look at costs by company sector: > plot(result.batched, "sector") Trade costs by sector NA report: 300 200 100 TL M AT U M O C N D D C S E N IN C C EN TE N TH 0 FI This analysis covers almost $50 million of executions from March 1 to March 30, 2007. Relative to decision price, the trades cost $587,148, or 1.23% of the total executed market value. The most expensive batch in the result contained the four sells in PGIC (59862K10) from March 13 to March 20, which cost $182,707. 400 H count 0 0 6 0 0 2 Basis points id period side exec.price exec.qty prior.close Plotting results The tradeCosts package includes a plot method that displays bar charts of trade costs. It requires two arguments, a tradeCostsResults object, and a character string that describes the type of plot to create. R News Figure 2: A plot of trade costs by sector. Over the period of the analysis, trades in CND were especially expensive relative to decision price. ISSN 1609-3631 Vol. 8/1, May 2008 13 The last plot applies only to same.sided batched trade cost analysis as we performed in the multiperiod example. This chart shows cost separated into the different periods of a batch. The cost of the first batch of PGIC, for example, contributes to the first bar, the cost of the second batch to the second bar, and so on. As one might expect, the first and second trades in a batch are the cheapest with respect to decision price because they occur closest to the time of the decision. Conclusion > plot(result.batched, "cumulative") tradeCosts currently provides a simple means of calculating the cost of trades relative to a benchmark price over multiple periods. Costs may be calculated relative to a period-specific benchmark price or, for trades spanning multiple periods, the initial decision price of the trade. We hope that over time and through collaboration the package will be able to tackle more complex issues, such as ex-ante modeling and finer compartmentalization of trade costs. 400 Bibliography 200 Basis points 600 Trade costs by batch period 0 R. Kissell and M. Glantz. Optimal Trading Strategies. American Management Association, 2003. 1 2 3 4 5 6 7 8 9 10 11 Period of batch Figure 3: Costs by batch period, in bps. R News Jeff Enos, David Kane, Arjun Ravi Narayan, Aaron Schwartz, Daniel Suo, Luyi Zhao Kane Capital Management Cambridge, Massachusetts, USA [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 14 Survival Analysis for Cohorts with Missing Covariate Information by Hormuzd A. Katki and Steven D. Mark NestedCohort fits Kaplan-Meier and Cox Models to estimate standardized survival and attributable risk for studies where covariates of interest are observed on only a sample of the cohort. Missingness can be either by happenstance or by design (for example, the case-cohort and case-control within cohort designs). Introduction Most large cohort studies have observations with missing values for one or more exposures of interest. Exposure covariates that are missing by chance (missing by happenstance) present challenges in estimation well-known to statisticians. Perhaps less known is that most large cohort studies now include analyses of studies which deliberately sample only a subset of all subjects for the measurement of some exposures. These “missingness by design” studies are used when an exposure of interest is expensive or difficult to measure. Examples of different sampling schemes that are used in missing by design studies are the case-cohort, nested case-control, and case-control studies nested within cohorts in general (Mark and Katki (2001); Mark (2003); Mark and Katki (2006)). Missingness by design can yield important cost savings with little sacrifice of statistical efficiency (Mark (2003); Mark and Katki (2006)). Although for missingness-by-happenstance, the causes of missingness are not controlled by the investigator, the validity of any analysis of data with missing values depends on the relationship between the observed data and the missing data. Except under the strongest assumption that missing values occur completely at random (MCAR), standard estimators that work for data without missing values are biased when used to analyze data with missing values. Mark (2003); Mark and Katki (2006) propose a class of weighted survival estimators that accounts for either type of missingness. The estimating equations in this class weight the contribution from completely observed subjects by the inverse probability of being completely observed (see below), and subtract an ‘offset’ to gain efficiency (see above references). The probabilities of being completely observed are estimated from a logistic regression. The predictors for this logistic regression are some (possibly improper) subset of the covariates for which there are no missing values; the outcome is an indicator variable denoting whether each observation has measurements for all covariates. The predictors R News may include the outcome variables (time-to-event), exposure variables that are measured on all subjects, and any other variables measured on the entire cohort. We refer to variables that are neither the outcome, nor in the set of exposures of interest (e.g. any variable used in the estimation of the Cox model), as auxiliary variables. The weighted estimators we propose are unbiased when the missing mechanism is missing-atrandom (MAR) and the logistic regression is correctly specified. For missing-by-design, MAR is satisfied and the correct logistic model is known. If there is any missing-by-happenstance, MAR is unverifiable. Given MAR is true, a logistic model saturated in the completely-observed covariates will always be correctly specified. In practice, given that the outcome is continuous (time-to-event), fitting saturated models is not feasible. However, fitting as rich a model as is reasonably possible not only bolsters the user’s case that the model is correctly specified, but also improves efficiency (Mark (2003); Mark and Katki (2006)). Also, auxiliary variables can produce impressive efficiency gains and hence should be included as predictors even when not required for correct model specification (Mark (2003); Mark and Katki (2006)). Our R package NestedCohort implements much of the methodology of Mark (2003); Mark and Katki (2006). The major exception is that it does not currently implement the finely-matched nested casecontrol design as presented in appendix D of Mark (2003); frequency-matching, or no matching, in a case-control design are implemented. In particular, NestedCohort 1. estimates not just relative risks, but also absolute and attributable risks. NestedCohort can estimate both non-parametric (KaplanMeier) and semi-parametric (Cox model) survival curves for each level of the exposures also attributable risks that are standardized for confounders. 2. allows cases to have missing exposures. Standard nested case-control and case-cohort software can produce biased estimates if cases are missing exposures. 3. produces unbiased estimates when the sampling is stratified on any completely observed variable, including failure time. 4. extracts efficiency out of auxiliary variables available on all cohort members. ISSN 1609-3631 Vol. 8/1, May 2008 5. uses weights estimated from a correctlyspecified sampling model to greatly increase the efficency of the risk estimates compared to using the ‘true’ weights (Mark (2003); Mark and Katki (2006)). 6. estimates relative, absolute, and attributable risks for vectors of exposures. For relative risks, any covariate can be continuous or categorical. NestedCohort has three functions that we demonstrate in this article. 1. nested.km: Estimates the Kaplan-Meier survival curve for each level of categorical exposures. 2. nested.coxph: Fits the Cox model to estimate relative risks. All covariates and exposures can be continuous or categorical. 3. nested.stdsurv: Fits the Cox model to estimate standardized survival probabilities, and Population Attributable Risk (PAR). All covariates and exposures must be categorical. Example study nested in a cohort In Mark and Katki (2006), we use our weighted estimators to analyze data on the association of H.Pylori with gastric cancer and provide simulations that demonstrate the increases in efficiency due to using estimated weights and auxiliary variables. In this document, we present a second example. Abnet et al. (2005) observe esophageal cancer survival outcomes and relevant confounders on the entire cohort. We are interested in the effect of concentrations of various metals, especially zinc, on esophageal cancer. However, measuring metal concentrations consumes precious esophageal biopsy tissue and requires a costly measurement technique. Thus we measured concentrations of zinc (as well as iron, nickel, copper, calcium, and sulphur) on a sample of the cohort. This sample oversampled the cases and those with advanced baseline histologies (i.e. those most likely to become cases) since these are the most informative subjects. Due to cost and availability constraints, less than 30% of the cohort was sampled. For this example, NestedCohort will provide adjusted hazard ratios, standardized survival probabilities, and PAR for the effect of zinc on esophageal cancer. Specifying the sampling model Abnet et al. (2005) used a two-phase sampling design to estimate the association of zinc concentration with the development of esophageal cancer. Sampling probabilities were determined by case-control R News 15 status and severity of baseline esophageal histology. The sampling frequencies are given in the table below: Baseline Histology Normal Esophagitis Mild Dysplasia Moderate Dysplasia Severe Dysplasia Carcinoma In Situ Unknown Total Case 14 / 22 19 / 26 12 / 17 3 / 7 5 / 6 2 / 2 1 / 1 56 / 81 Control Total 17 / 221 31 / 243 22 / 82 41 / 108 19 / 35 31 / 52 4 / 6 7 / 13 3 / 4 8 / 10 0 / 0 2 / 2 2 / 2 3 / 3 67 / 350 123 / 431 The column “baseline histology” contains, in order of severity, classification of pre-cancerous lesions. For each cell, the number to the right of the slash is the total cohort members in that cell, the left is the number we sampled to have zinc observed (i.e. in the top left cell, we measured zinc on 14 of the 22 members who became cases and had normal histology at baseline). Note that for each histology, we sampled roughly 1:1 cases to controls (frequency matching), and we oversampled the more severe histologies (who are more informative since they are more likely to become cases). Thirty percent of the cases could not be sampled due to sample availability constraints. Since the sampling depended on case/control status (variable ec01) crossed with the seven baseline histologies (variable basehist), this sampling scheme will be accounted for by each function with the statement ‘samplingmod="ec01*basehist"’. This allows each of the 14 sampling strata its own sampling fraction, thus reproducing the sampling frequencies in the table. NestedCohort requires that each observation have nonzero sampling probability. For this table, each of the 13 non-empty strata must have have someone sampled in it. Also, the estimators require that there are no missing values in any variable in the sampling model. However, if there is missingness, for convenience, NestedCohort will remove from the cohort any observations that have missingness in the sampling variables and will print a warning to the user. There should not be too many such observations. Kaplan-Meier curves To make non-parametric (Kaplan-Meier) survival curves by quartile of zinc level, use nested.km. These Kaplan-Meier curves have the usual interpretation: they do not standardize for other variables, and do not account for competing risks. To use this, provide both a legal formula as per the survfit function and also a sampling model to calculate stratum-specific sampling fractions. Note that the ‘survfitformula’ and ‘samplingmod’ require their arguments to be inside double quotes. The ISSN 1609-3631 Vol. 8/1, May 2008 16 ‘data’ argument is required: the user must provide the data frame within which all variables reside in. This outputs the Kaplan-Meier curves into a survfit object, so all the methods that are already there to manipulate survfit objects can be used1 . Plotting Kaplan-Meier curves To examine survival from cancer within each quartile of zinc, allowing different sampling probabilities for each of the 14 strata above, use nested.km, which prints out a table of risk differences versus the level named in ‘exposureofinterest’; in this case, it’s towards “Q4” which labels the 4th quartile of zinc concentration: > plot(mod,ymin=.6,xlab="time",ylab="survival", + main="Survival by Quartile of Zinc", + legend.text=c("Q1","Q2","Q3","Q4"), + lty=1:4,legend.pos=c(2000,.7)) Make Kaplan-Meier plots with the plot function for survfit objects. All plot options for survfit objects can be used. 1.0 Survival by Quartile of Zinc 308 observations deleted due to missing znquartiles=Q1 time n.risk n.event survival std.err 95% CI 163 125.5 1.37 0.989 0.0108 0.925 0.998 1003 120.4 1.57 0.976 0.0169 0.906 0.994 1036 118.8 1.00 0.968 0.0191 0.899 0.990 [...] znquartiles=Q2 time n.risk n.event survival std.err 95% CI 1038 116.9 1.57 0.987 0.0133 0.909 0.998 1064 115.3 4.51 0.949 0.0260 0.864 0.981 1070 110.8 2.33 0.929 0.0324 0.830 0.971 [...] summary gives the lifetable. Although summary prints how many observations were ‘deleted’ because of missing exposures, the ‘deleted’ observations still contribute to the final estimates via estimation of the sampling probabilities. Note that the lifetable contains the weighted numbers of those at risk and who had the developed cancer. The option ‘outputsamplingmod’ returns the sampling model that the sampling probabilities were calculated from. Examine this model if warned that it didn’t converge. If ‘outputsamplingmod’ is TRUE, then nested.km will output a list with 2 components, the survmod component being the KaplanMeier survfit object, and the other samplingmod component being the sampling model. 0.8 0.7 Q1 Q2 Q3 Q4 0.6 > summary(mod) [...] survival Risk Differences vs. znquartiles=Q4 by time 5893 Risk Difference StdErr 95% CI Q4 - Q1 0.28175 0.10416 0.07760 0.4859 Q4 - Q2 0.05551 0.07566 -0.09278 0.2038 Q4 - Q3 0.10681 0.08074 -0.05143 0.2651 0.9 > library(NestedCohort) > mod <- nested.km(survfitformula = + "Surv(futime01,ec01==1)~znquartiles", + samplingmod = "ec01*basehist", + exposureofinterest = "Q4", data = zinc) 0 1000 2000 3000 4000 5000 6000 time Figure 1: Kaplan-Meier plots by nested.km(). nested.km has some restrictions: 1. All variables are in a dataframe denoted by the ‘data’ argument. 2. No variable in the dataframe can be named o.b.s.e.r.v.e.d. or p.i.h.a.t. 3. ‘survfitformula’ must be a valid formula for survfit objects: All variables must be factors. 4. It does not support staggered entry into the cohort. The survival estimates will be correct, but their standard errors will be wrong. Cox models: relative risks To fit the Cox model, use nested.coxph. This function relies on coxph that is already in the survival package, and imitates its syntax as much as possible. In this example, we are interested in estimating the effect of zinc (as zncent, a continuous variable standardized to 0 median and where a 1 unit change is an 1 nested.km uses the weights option in survfit to estimate the survival curve. However, the standard errors reported by survfit are usually quite different from, and usually much smaller than, the correct ones as reported by nested.km. R News ISSN 1609-3631 Vol. 8/1, May 2008 17 increase of 1 quartile in zinc) on esophageal cancer, while controlling for sex, age (as agepill, a continuous variable), smoking, drinking (both ever/never), baseline histology, and family history (yes/no). We use the same sampling model ec01*basehist as before. The output is edited for space: > coxmod <- nested.coxph(coxformula = + "Surv(futime01,ec01==1)~sex+agepill+basehist+ anyhist+zncent", + samplingmod = "ec01*basehist", data = zinc) > summary(coxmod) [...] exp(coef) lower/upper.95 sexMale 0.83 0.38 1.79 agepill 1.04 0.99 1.10 basehistEsophagitis 2.97 1.41 6.26 basehistMild Dysplasia 4.88 2.19 10.88 basehistModerate Dysplasia 6.95 2.63 18.38 basehistSevere Dysplasia 11.05 3.37 36.19 basehistNOS 3.03 0.29 30.93 basehistCIS 34.43 10.33 114.69 anyhistFamily History 1.32 0.61 2.83 zncent 0.73 0.57 0.93 [...] Wald test = 97.5 on 10 df, p=2.22e-16 This is the exact same coxph output, except that the R2 , overall likelihood ratio and overall score tests are not computed. The overall Wald test is correctly computed. nested.coxph has the following restrictions 1. All variables are in the dataframe in the ‘data’ argument. 2. No variable in the dataframe can be named o.b.s.e.r.v.e.d. or p.i.h.a.t. 3. You must use Breslow tie-breaking. 4. No ‘cluster’ statements are allowed. However, nested.coxph does allow staggered entry into the cohort, stratification of the baselize hazard via ‘strata’, and use of ‘offset’ arguments to coxph (see help for coxph for more information). Standardized survival tributable risk and at- nested.stdsurv first estimates hazard ratios exactly like nested.coxph, and then also estimates survival probabilities for each exposure level as well as Population Attributable Risk (PAR) for a given exposure level, standardizing both to the marginal confounder distribution in the cohort. For example, the standardized survival associated with exposure Q and confounder J is Sstd (t| Q) = R News Z S(t| J, Q)dF ( J ). In contrast, the crude observed survival is Scrude (t| Q) = Z S(t| J, Q)dF ( J | Q). The crude S is the observed survival, so the effects of confounders remain. The standardized S is estimated by using the observed J distribution as the standard, so J is independent of Q. For more on direct standardization, see Breslow and Day (1987) To standardize, the formula for a Cox model must be split in two pieces: the argument ‘exposures’ denotes the part of the formula for the exposures of interest, and ‘confounders’ which denotes the part of the formula for the confounders. All variables in either part of the formula must be factors. In either part, do not use ’*’ to specify interaction, use interaction. In the zinc example, the exposures are ‘exposures="znquartiles"’, a factor variable denoting which quartile of zinc each measurement is in. The confounders are ‘confounders="sex+agestr+basehist+anyhist"’, these are the same confounders in the hazard ratio example, except that we must categorize age as the factor agestr. ‘timeofinterest’ denotes the time at which survival probabilities and PAR are to be calculated at, the default is at the last event time. ‘exposureofinterest’ is the name of the exposure level to which the population is to be set at for computing PAR; ‘exposureofinterest="Q4"’ denotes that we want PAR if we could move the entire population’s zinc levels into the fourth quartile of the current zinc levels. ‘plot’ plots the standardized survivals with 95% confidence bounds at ‘timeofinterest’ and returns the data used to make the plot. The output is edited for space. > mod <- nested.stdsurv(outcome = + "Surv(futime01,ec01==1)", + exposures = "znquartiles", + confounders = "sex+agestr+basehist+anyhist", + samplingmod = "ec01*basehist", + exposureofinterest = "Q4", plot = T, main = + "Time to Esophageal Cancer by Quartiles of Zinc", + data = zinc) Std Survival for znquartiles by time 5893 Survival StdErr 95% CI Left 95% CI Right Q1 0.5054 0.06936 0.3634 0.6312 Q2 0.7298 0.07768 0.5429 0.8501 Q3 0.6743 0.07402 0.5065 0.7959 Q4 0.9025 0.05262 0.7316 0.9669 Crude 0.7783 0.02283 0.7296 0.8194 Std Risk Differences vs. znquartiles = Q4 by time 5893 Risk Difference StdErr 95% CI Q4 - Q1 0.3972 0.09008 0.22060 0.5737 Q4 - Q2 0.1727 0.09603 -0.01557 0.3609 Q4 - Q3 0.2282 0.08940 0.05294 0.4034 ISSN 1609-3631 Vol. 8/1, May 2008 Q4 - Crude 18 0.1242 0.05405 4. It does not support staggered entry into the cohort. 0.01823 0.2301 PAR if everyone had znquartiles = Q4 Estimate StdErr 95% CI Left 95% CI Right PAR 0.5602 0.2347 -0.2519 0.8455 The first table shows the survival for each quartile of zinc, standardized for all the confounders, as well as the ‘crude’ survival, which is the observed survival in the population (so is not standardized). The next table shows the standardized survival differences vs. the exposure of interest. The last table shows the PAR, and the CI for PAR is based on the log(1 − PAR) transformation (this is often very different from, and superior to, the naive CI without transformation). summary(mod) yields the same hazard ratio output as if the model had been run under nested.coxph. The plot is in figure 2. This plots survival curves; to plot cumulative incidence (1survival), use ‘cuminc=TRUE’. The 95% CI bars are plotted at timeofinterest. All plot options are usable: e.g. ‘main’ to title the plot. 0.7 0.8 Q4 Q2 0.6 Q3 0.5 Standardized Survival 0.9 1.0 Time to Esophageal Cancer by Quartiles of Zinc 0.4 Q1 5. It does not support the baseline hazard to be stratified. ‘cluster’ and ‘offset’ arguments are not supported either. 6. It only allows Breslow tie-breaking. Including auxiliary variables In this analysis, we used the smallest correctlyspecified logistic model to predict sampling probabilities. To illustrate the use of an auxiliary variable, let’s pretend we have a categorical surrogate named znauxiliary, a cheaply-available but non-ideal measure of zinc concentration, observed on the full cohort. The user could sample based on znauxiliary to try to improve efficiency. In this case, znauxiliary must be included as a sampling variable in the sampling model with samplingmod="ec01*basehist*znauxiliary". Note that auxiliary variables must be observed on the entire cohort. Even if sampling is not based on znauxiliary, it can still be included in the sampling model as above. This is because, even though znauxiliary was not explicitly sampled on, if znauxiliary has something to do with zinc, and zinc has something to do with either ec01 or basehist, then one is implicitly sampling on znauxiliary. The simulations in (Mark and Katki (2006)) show the efficiency gain from including auxiliary variables in the sampling model. Including auxiliary variables will always reduce the standard errors of the risk estimates. Multiple exposures nested.stdsurv has some restrictions: Multiple exposures (with missing values) are included in the risk regression just like any other variable. For example, if we want to estimate the esophageal cancer risk from zinc and calcium jointly, the Cox model would include cacent as a covariate. Cutting calcium into quartiles into the variable caquartiles, include it as an exposure with nested.stdsurv with ‘exposures="znquartiles+caquartiles"’. 1. All variables are in the dataframe in the ‘data’ argument. Full cohort analysis 0 1000 2000 3000 4000 5000 6000 Time Figure 2: Survival curves for each zinc quantile, standardized for confounders 2. No variable in the dataframe can be named o.b.s.e.r.v.e.d. or p.i.h.a.t. 3. The variables in the ‘exposures’ and ‘confounders’ must be factors, even if they are binary. In these formulas, never use ’*’ to mean interaction, use interaction. R News NestedCohort can be used if all covariates are observed on the full cohort. You can estimate the standardized survival and attributable risks by setting ‘samplingModel="1"’, to force equal weights for all cohort members. nested.km will work exactly as survfit does. The Cox model standard errors will be those obtained from coxph with ‘robust=TRUE’. ISSN 1609-3631 Vol. 8/1, May 2008 Bibliography Abnet, C. C., Lai, B., Qiao, Y.-L., Vogt, S., Luo, X.-M., Taylor, P. R., Dong, Z.-W., Mark, S. D., and Dawsey, S. M. (2005). Zinc concentration in esophageal biopsies measured by x-ray fluorescence and cancer risk. Journal of the National Cancer Institute, 97(4):301–306. Breslow, N. E. and Day, N. E. (1987). Statistical Methods in Cancer Research. Volume II: The Design and Analysis of Cohort Studies. IARC Scientific Publications, Lyon. Mark, S. D. (2003). Nonparametric and semiparametric survival estimators,and their implementation, in two-stage (nested) cohort studies. Proceedings of the Joint Statistical Meetings, 2675–2691. Mark, S. D. and Katki, H. A. (2001). Influence Func- R News 19 tion Based Variance Estimation and Missing Data Issues in Case-Cohort Studies. Lifetime Data Analysis, 7:331–344. Mark, S. D. and Katki, H. A. (2006). Specifying and implementing nonparametric and semiparametric survival estimators in two-stage (sampled) cohort studies with missing case data. Journal of the American Statistical Association, 101(474):460–471. Hormuzd A. Katki Division of Cancer Epidemiology and Genetics National Cancer Institute, NIH, DHHS, USA [email protected] Steven D. Mark Department of Preventive Medicine and Biometrics University of Colorado Health Sciences Center [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 20 segmented: An R Package to Fit Regression Models with Broken-Line Relationships by Vito M. R. Muggeo Introduction Segmented or broken-line models are regression models where the relationships between the response and one or more explanatory variables are piecewise linear, namely represented by two or more straight lines connected at unknown values: these values are usually referred as breakpoints, changepoints or even joinpoints. Hereafter we use such words indistinctly. Broken-line relationships are common in many fields, including epidemiology, occupational medicine, toxicology, and ecology, where sometimes it is of interest to assess threshold value where the effect of the covariate changes (Ulm, 1991; Betts et al., 2007). with the best fit. There are at least two drawbacks in using this procedure: (i) estimation might be quite cumbersome with more than one breakpoint and/or with large datasets and (ii) depending on sample size and configuration of data, estimating the model with fixed changepoint may lead the standard error of the other parameters to be too narrow, since uncertainty in the breakpoint is not taken into account. The package segmented offers facilities to estimate and summarize generalized linear models with segmented relationships; virtually, no limit on the number of segmented variables and on the number of changepoint exists. segmented uses a method that allows the modeler to estimate simultaneously all the model parameters yielding also, at the possible convergence, the approximate full covariance matrix. Estimation Formulating the model, estimation and testing A segmented relationship between the mean response µ = E[Y ] and the variable Z, for observation i = 1, 2, . . . , n is modelled by adding in the linear predictor the following terms Muggeo (2003) shows that the nonlinear term (1) has an approximate intrinsic linear representation which, to some extent, allows us to translate the problem into the standard linear framework: given an initial guess for the breakpoint, ψ̃ say, segmented attempts to estimate model (1) by fitting iteratively the linear model with linear predictor β 1 zi + β 2 ( zi − ψ )+ β1 zi + β2 ( zi − ψ̃)+ + γ I ( zi > ψ̃)− (1) where ( zi − ψ)+ = ( zi − ψ) × I ( zi > ψ) and I (·) is the indicator function equal to one when the statement is true. According to such parameterization, β1 is the left slope, β2 is the difference-in-slopes and ψ is the breakpoint. In this paper we tacitly assume a GLM with a known link function and possible additional covariates, xi , with linear parameters δ, namely link(µi ) = xi0 δ + β1 zi + β2 ( zi − ψ)+ ; however, since the discussed methods only depend on (1), we leave out from our presentation the response, the link function, and the possible linear covariates. Breakpoints and slopes of such segmented relationship are usually of main interest, although parameters relevant to the additional covariates may be of some concern. Difficulties in estimating and testing problems are well-known in such models, see for instance Hinkley (1971). A simple and common approach to estimate the model is via grid-search type algorithms: basically, given a grid of possible candidate values of {ψk }k=1,...,K , one fits K linear models and seeks for the value corresponding to the model R News (2) where I (·)− = − I (·) and γ is the parameter which may be understood as a re-parameterization of ψ and therefore accounts for the breakpoint estimation. At each iteration, a standard linear model is fitted, and the breakpoint value is updated via ψ̂ = ψ̃ + γ̂ /β̂2 ; note that γ̂ measures the gap, at the current estimate of ψ, between the two fitted straight lines coming from model (2). When the algorithm converges, the ‘gap’ should be small, i.e. γ̂ ≈ 0, and the standard error of ψ̂ can be obtained via the Delta method for the ratio β̂γ̂ which reduces to SE(γ̂ )/|β̂2 | if γ̂ = 0. 2 The idea may be used to fit multiple segmented relationships, only by including in the linear predictor the appropriate constructed variables for the additional breakpoints to be estimated: at each step, every breakpoint estimate is updated through the relevant ‘gap’ and ‘difference-in-slope’ coefficients. Due to its computational facility, the algorithm is able to perform multiple breakpoint estimation in a very efficient way. ISSN 1609-3631 Vol. 8/1, May 2008 21 p-value ≈ Φ(− M) + V exp{− M2 /2}(8π )−1/2 (4) where M = max{ S(ψk )}k is the maximum of the K test statistics, Φ(·) is the standard Normal distribution function, and V = ∑k (| S(ψk ) − S(ψk−1 )|) is the total variation of { S(ψk )}k . Formula (4) is an upper bound, hence the reported p-value is somewhat overestimated and the test is slightly conservative. Davies does not provide guidelines for selecting number and location of the fixed values {ψk }k , however a reasonable strategy is to use the quantiles of the distribution of Z ; some simulation experiments have shown that 5 ≤ K ≤ 10 usually suffices. Formula (4) refers to one-sided hypothesis test, the alternative being H1 : β2 (ψ) > 0. The pvalue for the ‘lesser’ alternative is obtained by using M = min{ S(ψk )}k , while for the two-sided case let M = max{| S(ψk )|}k and double the (4) (Davies, 1987). The Davies test is appropriate for testing for a breakpoint, but it does not appear useful for selecting the number of the joinpoints. Following results by Tiwari et al. (2005), we suggest using the BIC for this aim. Examples Black dots in Figure 1 plotted on the logit scale, show the percentages of babies with Down Syndrome (DS) on births for mothers with different age groups (Davison and Hinkley, 1997, p.371). It is wellknown that the risk of DS increases with the mother’s age, but it is important to assess where and how such a risk changes with respect to the mother age. Presumably, at least three questions have to answered: (i) does the mother’s age increase the risk of DS?; R News > > > + > + library("segmented") data("down") fit.glm<-glm(cases/births~age, weight= births, family=binomial, data=down) fit.seg<-segmented(fit.glm, seg.Z=~age, psi=25) segmented takes the original (G)LM object (fit.glm) and fits a new model taking into account the piecewise linear relationship. The argument seg.Z is a formula (without response) which specifies the variable, possibly more than one, supposed to have a piecewise relationship, while in the psi argument the initial guess for the breakpoint must be supplied. ● ● ● −4 Note that here we write β2 (ψ) to stress that the parameter of interest, β2 , depends on a nuisance parameter, ψ, which vanishes under H0 . Conditions for validity of standard statistical tests (Wald, for instance) are not satisfied. More specifically, the pvalue returned by classical tests is heavily underestimated, with an empirical levels about three to five times larger than the nominal levels. segmented employs the Davies (1987) test for performing hypothesis (3). It works as follows: given K fixed ordered values of breakpoints ψ1 < ψ2 < . . . < ψK in the range of Z , and relevant K values of the test statistic { S(ψk )}k=1,...,K having a standard Normal distribution for fixed ψk , Davies provides an upper bound given by ● ● ● ● ● −5 (3) ● ● ● −6 H0 : β2 (ψ) = 0. ● ● ● ● −7 If the breakpoint does not exist the difference-inslopes parameter has to be zero, then a natural test for the existence of ψ is (ii) is the risk constant over the whole range of age? and (iii) if the risk is age-dependent, does a threshold value exist? In a wider context, the problem is to estimate the broken-line model and to provide point estimates and relevant uncertainty measures of all the model parameters. The steps to be followed are straightforward with segmented. First, a standard GLM is estimated and a broken-line relationship is added afterwards by re-fitting the overall model. The code below uses the dataframe down shipped with the package. logit(cases/births) Testing for a breakpoint ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 25 30 35 40 45 Mother Age Figure 1: Scatter plot (on the logit scale) of proportion of babies with DS against mother’s age and fits from models fit.seg and fit.seg1. The estimated model can be visualized by the relevant methods print(), summary() and print.summary() of class segmented. The summary shown in Figure 2 is very similar to one of summary.glm(). Additional printed information include the estimated breakpoint and relevant (approximate) standard error (computed via SE(ψ̂) = SE(γ̂ )/|β̂2 |), the t value for the ‘gap’ variable which should be ‘small’ (|t| < 2, say) when the algorithm converges, and the number of iterations employed to fit the model. The variable labeled with U1.age stands for the ‘difference-in-slope parameter ISSN 1609-3631 Vol. 8/1, May 2008 22 > summary(fit.seg) ***Regression Model with Segmented Relationship(s)*** Call: segmented.glm(obj = fit.glm, seg.Z = ~age, psi = 25) Estimated Break-Point(s): Est. St.Err 31.0800 0.7242 t value for the gap-variable(s) V: Meaningful coefficients Estimate (Intercept) -6.78243778 age -0.01341037 U1.age 0.27422124 7.367959e-13 of the linear terms: Std. Error z value Pr(>|z|) 0.43140674 -15.7216777 1.074406e-55 0.01794710 -0.7472162 4.549330e-01 0.02323945 11.7998172 NA (Dispersion parameter for binomial family taken to be 1) Null deviance: 625.210 Residual deviance: 43.939 AIC: 190.82 on 29 on 26 degrees of freedom degrees of freedom Convergence attained in 5 iterations with relative change 1.455411e-14 Figure 2: Output of summary.segmented() of the variable age’ (β2 in equation (1)) and the estimate of the gap parameter γ is omitted since it is just a trick to estimate ψ. Note, however, that the model degrees of freedom are correctly computed and displayed. Also notice that the p-value relevant to U1.age is not reported, and NA is printed. The reason is that, as discussed previously, standard asymptotics do not apply. In order to test for a significant differencein-slope, the Davies’ test can be used. The use of davies.test() is straightforward and requires to specify the regression model (lm or glm), the ‘segmented’ variable whose a broken-line relationship is being tested, and the number of the evaluation points, > davies.test(fit.glm,"age",k=5) Davies’ test for a change in the slope data: Model = binomial , link = logit formula = cases/births ~ age segmented variable = age ‘Best’ at = 32, n.points = 5, p-value < 2.2e-16 alternative hypothesis: two.sided Currently davies.test() only uses the Wald statistic, i.e. S(ψk ) = β̂2 /SE(β̂2 ) for each fixed ψk , although alternative statistics could be used. If the breakpoint exists, the limiting distribution of β̂2 is gaussian, therefore estimates (and standard errors) of the slopes can be easily computed via the function slope() whose argument R News conf.level specifies the confidence level (defaults to conf.level=0.95), > slope(fit.seg) $age Est. St.Err. t value CI(95%).l CI(95%).u slope1 -0.01341 0.01795 -0.7472 -0.04859 0.02177 slope2 0.26080 0.01476 17.6700 0.23190 0.28970 Davison and Hinkley (1997) discuss that it might be of some interest to test for a null left slope, and at this aim they use isotonic regression. On the other hand, the piecewise parameterization allows to face this question in a straightforward way since only a test for H0 : β1 = 0 has be performed; for instance, a Wald test is available directly from the summary (see Figure 2, t = −0.747). Under a null-left-slope constraint, a segmented model may be fitted by omitting from the ‘initial’ model the segmented variable, namely > fit.glm<-update(fit.glm,.~.-age) > fit.seg1<-update(fit.seg) While the fit is substantially unchanged, the (approximate) standard error of the breakpoint is noticeably reduced (compare the output in Figure 2) > fit.seg1$psi Initial Est. St.Err psi1.age 25 31.45333 0.5536572 Instead, as firstly observed in Hinkley (1971) and shown by some simulations, the breakpoint estimator coming from a null left slope model is more efficient as compared to the one coming from a nonnull ISSN 1609-3631 Vol. 8/1, May 2008 left slope fit. Fitted values for both segmented models are displayed in Figure 1 where broken-lines and bars for the breakpoint estimates have been added via the relevant methods plot() and lines() detailed at the end of this section. We continue our illustration of the segmented package by running a further example using the plant dataset in the package. This example may be instructive to describe how to fit multiple segmented relationships with also a zero constraint on the right slope. Data refer to variables, y, time and group which represent measurements of a plant organ over time for three attributes (levels of the factor group). The data have been kindly provided by Dr Zongjian Yang at School of Land, Crop and Food Sciences, The University of Queensland, Brisbane, Australia. Biological reasoning and empirical evidence as emphasized in Figure 3, indicate that non-parallel segmented relationships with multiple breakpoints may allow a well-grounded and reasonable fit. Multiple breakpoints are easily accounted in equation (1) by including additional terms β3 ( zi − ψ2 )+ + . . . segmented allows a such extension in a straightforward manner by supplying multiple starting points in the psi argument. To fit such a broken-line model within segmented, we first need to build the three different explanatory variables, products of the covariate time by the dummies of group 1 , > > > > > > data("plant") attach(plant) X<-model.matrix(~0+group)*time time.KV<-X[,1] time.KW<-X[,2] time.WC<-X[,3] Then we call segmented on a lm fit, by specifying multiple segmented variables in seg.Z and using a list to supply the starting values for the breakpoints in psi. We assume two breakpoints in each series, > olm<-lm(y~0+group+ time.KV + time.KW + time.WC) > os<-segmented(olm, seg.Z= ~ time.KV + time.KW + + time.WC, psi=list(time.KV=c(300,450), + time.KW=c(450,600), time.WC=c(300,450))) Warning message: max number of iterations attained Some points are probably worth mentioning here. First, the starting linear model olm could be fitted via the more intuitive call lm(y~group*time): even if segmented() would have worked providing the same results, a possible use of slope() would have not been allowed. Second, since there are multiple segmented variables, the starting values - obtained by visual inspection of the scatter-plots - have to supplied via a named list whose names have to match with the variables in seg.Z. Last but not least, the 23 printed message suggests to re-fit the model because convergence is suspected. Therefore it could be helpful to trace out the algorithm and/or to increase the maximum number of the iterations, > os<-update(os, control=seg.control(it.max=30, + display=TRUE)) 0 1.433 (No breakpoint(s)) 1 0.108 2 0.109 3 0.108 4 0.109 5 0.108 . . . . 29 0.108 30 0.109 Warning message: max number of iterations attained The optimized objective function (residual sum of squares in this case) alternates among two values and ‘does not converge’, in that differences never reach the (default) tolerance value of 0.0001; the function draw.history() may be used to visualize the values of breakpoints throughout the iterations. Moreover, increasing the number of maximum iterations, typically does not modify the result. This is not necessarily a problem. One could change the tolerance by setting toll=0.001, say, or better, stop the algorithm at the iteration with the best value. Also, one could stabilize the algorithm by shrinking the increments in breakpoint updates through a factor h < 1, say; this is attained via the argument h in the auxiliary function seg.control(), > os<-update(os, control=seg.control(h=.3)) However, when convergence is not straightforward, the fitted model has to be inspected with particular care: if a breakpoint is understood to exist, the corresponding difference-in-slope estimate (and its t value) has to be large and furthermore the ‘gap’ coefficient (and its t value) has to be small (see the summary(..)$gap). If at the estimated breakpoint the coefficient of the gap variable is large (greater than two, say) a broken-line parameterization is somewhat questionable. Finally, a test for the existence of the breakpoint and/or comparing the BIC values would be very helpful in these circumstances. Green diamonds in Figure 3 and output from slope() (not shown) show that the last slope for group "KW" may be set to zero. While a left slope is allowed by fitting only ( z − ψ)+ (i.e. by omitting the main variable z in the initial linear model as in the previous example), similarly a null right slope might be allowed by including only ( z − ψ)− . segmented does not handle such terms explicitly, however by noting that ( z − ψ)− = −(− z + ψ)+ , we can proceed as follows 1 Of course, a corner-point parameterization (i.e. ‘treatment’ contrasts) is required to define the dummies relevant to the grouping variable; this is the default in R. R News ISSN 1609-3631 Vol. 8/1, May 2008 > neg.time.KW<- -time.KW > olm1<-lm(y~0+group+time.KV+time.WC) > os1<-segmented(olm1, seg.Z=~ time.KV + time.WC+ + neg.time.KW, psi=list(time.KV=c(300,450), + neg.time.KW=c(-600,-450), time.WC=c(300,450))) The ‘minus’ of the explanatory variable in group "KW" requires that the corresponding starting guess has to be supplied with reversed sign and, as consequence, the signs of estimates for the corresponding group will be reversed. The method segmented for confint() may be used to display (large sample) interval estimates for the breakpoints; confidence intervals are computed using ψ̂ ∓ zα /2 SE(ψ̂) where SE(ψ̂) comes from the Delta method for the ratio β̂γ̂ and zα /2 is the quantile of the standard Nor2 mal. Optional arguments are parm to specify the segmented variable of interest (default to all variables) and rev.sgn to change the sign of output before printing (this is useful when the sign of the segmented variable has been changed to constrain the last slope as in example at hand). > confint(os1,rev.sgn=c(FALSE,FALSE,TRUE)) $time.KV Est. CI(95%).l CI(95%).u psi1.time.KV 299.9 256.9 342.8 psi2.time.KV 441.9 402.0 481.8 24 have to be set carefully. The role of rev.sgn is intuitive and has been discussed above while const indicates a constant to be added to the fitted values before plotting, > plot(os1, term="neg.time.KW", add=TRUE, col=3, + const=coef(os1)["groupRKW"], rev.sgn=TRUE) const defaults to the model intercept, and for relationships by group the group-specific intercept is appropriate, as in the "KW" group example above. However when a ‘minus’ variable has been considered, simple algebra on the regression equation show that the correct constant for the other groups is given by the current estimate minus a linear combination of difference-in-slope parameters and relevant breakpoints. For the "KV" group we add the fitted lines after computing the ‘adjusted’ constant, > const.KV<-coef(os1)["groupRKV"]+ coef(os1)["U1.neg.time.KW"]* + os1$psi["psi1.neg.time.KW","Est."]+ coef(os1)["U2.neg.time.KW"]* + os1$psi["psi2.neg.time.KW","Est."] > plot(os1, "time.KV", add=TRUE, col=2, const=const.KV) and similarly for group "WC". Finally the estimated join points with relevant confidence intervals are added to the current device via the lines.segmented() method, > lines(os1,term="neg.time.KW",col=3,rev.sgn=TRUE) > lines(os1,term="time.KV",col=2,k=20) > lines(os1,term="time.WC",col=4,k=10) $time.WC $neg.time.KW Est. CI(95%).l CI(95%).u psi1.neg.time.KW 445.4 398.5 492.3 psi2.neg.time.KW 609.9 549.7 670.0 where term selects the segmented variable, rev.sgn says if the sign of the breakpoint values (point estimate and confidence limits) have to be reversed, k regulates the vertical position of the bars, and the remaining arguments refer to options of the drawn segments. Notice that in light of the constrained right slope, standard errors, t-values, and confidence limits are not computed. Figure 3 emphasizes the constrained fit which has been added to the current device via the relevant plot() method. More specifically, plot() allows to draw on the current or new device (depending on the logical value TRUE/FALSE of add) the fitted piecewise relationship for the variable term. To get sensible plots with fitted values to be superimposed to the observed points, the arguments const and rev.sgn R News 0.8 0.4 y RKV RKW RWC 0.2 > slope(os1, parm="neg.time.KW", rev.sgn=TRUE) $neg.time.KW Est. St.Err. t value CI(95%).l CI(95%).u slope1 0.0022640 8.515e-05 26.580 0.0020970 0.002431 slope2 0.0008398 2.871e-04 2.925 0.0002771 0.001403 slope3 0.0000000 NA NA NA NA 1.0 The slope estimates may be obtained using slope(); again, parm and rev.sgn may be specified when requested, 0.6 Est. CI(95%).l CI(95%).u psi1.time.WC 306.0 284.2 327.8 psi2.time.WC 460.1 385.5 534.7 200 300 400 500 600 700 time Figure 3: The plant dataset: data and constrained fit (model os1). Conclusions We illustrated the key-ideas of broken-line regression and how such a class of models may be fitted ISSN 1609-3631 Vol. 8/1, May 2008 in R through the package segmented. Although alternative approaches could be undertaken to model nonlinear relationships, for instance via splines, the main appealing of segmented models lies on interpretability of the parameters. Sometimes a piecewise parameterization may provide a reasonable approximation to the shape of the underlying relationship, and threshold and slopes may be very informative and meaningful. However it is well known that the likelihood in segmented models may not be concave, hence there is no guarantee the algorithm finds the global maximum; moreover it should be recognized that the method works by approximating the ‘true’ model (1) by (2), which could make the estimation problematic. A possible and useful strategy - quite common in the nonlinear optimization field - is to run the algorithm starting with different initial guesses for the breakpoint in order to assess possible differences. This is quite practicable due to computational efficiency of the algorithm. However, the more the clear-cut the relationship, the less important the starting values become. The package is not concerned with estimation of the number of the breakpoints. Although the BIC has been suggested, in general nonstatistical issues related to the understanding of the mechanism of the phenomenon in study could help to discriminate among several competing models with a different number of joinpoints. Currently, only methods for LM and GLM objects are implemented; however, due to the ease of the algorithm which only depends on the linear predictor, methods for other models (Cox regression, say) could be written straightforwardly following the skeleton of segmented.lm or segmented.glm. Finally, for the sake of novices in breakpoint estimation, it is probably worth mentioning the difference existing with the other R package dealing with breakpoints. The strucchange package by Zeileis et al. (2002) substantially is concerned with regression models having a different set of parameters for each ‘interval’ of the segmented variable, typically the time; strucchange performs breakpoint estimation via a dynamic grid search algorithm and allows for testing for parameter instability. Such ‘structural breaks models’, mainly employed in economics and econometrics, are somewhat different from the broken-line models discussed in this paper, since they do not require the fitted lines to join at the es- R News 25 timated breakpoints. Acknowledgements This work was partially supported by grant ‘Fondi di Ateneo (ex 60%) 2004 prot. ORPA044431: ‘Verifica di ipotesi in modelli complessi e non-standard’ (‘Hypothesis testing in complex and nonstandard models’). The author would like to thank the anonymous referee for useful suggestions which improved the paper and the interface of the package itself. Bibliography M. Betts, G. Forbes, and A. Diamond. Thresholds in songbird occurrence in relation to landscape structure. Conservation Biology, 21:1046–1058, 2007. R. B. Davies. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 74:33–43, 1987. A. Davison and D. Hinkley. Bootstrap methods and their application. Cambridge University Press, 1997. D. Hinkley. Inference in two-phase regression. Journal of American Statistical Association, pages 736– 743, 1971. V. Muggeo. Estimating regression models with unknown break-points. Statistics in Medicine, 22: 3055–3071, 2003. R. Tiwari, K. A. Cronin, W. Davis, E. Feuer, B. Yu, and S. Chib. Bayesian model selection for join point regression with application to age-adjusted cancer rates. Applied Statistics, 54:919–939, 2005. K. Ulm. A statistical methods for assessing a threshold in epidemiological studies. Statistics in Medicine, 10:341–349, 1991. A. Zeileis, F. Leisch, K. Hornik, and C. Kleiber. strucchange: An R package for testing for structural change in linear regression models. Journal of Statistical Software, 7(2):1–38, 2002. Vito M. R. Muggeo Dipartimento Scienze Statistiche e Matematiche ‘Vianelli’ Università di Palermo, Italy [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 26 Bayesian Estimation for Parsimonious Threshold Autoregressive Models in R by Cathy W.S. Chen, Edward M.H. Lin, F.C. Liu, and Richard Gerlach test statistic and/or using scatter plots (Tsay, 1989); or by minimizing a conditional least squares formula (Tsay, 1998). Introduction Bayesian methods allow simultaneous inference on all model parameters, in this case allowing uncertainty about the threshold lag d and threshold parameter r to be properly incorporated into estimation and inference. Such uncertainty is not accounted for in the standard two-stage methods. However, in the nonlinear TAR setting, neither the marginal or joint posterior distributions for the parameters can be easily analytically obtained: these usually involve high dimensional integrations and/or non-tractable forms. However, the joint posterior distribution can be evaluated up to a constant, and thus numerical integration techniques can be used to estimate the marginal distributions required. The most successful of these, for TAR models, are Markov chain Monte Carlo (MCMC) methods, specifically those based on the Gibbs sampler. Chen and Lee (1995) proposed such a method, incorporating the MetropolisHastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970), for inference in TAR models. Utilizing this MH within Gibbs algorithm, the marginal and posterior distributions can be estimated by iterative sampling. To the best of our knowledge, this is the first time a Bayesian approach for TAR models has been offered in a statistical package. The threshold autoregressive (TAR) model proposed by Tong (1978, 1983) and Tong and Lim (1980) is a popular nonlinear time series model that has been widely applied in many areas including ecology, medical research, economics, finance and others (Brockwell, 2007). Some interesting statistical and physical properties of TAR include asymmetry, limit cycles, amplitude dependent frequencies and jump phenomena, all of which linear models are unable to capture. The standard two-regime threshold autoregressive (TAR) model is considered in this paper. Given the autoregressive (AR) orders p1 and p2 , the two-regime TAR(2:p1 ;p2 ) model is specified as: p1 (1) (1) (1) φ0 + ∑ φki yt−ki + at if zt−d ≤ r, i =1 yt = p2 (2) (2) (2) + φ 0 ∑ φli yt−li + at if zt−d > r, i =1 where {ki , i = 1, . . . , p1 } and {li , i = 1, . . . , p2 } are subsets of {1, . . . , p}, where r is the threshold parameter driving the regime-switching behavior; where p is a reasonable maximum lag 1 ; zt is the threshold variable; d is the (1) (2) threshold lag of the model; at and at are two independent Gaussian white noise processes with mean zero and variance σ 2j , j = 1, 2. It is common to choose the threshold variable z as a lagged value of the time series itself, that is zt = yt . In this case the resulting model is called Self-Exciting (SETAR). In general, z could be any exogenous or endogenous variable (Chen 1998). The TAR model consists of a piecewise linear AR model in each regime, defined by the threshold variable zt−d and associated threshold value r. Note that the parameter p is not an input to the R program, but instead should be considered by the user as the largest possible lag the model could accommodate, e.g. in light of the sample size n. i.e. p << n is usually enforced for AR models. Frequentist parameter estimation of the TAR model is usually implemented in two stages; see for example Tong and Lim (1980), Tong (1990) and Tsay (1989, 1998). For fixed and subjectively chosen values of d (usually 1) and r (usually 0), all other model parameters are estimated first. Then, conditional on these parameter estimates, d and r can be estimated by: minimizing the AIC, minimizing a nonlinearity 1 We We propose an R package BAYSTAR that provides functionality for parameter estimation and inference for two-regime TAR models, as well as allowing the monitoring of MCMC convergence by returning all MCMC iterates as output. The main features of the package BAYSTAR are: applying the Bayesian inferential methods to simulated or real data sets; online monitoring of the acceptance rate and tuning parameters of the MH algorithm, to aid the convergence of the Markov chain; returning all MCMC iterates for user manipulation, clearly reporting the relevant MCMC summary statistics and constructing trace plots and auto-correlograms as diagnostic tools to assess convergence of MCMC chains. This allows us to statistically estimate all unknown model parameters simultaneously, including capturing uncertainty about threshold value and delay lag; not accounted for in standard methods that condition upon a particular threshold value and delay lag, see e.g. the SETAR function which is available in the "tsDyn" package at CRAN. We also allow the user to define a parsimonious separate AR order specification in each regime. Using our code it is possible to set some AR parameters in either or both regimes to be have tried up to p=50 successfully. R News ISSN 1609-3631 Vol. 8/1, May 2008 27 zero. That is, we could set p1 = 3 and subsequently estimate any three parameters of our convenience. 2; 2) model specified as: ( (1) 0.1 − 0.4yt−1 + 0.3yt−2 + at if yt−1 ≤ 0.4, yt = (2) 0.2 + 0.3yt−1 + 0.3yt−2 + at if yt−1 > 0.4, Prior settings (1) where at Bayesian inference requires us to specify a prior distribution for the unknown parameters. The parameters of the TAR(2:p1 ;p2 ) model are Θ1 , Θ2 , σ12 , σ22 , r (1) (1) (1) 0 (2) and d, where Θ1 =(φ0 , φ1 , . . ., φ p1 ) and Θ2 =(φ0 , (2) (2) 0 φ1 , . . ., φ p2 ) . We take fairly standard choices: Θ1 , Θ2 as independent N (Θ0i , Vi−1 ), i = 1, 2, and employ conjugate priors for σ12 and σ22 , σi2 ∼ IG (νi /2, νi λi /2) , i = 1, 2, where IG stands for the inverse Gamma distribution. In threshold modeling, it is important to set a minimum sample size in each regime to generate meaningful inference. The prior for the threshold parameter r, follows a uniform distribution on a range (l, u), where l and u are set as relevant percentiles of the observed threshold variable. This prior could be considered to correspond to an empirical Bayes approach, rather than a fully Bayesian one. Finally, the delay d has a discrete uniform prior over the integers: 1,2,. . ., d0 , where d0 is a user-set maximum delay. We assume the hyper-parameters, (Θ0i , Vi , νi , λi , a, b, d0 ) are known and can be specified by the user in our R code. The MCMC sampling scheme successively generates iterates from the posterior distributions for groups of parameters, conditional on the sample data and the remaining parameters. Multiplying the likelihood and the priors, using Bayes’ rule, leads to these conditional posteriors. For details, readers are referred to Chen and Lee (1995). Only the posterior distribution of r is not a standard distributional form, thus requiring us to use the MH method to achieve the desired sample for r. The standard Gaussian proposal random walk MH algorithm is used. To yield good convergence properties for this algorithm, the choice of step size, controlling the proposal variance, is important. A suitable value of the step size, with good convergence properties, can be achieved by tuning to achieve an acceptance rate between 25% to 50%, as suggested by Gelman, Roberts and Gilks (1996). This tuning occurs only in the burnin period. Exemplary applications Simulated data We now illustrate an example with simulated data. The data is generated from a two-regime SETAR(2 : R News (2) ∼ N (0, 0.8) and at ∼ N (0, 0.5). Users can import data from an external file, or use their own simulated data, and directly estimate model parameters via the proposed functions. To implement the MCMC sampling, the scheme was run for N = 10, 000 iterations (the total MCMC sample) and the first M = 2, 000 iterations (the burn-in sample) were discarded. + nIterations<- 10000 + nBurnin<- 2000 The hyper-parameters are set as Θ0i = 0, Vi = e 2 /3 for i = 1,2, diag(0.1, . . ., 0.1), νi = 3 and λi = σ 2 e is the residual mean squared error of fitwhere σ ting an AR(p1 ) model to the data. The motivation to choose the hyper-parameters of νi and λi is that the e 2 . The maximum delay expectation of σi2 is equal to σ lag is set to d0 = 3. We choose a = Q1 and b = Q3 : the 1st and 3rd quartiles of the data respectively, for the prior on r. + + + + mu0<- matrix(0, nrow=p1+1, ncol=1) v0<- diag(0.1, p1+1) ar.mse<- ar(yt,aic=FALSE, order.max=p1) v<- 3; lambda<- ar.mse$var.pred/3 The MCMC sampling steps sequentially draw samples of parameters by using the functions TAR.coeff(), TAR.sigma(), TAR.lagd() and TAR.thres(), iteratively. TAR.coeff() returns the updated values of Θ1 and Θ2 from a multivariate normal distribution for each regime. σ12 and σ22 are sampled separately using the function TAR.sigma() from inverse gamma distributions. TAR.lagd() and TAR.thres() are used to sample d, from a multinomial distribution, and r, by using the MH algorithm, respectively. The required log-likelihood function is computed by the function TAR.lik(). When drawing r, we monitor the acceptance rate of the MH algorithm so as to maximize the chance of achieving a stationary and convergent sample. The BAYSTAR package provides output after every 1,000 MCMC iterations, for monitoring the estimation, and the acceptance rate, of r. If the acceptance rate falls outside 25% to 50%, the step size of the MH algorithm is automatically adjusted during burn-in iterations, without re-running the whole program. Enlarging the step size should reduce the acceptance rate while diminishing the step size should increase this rate. A summary of the MCMC output can be obtained via the function TAR.summary(). TAR.summary() returns the posterior mean, median, standard deviation and the lower and upper bound of the 95% ISSN 1609-3631 Vol. 8/1, May 2008 28 true phi0^1 0.1000 phi1^1 -0.4000 phi2^1 0.3000 phi0^2 0.2000 phi1^2 0.3000 phi2^2 0.3000 sigma1 0.8000 simga2 0.5000 r 0.4000 diff.phi0 -0.1000 diff.phi1 -0.7000 diff.phi2 0.0000 mean1 0.0909 mean2 0.5000 Lag choice : 1 2 3 Freq 10000 0 0 mean 0.0873 -0.3426 0.2865 0.2223 0.2831 0.3244 0.7789 0.5563 0.4161 -0.1350 -0.6257 -0.0379 0.0834 0.5598 median 0.0880 -0.3423 0.2863 0.2222 0.2836 0.3245 0.7773 0.5555 0.4097 -0.1354 -0.6258 -0.0381 0.0829 0.5669 s.d. 0.0395 0.0589 0.0389 0.0533 0.0407 0.0234 0.0385 0.0231 0.0222 0.0654 0.0726 0.0455 0.0390 0.0888 lower upper 0.0096 0.1641 -0.4566 -0.2294 0.2098 0.3639 0.1187 0.3285 0.2040 0.3622 0.2780 0.3701 0.7079 0.8587 0.5132 0.6029 0.3968 0.4791 -0.2631 -0.0039 -0.7657 -0.4841 -0.1256 0.0521 0.0088 0.1622 0.3673 0.7161 Figure 1: The summary output for all parameters is printed as a table. Figure 2: The trace plots of all MCMC iterations for all parameters. R News ISSN 1609-3631 Vol. 8/1, May 2008 Bayes posterior interval for all parameters, all obtained from the sampling period only, after burnin. Output is also displayed for the differences in the mean coefficients and the unconditional mean in each regime. The summary statistics are printed as in Figure 1. To assess MCMC convergence to stationarity, we monitor trace plots and autocorrelation plots of the MCMC iterates for all parameters. Trace plots are obtained via ts.plot() for all MCMC iterations, as shown in Figure 2. The red horizontal line is the true value of the parameter, the yellow line represents the posterior mean and the green lines are the lower and upper bounds of the 95% Bayes credible interval. The density plots, via the function density(), are provided for each parameter and the differences in the mean coefficients as shown in Figure 3. An example of a simulation study is now illustrated. For 100 simulated data sets, the code saves the posterior mean, median, standard deviation and the lower and upper bound of each 95% Bayesian interval for each parameter. The means, over the replicated data sets, of these quantities are reported as a table in Figure 4. For counting the frequencies of each estimated delay lag, we provide the frequency table of d by the function table(), as shown in the bottom of Figure 4. The average posterior probabilities that d = 1 are all very close to 1; the posterior mode of d very accurately estimates the true delay parameter in this case. US unemployment rate data For empirical illustration, we consider the monthly U.S. civilian unemployment rate from January 1948 to March 2004. The data set, which consists of 675 monthly observations, is shown in Figure 5. The data is available in Tsay (2005). We take the first difference of the unemployment rates in order to achieve mean stationarity. A partial autocorrelation plot (PACF) of the change of unemployment rate is given in Figure 5. For illustration, we use the same model orders as in Tsay (2005), except for the addition of a 10th lag in regime one. We obtain the fitted SETAR model: 0.187yt−2 + 0.143yt−3 + 0.127yt−4 (1) −0.106yt−10 − 0.087yt−12 + at if yt−3 ≤ 0.05, yt = 0.312yt−2 + 0.223yt−3 − 0.234yt−12 (2) + at if yt−3 > 0.05, The results are shown in Figure 6. Trace plots and autocorrelograms for after burn-in MCMC iterations are given in Figures 7 and 8. Clearly, MCMC convergence is almost immediate. The parameter estimates are quite reasonable, being similar to the results of Tsay (2005), except the threshold lag, which is set as d = 1 by Tsay. Instead, our results suggest that nonlinearity in the differences in the unemployment rate, responds around a positive 0.05 change in the R News 29 unemployment rate, is at a lag of d = 3 months, for this data. This is quite reasonable. The estimated AR coefficients differ between the two regimes, indicating the dynamics of the US unemployment rate are based on the previous quarter’s change in rate. It is also clear that the regime variances are significantly different to each other, which can be confirmed by finding a 95% credible interval from the MCMC iterates of the differences between these parameters. Summary BAYSTAR provides Bayesian MCMC methods for iterative sampling to provide parameter estimates and inference for the two-regime TAR model. Parsimonious AR specifications between regimes can also be easily employed. A convenient user interface for importing data from a file or specifying true parameter values for simulated data is easy to apply for analysis. Parameter inferences are summarized to an easily readable format. Simultaneously, the checking of convergence can be done by monitoring the MCMC trace plots and autocorrelograms. Simulations illustrated the good performance of the sampling scheme, while a real example illustrated nonlinearity present in the US unemployment rate. In the future we will extend BAYSTAR to more flexible models, such as threshold moving-average (TMA) models and threshold autoregressive moving-average (TARMA) models, which are also frequently used in time series modeling. In addition model and order selection is an important issue for these models. It is interesting to examine the method of the stochastic search variable selection (SSVS) in the R package with BAYSTAR for model order selection in these types of models, e.g. So and Chen (2003). Acknowledgments Cathy Chen thanks Professor Kerrie Mengersen for her invitation to appear as keynote speaker and to present this work at the Spring Bayes 2007 workshop. Cathy Chen is supported by the grant: 962118-M-002-MY3 from the National Science Council (NSC) of Taiwan and grant 06G27022 from Feng Chia University. The authors would like to thank the editors and anonymous referee for reading and assessing the paper, and especially to thank the referee who made such detailed and careful comments that significantly improved the paper. Bibliography P. Brockwell. Beyond Linear Time Series. Statistica Sinica, 17:3-7, 2007. ISSN 1609-3631 Vol. 8/1, May 2008 30 Figure 3: Posterior densities of all parameters and the differences of mean coefficients. true mean median s.d. lower upper phi0^1 0.1000 0.0917 0.0918 0.0401 0.0129 0.1701 phi1^1 -0.4000 -0.4058 -0.4056 0.0616 -0.5268 -0.2852 phi2^1 0.3000 0.3000 0.3000 0.0417 0.2184 0.3817 phi0^2 0.2000 0.2082 0.2082 0.0509 0.1088 0.3082 phi1^2 0.3000 0.2940 0.2940 0.0387 0.2181 0.3697 phi2^2 0.3000 0.2961 0.2961 0.0226 0.2517 0.3404 sigma1 0.8000 0.7979 0.7966 0.0397 0.7239 0.8796 simga2 0.5000 0.5038 0.5033 0.0209 0.4645 0.5464 r 0.4000 0.3944 0.3948 0.0157 0.3657 0.4247 diff.phi0 -0.1000 -0.1165 -0.1166 0.0644 -0.2425 0.0099 diff.phi1 -0.7000 -0.6997 -0.6995 0.0731 -0.8431 -0.5568 diff.phi2 0.0000 0.0039 0.0040 0.0475 -0.0890 0.0972 mean1 0.0909 0.0841 0.0836 0.0379 0.0116 0.1601 mean2 0.5000 0.4958 0.5024 0.0849 0.3109 0.6434 > table(lag.yt) lag.yt 1 100 Figure 4: The summary output for all parameters from 100 replications is printed as a table. R News ISSN 1609-3631 Vol. 8/1, May 2008 31 Unemployment Rate ● ● ●● ● ● ● ● ● ● 10 ● ● ● ● ● 8 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 0 ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● 100 ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 200 300 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● 400 ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 500 600 700 Partial autocorrelation function for differenced unemployment rate 0.2 0.1 0.0 −0.1 0 20 40 60 80 100 Lag Figure 5: Time series plots of the PACF of the changed unemployment rate. mean median s.d. phi1.2 0.1874 0.1877 0.0446 phi1.3 0.1431 0.1435 0.0457 phi1.4 0.1270 0.1273 0.0447 phi1.10 -0.1060 -0.1058 0.0400 phi1.12 -0.0875 -0.0880 0.0398 phi2.2 0.3121 0.3124 0.0613 phi2.3 0.2233 0.2233 0.0594 phi2.12 -0.2340 -0.2341 0.0766 sigma1 0.0299 0.0298 0.0021 simga2 0.0588 0.0585 0.0054 r 0.0503 0.0506 0.0290 Lag choice : 1 2 3 Freq 15 0 9985 -----------The highest posterior prob. of lower upper 0.0993 0.2751 0.0526 0.2338 0.0394 0.2157 -0.1855 -0.0275 -0.1637 -0.0082 0.1932 0.4349 0.1077 0.3387 -0.3837 -0.0839 0.0261 0.0342 0.0492 0.0702 0.0027 0.0978 lag at : 3 Figure 6: The summary output for all parameters of the U.S. unemployment rate is printed as a table. R News ISSN 1609-3631 Vol. 8/1, May 2008 32 Figure 7: The trace plots of after burn-in MCMC iterations for all parameters. Figure 8: Autocorrelation plots of after burn-in MCMC iterations for all parameters. R News ISSN 1609-3631 Vol. 8/1, May 2008 C.W.S. Chen. A Bayesian analysis of generalized threshold autoregressive models. Statistics and Probability Letters, 40:15–22, 1998. C.W.S. Chen and J.C. Lee. Bayesian inference of threshold autoregressive models. J. Time Ser. Anal., 16:483–492, 1995. A. Gelman, G.O. Roberts, and W.R. Gilks. Efficient Metropolis jumping rules. In: Bayesian Statistics 5 (Edited by J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith), 599–607, 1996. Oxford University Press, Oxford. W.K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97-109, 1970. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth and A.H. Teller. Equations of state calculations by fast computing machines. J. Chem. Phys., 21:10871091, 1953. M.K.P. So and C.W.S. Chen. Subset threshold autoregression. Journal of Forecasting, 22, :49-66, 2003. H. Tong. On a Threshold Model in Pattern Recognition and Signal Processing, ed. C. H. Chen, Sijhoff & Noordhoff: Amsterdam, 1978. H. Tong. Threshold Models in Non-linear Time Series Analysis, Vol. 21 of Lecture Notes in Statistics R News 33 (K. Krickegerg, ed.). Springer-Verlag, New York, 1983. H. Tong and K.S. Lim. Threshold autoregression, limit cycles and cyclical data. (with discussion), J. R. Stat. Soc. Ser. B, 42:245–292, 1980. R.S. Tsay. Testing and modeling threshold autoregressive process. J. Amer. Statist. Assoc., 84:231–240, 1989. R.S. Tsay. Testing and modeling multivariate threshold models. J. Amer. Statist. Assoc., 93:1188–1202, 1998. R.S. Tsay. Analysis of Financial Time Series, 2nd Edition, John Wiley & Sons, 2005. Cathy W. S. Chen Feng Chia University, Taiwan [email protected] Edward M. H. Lin, Feng Chi Liu Feng Chia University, Taiwan Richard Gerlach University of Sydney, Australia [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 34 Statistical Modeling of Loss Distributions Using actuar by Vincent Goulet and Mathieu Pigeon Introduction actuar (Dutang et al., 2008) is a package providing additional Actuarial Science functionality to R. Although various packages on CRAN provide functions useful to actuaries, actuar aims to serve as a central location for more specifically actuarial functions and data sets. The current feature set of the package can be split in four main categories: loss distributions modeling, risk theory (including ruin theory), simulation of compound hierarchical models and credibility theory. This paper reviews the loss distributions modeling features of the package — those most likely to interest R News readers and to have links with other fields of statistical practice. Actuaries need to model claim amounts distributions for ratemaking, loss reserving and other risk evaluation purposes. Typically, claim amounts data are nonnegative and skewed to the right, often heavily. The probability laws used in modeling must match these characteristics. Furthermore, depending on the line of business, data can be truncated from below, censored from above or both. The main actuar features to aid in loss modeling are the following: 1. Introduction of 18 additional probability laws and functions to get raw moments, limited moments and the moment generating function. 2. Fairly extensive support of grouped data. 3. Calculation of the empirical raw and limited moments. 4. Minimum distance estimation using three different measures. 5. Treatment of coverage modifications (deductibles, limits, inflation, coinsurance). Probability laws R already includes functions to compute the probability density function (pdf), the cumulative distribution function (cdf) and the quantile function of a fair number of probability laws, as well as functions to generate variates from these laws. For some root foo , the functions are named dfoo , pfoo , qfoo and rfoo , respectively. R News The actuar package provides d, p, q and r functions for all the probability laws useful for loss severity modeling found in Appendix A of Klugman et al. (2004) and not already present in base R, excluding the inverse Gaussian and log-t but including the loggamma distribution (Hogg and Klugman, 1984). We tried to make these functions as similar as possible to those in the stats package, with respect to the interface, the names of the arguments and the handling of limit cases. Table 1 lists the supported distributions as named in Klugman et al. (2004) along with the root names of the R functions. The name or the parametrization of some distributions may differ in other fields; check with the lossdist package vignette for the pdf and cdf of each distribution. In addition to the d, p, q and r functions, the package provides m, lev and mgf functions to compute, respectively, theoretical raw moments mk = IE [ X k ], (1) theoretical limited moments IE [( X ∧ x)k ] = IE [(min X, x)k ] (2) and the moment generating function MX (t) = IE [etX ], (3) when it exists. Every probability law of Table 1 is supported, plus the following ones: beta, exponential, chi-square, gamma, lognormal, normal (except lev), uniform and Weibull of base R, and the inverse Gaussian distribution of package SuppDists (Wheeler, 2006). The m and lev functions are especially useful with estimation methods based on the matching of raw or limited moments; see below for their empirical counterparts. The mgf functions are introduced in the package mostly for calculation of the adjustment coefficient in ruin theory; see the "risk" package vignette. In addition to the 17 distributions of Table 1, the package provides support for phase-type distributions (Neuts, 1981). These are not so much included in the package for statistical inference, but rather for ruin probability calculations. A phase-type distribution is defined as the distribution of the time until absorption of a continuous time, finite state Markov process with m transient states and one absorbing state. Let T t Q= (4) 0 0 be the transition rates matrix (or intensity matrix) of such a process and let (π , πm+1 ) be the initial probability vector. Here, T is an m × m non-singular matrix with tii < 0 for i = 1, . . . , m and ti j ≥ 0 for i 6= j, ISSN 1609-3631 Vol. 8/1, May 2008 35 Table 1: Probability laws supported by actuar classified by family and root names of the R functions. Family Distribution Root (alias) Transformed beta Transformed beta Burr Loglogistic Paralogistic Generalized Pareto Pareto Inverse Burr Inverse Pareto Inverse paralogistic trbeta (pearson6) burr llogis paralogis genpareto pareto (pareto2) invburr invpareto invparalogis Transformed gamma Transformed gamma Inverse transformed gamma Inverse gamma Inverse Weibull Inverse exponential trgamma invtrgamma invgamma invweibull (lgompertz) invexp Other Loggamma Single parameter Pareto Generalized beta lgamma pareto1 genbeta t = − Te and e is a column vector with all components equal to 1. Then the cdf of the time until absorption random variable with parameters π and T is ( 1 − π e T x e, x > 0 F ( x) = (5) πm+1 , x = 0, where eM = ∞ ∑ n=0 Mn n! (6) is the matrix exponential of matrix M. The exponential, the Erlang (gamma with integer shape parameter) and discrete mixtures thereof are common special cases of phase-type distributions. The package provides d, p, r, m and mgf functions for phase-type distributions. The root is phtype and parameters π and T are named prob and rates, respectively. The core of all the functions presented in this section is written in C for speed. The matrix exponential C routine is based on expm() from the package Matrix (Bates and Maechler, 2007). space in computers has almost become a non-issue, grouped data has somewhat fallen out of fashion. Still, grouped data remains useful in some fields of actuarial practice and for parameter estimation. For these reasons, actuar provides facilities to store, manipulate and summarize grouped data. A standard storage method is needed since there are many ways to represent grouped data in the computer: using a list or a matrix, aligning the n j s with the c j−1 s or with the c j s, omitting c0 or not, etc. Moreover, with appropriate extraction, replacement and summary functions, manipulation of grouped data becomes similar to that of individual data. First, function grouped.data creates a grouped data object similar to — and inheriting from — a data frame. The input of the function is a vector of group boundaries c0 , c1 , . . . , cr and one or more vectors of group frequencies n1 , . . . , nr . Note that there should be one group boundary more than group frequencies. Furthermore, the function assumes that the intervals are contiguous. For example, the following data Group Grouped data What is commonly referred to in Actuarial Science as grouped data is data represented in an intervalfrequency manner. In insurance applications, a grouped data set will typically report that there were n j claims in the interval (c j−1 , c j ], j = 1, . . . , r (with the possibility that cr = ∞). This representation is much more compact than an individual data set (where the value of each claim is known), but it also carries far less information. Now that storage R News (0, 25] (25, 50] (50, 100] (100, 150] (150, 250] (250, 500] Frequency (Line 1) Frequency (Line 2) 30 31 57 42 65 84 26 33 31 19 16 11 is entered and represented in R as > x <- grouped.data(Group = c(0, 25, + 50, 100, 150, 250, 500), Line.1 = c(30, ISSN 1609-3631 Vol. 8/1, May 2008 + + 36 R has a function ecdf to compute the empirical cdf of an individual data set, 31, 57, 42, 65, 84), Line.2 = c(26, 33, 31, 19, 16, 11)) Object x is stored internally as a list with class Fn ( x) = > class(x) 1 n n ∑ I { x j ≤ x}, j=1 [1] "grouped.data" "data.frame" With a suitable print method, these objects can be displayed in an unambiguous manner: > x Group Line.1 Line.2 1 (0, 25] 30 26 2 (25, 50] 31 33 3 (50, 100] 57 31 4 (100, 150] 42 19 5 (150, 250] 65 16 6 (250, 500] 84 11 Second, the package supports the most common extraction and replacement methods for "grouped.data" objects using the usual [ and [<operators; see ?Extract.grouped.data for details. The package defines methods of a few existing summary functions for grouped data objects. Computing the mean . r r c j−1 + c j nj ∑ nj (7) ∑ 2 j=1 j=1 is made simple with a method for the mean function: where I {A} = 1 if A is true and I {A} = 0 otherwise. The function returns a "function" object to compute the value of Fn ( x) in any x. The approximation of the empirical cdf for grouped data is called an ogive (Klugman et al., 1998; Hogg and Klugman, 1984). It is obtained by joining the known values of Fn ( x) at group boundaries with straight line segments: 0, x ≤ c0 (c j − x) Fn (c j−1 ) + ( x − c j−1 ) Fn (c j ) , c j − c j−1 F̃n ( x) = c j−1 < x ≤ c j 1, x > cr . (8) The package includes a function ogive that otherwise behaves exactly like ecdf. In particular, methods for functions knots and plot allow, respectively, to obtain the knots c0 , c1 , . . . , cr of the ogive and a graph. Calculation of empirical moments > mean(x) Line.1 Line.2 179.8 99.9 Higher empirical moments can be computed with emm; see below. A method for function hist draws a histogram for already grouped data. Only the first frequencies column is considered (see Figure 1 for the resulting graph): In the sequel, we frequently use two data sets provided by the package: the individual dental claims (dental) and grouped dental claims (gdental) of Klugman et al. (2004). The package provides two functions useful for estimation based on moments. First, function emm computes the kth empirical moment of a sample, whether in individual or grouped data form: > emm(dental, order = 1:3) > hist(x[, -3]) [1] 3.355e+02 2.931e+05 3.729e+08 Histogram of x[, −3] 0.004 > emm(gdental, order = 1:3) 0.002 Second, in the same spirit as ecdf and ogive, function elev returns a function to compute the empirical limited expected value — or first limited moment — of a sample for any limit. Again, there are methods for individual and grouped data (see Figure 2 for the graphs): 0.001 Density 0.003 [1] 3.533e+02 3.577e+05 6.586e+08 0.000 > lev <- elev(dental) > lev(knots(lev)) 0 100 200 300 400 500 x[, −3] Figure 1: Histogram of a grouped data object R News [1] 16.0 37.6 42.4 85.1 105.5 164.5 [7] 187.7 197.9 241.1 335.5 > plot(lev, type = "o", pch = 19) > lev <- elev(gdental) > lev(knots(lev)) ISSN 1609-3631 Vol. 8/1, May 2008 37 elev(x = gdental) 350 elev(x = dental) ● ● ● ● 150 ● ● ● ● 0 500 1000 1500 250 ● ● ● 0 50 ● ● ● 150 Empirical LEV 250 ● ● ● 50 Empirical LEV ● ● ● ● 0 1000 x 2000 3000 4000 x Figure 2: Empirical limited expected value function of an individual data object (left) and a grouped data object (right) 2. The modified chi-square method (chi-square) applies to grouped data only and minimizes the squared difference between the expected and observed frequency within each group: [1] 0.00 24.01 46.00 84.16 115.77 [6] 164.85 238.26 299.77 324.90 347.39 [11] 353.34 > plot(lev, type = "o", pch = 19) r d(θ ) = Minimum distance estimation j=1 Two methods are widely used by actuaries to fit models to data: maximum likelihood and minimum distance. The first technique applied to individual data is well covered by function fitdistr of the MASS package (Venables and Ripley, 2002). The second technique minimizes a chosen distance function between theoretical and empirical distributions. The actuar package provides function mde, very similar in usage and inner working to fitdistr, to fit models according to any of the following three distance minimization methods. 1. The Cramér-von Mises method (CvM) minimizes the squared difference between the theoretical cdf and the empirical cdf or ogive at their knots: n ∑ w j [ F(x j ; θ) − Fn (x j ; θ)]2 d(θ ) = (9) j=1 for individual data and r d(θ ) = ∑ w j [ F(c j ; θ) − F̃n (c j ; θ)] 2 (10) j=1 for grouped data. Here, F ( x) is the theoretical cdf of a parametric family, Fn ( x) is the empirical cdf, F̃n ( x) is the ogive and w1 ≥ 0, w2 ≥ 0, . . . are arbitrary weights (defaulting to 1). R News ∑ w j [n( F(c j ; θ) − F(c j−1 ; θ)) − n j ]2 , (11) 1 where n = ∑rj=1 n j . By default, w j = n− j . The method is called “modified” because the default denominators are the observed rather than the expected frequencies. 3. The layer average severity method (LAS) applies to grouped data only and minimizes the squared difference between the theoretical and empirical limited expected value within each group: r d(θ ) = ∑ w j [LAS(c j−1 , c j ; θ) j=1 ˜ n (c j−1 , c j ; θ )]2 , (12) − LAS where LAS( x, y) = IE [ X ∧ y] − IE [ X ∧ x], ˜ n ( x, y) = IE ˜ n [ X ∧ y] − IE ˜ n [ X ∧ x], and LAS ˜ n [ X ∧ x] is the empirical limited expected IE value for grouped data. The arguments of mde are a data set, a function to compute F ( x) or IE [ X ∧ x], starting values for the optimization procedure and the name of the method to use. The empirical functions are computed with ecdf, ogive or elev. The expressions below fit an exponential distribution to the grouped dental data set, as per Example 2.21 of Klugman et al. (1998): ISSN 1609-3631 Vol. 8/1, May 2008 > mde(gdental, pexp, + start = list(rate = 1/200), + measure = "CvM") rate 0.003551 distance 0.002842 > mde(gdental, pexp, + start = list(rate = 1/200), + measure = "chi-square") rate 0.00364 distance 13.54 > mde(gdental, levexp, + start = list(rate = 1/200), + measure = "LAS") rate 0.002966 distance 694.5 It should be noted that optimization is not always that simple to achieve. For example, consider the problem of fitting a Pareto distribution to the same data set using the Cramér–von Mises method: > mde(gdental, ppareto, + start = list(shape = 3, + scale = 600), measure = "CvM") Error in mde(gdental, ppareto, start = list(shape = 3, scale = 600), measure = "CvM") : optimization failed 38 Coverage modifications Let X be the random variable of the actual claim amount for an insurance policy, Y L be the random variable of the amount paid per loss and Y P be the random variable of the amount paid per payment. The terminology for the last two random variables refers to whether or not the insurer knows that a loss occurred. Now, the random variables X, Y L and Y P will differ if any of the following coverage modifications are present for the policy: an ordinary or a franchise deductible, a limit, coinsurance or inflation adjustment (see Klugman et al., 2004, Chapter 5 for precise definitions of these terms). Table 2 summarizes the definitions of Y L and Y P . The effect of an ordinary deductible is known as truncation from below, and that of a policy limit as censoring from above. Censored data is very common in survival analysis; see the package survival (Lumley, 2008) for an extensive treatment in R. Yet, actuar provides a different approach. Suppose one wants to use censored data Y1 , . . . , Yn from the random variable Y to fit a model on the unobservable random variable X. This requires expressing the pdf or cdf of Y in terms of the pdf or cdf of X. Function coverage of actuar does just that: given a pdf or cdf and any combination of the coverage modifications mentioned above, coverage returns a function object to compute the pdf or cdf of the modified random variable. The function can then be used in modeling or plotting like any other dfoo or pfoo function. For example, let Y represent the amount paid (per payment) by an insurer for a policy with an ordinary deductible d and a limit u − d (or maximum covered loss of u). Then the definition of Y is ( Working in the log of the parameters often solves the problem since the optimization routine can then flawlessly work with negative parameter values: > f <- function(x, lshape, lscale) ppareto(x, + exp(lshape), exp(lscale)) > (p <- mde(gdental, f, list(lshape = log(3), + lscale = log(600)), measure = "CvM")) lshape 1.581 lscale 7.128 distance 0.0007905 The actual estimators of the parameters are obtained with > exp(p$estimate) lshape lscale 4.861 1246.485 This procedure may introduce additional bias in the estimators, though. R News Y= X − d, u − d, d≤X≤u X≥u (13) and its pdf is 0, f X ( y + d) , 1 − FX (d) fY ( y) = 1 − FX (u) , 1 − FX (d) 0, y=0 0 < y < u−d (14) y = u−d y > u − d. Assume X has a gamma distribution. Then an R function to compute the pdf (14) in any y for a deductible d = 1 and a limit u = 10 is obtained with coverage as follows: > f <- coverage(pdf = dgamma, cdf = pgamma, + deductible = 1, limit = 10) > f(0, shape = 5, rate = 1) ISSN 1609-3631 Vol. 8/1, May 2008 39 Table 2: Coverage modifications for per-loss variable (Y L ) and per-payment variable (Y P ) as defined in Klugman et al. (2004). Per-loss variable (Y L ) ( 0, X≤d X − d, X > d ( 0, X ≤ d X, X > d ( X, X ≤ u u, X > u Per-payment variable (Y P ) Coinsurance (α) αX αX Inflation (r) (1 + r) X (1 + r) X Coverage modification Ordinary deductible (d) Franchise deductible (d) Limit (u) [1] 0 > f(5, shape = 5, rate = 1) [1] 0.1343 > f(9, shape = 5, rate = 1) [1] 0.02936 > f(12, shape = 5, rate = 1) [1] 0 The function f is built specifically for the coverage modifications submitted and contains as little useless code as possible. Let object y contain a sample of claims amounts from policies with the above deductible and limit. Then one can fit a gamma distribution by maximum likelihood to the claim severity process as follows: > library(MASS) > fitdistr(y, f, start = list(shape = 2, + rate = 0.5)) shape rate 4.1204 0.8230 (0.7054) (0.1465) The package vignette "coverage" contains more detailed pdf and cdf formulas under various combinations of coverage modifications. Conclusion This paper reviewed the main loss modeling features of actuar, namely many new probability laws; new utility functions to access the raw moments, the limited moments and the moment generating function of these laws and some of base R; functions to create and manipulate grouped data sets; functions to R News n X − d, n X, ( X, u, X>d X>d X≤u X>u ease calculation of empirical moments, in particular for grouped data; a function to fit models by distance minimization; a function to help work with censored data or data subject to coinsurance or inflation adjustments. We hope some of the tools presented here may be of interest outside the field they were developed for, perhaps provided some adjustments in terminology and nomenclature. Finally, we note that the distrXXX family of packages (Ruckdeschel et al., 2006) provides a general, object oriented approach to some of the features of actuar, most notably the calculation of moments for many distributions (although not necessarily those presented here), minimum distance estimation and censoring. Acknowledgments This research benefited from financial support from the Natural Sciences and Engineering Research Council of Canada and from the Chaire d’actuariat (Actuarial Science Chair) of Université Laval. We also thank one anonymous referee and the R News editors for corrections and improvements to the paper. Bibliography D. Bates and M. Maechler. Matrix: A matrix package for R, 2007. R package version 0.999375-3. C. Dutang, V. Goulet, and M. Pigeon. actuar: An R package for actuarial science. Journal of Statistical Software, 25(7), 2008. URL http://www. actuar-project.org. R. V. Hogg and S. A. Klugman. Loss Distributions. Wiley, New York, 1984. ISBN 0-4718792-9-0. ISSN 1609-3631 Vol. 8/1, May 2008 40 S. A. Klugman, H. H. Panjer, and G. Willmot. Loss Models: From Data to Decisions. Wiley, New York, 1998. ISBN 0-4712388-4-8. phausen. S4 classes for distributions. R News, 6 (2):2–6, May 2006. URL http://distr.r-forge. r-project.org. S. A. Klugman, H. H. Panjer, and G. Willmot. Loss Models: From Data to Decisions. Wiley, New York, 2 edition, 2004. ISBN 0-4712157-7-5. W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York, 4 edition, 2002. ISBN 0-3879545-7-0. T. Lumley. survival: Survival analysis, including penalised likelihood, 2008. R package version 2.34. S original by Terry Therneau. B. Wheeler. SuppDists: Supplementary distributions, 2006. URL http://www.bobwheeler.com/stat. R package version 1.1-0. M. F. Neuts. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. Dover Publications, 1981. ISBN 978-0-4866834-2-3. P. Ruckdeschel, M. Kohl, T. Stabla, and F. Cam- R News Vincent Goulet and Mathieu Pigeon Université Laval, Canada [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 41 Programmers’ Niche: Multivariate polynomials in R The answer is the coefficient of xn in The multipol package n by Robin K. S. Hankin 1 ∏ 1 − xi i =1 Abstract In this short article I introduce the multipol package, which provides some functionality for handling multivariate polynomials; the package is discussed here from a programming perspective. An example from the field of enumerative combinatorics is presented. Univariate polynomials A polynomial is an algebraic expression of the form ∑in=0 ai xi where the ai are real or complex numbers and n (the degree of the polynomial) is a nonnegative integer. A polynomial may be viewed in three distinct ways: • Polynomials are interesting and instructive examples of entire functions: they map (the complex numbers) to . C C • Polynomials are a map from the positive integers to : this is f (n) = an and one demands that ∃n0 with n > n0 −→ f (n) = 0. Relaxation of the final clause results in a generating function which is useful in combinatorics. C • Polynomials with complex coefficients form an algebraic object known as a ring: polynomial multiplication is associative and distributive with respect to addition; ( ab)c = a(bc) and a(b + c) = ab + ac. A multivariate polynomial is a generalization of a polynomial to expressions of the ij form ∑ ai1 i2 ...id ∏dj=1 x j . The three characterizations of polynomials above generalize to the multivariate case, but note that the algebraic structure is more general. In the context of R programming, the first two points typically dominate. Viewing a polynomial as a function is certainly second nature to the current readership and unlikely to yield new insight. But generating functions are also interesting and useful applications of polynomials (Wilf, 1994) which may be less familiar and here I discuss an example from the discipline of integer partitions (Andrews, 1998). A partition of an integer n is a non-increasing sequence of positive integers p1 , p2 , . . . , pr such that n = ∑ri=1 pi (Hankin, 2006b). How many distinct partitions does n have? R News (observe that we may truncate the Taylor expansion of 1/(1 − x j ) to terms not exceeding xn ; thus the problem is within the domain of polynomials as infinite sequences of coefficients are not required). Here, as in many applications of generating functions, one uses the mechanism of polynomial multiplication as a bookkeeping device to keep track of the possibilities. The R idiom used in the polynom package is a spectacularly efficient method for doing so. Multivariate polynomials generalize the concept of generating function, but in this case the functions are from n-tuples of nonnegative integers to . An example is given in the appendix below. C The polynom package The polynom package (Venables et al., 2007) is a consistent and convenient suite of software for manipulating polynomials. This package was originally written in 1993 and is used by Venables and Ripley (2001) as an example of S3 classes. The following R code shows the polynom package in use; the examples are then generalized to the multivariate case using the multipol package. > require(polynom) > (p <- polynomial(c(1, 0, 0, 3, 4))) 1 + 3*x^3 + 4*x^4 > str(p) Class ’polynomial’ num [1:5] 1 0 0 3 4 See how a polynomial is represented as a vector of coefficients with p[i] holding the coefficient of xi−1 ; note the off-by-one issue. Observe the natural print method which suppresses the zero entries—but the internal representation requires all coefficients so a length 5 vector is needed to store the object. Polynomials may be multiplied and added: > p + polynomial(1:2) 2 + 2*x + 3*x^3 + 4*x^4 > p * p 1 + 6*x^3 + 8*x^4 + 9*x^6 + 24*x^7 + 16*x^8 ISSN 1609-3631 Vol. 8/1, May 2008 Note the overloading of ‘+’ and ‘*’: polynomial addition and multiplication are executed using the natural syntax on the command line. Observe that the addition is not entirely straightforward: the shorter polynomial must be padded with zeros. A polynomial may be viewed either as an object, or a function. Coercing a polynomial to a function is straightforward: > f1 <- as.function(p) > f1(pi) [1] 483.6552 > f1(matrix(1:6, 2, 3)) [1,] [2,] [,1] [,2] [,3] 8 406 2876 89 1217 5833 Note the effortless and transparent vectorization of f1(). Multivariate polynomials There exist several methods by which polynomials may be generalized to multipols. To this author, the most natural is to consider an array of coefficients; the dimensionality of the array corresponds to the arity of the multipol. However, other methods suggest themselves and a brief discussion is given at the end. Much of the univariate polynomial functionality presented above is directly applicable to multivariate polynomials. > require(multipol) > (a <- as.multipol(matrix(1:10, nrow = 2))) x^0 x^1 y^0 y^1 y^2 y^3 y^4 1 3 5 7 9 2 4 6 8 10 See how a multipol is actually an array, with one extent per variable present, in this case 2, although the package is capable of manipulating polynomials of arbitrary arity. Multipol addition is a slight generalization of the univariate case: > b <- as.multipol(matrix(1:10, ncol = 2)) > a + b x^0 x^1 x^2 x^3 x^4 y^0 y^1 y^2 y^3 y^4 2 9 5 7 9 4 11 6 8 10 3 8 0 0 0 4 9 0 0 0 5 10 0 0 0 In the multivariate case, the zero padding must be done in each array extent; the natural commandline syntax is achieved by defining an appropriate Ops.multipol() function to overload the arithmetic operators. R News 42 Multivariate polynomial multiplication The heart of the package is multipol multiplication: > a * b x^0 x^1 x^2 x^3 x^4 x^5 y^0 y^1 y^2 y^3 1 9 23 37 4 29 61 93 7 39 79 119 10 49 97 145 13 59 115 171 10 40 70 100 y^4 51 125 159 193 227 130 y^5 54 123 142 161 180 100 Multivariate polynomial multiplication is considerably more involved than in the univariate case. Consider the coefficient of x2 y2 in the product. This is Ca x2 y2 Cb (1) + Ca xy2 Cb ( x) + Ca y2 Cb x2 + Ca x2 y Cb ( y) + Ca ( xy) Cb ( xy) + Ca ( y) Cb x2 y + Ca x2 Cb y2 + Ca ( x) Cb xy2 + Ca (1) Cb x2 y2 = 0·1+6·2+5·3 +0·6+4·7+3·8 +0·0+2·0+1·0 = 79, where “Ca ( xm yn )” means the coefficient of xm yn in polynomial a. It should be clear that large multipols involve more terms and a typical example is given later in the paper. Multivariate polynomial multiplication in multipol The appropriate R idiom is to follow the above prose description in a vectorized manner; the following extract from mprod() is very slightly edited in the interests of clarity. First we define a matrix, index, whose rows are the array indices of the product: outDims <- dim(a)+dim(b)-1 Here outDims is the dimensions of the product. Note again the off-by-one issue: the package uses array indices internally, while the user consistently indexes by variable power. index <- expand.grid(lapply(outDims,seq_len)) Each row of matrix index is thus an array index for the product. The next step is to define a convenience function f(), whose argument u is a row of index, that returns the entry in the multipol product: f <- function(u){ jja <expand.grid(lapply(u,function(i)0:(i-1))) jjb <- -sweep(jja, 2, u)-1 ISSN 1609-3631 Vol. 8/1, May 2008 So jja is the (power) index of a, and the rows of jjb added to those of jja give u, which is the power index of the returned array. Now not all rows of jja and jjb correspond to extant elements of a and b respectively; so define a Boolean variable wanted that selects just the appropriate rows: wanted <apply(jja,1,function(x)all(x < dim(a))) & apply(jjb,1,function(x)all(x < dim(b))) & apply(jjb,1,function(x)all(x >= 0)) Thus element n of wanted is TRUE only if the nth row of both jja and jjb correspond to a legal element of a and b respectively. Now perform the addition by summing the products of the legal elements: sum(a[1+jja[wanted,]] * b[1+jjb[wanted,]]) } Thus function f() returns the coefficient, which is the sum of products of pairs of legal elements of a and b. Again observe the off-by-one issue. Now apply() function f() to the rows of index and reshape: out <- apply(index,1,f) dim(out) <- outDims Thus array out contains the multivariate polynomial product of a and b. The preceding code shows how multivariate polynomials may be multiplied. The implementation makes no assumptions about the entries of a or b and the coefficients of the product are summed over all possibilities; opportunities to streamline the procedure are discussed below. Multipols as functions Polynomials are implicitly functions of one variable; multivariate polynomials are functions too, but of more than one argument. Coercion of a multipol to a function is straightforward: 43 Multipol extraction and replacement One often needs to extract or replace parts of a multipol. The package includes extraction and replacement methods but, partly because of the off-by-one issue, these are not straightforward. Consider the case where one has a multipol and wishes to extract the terms of order zero and one: > a[0:1, 0:1] [1,] [2,] Note how the off-by-one issue is handled: a[i,j] is the coefficient of xi y j (here the constant and firstorder terms); the code is due to Rougier (2007). Replacement is slightly different: > a[0, 0] <- -99 > a y^0 y^1 y^2 y^3 y^4 x^0 -99 3 5 7 9 x^1 2 4 6 8 10 Observe how replacement operators—unlike extraction operators—return a multipol; this allows expeditious modification of multivariate polynomials. The reason that the extraction operator returns an array rather than a multipol is that the extracted object often does not have unambiguous interpretation as a multipol (consider a[-1,-1], for example). It seems to this author that the loss of elegance arising from the asymmetry between extraction and replacement is amply offset by the impossibility of an extracted object’s representation as a multipol being undesired—unless the user explicitly coerces. The elephant in the room Representing a multivariate polynomial by an array is a natural and efficient method, but suffers some disadvantages. Consider Euler’s four-square identity > f2 <- as.function(a * b) > f2(c(x = 1, y = 0+3i)) [,1] [,2] 1 3 2 4 a21 + a22 + a23 + a24 · b21 + b22 + b23 + b24 = ( a1 b1 − a2 b2 − a3 b3 − a4 b4 )2 + ( a1 b2 + a2 b1 + a3 b4 − a4 b3 )2 + [1] 67725+167400i ( a1 b3 − a2 b4 + a3 b1 + a4 b2 )2 + It is worth noting the seamless integration between polynom and multipol in this regard: f1(a) is a multipol [recall that f1() is a function coerced from a univariate polynomial]. ( a1 b4 + a2 b3 − a3 b2 + a4 b1 )2 which was discussed in 1749 in a letter from Euler to Goldbach. The identity is important in number 1 Or indeed more elegantly by observing that both sides of the identity express the absolute value of the product of two quaternions: | a|2 |b|2 = | ab|2 . With the onion package (Hankin, 2006a), one would define f <- function(a,b)Norm(a)*Norm(b) - Norm(a*b) and observe (for example) that f(rquat(rand="norm"),rquat(rand="norm")) is zero to machine precision. R News ISSN 1609-3631 Vol. 8/1, May 2008 theory, and may be proved straightforwardly by direct expansion.1 It may by verified to machine precision using the multipol package; the left hand side is given by: > options("showchars" = TRUE) > lhs <- polyprod(ones(4,2),ones(4,2)) [1] "1*x1^2*x5^2 + 1*x2^2*x5^2 + ... (the right hand side’s idiom is more involved), but this relatively trivial expansion requires about 20 minutes on my 1.5 GHz G4; the product comprises 38 = 6561 elements, of which only 16 are nonzero. Note the options() statement controlling the format of the output which causes the result to be printed in a more appropriate form. Clearly the multipol package as currently implemented is inefficient for multivariate problems of this nature in which the arrays possess few nonzero elements. A challenge The inefficiency discussed above is ultimately due to the storage and manipulation of many zero coefficients that may be omitted from a calculation. Multivariate polynomials for which this is an issue appear to be common: the package includes many functions—such as uni(), single(), and lone()— that define useful multipols in which the number of nonzero elements is very small. In this section, I discuss some ideas for implementations in which zero operations are implicitly excluded. These ideas are presented in the spirit of a request for comments: although they seem to this author to be reasonable methodologies, readers are invited to discuss the ideas presented here and indeed to suggest alternative strategies. The canonical solution would be to employ some form of sparse array class, along the lines of Mathematica’s SparseArray. Unfortunately, no such functionality exists as of 2008, but C++ includes a “map” class (Stroustrup, 1997) that would be ideally suited to this application. There are other paradigms that may be worth exploring. It is possible to consider a multivariate polynomial of arity d (call this an object of class Pd ) as being a univariate polynomial whose coefficients are of class Pd−1 —class P0 would be a real or complex number—but such recursive class definitions appear not to be possible with the current implementation of S3 or S4 (Venables, 2008). Recent experimental work by West (2008) exhibits a proof-of-concept in C++ which might form the back end of an R implementation. Euler’s identity appears to be a particularly favourable example and is proved essentially instantaneously (the proof is a rigorous theoretical result, not just a numerical verification, as the system uses exact integer arithmetic). R News 44 Conclusions This article introduces the multipol package that provides functionality for manipulating multivariate polynomials. The multipol package builds on and generalizes the polynom package of Venables et al., which is restricted to the case of univariate polynomials. The generalization is not straightforward and presents a number of programming issues that were discussed. One overriding issue is that of performance: many multivariate polynomials of interest are “sparse” in the sense that they have many zero entries that unnecessarily consume storage and processing resources. Several possible solutions are suggested, in the form of a request for comments. The canonical method appears to be some form of sparse array, for which the “map” class of the C++ language is ideally suited. Implementation of such functionality in R might well find application in fields other than multivariate polynomials. Appendix: an example This appendix presents a brief technical example of multivariate polynomials in use in the field of enumerative combinatorics (Good, 1976). Suppose one wishes to determine how many contingency tables, with non-negative integer entries, have specified row and column marginal totals. The appropriate generating function is 1 1 − xi y j 16i6nr 16 j6nc ∏ ∏ where the table has nr rows and nc columns (the number of contingency tables is given by the coeffit t s s cient of x11 x22 · · · xrsr · y11 y22 · · · yttc where the si and ti are the row- and column- sums respectively). The R idiom for the generating function gf in the case of nr = nc = n = 3 is: n jj f u gf <<<<<- 3 as.matrix(expand.grid(1:n,n+(1:n))) function(i) ooom(n,lone(2*n,jj[i,]),m=n) c(sapply(1:(n*n),f,simplify=FALSE)) do.call("mprod", c(u,maxorder=n)) [here function ooom() is “one-over-one-minus”; and mprod() is the function name for multipol product]. In this case, it is clear that sparse array functionality would not result in better performance, many elements of the generating function gf are nonzero. Observe that the maximum of gf, 55, is consistent with Sloane (2008). Acknowledgements I would like to acknowledge the many stimulating comments made by the R-help list. In particular, ISSN 1609-3631 Vol. 8/1, May 2008 the insightful comments from Bill Venables and Kurt Hornik were extremely helpful. Bibliography G. E. Andrews. The Theory of Partitions. Cambridge University Press, 1998. L. Euler. Lettre CXXV. Communication to Goldbach; Berlin, 12 April, 1749. I. J. Good. On the application of symmetric Dirichlet distributions and their mixtures to contingency tables. The Annals of Statistics, 4(6):1159–1189, 1976. R. K. S. Hankin. Normed division algebras with R: Introducing the onion package. R News, 6(2): 49–52, May 2006a. URL http://CRAN.R-project. org/doc/Rnews/. R. K. S. Hankin. Additive integer partitions in R. Journal of Statistical Software, Code Snippets, 16(1), May 2006b. J. Rougier. Oarray: Arrays with arbitrary offsets, 2007. R package version 1.4-2. N. J. A. Sloane. The on-line encyclopedia of integer sequences. Published electronically at http://www.research.att.com/~njas/ sequences/A110058, 2008. R News 45 B. Stroustrup. The C++ Programming Language. Addison Wesley, third edition, 1997. W. N. Venables, K. Hornik, and M. Maechler. polynom: A collection of functions to implement a class for univariate polynomial manipulations, 2007. URL http://CRAN.R-project.org/. R package version 1.3-2. S original by Bill Venables, packages for R by Kurt Hornik and Martin Maechler. W. N. Venables. Personal Communication, 2008. W. N. Venables and B. D. Ripley. S Programming. Springer, 2001. L. J. West. An experimental C++ implementation of a recursively defined polynomial class. Personal communication, 2008. H. S. Wilf. generatingfunctionology. Academic Press, 1994. Robin K. S. Hankin National Oceanography Centre, Southampton European Way Southampton United Kingdom SO14 3ZH [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 46 R Help Desk How Can I Avoid This Loop or Make It Faster? Vectorization! by Uwe Ligges and John Fox R is an interpreted language, i.e., code is parsed and evaluated at runtime. Therefore there is a speed issue which can be addressed by writing vectorized code (which is executed vector-wise) rather than using loops, if the problem can be vectorized. Loops are not necessarily bad, however, if they are used in the right way — and if some basic rules are heeded: see the section below on loops. Many vector-wise operations are obvious. Nobody would want to replace the common component-wise operators for vectors or matrices (such as +, -, *, . . . ), matrix multiplication (%*%), and extremely handy vectorized functions such as crossprod() and outer() by loops. Note that there are also very efficient functions available for calculating sums and means for certain dimensions in arrays or matrices: rowSums(), colSums(), rowMeans(), and colMeans(). If vectorization is not as obvious as in the cases mentioned above, the functions in the ‘apply’ family, named [s,l,m,t]apply, are provided to apply another function to the elements/dimensions of objects. These ‘apply’ functions provide a compact syntax for sometimes rather complex tasks that is more readable and faster than poorly written loops. Introduction There are some circumstances in which optimized and efficient code is desirable: in functions that are frequently used, in functions that are made available to the public (e.g., in a package), in simulations taking a considerable amount of time, etc. There are other circumstances in which code should not be optimized with respect to speed — if the performance is already satisfactory. For example, in order to save a few seconds or minutes of CPU time, you do not want to spend a few hours of programming, and you do not want to break code or introduce bugs applying optimization patches to properly working code. A principal rule is: Do not optimize unless you really need optimized code! Some more thoughts about this rule are given by Hyde (2006), for example. A second rule in R is: When you write a function from scratch, do it the vectorized way initially. If you do, then most of the time there will be no need to optimize later on. If you really need to optimize, measure the speed of your code rather than guessing it. How to profile R code in order to detect the bottlenecks is described in Venables (2001), R Development Core Team (2008a), and the help page ?Rprof. The CRAN packages proftools (Tierney, 2007) and profr (Wickham, 2008) provide sets of more extensive profiling tools. The convenient function system.time() (used later in this article) simply measures the time of the command given as its argument.1 The returned value consists of user time (CPU time R needs for calculations), system time (time the system is using for processing requests, e.g., for handling files), total time (how long it really took to process the command) and — depending on the operating system in use — two further elements. Readability and clarity of the code is another topic in the area of optimized code that has to be considered, because readable code is more maintainable, and users (as well as the author) can easily see what is going on in a particular piece of code. In the next section, we focus on vectorization to optimize code both for speed and for readability. We describe the use of the family of *apply functions, which enable us to write condensed but clear code. Some of those functions can even make the code perform faster. How to avoid mistakes when writing loops and how to measure the speed of code is described in a subsequent section. Matrices and arrays: apply() The function apply() is used to work vector-wise on matrices or arrays. Appropriate functions can be applied to the columns or rows of a matrix or array without explicitly writing code for a loop. Before reading further in this article, type ?apply and read the whole help page, particularly the sections ‘Usage’, ‘Arguments’, and ‘Examples’. As an example, let us construct a 5 × 4 matrix X from some random numbers (following a normal distribution with µ = 0, σ = 1) and apply the function max() column-wise to X. The result will be a vector of the maxima of the columns: R> (X <- matrix(rnorm(20), nrow = 5, ncol = 4)) R> apply(X, 2, max) Dataframes, lists and vectors: sapply() lapply() and Using the function lapply() (l because the value returned is a list), another appropriate function can be quickly applied element-wise to other objects, for example, dataframes, lists, or simply vectors. The resulting list has as many elements as the original object to which the function is applied. 1 Timings in this article have been measured on the following platform: AMD Athlon 64 X2 Dual Core 3800+ (2 GHz), 2 Gb RAM, Windows XP Professional SP2 (32-bit), using an optimized ‘Rblas.dll’ linked against ATLAS as available from CRAN. R News ISSN 1609-3631 Vol. 8/1, May 2008 Analogously, the function sapply() (s for simplify) works like lapply() with the exception that it tries to simplify the value it returns. This means, for example, that if the resulting object is a list containing just vectors of length one, the result simplifies to a vector (or a matrix, if the list contains vectors of equal lengths). If sapply() cannot simplify the result, it returns the same list as lapply(). A frequently used R idiom: Suppose that you want to extract the i-th columns of several matrices that are contained in a list L. To set up an example, we construct a list L containing two matrices A and B: R> A <- matrix(1:4, 2, 2) R> B <- matrix(5:10, 2, 3) R> (L <- list(A, B)) [[1]] [,1] [,2] [1,] 1 3 [2,] 2 4 [[2]] [,1] [,2] [,3] [1,] 5 7 9 [2,] 6 8 10 The next call can be read as follows: ‘Apply the function [() to all elements of L as the first argument, omit the second argument, and specify 2 as the third argument. Finally return the result in the form of a list.’ The command returns the second columns of both matrices in the form of a list: R> lapply(L, "[", , 2) [[1]] [1] 3 4 [[2]] [1] 7 8 The same result can be achieved by specifying an anonymous function, as in: R> sapply(L, function(x) x[ , 2]) [,1] [,2] [1,] 3 7 [2,] 4 8 where the elements of L are passed separately as x in the argument of the anonymous function given as the second argument in the lapply() call. Because all matrices in L contain equal numbers of rows, the call returns a matrix consisting of the second columns of all the matrices in L. Vectorization via mapply() and Vectorize() The mapply() function (m for multivariate) can simultaneously vectorize several arguments to a function that does not normally take vector arguments. Consider the integrate() function, which approximates definite integrals by adaptive quadrature, and which is designed to compute a single integral. The following command, for example, integrates the standard-normal density function from −1.96 to 1.96: R News 47 R> integrate(dnorm, lower=-1.96, upper=1.96) 0.9500042 with absolute error < 1.0e-11 integrate() returns an object, the first element of which, named "value", contains the value of the integral. This is an artificial example because normal integrals can be calculated more directly with the vectorized pnorm() function: > pnorm(1.96) - pnorm(-1.96) [1] 0.9500042 mapply() permits us to compute several normal integrals simultaneously: R> (lo <- c(-Inf, -3:3)) [1] -Inf -3 -2 -1 R> (hi <- c(-3:3, Inf)) [1] -3 -2 -1 0 1 0 2 1 2 3 3 Inf R> (P <- mapply(function(lo, hi) + integrate(dnorm, lo, hi)$value, lo, hi)) [1] 0.001349899 0.021400234 0.135905122 [4] 0.341344746 0.341344746 0.135905122 [7] 0.021400234 0.001349899 R> sum(P) [1] 1 vectorize() takes a function as its initial argument and returns a vectorized version of the function. For example, to vectorize integrate(): R> Integrate <- Vectorize( + function(fn, lower, upper) + integrate(fn, lower, upper)$value, + vectorize.args=c("lower", "upper") + ) Then R> Integrate(dnorm, lower=lo, upper=hi) produces the same result as the call to mapply() above. Optimized BLAS for vectorized code If vector and matrix operations (such as multiplication, inversion, decomposition) are applied to very large matrices and vectors, optimized BLAS (Basic Linear Algebra Subprograms) libraries can be used in order to increase the speed of execution dramatically, because such libraries make use of the specific architecture of the CPU (optimally using caches, pipelines, internal commands and units of a CPU). A well known optimized BLAS is ATLAS (Automatically Tuned Linear Algebra Software, http: //math-atlas.sourceforge.net/, Whaley and Petitet, 2005). How to link R against ATLAS, for example, is discussed in R Development Core Team (2008b). Windows users can simply obtain precompiled binary versions of the file ‘Rblas.dll’, linked against ATLAS for various CPUs, from the directory ISSN 1609-3631 Vol. 8/1, May 2008 ‘/bin/windows/contrib/ATLAS/’ on their favourite CRAN mirror. All that is necessary is to replace the standard file ‘Rblas.dll’ in the ‘bin’ folder of the R installation with the file downloaded from CRAN. In particular, it is not necessary to recompile R to use the optimized ‘Rblas.dll’. Loops! Many comments about R state that using loops is a particularly bad idea. This is not necessarily true. In certain cases, it is difficult to write vectorized code, or vectorized code may consume a huge amount of memory. Also note that it is in many instances much better to solve a problem with a loop than to use recursive function calls. Some rules for writing loops should be heeded, however: Initialize new objects to full length before the loop, rather than increasing their size within the loop. If an element is to be assigned into an object in each iteration of a loop, and if the final length of that object is known before the loop starts, then the object should be initialized to full length prior to the loop. Otherwise, memory has to be allocated and data has to be copied in each iteration of the loop, which can take a considerable amount of time. To initialize objects we can use functions such as • logical(), integer(), numeric(), complex(), and character() for vectors of different modes, as well as the more general function vector(); • matrix() and array(). Consider the following example. We write three functions, time1(), time2(), and time3(), each assigning values element-wise into an object: For i = 1, . . . , n, the value i2 will be written into the i-th element of vector a. In function time1(), a will not be initialized to full length (very bad practice, but we see it repeatedly: a <- NULL): R> time1 <- function(n){ + a <- NULL + for(i in 1:n) a <- c(a, i^2) + a + } R> system.time(time1(30000)) user system elapsed 5.11 0.01 5.13 In function time2(), a will be initialized to full length [a <- numeric(n)]: R> time2 <- function(n){ + a <- numeric(n) + for(i in 1:n) a[i] <- i^2 + a + } R> system.time(time2(30000)) R News 48 user 0.22 system elapsed 0.00 0.22 In function time3(), a will be created by a vectorwise operation without a loop. R> time3 <- function(n){ + a <- (1:n)^2 + a + } R> system.time(time3(30000)) user system elapsed 0 0 0 What we see is that • it makes sense to measure and to think about speed; • functions of similar length of code and with the same results can vary in speed — drastically; • the fastest way is to use a vectorized approach [as in time3()]; and • if a vectorized approach does not work, remember to initialize objects to full length as in time2(), which was in our example more than 20 times faster than the approach in time1(). It is always advisable to initialize objects to the right length, if possible. The relative advantage of doing so, however, depends on how much computational time is spent in each loop iteration. We invite readers to try the following code (which pertains to an example that we develop below): R> system.time({ + matrices <- vector(mode="list", length=10000) + for (i in 1:10000) + matrices[[i]] <+ matrix(rnorm(10000), 100, 100) + }) R> system.time({ + matrices <- list() + for (i in 1:10000) + matrices[[i]] <+ matrix(rnorm(10000), 100, 100) + }) Notice, however, that if you deliberately build up the object as you go along, it will slow things down a great deal, as the entire object will be copied at every step. Compare both of the above with the following: R> system.time({ + matrices <- list() + for (i in 1:1000) + matrices <- c(matrices, + list(matrix(rnorm(10000), 100, 100))) + }) Do not do things in a loop that can be done outside the loop. It does not make sense, for example, to check for the validity of objects within a loop if checking can be applied outside, perhaps even vectorized. ISSN 1609-3631 Vol. 8/1, May 2008 It also does not make sense to apply the same calculations several times, particularly not n times within a loop, if they just have to be performed one time. Consider the following example where we want to apply a function [here sin()] to i = 1, . . . , n and multiply the results by 2π. Let us imagine that this function cannot work on vectors [although sin() does work on vectors, of course!], so that we need to use a loop: R> time4 <- function(n){ + a <- numeric(n) + for(i in 1:n) + a[i] <- 2 * pi * sin(i) + a + } R> system.time(time4(100000)) user system elapsed 0.75 0.00 0.75 R> time5 <- function(n){ + a <- numeric(n) + for(i in 1:n) + a[i] <- sin(i) + 2 * pi * a + } R> system.time(time5(100000)) user system elapsed 0.50 0.00 0.50 Again, we can reduce the amount of CPU time by heeding some simple rules. One of the reasons for the performance gain is that 2*pi can be calculated just once [as in time5()]; there is no need to calculate it n = 100000 times [as in the example in time4()]. Do not avoid loops simply for the sake of avoiding loops. Some time ago, a question was posted to the R-help email list asking how to sum a large number of matrices in a list. To simulate this situation, we create a list of 10000 100 × 100 matrices containing randomnormal numbers: R> matrices <- vector(mode="list", length=10000) R> for (i in seq_along(matrices)) + matrices[[i]] <+ matrix(rnorm(10000), 100, 100) One suggestion was to use a loop to sum the matrices, as follows, producing, we claim, simple, straightforward code: R> system.time({ + S <- matrix(0, 100, 100) + for (M in matrices) + S <- S + M + }) user system elapsed 1.22 0.08 1.30 In response, someone else suggested the following ‘cleverer’ solution, which avoids the loop: R News 49 R> system.time(S <- apply(array(unlist(matrices), + dim = c(100, 100, 10000)), 1:2, sum)) Error: cannot allocate vector of size 762.9 Mb Not only does this solution fail for a problem of this magnitude on the system on which we tried it (a 32bit system, hence limited to 2Gb for the process), but it is slower on smaller problems. We invite the reader to redo this problem with 10000 10 × 10 matrices, for example. A final note on this problem: R> S <- rowSums(array(unlist(matrices), + dim = c(10, 10, 10000)), dims = 2) is approximately as fast as the loop for the smaller version of the problem but fails on the larger one. The lesson: Avoid loops to produce clearer and possibly more efficient code, not simply to avoid loops. Summary To answer the frequently asked question, ‘How can I avoid this loop or make it faster?’: Try to use simple vectorized operations; use the family of apply functions if appropriate; initialize objects to full length when using loops; and do not repeat calculations many times if performing them just once is sufficient. Measure execution time before making changes to code, and only make changes if the efficiency gain really matters. It is better to have readable code that is free of bugs than to waste hours optimizing code to gain a fraction of a second. Sometimes, in fact, a loop will provide a clear and efficient solution to a problem (considering both time and memory use). Acknowledgment We would like to thank Bill Venables for his many helpful suggestions. Bibliography R. Hyde. The fallacy of premature optimization. Ubiquity, 7(24), 2006. URL http://www.acm.org/ ubiquity. R Development Core Team. Writing R Extensions. R Foundation for Statistical Computing, Vienna, Austria, 2008a. URL http://www.R-project.org. R Development Core Team. R Installation and Administration. R Foundation for Statistical Computing, Vienna, Austria, 2008b. URL http://www. R-project.org. L. Tierney. proftools: Profile Output Processing Tools for R, 2007. R package version 0.0-2. ISSN 1609-3631 Vol. 8/1, May 2008 W. Venables. Programmer’s Niche. R News, 1(1):27– 30, 2001. URL http://CRAN.R-project.org/doc/ Rnews/. R. C. Whaley and A. Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101–121, February 2005. URL http://www.cs.utsa.edu/~whaley/papers/ spercw04.ps. H. Wickham. profr: An alternative display for profiling R News 50 information, 2008. URL http://had.co.nz/profr. R package version 0.1.1. Uwe Ligges Department of Statistics, Technische Universität Dortmund, Germany [email protected] John Fox Department of Sociology, McMaster University, Hamilton, Ontario, Canada [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 51 Changes in R Version 2.7.0 by the R Core Team User-visible changes • The default graphics device in non-interactive use is now pdf() rather than postscript(). [PDF viewers are now more widely available than PostScript viewers.] The default width and height for pdf() and bitmap() have been changed to 7 (inches) to match the screen devices. • Most users of the X11() device will see a new device that has different fonts, anti-aliasing of lines and fonts and supports semi-transparent colours. • Considerable efforts have been made to make the default output from graphics devices as similar as possible (and in particular close to that from postscript/pdf). Many devices were misinterpreting ’pointsize’ in some way, for example as being in device units (pixels) rather than in points. • Packages which include graphics devices need to be re-installed for this version of R, with recently updated versions. New features • The apse code used by agrep() has been updated to version 0.16, with various bug fixes. agrep() now supports multibyte character sets. • any() and all() avoid coercing zero-length arguments (which used a surprising amount of memory) since they cannot affect the answer. Coercion of other than integer arguments now gives a warning as this is often a mistake (e.g. writing all(pr) > 0 instead of all(pr > 0) ). • as.Date(), as.POSIXct() and as.POSIXlt() now convert numeric arguments (days or seconds since some epoch) provided the ’origin’ argument is specified. • New function as.octmode() to create objects such as file permissions. • as.POSIXlt() is now generic, and it and as.POSIXct() gain a ’...’ argument. The character/factor methods now accept a ’format’ argument (analogous to that for as.Date). R News • New function browseVignettes() lists available vignettes in an HTML browser with links to PDF, Rnw, and R files. • There are new capabilities "aqua" (for the AQUA GUI and quartz() device on Mac OS X) and "cairo" (for cairo-based graphics devices). • New function checkNEWS() in package ’tools’ that detects common errors in NEWS file formatting. • deparse() gains a new argument ’nlines’ to limit the number of lines of output, and this is used internally to make several functions more efficient. • deriv() now knows the derivatives of digamma(x), trigamma(x) and psigamma(x, deriv) (wrt to x). • dir.create() has a new argument ’mode’, used on Unix-alikes (only) to set the permissions on the created directory. • Where an array is dropped to a length-one vector by drop() or [, drop = TRUE], the result now has names if exactly one of the dimensions was named. (This is compatible with S.) Previously there were no names. • The ’incomparables’ argument to duplicated(), unique() and match() is now implemented, and passed to match() from merge(). • dyn.load() gains a ’DLLpath’ argument to specify the path for dependent DLLs: currently only used on Windows. • The spreadsheet edit() methods (and used by fix()) for data frames and matrices now warn when classes are discarded. When editing a data frame, columns of unknown type (that is not numeric, logical, character or factor) are now converted to character (instead of numeric). • file.create() has a new argument ’showWarnings’ (default TRUE) to show an informative warning when creation fails, and dir.create() warns under more error conditions. • New higher-order functions Find(), Negate() and Position(). • [dpqr]gamma(*, shape = 0) now work as limits of ’shape -> 0’, corresponding to the point distribution with all mass at 0. ISSN 1609-3631 Vol. 8/1, May 2008 • An informative warning (in addition to the error message) will be given when the basic, extended or perl mode of grep(), strsplit() and friends fails to compile the pattern. • More study is done of perl=TRUE patterns in grep() and friends when length(x) > 10: this should improve performance on long vectors. • grep(), strsplit() and friends with fixed=TRUE or perl=TRUE work in UTF-8 and preserve the UTF-8 encoding for UTF-8 inputs where supported. • help.search() now builds the database about 3x times faster. • iconv() now accepts "UTF8" on all platforms (many did, but not e.g. libiconv as used on Windows). • identity() convenience function to be used for programming. • In addition to warning when ’pkgs’ is not found, install.packages() now reports if it finds a valid package with only a case mismatch in the name. • intToUtf8() now marks the Encoding of its output. • The function is() now works with S3 inheritance; that is, with objects having multiple strings in the class attribute. • Extensions to condition number computation for matrices, notably complex ones are provided, both in kappa() and the new rcond(). • list.files() gains a ’ignore.case’ argument, to allow case-insensitive matching on some Windows/MacOS file systems. • ls.str() and lsf.str() have slightly changed arguments and defaults such that ls.str() no arguments works when debugging. • Under Unix, utils::make.packages.html() can now be used directly to set up linked HTML help pages, optionally without creating the package listing and search database (which can be much faster). • new.packages() now knows about the frontend package gnomeGUI (which does not install into a library). • optim(*, control = list(...)) now warns when ’...’ contains unexpected names, instead of silently ignoring them. R News 52 • The options "browser" and "editor" may now be set to functions, just as "pager" already could. • packageDescription() makes use of installed metadata where available (for speed, e.g. in make.packages.html()). • pairwise.t.test() and pairwise.wilcox.test() now more explicitly allow paired tests. In the former case it is now flagged as an error if both ’paired’ and ’pool.SD’ are set TRUE (formerly, ’paired’ was silently ignored), and onesided tests are generated according to ’alternative’ also if ’pool.SD’ is TRUE. • paste() and file.path() are now completely internal, for speed. (This speeds up make.packages.html(packages=FALSE) severalfold, for example.) • paste() now sets the encoding on the result under some circumstances (see ?paste). • predict.loess() now works when loess() was fitted with transformed explanatory variables, e.g, loess(y ~ log(x)+ log(z)). • print(<data.frame>)’s new argument ’row.names’ allows to suppress printing rownames. • print() and str() now also "work" for ’logLik’ vectors longer than one. • Progress-bar functions txtProgressBar(), tkProgressBar() in package tcltk and winProgressBar() (Windows only). • readChar() gains an argument ’useBytes’ to allow it to read a fixed number of bytes in an MBCS locale. • readNEWS() has been moved to the tools package. • round() and signif() now do internal argument matching if supplied with two arguments and at least one is named. • New function showNonASCII() in package tools to aid detection of non-ASCII characters in .R and .Rd files. • The [dpq]signrank() functions now typically use considerably less memory than previously, thanks to a patch from Ivo Ugrina. • spec.ar() now uses frequency(x) when calculating the frequencies of the estimated spectrum, so that for monthly series the frequencies are now per year (as for spec.pgram) rather than per month as before. • spline() gets an ’xout’ argument, analogously to approx(). ISSN 1609-3631 Vol. 8/1, May 2008 • sprintf() now does all the conversions needed in a first pass if length(fmt) == 1, and so can be many times faster if called with long vector arguments. • [g]sub(useBytes = FALSE) now sets the encoding on changed elements of the result when working on an element of known encoding. (This was previously done only for perl = TRUE.) • New function Sys.chmod(), a wrapper for ’chmod’ on platforms which support it. (On Windows it handles only the read-only bit.) • New function Sys.umask(), a wrapper for ’umask’ on platforms which support it. • New bindings ttk*() in package tcltk for the ’themed widgets’ of Tk 8.5. The tcltk demos make use of these widgets where available. • write.table(d, row.names=FALSE) is faster when ’d’ has millions of rows; in particular for a data frame with automatic row names. (Suggestion from Martin Morgan.) • The parser limit on string size has been removed. • If a NEWS file is present in the root of a source package, it is installed (analogously to LICENSE, LICENCE and COPYING). 53 • x[<zero-length>] <- NULL is always a no-op: previously type-checking was done on the replacement value and so this failed, whereas we now assume NULL can be promoted to any zero-length vector-like object. Other cases of a zero-length index are done more efficiently. • There is a new option in Rd markup of \donttest{} to mark example code that should be run by example() but not tested (e.g. because it might fail in some locales). • The error handler in the parser now reports line numbers for more syntax errors (MBCS and Unicode encoding errors, line length and context stack overflows, and mis-specified argument lists to functions). • The "MethodsList" objects originally used for method selection are being phased out. New utilities provide simpler alternatives (see ?findMethods), and direct use of the mangled names for the objects is now deprecated. • Creating new S4 class and method definitions in an environment that could not be identified (as package, namespace or global) previously generated an error. It now results in creating and using an artificial package name from the current date/time, with a warning. See ?getPackageName. • Rd conversion to ’example’ now quotes aliases which contain spaces. • Unix-alikes now give a warning on startup if locale settings fail. (The Windows port has long done so.) • The handling of DST on dates outside the range 1902-2037 has been improved. Dates after 2037 are assumed to have the same DST rules as currently predicted for the 2030’s (rather than the 1970s), and dates prior to 1902 are assumed to have no DST and the same offset as in 1902 (if known, otherwise as in the 1970s). • Parsing and scanning of numerical constants is now done by R’s own C code. This ensures cross-platform consistency, and mitigates the effects of setting LC_NUMERIC (within base R it only applies to output – packages may differ). • On platforms where we can detect that mktime sets errno (e.g. Solaris and the code used on Windows but not Linux nor Mac OS X), 196912-31 23:59:59 GMT is converted from POSIXlt to POSIXct as -1 and not NA. • The definition of ’whitespace’ used by the parser is slightly wider: it includes Unicode space characters on Windows and in UTF-8 locales on machines which use Unicode wide characters. • The src/extra/intl sources have been updated to those from gettext 0.17. • New flag –interactive on Unix-alikes forces the session to be interactive (as –ess does on Windows). R News The format accepted is more general than before and includes binary exponents in hexadecimal constants: see ?NumericConstants for details. • Dependence specifications for R or packages in the Depends field in a DESCRIPTION file can now make use of operators < > == and != (in addition to <= and >=): such packages will not be installable nor loadable in R < 2.7.0. There can be multiple mentions of R or a package in the Depends field in a DESCRIPTION file: only the first mention will be used in R < 2.7.0. GRAPHICS CHANGES • The default graphics devices in interactive and non-interactive sessions are ISSN 1609-3631 Vol. 8/1, May 2008 now configurable via environment ables R_INTERACTIVE_DEVICE R_DEFAULT_DEVICE respectively. 54 variand • New function dev.new() to launch a new copy of the default graphics device (and taking care if it is "pdf" or "postscript" not to trample on the file of an already running copy). • dev.copy2eps() uses dev.displaylist() to detect screen devices, rather than list them in the function. • New function dev.copy2pdf(), the analogue of dev.copy2eps(). • dev.interactive() no longer treats a graphics device as interactive if it has a display list (but devices can still register themselves on the list of interactive devices). • The X11() and windows() graphics devices have a new argument ’title’ to set the window title. • X11() now has the defaults for all of its arguments set by the new function X11.options(), inter alia replacing options "gamma", "colortype" and "X11fonts". • ps.options() now warns on unused option ’append’. xfig() no longer takes default arguments from ps.options(). (This was not documented prior to 2.6.1 patched.) pdf() now takes defaults from the new function pdf.options() rather that from ps.options() (and the latter was not documented prior to 2.6.1 patched). The defaults for all arguments other than ’file’ in postscript() and pdf() can now be set by ps.options() or pdf.options() • New functions setEPS() and setPS() as wrappers to ps.options() to set appropriate defaults for figures for inclusion in other documents and for spooling to a printer respectively. • The meaning of numeric ’pch’ has been extended where MBCSes are supported. Now negative integer values indicate Unicode points, integer values in 32-127 represent ASCII characters, and 128-255 are valid only in singlebyte locales. (Previously what happened with negative pch values was undocumented: they were replaced by the current setting of par("pch").) • Graphics devices can say if they can rotate text well (e.g. postscript() and pdf() can) and if so the device’s native text becomes the default R News for contour labels rather than using Hershey fonts. • The setting of the line spacing (par("cra")[2]) on the X11() and windows() devices is now comparable with postscript() etc, and roughly 20% smaller than before (it used to depend on the locale for X11). (So is the pictex() device, now 20% larger.) This affects the margin size in plots, and should result in betterlooking plots. • There is a per-device setting for whether new frames need confirmation. This is controlled by either par("ask") or grid.prompt() and affects all subsequent plots on the device using base or grid graphics. • There is a new version of the X11() device based on cairo graphics which is selected by type "cairo" or "nbcairo", and is available on machines with cairo installed and preferably pango (which most machines with gtk+ >= 2.8 will have). This version supports translucent colours and normally does a better job of font selection so it has been possible to display (e.g.) English, Polish, Russian and Japanese text on a single X11() window. It is the default where available. There is a companion function, savePlot(), to save the current plot to a PNG file. On Unix-alikes, devices jpeg() and png() also accept type = "cairo", and with that option do not need a running X server. The meaning of capabilities("jpeg") and capabilities("png") has changed to reflect this. On MacOS X, there is a further type = "quartz". The default type is selected by the new option "bitmapType", and is "quartz" or "cairo" where available. Where cairo 1.2 or later is supported, there is a svg() device to write SVG files, and cairo_pdf() and cairo_ps() devices to write (possibly bitmap) PDF and postscript files via cairo. Some features require cairo >= 1.2, and some which are nominally supported under 1.2 seem to need 1.4 to work well. • There are new bmp() and tiff() devices. • New function devSize() to report the size of the current graphics device surface (in inches or device units). This gives the same information as par("din"), but independent of the graphics subsystem. • New base graphics function clip() to set the clipping region (in user coordinates). ISSN 1609-3631 Vol. 8/1, May 2008 • New functions grconvertX() and grconvertY() to convert between coordinate systems in base graphics. • identify() recycles its ’labels’ argument if necessary. • stripchart() is now a generic function, with default and formula methods defined. Additional graphics parameters may be included in the call. Formula handling is now similar to boxplot(). • strwidth() and strheight() gain ’font’ and ’vfont’ arguments and accept in-line pars such as ’family’ in the same way as text() does. (Longstanding wish of PR#776) 55 • Use of the graphics headers Rgraphics.h and Rdevices.h is deprecated, and these will be unavailable in R 2.8.0. (They are hardly used except in graphics devices, for which there is an updated API in this version of R.) • options("par.ask.default") is deprecated in favour of "device.ask.default". • The ’device-independent’ family "symbol" is deprecated as it was highly locale- and devicedependent (it only did something useful in single-byte locales on most devices) and font=5 (base) or fontface=5 (grid) did the job it was intended to do more reliably. • gammaCody() is now formally deprecated. • example(ask=TRUE) now applies to grid graphics (e.g. from lattice) as well as to base graphics. • Two low-level functions using MethodsList metadata objects (mlistMetaName() and getAllMethods()) are deprecated. • Option "device.ask.default" replaces "par.ask.default" now it applies also to grid.prompt(). • Setting par(gamma=) is now deprecated, and the windows() device (the only known example) no longer allows it. • plot.formula() only prompts between plots for interactive devices (it used to prompt for all devices). • When plot.default() is called with y=NULL it now calls Axis() with the ’y’ it constructs rather than use the default axis. Deprecated & defunct • In package installation, SaveImage: yes is defunct and lazyloading is attempted instead. • $ on an atomic vector or S4 object is now defunct. • Partial matching in [[ is now only performed if explicitly requested (by exact=FALSE or exact=NA). • Command-line completion has been moved from package ’rcompgen’ to package ’utils’: the former no longer exists as a separate package in the R distribution. • The S4 pseudo-classes "single" and double have been removed. (The S4 class for a REALSXP is "numeric": for back-compatibility as(x, "double") coerces to "numeric".) • gpar(gamma=) in the grid package is now defunct. • Several S4 class definition utilities, get*(), have been said to be deprecated since R 1.8.0; these are now formally deprecated. Ditto for removeMethodsObject(). R News • The C macro ’allocString’ will be removed in 2.8.0 – use ’mkChar’, or ’allocVector’ directly if really necessary. Installation • Tcl/Tk >= 8.3 (released in 2000) is now required to build package tcltk. • configure first tries TCL_INCLUDE_SPEC and TK_INCLUDE_SPEC when looking for Tcl/Tk headers. (The existing scheme did not work for the ActiveTcl package on Mac OS X.) • The Windows build only supports Windows 2000 or later (XP, Vista, Server 2003 and Server 2008). • New option –enable-R-static-lib installs libR.a which can be linked to a front-end via ’R CMD config –ldflags’. The tests/Embedding examples now work with a static R library. • Netscape (which was discontinued in Feb 2008) is no longer considered when selecting a browser. • xdg-open (the freedesktop.org interface to kfmclient/gnome-open/...) is considered as a possible browser, after real browsers such as firefox, mozilla and opera. • The search for tclConfig.sh and tkConfig.sh now only looks in directories with names containing $(LIBnn) in the hope of finding the version for the appropriate architecture (e.g. x86_64 or i386). ISSN 1609-3631 Vol. 8/1, May 2008 • libtool has been updated to version 2.2. • Use of –with-system-zlib, –with-system-bzlib or –with-system-pcre now requires version >= 1.2.3, 1.0.5, 7.6 respectively, for security. Utilities • Rdconv now removes empty sections including alias and keyword entries, with a note. • Keyword entries are no longer mandatory in Rd files. • R CMD INSTALL now also installs tangled versions of all vignettes. • R CMD check now warns if spaces or nonASCII characters are used in file paths, since these are not in general portable. • R CMD check (via massage-examples.pl) now checks all examples with a 7 inch square device region on A4 paper, for locale-independence and to be similar to viewing examples on an on-screen device. If a package declares an encoding in the DESCRIPTION file, the examples are assumed to be in that encoding when running the tests. (This avoids errors in running latin1 examples in a UTF-8 locale.) • R CMD check uses pdflatex (if available) to check the typeset version of the manual, producing PDF rather than DVI. (This is a better check since the package reference manuals on CRAN are in PDF.) • R CMD Rd2dvi gains a –encoding argument to be passed to R CMD Rdconv, to set the default encoding for conversions. If this is not supplied and the files are package sources and the DESCRIPTION file contains an Encoding field, that is used for the default encoding. • available.packages() (and hence install.packages() etc.) now supports subdirectories in a repository, and tools::write_PACKAGES() can now produce PACKAGES files including subdirectories. • The default for ’stylepath’ in Sweave’s (default) RweaveLatex driver can be set by the environment variable SWEAVE_STYLEPATH_DEFAULT: see ?RweaveLatex. C-level facilities • Both the Unix and Windows interfaces for embedding now make use of ’const char *’ declarations where appropriate. R News 56 • Rprintf() and REprintf() now use ’const char *’ for their format argument – this should reduce warnings when called from C++. • There is a new description of the interface for graphics devices in the ’R Internals’ manual, and several new entry points. The API has been updated to version R_GE_version = 5, and graphics devices will need to be updated accordingly. • Graphics devices can now select to be sent text in UTF-8, even if the current locale is not UTF-8 (and so enable text entered in UTF-8 to be plotted). This is used by postscript(), pdf() and the windows() family of devices, as well as the new cairo-based devices. • More Lapack routines are available (and declared in R_Ext/Lapack.h), notably for (reciprocal) condition number estimation of complex matrices. • Experimental utility R_has_slot supplementing R_do_slot. • There is a new public interface to the encoding info stored on CHARSXPs, getCharCE and mkCharCE using the enumeration type cetype_t. • A new header ’R_ext/Visibility.h’ contains some definitions for controlling the visibility of entry points, and how to control visibility is now documented in ’Writing R Extensions’. Bug fixes • pt(x, df) is now even more accurate in some cases (e.g. 12 instead of 8 significant digits), when x2 << d f , thanks to a remark from Ian Smith, related to PR#9945. • co[rv](use = "complete.obs") now always gives an error if there are no complete cases: they used to give NA if method = "pearson" but an error for the other two methods. (Note that this is pretty arbitrary, but zero-length vectors always give an error so it is at least consistent.) cor(use="pair") used to give diagonal 1 even if the variable was completely missing for the rank methods but NA for the Pearson method: it now gives NA in all cases. cor(use="pair") for the rank methods gave a matrix result with dimensions > 0 even if one of the inputs had 0 columns. • Supplying edit.row.names = TRUE when editing a matrix without row names is now an error and not a segfault. (PR#10500) ISSN 1609-3631 Vol. 8/1, May 2008 • The error handler in the parser reported unexpected & as && and | as ||. • ps.options(reset = TRUE) had not reset for a long time. • paste() and file.path() no longer allow NA_character_ for their ’sep’ and ’collapse’ arguments. • by() failed for 1-column matrices and dataframes. (PR#10506) However, to preserve the old behaviour, the default method when operating on a vector still passes subsets of the vector to FUN, and this is now documented. • Better behaviour of str.default() for nondefault ’strict.width’ (it was calling str() rather than str.default() internally); also, more useful handling of options("str"). • wilcox.test(exact=FALSE, conf.int=TRUE) could fail in some extreme two-sample problems. (Reported by Wolfgang Huber.) 57 This means that devices can now indicate the ’graphics input’ mode by e.g. a change of cursor. • Locales without encoding specification and non-UTF-8 locales now work properly on Mac OS X. Note that locales without encoding specification always use UTF-8 encoding in Mac OS X (except for specials "POSIX" and "C") - this is different from other operating systems. • iconv() now correctly handles to="" and from="" on Mac OS X. • In diag()’s argument list, drop the explicit default (’ = n’) for ’ncol’ which is ugly when making diag() generic. • S4 classes with the same name from different packages were not recognized because of a bug in caching the new definition. • jpeg() and png() no longer maintain a display list, as they are not interactive devices. • par(pch=) would accept a multi-byte string but only use the first byte. This would lead to incorrect results in an MBCS locale if a nonASCII character was supplied. • Using attr(x, "names") <- value (instead of the correct names<-) with ’value’ a pairlist (instead of the correct character vector) worked incorrectly. (PR#10807) • There are some checks for valid C-style formats in, e.g. png(filename=). (PR#10571) • Using [<- to add a column to a data frame dropped other attributes whereas [[<- and $<- did not: now all preserve attributes. (PR#10873) • vector() was misinterpreting some double ’length’ values, e.g, NaN and NA_real_ were interpreted as zero. Also, invalid types of ’length’ were interpreted as -1 and hence reported as negative. (length<- shared the code and hence the same misinterpretations.) • A basic class "S4" was added to correspond to the "S4" object type, so that objects with this type will print, etc. The class is VIRTUAL, since all actual S4 objects must have a real class. • Classes with no slots that contain only VIRTUAL classes are now VIRTUAL, as was intended but confused by having an empty S4 object as prototype. ## backed out temporarily ## • format.AsIs() discarded dimnames, causing dataframes with matrix variables to be printed without using the column names, unlike what happens in S-PLUS (Tim Hesterberg, PR#10730). • xspline() and grid::grid.xspline() work in device coordinates and now correct for anisotropy in the device coordinate system. • grid.locator() now indicates to the graphics device that it is is in ’graphics input’ mode (as locator() and identify() always have). R News • File access functions such as file.exists(), file.info(), dirname() and unlink() now treat an NA filename as a non-existent file and not the file "NA". • r<foo>(), the random number generators, are now more consistent in warning when NA’s (specifically NaN’s) are generated. • rnorm(n, mu = Inf) now returns rep(Inf, n) instead of NaN; similar changes are applied to rlnorm(), rexp(), etc. • [l]choose() now warns when rounding noninteger ’k’ instead of doing so silently. (May help confused users such as PR#10766.) • gamma() was warning incorrectly for most negative values as being too near a negative integer. This also affected other functions making use of its C-level implementation. • dumpMethod() and dumpMethods() now work again. • package.skeleton() now also works for code_files with only metadata (e.g. S4 setClass) definitions; it handles S4 classes and methods, producing documentation and NAMESPACE exports if requested. ISSN 1609-3631 Vol. 8/1, May 2008 • Some methods package utilities (implicitGeneric(), makeGeneric()) will be more robust in dealing with primitive functions (not a useful idea to call them with primitives, though). • Making a MethodsList from a function with no methods table will return an empty list, rather than cause an error (questionably a bug, but caused some obscure failures). • setAs() now catches 2 arguments in the method definition, if they do not match the arguments of coerce(). • S4 methods with missing arguments in the definition are handled correctly when nonsignature arguments exist, and check for conflicting local names in the method definition. • qgamma() and qchisq() could be inaccurate for small p, e.g. qgamma(1.2e-10, shape = 19) was 2.52 rather than 2.73. • dbeta(.., ncp) is now more accurate for large ncp, and typically no longer underflows for give.log = TRUE. • coerce() is now a proper S4 object and so prints correctly. R News 58 • @ now checks it is being applied to an S4 object, and if not gives a warning (which will become an error in 2.8.0). • dump() and friends now warn that all S4 objects (even those based on vectors) are not source()able, with a stronger wording. • read.dcf(all = TRUE) was leaking connections. • scan() with a non-default separator could skip nul bytes, including those entered as code 0 with allowEscapes=TRUE. This was different from the default separator. • determinant(matrix(,0,0)) now returns a correct "det" result; also value 1 or 0 depending on ’logarithm’, rather than numeric(0). • Name space ’grDevices’ was not unloading its DLL when the name space was unloaded. • getNativeSymbolInfo() was unaware of nonregistered Fortran names, because one of the C support routines ignored them. • load() again reads correctly character strings with embedded nuls. (This was broken in 2.6.x, but worked in earlier versions.) ISSN 1609-3631 Vol. 8/1, May 2008 59 Changes on CRAN by Kurt Hornik CRAN package web The CRAN package web area has substantially been reorganized and enhanced. Most importantly, packages now have persistent package URLs of the form http://CRAN.R-project.org/package=foo which is also the recommended package URL for citations. (The package=foo redirections also work for most official CRAN mirrors.) The corresponding package web page now has its package dependency information hyperlinked. It also points to a package check page with check results and timings, and to an archive directory with the sources of older versions of the package (if applicable), which are conveniently gathered into an ‘Archive/foo’ subdirectory of the CRAN ‘src/contrib’ area. CRAN package checking The CRAN Linux continuing (“daily”) check processes now fully check packages with dependencies on packages in Bioconductor and Omegahat. All check flavors now give timings for installing and checking installed packages. The check results are available in overall summaries sorted by either package name or maintainer name, and in individual package check summary pages. New contributed packages BAYSTAR Bayesian analysis of Threshold autoregressive model (BAYSTAR). By Cathy W. S. Chen, Edward M.H. Lin, F.C. Liu, and Richard Gerlach. by our compression technique that represents a group of original parameters as a single one in MCMC step. By Longhai Li. Bchron Create chronologies based on radiocarbon and non-radiocarbon dated depths, following the work of Parnell and Haslett (2007, JRSSC). The package runs MCMC, predictions and plots for radiocarbon (and non radiocarbon) dated sediment cores. By Andrew Parnell. BootPR Bootstrap Prediction Intervals and BiasCorrected Forecasting for auto-regressive time series. By Jae H. Kim. COZIGAM COnstrained Zero-Inflated Generalized Additive Model (COZIGAM) fitting with associated model plotting and prediction. By Hai Liu and Kung-Sik Chan. CPE Concordance Probability Estimates in survival analysis. By Qianxing Mo, Mithat Gonen and Glenn Heller. ChainLadder Mack- and Munich-chain-ladder methods for insurance claims reserving. By Markus Gesmann. CombMSC Combined Model Selection Criteria: functions for computing optimal convex combinations of model selection criteria based on ranks, along with utility functions for constructing model lists, MSCs, and priors on model lists. By Andrew K. Smith. Containers Object-oriented data structures for R: stack, queue, deque, max-heap, min-heap, binary search tree, and splay tree. By John Hughes. CoxBoost Cox survival models by likelihood based boosting. By Harald Binder. BB Barzilai-Borwein Spectral Methods for solving nonlinear systems of equations, and for optimizing nonlinear objective functions subject to simple constraints. By Ravi Varadhan. DEA Data Envelopment Analysis, performing some basic models in both multiplier and envelopment form. By Zuleyka Diaz-Martinez and Jose Fernandez-Menendez. BPHO Bayesian Prediction with High-Order interactions. This software can be used in two situations. The first is to predict the next outcome based on the previous states of a discrete sequence. The second is to classify a discrete response based on a number of discrete covariates. In both situations, we use Bayesian logistic regression models that consider the highorder interactions. The models are trained with slice sampling method, a variant of Markov chain Monte Carlo. The time arising from using high-order interactions is reduced greatly DierckxSpline R companion to “Curve and Surface Fitting with Splines”, providing a wrapper to the FITPACK routines written by Paul Dierckx. The original Fortran is available from http://www.netlib.org/dierckx. By Sundar Dorai-Raj. R News EMC Evolutionary Monte Carlo (EMC) algorithm. Random walk Metropolis, Metropolis Hasting, parallel tempering, evolutionary Monte Carlo, temperature ladder construction and placement. By Gopi Goswami. ISSN 1609-3631 Vol. 8/1, May 2008 EMCC Evolutionary Monte Carlo (EMC) methods for clustering, temperature ladder construction and placement. By Gopi Goswami. EMD Empirical Mode Decomposition and Hilbert spectral analysis. By Donghoh Kim and HeeSeok Oh. ETC Tests and simultaneous confidence intervals for equivalence to control. The package allows selecting those treatments of a one-way layout being equivalent to a control. Bonferroni adjusted “two one-sided t-tests” (TOST) and related simultaneous confidence intervals are given for both differences or ratios of means of normally distributed data. For the case of equal variances and balanced sample sizes for the treatment groups, the single-step procedure of Bofinger and Bofinger (1995) can be chosen. For non-normal data, the Wilcoxon test is applied. By Mario Hasler. EffectiveDose Estimates the Effective Dose level for quantal bioassay data by nonparametric techniques and gives a bootstrap confidence interval. By Regine Scheder. FAiR Factor Analysis in R. This package estimates factor analysis models using a genetic algorithm, which opens up a number of new ways to pursue old ideas, such as those discussed by Allen Yates in his 1987 book “Multivariate Exploratory Data Analysis”. The major sources of value added in this package are new ways to transform factors in exploratory factor analysis, and perhaps more importantly, a new estimator for the factor analysis model called semiexploratory factor analysis. By Ben Goodrich. FinTS R companion to Tsay (2005), “Analysis of Financial Time Series, 2nd ed.” (Wiley). Includes data sets, functions and script files required to work some of the examples. Version 0.2-x includes R objects for all data files used in the text and script files to recreate most of the analyses in chapters 1 and 2 plus parts of chapters 3 and 11. By Spencer Graves. FitAR Subset AR Model fitting. Complete functions are given for model identification, estimation and diagnostic checking for AR and subset AR models. Two types of subset AR models are supported. One family of subset AR models, denoted by ARp, is formed by taking subsets of the original AR coefficients and in the other, denoted by ARz, subsets of the partial autocorrelations are used. The main advantage of the ARz model is its applicability to very large order models. By A.I. McLeod and Ying Zhang. FrF2 Analyzing Fractional Factorial designs with 2level factors. The package is meant for comR News 60 pletely aliased designs only, i.e., e.g. not for analyzing Plackett-Burman designs with interactions. Enables convenient main effects and interaction plots for all factors simultaneously and offers a cube plot for looking at the simultaneous effects of three factors. An enhanced DanielPlot function (modified from BsMD) is provided. Furthermore, the alias structure for Fractional Factorial 2-level designs is output in a more readable format than with the built-in function alias. By Ulrike Groemping. FunNet Functional analysis of gene co-expression networks from microarray expression data. The analytic model implemented in this package involves two abstraction layers: transcriptional and functional (biological roles). A functional profiling technique using Gene Ontology & KEGG annotations is applied to extract a list of relevant biological themes from microarray expression profiling data. Afterwards, multiple-instance representations are built to relate significant themes to their transcriptional instances (i.e., the two layers of the model). An adapted non-linear dynamical system model is used to quantify the proximity of relevant genomic themes based on the similarity of the expression profiles of their gene instances. Eventually an unsupervised multiple-instance clustering procedure, relying on the two abstraction layers, is used to identify the structure of the co-expression network composed from modules of functionally related transcripts. Functional and transcriptional maps of the co-expression network are provided separately together with detailed information on the network centrality of related transcripts and genomic themes. By Corneliu Henegar. GEOmap Routines for making map projections (forward and inverse), topographic maps, perspective plots, geological maps, geological map symbols, geological databases, interactive plotting and selection of focus regions. By Jonathan M. Lees. IBrokers R API to Interactive Brokers Trader Workstation. By Jeffrey A. Ryan. ISA Insieme di funzioni di supporto al volume “INTRODUZIONE ALLA STATISTICA APPLICATA con esempi in R”, Federico M. Stefanini, PEARSON Education Milano, 2007. By Fabio Frascati and Federico M. Stefanini. ISOcodes ISO language, territory, currency, script and character codes. Provides ISO 639 language codes, ISO 3166 territory codes, ISO 4217 currency codes, ISO 15924 script codes, and the ISO 8859 and ISO 10646 character codes as well ISSN 1609-3631 Vol. 8/1, May 2008 as the Unicode data table. By Christian Buchta and Kurt Hornik. Iso Functions to perform isotonic regression. Does linear order and unimodal order isotonic regression. By Rolf Turner. JM Shared parameter models for the joint modeling of longitudinal and time-to-event data. By Dimitris Rizopoulos. MCPAN Multiple contrast tests and simultaneous confidence intervals based on normal approximation. With implementations for binomial proportions in a 2 × k setting (risk difference and odds ratio), poly-3-adjusted tumor rates, and multiple comparisons of biodiversity indices. Approximative power calculation for multiple contrast tests of binomial proportions. By Frank Schaarschmidt, Daniel Gerhard, and Martin Sill. MCPMod Design and analysis of dose-finding studies. Implements a methodology for doseresponse analysis that combines aspects of multiple comparison procedures and modeling approaches (Bretz, Pinheiro and Branson, 2005, Biometrics 61, 738–748). The package provides tools for the analysis of dose finding trials as well as a variety of tools necessary to plan a trial to be conducted with the MCPMod methodology. By Bjoern Bornkamp, Jose Pinheiro, and Frank Bretz. MultEq Tests and confidence intervals for comparing two treatments when there is more than one primary response variable (endpoint) given. The step-up procedure of Quan et al. (2001) is both applied for differences and extended to ratios of means of normally distributed data. A related single-step procedure is also available. By Mario Hasler. PARccs Estimation of partial attributable risks (PAR) from case-control data, with corresponding percentile or BCa confidence intervals. By Christiane Raemsch. PASWR Data and functions for the book “Probability and Statistics with R” by M. D. Ugarte, A. F. Militino, and A. T. Arnholt (2008, Chapman & Hall/CRC). By Alan T. Arnholt. Peaks Spectrum manipulation: background estimation, Markov smoothing, deconvolution and peaks search functions. Ported from ROOT/TSpectrum class. By Miroslav Morhac. PwrGSD Tools to compute power in a group sequential design. SimPwrGSD C-kernel is a simulation routine that is similar in spirit to ‘dssp2.f’ by Gu and Lai, but with major improvements. AsyPwrGSD has exactly the same R News 61 range of application as SimPwrGSD but uses asymptotic methods and runs much faster. By Grant Izmirlian. QuantPsyc Quantitative psychology tools. Contains functions useful for data screening, testing moderation, mediation and estimating power. By Thomas D. Fletcher. R.methodsS3 Methods that simplify the setup of S3 generic functions and S3 methods. Major effort has been made in making definition of methods as simple as possible with a minimum of maintenance for package developers. For example, generic functions are created automatically, if missing, and name conflict are automatically solved, if possible. The method setMethodS3() is a good start for those who in the future want to migrate to S4. This is a crossplatform package implemented in pure R and is generating standard S3 methods. By Henrik Bengtsson. RExcelInstaller Integration of R and Excel (use R in Excel, read/write XLS files). RExcel, an add-in for MS Excel on MS Windows, allows to transfer data between R and Excel, writing VBA macros using R as a library for Excel, and calling R functions as worksheet function in Excel. RExcel integrates nicely with R Commander (Rcmdr). This R package installs the Excel addin for Excel versions from 2000 to 2007. It only works on MS Windows. By Erich Neuwirth, with contributions by Richard Heiberger and Jurgen Volkering. RFreak An R interface to a modified version of the Free Evolutionary Algorithm Kit FrEAK (http: //sourceforge.net/projects/freak427/), a toolkit written in Java to design and analyze evolutionary algorithms. Both the R interface an extended version of FrEAK are contained in the RFreak package. By Robin Nunkesser. RSEIS Tools for seismic time series analysis via spectrum analysis, wavelet transforms, particle motion, and hodograms. By Jonathan M. Lees. RSeqMeth Package for analysis of Sequenom EpiTYPER Data. By Aaron Statham. RTOMO Visualization for seismic tomography. Plots tomographic images, and allows one to interact and query three-dimensional tomographic models. Vertical cross-sectional cuts can be extracted by mouse click. Geographic information can be added easily. By Jonathan M. Lees. RankAggreg Performs aggregation of ordered lists based on the ranks using three different algorithms: Cross-Entropy Monte Carlo algorithm, ISSN 1609-3631 Vol. 8/1, May 2008 Genetic algorithm, and a brute force algorithm (for small problems). By Vasyl Pihur, Somnath Datta, and Susmita Datta. RcmdrPlugin.Export Graphically export objects to LATEX or HTML. This package provides facilities to graphically export Rcmdr output to LATEX or HTML code. Essentially, at the moment, the plug-in is a graphical front-end to xtable(). It is intended to (1) facilitate exporting Rcmdr output to formats other than ASCII text and (2) provide R novices with an easy to use, easy to access reference on exporting R objects to formats suited for printed output. By Liviu Andronic. RcmdrPlugin.IPSUR Accompanies “Introduction to Probability and Statistics Using R” by G. Andy Chang and G. Jay Kerns (in progress). Contributes functions unique to the book as well as specific configuration and selected functionality to the R Commander by John Fox. By G. Jay Kerns, with contributions by Theophilius Boye and Tyler Drombosky, adapted from the work of John Fox et al. RcmdrPlugin.epack Rcmdr plugin for time series. By Erin Hodgess. Rcplex R interface to CPLEX solvers for linear, quadratic, and (linear and quadratic) mixed integer programs. A working installation of CPLEX is required. Support for Windows platforms is currently not available. By Hector Corrada Bravo. Rglpk R interface to the GNU Linear Programing Kit (GLPK). GLPK is open source software for solving large-scale linear programming (LP), mixed integer linear programming (MILP) and other related problems. By Kurt Hornik and Stefan Theussl. Rsymphony An R interface to the SYMPHONY MILP solver (version 5.1.7). By Reinhard Harter, Kurt Hornik and Stefan Theussl. SMC Sequential Monte Carlo (SMC) Algorithm, and functions for particle filtering and auxiliary particle filtering. By Gopi Goswami. SyNet Inference and analysis of sympatry networks. Infers sympatry matrices from distributional data and analyzes them in order to identify groups of species cohesively sympatric. By Daniel A. Dos Santos. TSA Functions and data sets detailed in the book “Time Series Analysis with Applications in R (second edition)” by Jonathan Cryer and KungSik Chan. By Kung-Sik Chan. R News 62 TSHRC Two-stage procedure for comparing hazard rate functions which may or may not cross each other. By Jun Sheng, Peihua Qiu, and Charles J. Geyer. VIM Visualization and Imputation of Missing values. Can be used for exploring the data and the structure of the missing values. Depending on this structure, the tool can be helpful for identifying the mechanism generating the missings. A graphical user interface allows an easy handling of the implemented plot methods. By Matthias Templ. WINRPACK Reads in WIN pickfile and waveform file, prepares data for RSEIS. By Jonathan M. Lees. XReg Implements extreme regression estimation as described in LeBlanc, Moon and Kooperberg (2006, Biostatistics 7, 71–84). By Michael LeBlanc. adk Anderson-Darling K-sample test and combinations of such tests. By Fritz Scholz. anacor Simple and canonical correspondence analysis. Performs simple correspondence analysis (CA) on a two-way frequency table (with missings) by means of SVD. Different scaling methods (standard, centroid, Benzecri, Goodman) as well as various plots including confidence ellipsoids are provided. By Jan de Leeuw and Patrick Mair. anapuce Functions for normalization, differential analysis of microarray data and others functions implementing recent methods developed by the Statistic and Genome Team from UMR 518 AgroParisTech/INRA Appl. Math. Comput. Sc. By J. Aubert. backfitRichards Computation and plotting of backfitted independent values of Richards curves. By Jens Henrik Badsberg. bentcableAR Bent-Cable regression for independent data or auto-regressive time series. The bent cable (linear-quadratic-linear) generalizes the broken stick (linear-linear), which is also handled by this package. By Grace Chiu. biclust BiCluster Algorithms. The main function biclust() provides several algorithms to find biclusters in two-dimensional data: Cheng and Church, Spectral, Plaid Model, Xmotifs and Bimax. In addition, the package provides methods for data preprocessing (normalization and discretization), visualization, and validation of bicluster solutions. By Sebastian Kaiser, Rodrigo Santamaria, Roberto Theron, Luis Quintales and Friedrich Leisch. ISSN 1609-3631 Vol. 8/1, May 2008 63 bifactorial Global and multiple inferences for given bi- and trifactorial clinical trial designs using bootstrap methods and a classical approach. By Peter Frommolt. be smooth and strictly monotone. Also features percentile estimation for dose-response experiments (e.g., ED50 estimation of a medication) using CIR. By Assaf P. Oron. bipartite Visualizes bipartite networks and calculates some ecological indices. By Carsten F. Dormann and Bernd Gruber, with additional code from Jochen Fruend, based on the C-code developed by Nils Bluethgen. compHclust Performs the complementary hierarchical clustering procedure and returns X 0 (the expected residual matrix), and a vector of the relative gene importance. By Gen Nowak and Robert Tibshirani. birch Dealing with very large data sets using BIRCH. Provides an implementation of the algorithms described in Zhang et al. (1997), and provides functions for creating CF-trees, along with algorithms for dealing with some combinatorial problems, such as covMcd and ltsReg. It is very well suited for dealing with very large data sets, and does not require that the data can fit in physical memory. By Justin Harrington and Matias Salibian-Barrera. contfrac Various utilities for evaluating continued fractions. By Robin K. S. Hankin. brglm Bias-reduction in binomial-response GLMs. Fit binomial-response GLMs using either a modified-score approach to bias-reduction or maximum penalized likelihood where penalization is by Jeffreys invariant prior. Fitting takes place by iteratively fitting a local GLM on a pseudo-data representation. The interface is essentially the same as glm. More flexibility is provided by the fact that custom pseudodata representations can be specified and used for model fitting. Functions are provided for the construction of confidence intervals for the bias-reduced estimates. By Ioannis Kosmidis. bspec Bayesian inference on the (discrete) power spectrum of time series. By Christian Roever. bvls An R interface to the Stark-Parker algorithm for bounded-variable least squares. By Katharine M. Mullen. candisc Functions for computing and graphing canonical discriminant analyses. By Michael Friendly and John Fox. cheb Discrete linear Chebyshev approximation. By Jan de Leeuw. chemometrics An R companion to the book “Introduction to Multivariate Statistical Analysis in Chemometrics” by K. Varmuza and P. Filzmoser (CRC Press). By P. Filzmoser and K. Varmuza. cir Nonparametric estimation of monotone functions via isotonic regression and centered isotonic regression. Provides the well-known isotonic regression (IR) algorithm and an improvement called Centered Isotonic Regression (CIR) for the case the true function is known to R News corrperm Three permutation tests of correlation useful when there are repeated measurements. By Douglas M. Potter. crank Functions for completing and recalculating rankings. By Jim Lemon. degreenet Likelihood-based inference for skewed count distributions used in network modeling. Part of the “statnet” suite of packages for network analysis. By Mark S. Handcock. depmixS4 Fit latent (hidden) Markov models on mixed categorical and continuous (time series) data, otherwise known as dependent mixture models. By Ingmar Visser and Maarten Speekenbrink. diagram Visualization of simple graphs (networks) based on a transition matrix, utilities to plot flow diagrams and visualize webs, and more. Support for the book “A guide to ecological modelling” by Karline Soetaert and Peter Herman (in preparation). By Karline Soetaert. dynamicTreeCut Methods for detection of clusters in hierarchical clustering dendrograms. By Peter Langfelder and Bin Zhang, with contributions from Steve Horvath. emu Provides an interface to the Emu speech database system and many special purpose functions for display and analysis of speech data. By Jonathan Harrington and others. epiR Functions for analyzing epidemiological data. Contains functions for directly and indirectly adjusting measures of disease frequency, quantifying measures of association on the basis of single or multiple strata of count data presented in a contingency table, and computing confidence intervals around incidence risk and incidence rate estimates. Miscellaneous functions for use in meta-analysis, diagnostic test interpretation, and sample size calculations. By Mark Stevenson with contributions from Telmo Nunes, Javier Sanchez, and Ron Thornton. ISSN 1609-3631 Vol. 8/1, May 2008 ergm An integrated set of tools to analyze and simulate networks based on exponential-family random graph models (ERGM). Part of the “statnet” suite of packages for network analysis. By Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. fpca A geometric approach to MLE for functional principal components. By Jie Peng and Debashis Paul. gRain Probability propagation in graphical independence networks, also known as probabilistic expert systems (which includes Bayesian networks as a special case). By Søren Højsgaard. geozoo Zoo of geometric objects, allowing for display in GGobi through the use of rggobi. By Barret Scloerke, with contributions from Dianne Cook and Hadley Wickham. getopt C-like getopt behavior. Use this with Rscript to write “#!” shebang scripts that accept short and long flags/options. By Allen Day. gibbs.met Naive Gibbs sampling with Metropolis steps. Provides two generic functions for performing Markov chain sampling in a naive way for a user-defined target distribution which involves only continuous variables. gibbs_met() performs Gibbs sampling with each 1-dimensional distribution sampled with Metropolis update using Gaussian proposal distribution centered at the previous state. met_gaussian updates the whole state with Metropolis method using independent Gaussian proposal distribution centered at the previous state. The sampling is carried out without considering any special tricks for improving efficiency. This package is aimed at only routine applications of MCMC in moderatedimensional problems. By Longhai Li. gmaps Extends the functionality of the maps package for the grid graphics system. This enables more advanced plots and more functionality. It also makes use of the grid structure to fix problems encountered with the traditional graphics system, such as resizing of graphs. By Andrew Redd. 64 Normal/Gaussian, Poisson or negative binomial distribution. By Olivier Briet. helloJavaWorld Hello Java World. A dummy package to demonstrate how to interface to a jar file that resides inside an R package. By Tobias Verbeke. hsmm Computation of Hidden Semi Markov Models. By Jan Bulla, Ingo Bulla, Oleg Nenadic. hydrogeo Groundwater data presentation and interpretation. Contains one function for drawing Piper (also called Piper-Henn) digrammes from water analysis for major ions. By Myles English. hypergeo The hypergeometric function for complex numbers. By Robin K. S. Hankin. ivivc A menu-driven package for in vitro-in vivo correlation (IVIVC) model building and model validation. By Hsin Ya Lee and Yung-Jin Lee. jit Enable just-in-time (JIT) compilation. The functions in this package are useful only under Ra and have no effect under R. See http://www. milbo.users.sonic.net/ra/index.html. By Stephen Milborrow. kerfdr Semi-parametric kernel-based approaches to local fdr estimations useful for the testing of multiple hypothesis (in large-scale genetic, genomic and post-genomic studies for instance). By M Guedj and G Nuel, with contributions from S. Robin and A. Celisse. knorm Knorm correlations between genes (or probes) from microarray data obtained across multiple biologically interrelated experiments. The Knorm correlation adjusts for experiment dependencies (correlations) and reduces to the Pearson coefficient when experiment dependencies are absent. The Knorm estimation approach can be generally applicable to obtain between-row correlations from data matrices with two-way dependencies. By Siew-Leng Teng. goalprog Functions to solve weighted and lexicographical goal programming problems as specified by Lee (1972) and Ignizio (1976). By Frederick Novomestky. lago An efficient kernel algorithm for rare target detection and unbalanced classification. LAGO is a kernel method much like the SVM, except that it is constructed without the use of any iterative optimization procedure and hence very efficient (Technometrics 48, 193– 205; The American Statistician 62, 97–109, Section 4.2). By Alexandra Laflamme-Sanders, Wanhua Su, and Mu Zhu. gsarima Functions for Generalized SARIMA time series simulation. Write SARIMA models in (finite) AR representation and simulate generalized multiplicative seasonal autoregressive moving average (time) series with latentnetHRT Latent position and cluster models for statistical networks. This package implements the original specification in Handcock, Raftery and Tantrum (2007) and corresponds to version 0.7 of the original latentnet. The R News ISSN 1609-3631 Vol. 8/1, May 2008 current package latentnet implements the new specification in Krivitsky and Handcock (2008), and represents a substantial rewrite of the original package. Part of the “statnet” suite of packages for network analysis. By Mark S. Handcock, Jeremy Tantrum, Susan Shortreed, and Peter Hoff. limSolve Solving linear inverse models. Functions that (1) Find the minimum/maximum of a linear or quadratic function: min or max( f ( x)), where f(x) = || Ax − b||2 or f ( x) = ∑ ai xi subject to equality constraints Ex = f and/or inequality constraints Gx >= h. (2) Sample an under-determined or over-determined system Ex = f subject to Gx >= h, and if applicable Ax = b. (3) Solve a linear system Ax = B for the unknown x. Includes banded and tridiagonal linear systems. The package calls Fortran functions from LINPACK. By Karline Soetaert, Karel Van den Meersche, and Dick van Oevelen. lnMLE Maximum likelihood estimation of the Logistic-Normal model for clustered binary data. S original by Patrick Heagerty, R port by Bryan Comstock. locpol Local polynomial regression. By Jorge Luis Ojeda Cabrera. logregperm A permutation test for inference in logistic regression. The procedure is useful when parameter estimates in ordinary logistic regression fail to converge or are unreliable due to small sample size, or when the conditioning in exact conditional logistic regression restricts the sample space too severely. By Douglas M. Potter. marginTree Margin trees for high-dimensional classification, useful for more than 2 classes. By R. Tibshirani. maxLik Tools for Maximum Likelihood Estimation. By Ott Toomet and Arne Henningsen. minet Mutual Information NEtwork Inference. Implements various algorithms for inferring mutual information networks from data. All the algorithms compute the mutual information matrix in order to infer a network. Several mutual information estimators are implemented. By P. E. Meyer, F. Lafitte, and G. Bontempi. mixdist Contains functions for fitting finite mixture distribution models to grouped data and conditional data by the method of maximum likelihood using a combination of a Newton-type algorithm and the EM algorithm. By Peter Macdonald, with contributions from Juan Du. R News 65 moduleColor Basic module functions. Methods for color labeling, calculation of eigengenes, merging of closely related modules. By Peter Langfelder and Steve Horvath. mombf This package computes moment and inverse moment Bayes factors for linear models, and approximate Bayes factors for GLM and situations having a statistic which is asymptotically normally distributed and sufficient. Routines to evaluate prior densities, distribution functions, quantiles and modes are included. By David Rossell. moonsun A collection of basic astronomical routines for R based on “Practical astronomy with your calculator” by Peter Duffet-Smith. By Lukasz Komsta. msProcess Tools for protein mass spectra processing including data preparation, denoising, noise estimation, baseline correction, intensity normalization, peak detection, peak alignment, peak quantification, and various functionalities for data ingestion/conversion, mass calibration, data quality assessment, and protein mass spectra simulation. Also provides auxiliary tools for data representation, data visualization, and pipeline processing history recording and retrieval. By Lixin Gong, William Constantine, and Alex Chen. multipol Various utilities to manipulate multivariate polynomials. By Robin K. S. Hankin. mvna Computes the Nelson-Aalen estimator of the cumulative transition hazard for multistate models. By Arthur Allignol. ncf Functions for analyzing spatial (cross-) covariance: the nonparametric (cross-) covariance, the spline correlogram, the nonparametric phase coherence function, and related. By Ottar N. Bjornstad. netmodels Provides a set of functions designed to help in the study of scale free and small world networks. These functions are high level abstractions of the functions provided by the igraph package. By Domingo Vargas. networksis Simulate bipartite graphs with fixed marginals through sequential importance sampling, with the degrees of the nodes fixed and specified. Part of the “statnet” suite of packages for network analysis. By Ryan Admiraal and Mark S. Handcock. neuralnet Training of neural networks using the Resilient Backpropagation with (Riedmiller, 1994) or without Weightbacktracking (Riedmiller, 1993) or the modified globally convergent version by Anastasiadis et. al. (2005). The package ISSN 1609-3631 Vol. 8/1, May 2008 allows flexible settings through custom choice of error and activation functions. Furthermore the calculation of generalized weights (Intrator & Intrator, 1993) is implemented. By Stefan Fritsch and Frauke Guenther, following earlier work by Marc Suling. nlrwr Data sets and functions for non-linear regression, supporting software for the book “Nonlinear regression with R”. By Christian Ritz. nls2 Non-linear regression with brute force. By G. Grothendieck. nlt A nondecimated lifting transform for signal denoising. By Marina Knight. nlts functions for (non)linear time series analysis. A core topic is order estimation through crossvalidation. By Ottar N. Bjornstad. noia Implementation of the Natural and Orthogonal InterAction (NOIA) model. The NOIA model, as described extensively in Alvarez-Castro & Carlborg (2007, Genetics 176(2):1151-1167), is a framework facilitating the estimation of genetic effects and genotype-to-phenotype maps. This package provides the basic tools to perform linear and multilinear regressions from real populations (provided the phenotype and the genotype of every individuals), estimating the genetic effects from different reference points, the genotypic values, and the decomposition of genetic variances in a multi-locus, 2 alleles system. By Arnaud Le Rouzic. normwn.test Normality and white noise testing, including omnibus univariate and multivariate normality tests. One variation allows for the possibility of weak dependence rather than independence in the variable(s). Also included is an univariate white noise test where the null hypothesis is for “white noise” rather than “strict white noise”. The package deals with similar approaches to testing as the nortest, moments, and mvnormtest packages in R. By Peter Wickham. npde Routines to compute normalized prediction distribution errors, a metric designed to evaluate non-linear mixed effect models such as those used in pharmacokinetics and pharmacodynamics. By Emmanuelle Comets, Karl Brendel and France Mentré. nplplot Plotting non-parametric LOD scores from multiple input files. By Nandita Mukhopadhyay and Daniel E. Weeks. obsSens Sensitivity analysis for observational studies. Observational studies are limited in that there could be an unmeasured variable related R News 66 to both the response variable and the primary predictor. If this unmeasured variable were included in the analysis it would change the relationship (possibly changing the conclusions). Sensitivity analysis is a way to see how much of a relationship needs to exist with the unmeasured variable before the conclusions change. This package provides tools for doing a sensitivity analysis for regression (linear, logistic, and Cox) style models. By Greg Snow. ofw Implements the stochastic meta algorithm called Optimal Feature Weighting for two multiclass classifiers, CART and SVM. By Kim-Anh Le Cao and Patrick Chabrier. openNLP An interface to openNLP (http:// opennlp.sourceforge.net/), a collection of natural language processing tools including a sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, and named-entity detector, using the Maxent Java package for training and using maximum entropy models. By Ingo Feinerer. openNLPmodels English and Spanish models for openNLP. By Ingo Feinerer. pga An ensemble method for variable selection by carrying out Darwinian evolution in parallel universes. PGA is an ensemble algorithm similar in spirit to AdaBoost and random forest. It can “boost up” the performance of “bad” selection criteria such as AIC and GCV. (Technometrics 48, 491–502; The American Statistician 62, 97–109, Section 4.3). By Dandi Qiao and Mu Zhu. phangorn Phylogenetic analysis in R (estimation of phylogenetic trees and networks using maximum likelihood, maximum parsimony, distance methods & Hadamard conjugation). By Klaus Schliep. plotSEMM Graphing nonlinear latent variable interactions in SEMM. Contains functions which generate the diagnostic plots proposed by Bauer (2005) to investigate nonlinear latent variable interactions in SEMM using LISREL output. By Bethany E. Kok, Jolynn Pek, Sonya Sterba and Dan Bauer. poilog Functions for obtaining the density, random deviates and maximum likelihood estimates of the Poisson log-normal distribution and the bivariate Poisson log-normal distribution. By Vidar Grøtan and Steinar Engen. prob Provides a framework for performing elementary probability calculations on finite sample spaces, which may be represented by data frames or lists. Functionality includes setting up sample spaces, counting tools, defining ISSN 1609-3631 Vol. 8/1, May 2008 probability spaces, performing set algebra, calculating probability and conditional probability, tools for simulation and checking the law of large numbers, adding random variables, and finding marginal distributions. By G. Jay Kerns. profileModel Tools that can be used to calculate, evaluate, plot and use for inference the profiles of arbitrary inference functions for arbitrary glmlike fitted models with linear predictors. By Ioannis Kosmidis. profr An alternative data structure and visual rendering for the profiling information generated by Rprof. By Hadley Wickham. qAnalyst Control charts for variables and attributes according to the book “Introduction to Statistical Quality Control” by Douglas C. Montgomery. Moreover, capability analysis for normal and non-normal distributions is implemented. By Andrea Spanó and Giorgio Spedicato. qpcR Model fitting, optimal model selection and calculation of various features that are essential in the analysis of quantitative real-time polymerase chain reaction (qPCR). By AndrejNikolai Spiess and Christian Ritz. r2lUniv R to LATEX Univariate. Performs some basic analysis and generate the corresponding LATEX code. The basic analysis depends of the variable type (nominal, ordinal, discrete, continuous). By Christophe Genolini. 67 scrime Tools for the analysis of high-dimensional data developed/implemented at the group “Statistical Complexity Reduction In Molecular Epidemiology” (SCRIME). Main focus is on SNP data, but most of the functions can also be applied to other types of categorical data. By Holger Schwender and Arno Fritsch. segclust Segmentation and segmentation/clustering. Corresponds to the implementation of the statistical model described in Picard et. al., “A segmentation/clustering model for the analysis of array CGH data” (2007, Biometrics, 63(3)). Segmentation functions are also available (from Picard et al., “A statistical approach for array CGH data analysis” (2005, BMC Bioinformatics 11;6:27)). By Franck Picard. shape Plotting functions for creating graphical shapes such as ellipses, circles, cylinders, arrows, and more. Support for the book “A guide to ecological modelling” by Karline Soetaert and Peter Herman (in preparation). By Karline Soetaert. siar Stable Isotope Analysis in R. This package takes data on organism isotopes and fits a Bayesian model to their dietary habits based upon a Gaussian likelihood with a mixture Dirichletdistributed prior on the mean. By Andrew Parnell. similarityRichards Computing and plotting of values for similarity of backfitted independent values of Richards curves. By Jens Henrik Badsberg. randomLCA Fits random effects latent class models, as well as standard latent class models. By Ken Beath. space Partial correlation estimation with joint sparse regression model. By Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. richards Fit Richards curves. By Jens Henrik Badsberg. stab A menu-driven package for data analysis of drug stability based on ICH guideline (such as estimation of shelf-life from a 3-batch profile.). By Hsin-ya Lee and Yung-jin Lee. risksetROC Compute time-dependent incident/dynamic accuracy measures (ROC curve, AUC, integrated AUC) from censored survival data under proportional or non-proportional hazard assumption of Heagerty & Zheng (2005, Biometrics 61:1, 92–105). By Patrick J. Heagerty, packaging by Paramita Saha. statnet An integrated set of tools for the representation, visualization, analysis and simulation of network data. By Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, Martina Morris. robfilter A set of functions to filter time series based on concepts from robust statistics. By Roland Fried and Karen Schettlinger. subplex The subplex algorithm for unconstrained optimization, developed by Tom Rowan. By Aaron A. King, Rick Reeves. s20x Stats 20x functions. By Andrew Balemi, James Curran, Brant Deppa, Mike Forster, Michael Maia, and Chris Wild. survivalROC Compute time-dependent ROC curve from censored survival data using KaplanMeier (KM) or Nearest Neighbor Estimation (NNE) method of Heagerty, Lumley & Pepe (2000, Biometrics 56:2, 337–344). By Patrick J. Heagerty, packaging by Paramita Saha. sampleSelection Estimation of sample selection models. By Arne Henningsen and Ott Toomet. R News ISSN 1609-3631 Vol. 8/1, May 2008 torus Torus algorithm for quasi random number generation (for Van Der Corput lowdiscrepancy sequences, use fOptions from Rmetrics). Also implements a general linear congruential pseudo random generator (such as Park Miller) to make comparison with Mersenne Twister (default in R) and the Torus algorithm. By Christophe Dutang and Thibault Marchal. tpr Regression models for temporal process responses with time-varying coefficient. By Jun Yan. xts Extensible Time Series. Provide for uniform handling of R’s different time-based data classes by extending zoo, maximizing native format information preservation and allowing for user level customization and extension, while simplifying cross-class interoperability. By Jeffrey A. Ryan and Josh M. Ulrich. yaml Methods to convert R to YAML and back, implementing the Syck YAML parser (http: //www.whytheluckystiff.net/syck) for R. By Jeremy Stephens. R News 68 Other changes • New task views Optimization (packages which offer facilities for solving optimization problems, by Stefan Theussl) and ExperimentalDesign (packages for experimental design and analysis of data from experiments, by Ulrike Groemping). • Packages JLLprod, butler, elasticnet, epsi, gtkDevice, km.ci, ncvar, riv, rpart.permutation, rsbml, taskPR, treeglia, vardiag and zicounts were moved to the Archive. • Package CPGchron was moved to the Archive (replaced by Bchron). • Package IPSUR was moved to the Archive (replaced by RcmdrPlugin.IPSUR). • Package gRcox was renamed to gRc. • Package pwt was re-added to CRAN. Kurt Hornik Wirtschaftsuniversität Wien, Austria [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 69 News from the Bioconductor Project by the Bioconductor Team Program in Computational Biology Fred Hutchinson Cancer Research Center We are pleased to announce Bioconductor 2.2, released on May 1, 2008. Bioconductor 2.2 is compatible with R 2.7.0, and consists of 260 package. The release includes 35 new packages, and many improvements to existing packages. New packages New packages address a diversity of topics in highthroughput genomic analysis. Some highlights include: Advanced statistical methods for analysis ranging from probe-level modeling (e.g., plw) through gene set and other functional profiling (e.g., GSEAlm, goProfiles). New problem domains addressed by packages such as snpMatrix, offering classes and methods to compactly summarize large single nucleotide polymorphism data sets. Integration with third-party software including the GEOmetadb package for accessing GEO metadata and AffyCompatible and additions to affxparser for accessing microarray vendor resources. Graphical tools in packages such as GenomeGraphs and rtracklayer effectively visualize complex data in an appropriate genomic context. New technical approaches in packages such as affyPara and xps explore the significant computational burden of large-scale analysis. The release also includes packages to support two forthcoming books: by Gentleman (2008), about using R for bioinformatics; and by Hahne et al. (2008), presenting Bioconductor case studies. Annotations The ‘annotation’ packages in Bioconductor have experienced significant change. An annotation package contains very useful biological information about microarray probes and the genes they are meant to interrogate. Previously, these packages used an R environment to provide a simple key-value association between the probes and their annotations. This release of Bioconductor sees widened use of SQLite-based annotation packages, and SQLitebased annotations can now be used instead of most environment-based packages. SQLite-based packages offer several attractive features, including more efficient use of memory, R News representation of more complicated data structures, and flexible queries across annotations (SQL tables). Most users access these new annotations using familiar functions such as mget. One useful new function is the revmap function, which has the effect (but not the overhead!) of reversing the direction of the map (e.g., mapping from gene symbol to probe identifier, instead of the other way around). Advanced users can write SQL queries directly. The scope of annotation packages continues to expand, with a more extensive ‘organism’-centric (e.g., org.Hs.eg.db, representing Homo sapiens) annotations. New ’homology’ packages summarize the InParanoid data base, allowing between-species identification of homologous genes. Other developments and directions Bioconductor package authors continue to have access to a very effective package repository and build system. All packages are maintained under subversion version control, with the latest version of the package built each day on a diversity of computer architectures. Developers can access detailed information on the success of their package builds on both release and development platforms (e.g., http: //bioconductor.org/checkResults/). Users access successfully built packages using the biocLite function, which identifies the appropriate package for their version of R. New Bioconductor packages contributed from our active user / developer base now receive both technical and scientific reviews. This helps package authors produce quality packages, and benefits users by providing a more robust software experience. The 2.3 release of Bioconductor is scheduled for October 2008. We expect this to be a vibrant release cycle. High-throughput genomic research is a dynamic and exciting field. It is hard to predict what surprising packages are in store for future Bioconductor releases. We anticipate continued integration with diverse data sources, use of R’s advanced graphics abilities, and implementation of cutting edge research algorithms for the benefit of all Bioconductor users. Short-read DNA resequencing technologies are one area where growth seems almost certain. Bibliography R. Gentleman. Bioinformatics with R. Chapman & Hall/CRC, Boca Raton, FL, 2008. ISBN 1-42006367-7. F. Hahne, W. Huber, R. Gentleman, and S. Falcon. Bioconductor Case Studies. Springer, 2008. ISSN 1609-3631 Vol. 8/1, May 2008 70 Forthcoming Events: useR! 2008 The international R user conference ‘useR! 2008’ will take place at the Technische Universität Dortmund, Dortmund, Germany, August 12-14, 2008. This world-wide meeting of the R user community will focus on • R as the ‘lingua franca’ of data analysis and statistical computing; • providing a platform for R users to discuss and exchange ideas about how R can be used to do statistical computations, data analysis, visualization and exciting applications in various fields; • giving an overview of the new features of the rapidly evolving R project. The program comprises invited lectures, usercontributed sessions and pre-conference tutorials. Invited Lectures R has become the standard computing engine in more and more disciplines, both in academia and the business world. How R is used in different areas will be presented in invited lectures addressing hot topics. Speakers will include • Peter Bühlmann: Computationally Tractable Methods for High-Dimensional Data • John Fox and Kurt Hornik: The Past, Present, and Future of the R Project, a double-feature presentation including Social Organization of the R Project (John Fox), Development in the R Project (Kurt Hornik) • Andrew Gelman: Bayesian Generalized Linear Models and an Appropriate Default Prior • Gary King: The Dataverse Network • Duncan Murdoch: Package Development in Windows • Jean Thioulouse: Multivariate Data Analysis in Microbial Ecology – New Skin for the Old Ceremony • Graham J. Williams: Deploying Data Mining in Government – Experiences With R/Rattle User-contributed Sessions The sessions will be a platform to bring together R users, contributors, package maintainers and developers in the S spirit that ‘users are developers’. People from different fields will show us how they solve problems with R in fascinating applications. The scientific program is organized by members of the program committee, including Micah Altman, Roger Bivand, Peter Dalgaard, Jan de Leeuw, Ramón Díaz-Uriarte, Spencer Graves, Leonhard Held, Torsten Hothorn, François Husson, Christian Kleiber, Friedrich Leisch, Andy Liaw, Martin Mächler, Kate Mullen, Ei-ji R News Nakama, Thomas Petzoldt, Martin Theus, and Heather Turner, and will cover topics such as • • • • • • • • • • • • • • • • • • Applied Statistics & Biostatistics Bayesian Statistics Bioinformatics Chemometrics and Computational Physics Data Mining Econometrics & Finance Environmetrics & Ecological Modeling High Performance Computing Machine Learning Marketing & Business Analytics Psychometrics Robust Statistics Sensometrics Spatial Statistics Statistics in the Social and Political Sciences Teaching Visualization & Graphics and many more Pre-conference Tutorials Before the start of the official program, half-day tutorials will be offered on Monday, August 11. In the morning: • Douglas Bates: Mixed Effects Models • Julie Josse, François Husson, Sébastien Lê: Exploratory Data Analysis • Martin Mächler, Elvezio Ronchetti: Introduction to Robust Statistics with R • Jim Porzak: Using R for Customer Segmentation • Stefan Rüping, Michael Mock, and Dennis Wegener: Distributed Data Analysis Using R • Jing Hua Zhao: Analysis of Complex Traits Using R: Case studies In the afternoon: • Karim Chine: Distributed R and Bioconductor for the Web • Dirk Eddelbuettel: An Introduction to HighPerformance R • Andrea S. Foulkes: Analysis of Complex Traits Using R: Statistical Applications • Virgilio Gómez-Rubio: Small Area Estimation with R • Frank E. Harrell, Jr.: Regression Modelling Strategies • Sébastien Lê, Julie Josse, François Husson: Multiway Data Analysis • Bernhard Pfaff : Analysis of Integrated and Cointegrated Time Series ISSN 1609-3631 Vol. 8/1, May 2008 More Information A web page offering more information on ‘useR! 2008’ as well as the registration form is available at: http://www.R-project.org/useR-2008/. 71 We hope to meet you in Dortmund! The organizing committee: Uwe Ligges, Achim Zeileis, Claus Weihs, Gerd Kopp, Friedrich Leisch, and Torsten Hothorn [email protected] R Foundation News by Kurt Hornik Donations and new members Donations Austrian Association for Statistical Computing Fabian Barth, Germany Dianne Cook, USA Yves DeVille, France Zubin Dowlaty, USA David Freedman, USA Minato Nakazawa, Japan New benefactors Paul von Eikeren, USA InterContinental Hotels Group, USA R News New supporting institutions European Bioinformatics Inst., UK New supporting members Simon Blomberg, Australia Yves DeVille, France Adrian A. Dragulescu, USA Owe Jessen, Germany Luca La Rocca, Italy Sam Lin, New Zealand Chris Moriatity, USA Nathan Pellegrin, USA Peter Ruckdeschel, Germany Jitao David Zhang, Germany Kurt Hornik Wirtschaftsuniversität Wien, Austria [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 72 R News Referees 2007 by John Fox • Thomas Kneib R News articles are peer-reviewed. The editorial board members would like to take the opportunity to thank all referees who read and commented on submitted manuscripts during the previous year. Much of the quality of R News publications is due to their invaluable and timely service. Thank you! • Anthony Lancaster • Murray Aitkin • Doug Bates • Adrian Bowman • Patrick Burns • Peter Dalgaard • Philip Dixon • Dirk Eddelbuettel • Brian Everitt • Thomas Gerds • B.J. Harshfield • Sigbert Klinke R News • Duncan Temple Lang • Thomas Lumley • Martin Maechler • Brian McArdle • Georges Monette • Paul Murrell • Martyn Plummer • Christina Rabe • Alec Stephenson • Carolin Strobl • Simon Urbanek • Keith Worsley John Fox McMaster University, Canada [email protected] ISSN 1609-3631 Vol. 8/1, May 2008 Editor-in-Chief: John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 Editorial Board: Vincent Carey and Peter Dalgaard. Editor Programmer’s Niche: Bill Venables Editor Help Desk: Uwe Ligges Email of editors and editorial board: firstname.lastname @R-project.org R News 73 R News is a publication of the R Foundation for Statistical Computing. Communications regarding this publication should be addressed to the editors. All articles are copyrighted by the respective authors. Please send submissions to regular columns to the respective column editor and all other submissions to the editor-in-chief or another member of the editorial board. More detailed submission instructions can be found on the R homepage. R Project Homepage: http://www.R-project.org/ This newsletter is available online at http://CRAN.R-project.org/doc/Rnews/ ISSN 1609-3631

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement