A thesis submitted to
The University of Birmingham
for the degree of
Department of English
School of Humanities
The University of Birmingham
December 2004
This thesis puts forward a specialized, functional grammar of cause and effect within
the sub-genre of biomedical research articles. Building on research into the local
grammars of dictionary definitions and evaluation, the thesis describes the application
of a corpus-driven methodology to description of the principal lexical grammatical
patterns which underpin causation in scientific writing. The source of data is the 2
million-word Halmstad Biomedical Corpus constructed from 589 on-line research
articles published since 1997. These articles were sampled in accordance with a
standard library classification system across the broad spectrum of the biomedical
research literature. On the basis of lexical grammatical patterns identified in the
corpus, a total of five functional sub-types of causation are put forward. The local
grammar itself is a description of these sub-types based on the Hallidayian notion of
system along the syntagm coupled with the identification of the paradigmatic contents
of these systems as a closed set of 37 semantic categories specific to the biomedical
domain. A preliminary evaluation of the grammar is then offered in terms of handparsing experiments using a test corpus. Finally potential NLP applications of the
grammar are described in terms of on-line information extraction, ontology building
and text summary.
Till Karin för att du stod ut
Till Berit för ditt stöd
For Dad who missed everything
This PhD thesis would never have been completed without the generous and
unstinting support of my wife Karin and our three children Bill, Abigail and Måns.
Karin has kept the family going through a number of difficult years, cheerfully putting
up with my prolonged absences in addition to accepting the financial strictures which
this self-financed project has involved. My mother-in-law, Berit Johansson, has also
been a great source of support and inspiration. I would also like to thank my mother,
Joy Allen for hospitality during the summers in Birmingham.
On the academic side, I would like to acknowledge the help of a number of teachers
and colleagues in the pursuit of my academic career over the years. Firstly, my
supervisor, Dr Geoff Barnbrook, has been a source of helpful comments and friendly
advice during the summers in Birmingham and also via email. I would also like to
thank Dr Geoffrey Williams, Université de Bretagne Sud who helped me with the
SGML/XML formatting of the corpus files. Dr Gaëtanelle Gilquin, Université
catholique de Louvain, and Gudrun Rawoens, University of Ghent provided
stimulating exchanges of ideas on causation following on from the 2001 and 2004
ICAME conferences. Here in Sweden, my gratitude is also expressed to Ulla Brodow,
University of Karlstad who encouraged me greatly in pursuit of my interests in
computers and language.
Table of contents
1. Causation, science, local grammar .................................................. 1
1.1 Introduction ………………………………………………………… 1
1.2 Aims ………………………………………………………………… 2
1.3 Why parse causative sentences in scientific articles? ………………. 3
1.4 Previous research ………………………………………………….. 5
1.4.1Overview…………………………………………………….. 5
1.4.2 Causation and scientific explanations of the natural world … 6
1.4.3 The language of science ……………………………………. 10
1.4.4 The place of causation in linguistics………………………… 10
1.5 Local grammars……………………………………………………… 12
1.5.1 Preliminaries………………………………………………… 12
1.5.2 A local grammar of dictionary definition
sentences………… 15
1.5.3 A local grammar of evaluation…………………………….. 16
1.5.4 A local grammar of causation……………………………… 18
1.6 Objectives and overall format………………………………… 20
2. Biomedical sublanguages: from analysis to application……………… 22
2.1 Preliminaries………………………………………………………… 22
2.2 Distributional sublanguages in biomedicine………………………… 24
2.2.1 Dependency relations………………………………………… 24
2.2.2 Sublanguages and paraphrastic relations…………………….. 25
2.2.3 Inequalities of likelihood…………………………………….. 26
2.3 A survey of biomedical sublanguages………………………………. 27
2.3.1 Background………………………………………………….. 27
2.3.2 Clinical sublanguages……………………………………….. 29
2.3.3 A biomolecular sublanguage………………………………… 30
2.3.4 Clinical and biomedical sublanguages compared…………… 34
2.4 Natural language processing and biomedicine……………………… 35
2.4.1 General………………………………………………………. 35
2.4.2 Applications in the biomedical domain……………………… 36
2.5 Information retrieval and information extraction……………………. 37
2.5.1 General………………………………………………………. 37
2.5.2 Information retrieval………………………………………… 37
2.5.3 Information extraction………………………………………. 38
2.6 Sublanguages and local grammars…………………………………… 42
2.7 Summary………………………………………………………………43
3. Methodology……………………………………………………………… 44
3.1 Introduction………………………………………………………… 44
3.2 Causation and the specialist corpus………………………………… 45
3.2.1 Why a specialist corpus? ……………………………
3.2.2 The genre approach to small corpus design………………… 46
3.3 The HBC Pilot Corpus……………………………………………… 47
3.4 From pilot corpus to final corpus…………………………………… 50
3.4.1 General……………………………………………………… 50
3.4.2 The ‘final’ corpus: specification and representativeness…….. 52
3.4.3 Corpus composition and keywords…………………………. 53 External comparison…………………………………. 54 Internal keyword comparisons across the
subcorpora………………………………………………….. 55
3.5 Identifying causation in the biomedical RA………………………… 59
3.5.1 General……………………………………………………… 59
3.5.2 Semantic intuition…………………………………………… 60
3.5.3 Non-factivity and hedging…………………………………… 62
3.5.4 Other possible ‘borderline’ cases……………………………. 66
3.5.5 Summary…………………………………………………….. 66
3.6 From definition to mark-up…………………………………………...67
3.7 Concordancing……………………………………………………….. 68
3.8 Data storage………………………………………………………….. 72
3.8.1 The pattern grammar notation………………………………. 72
3.8.2 Presentational format………………………………………… 72 ‘Lexical’ format……………………………………… 73 ‘Pattern’ format……………………………………….74
3.8.3 Limitations of the pattern grammar notation………………… 75
3.9 Summary…………………………………………………………….. 76
4. The lexical patterns of cause and effect……………………………….. 78
4.1 Introduction…………………………………………………………. 78
4.2 The lexis of causation………………………………………………. 78
4.2.1 General ……………………………………………………… 78
4.2.2 Frequency measures………………………………………… 80
4.3 The taxonomy…………………………………………………………82
4.3.1 Outline……………………………………………………….. 82
4.3.2 The pattern taxonomy…………………………………………83
4.4 Verbal patterns ………………………………………………………..85
4.4.1 Overview…………………………………………………….. 85
4.4.2 Simple verbal patterns……………………………………….. 86 Active patterns………………………………………. 86 Passive patterns……………………………………… 94
4.4.3 Prepositional verb patterns………………………………….. 95 Active patterns………………………………………. 95 Passive patterns………………………………………
4.4.4 Clausal complementation patterns………………………….. 104 Active patterns………………………………………. 104
6 Passive patterns……………………………………… 106
4.5 Delexical patterns……………………………………………………. 107
4.5.1 Overview……………………………………………………. 107
4.5.2 Patterns with have + nominal group………………………… 108
4.5.3 Patterns with play + nominal group……………………….. 109
4.6 Nominal patterns……………………………………………………. 112
4.6.1 Overview……………………………………………………. 112
4.6.2 Internal patterns within the nominal group…………………. 112 Pre-modifying patterns……………………………… 112 Post-modifying patterns…………………………….. 114
4.6.3 External patterns……………………………………………. 116 v-link patterns………………………………………. 116 Patterns with existential there……………………… 118 Other patterns………………………………………. 119
4.7 Adjectival patterns…………………………………………………. 120
4.7.1 Overview…………………………………………………… 120
4.7.2 Meaning groups…………………………………………….. 120
4.8 Summary……………………………………………………………. 127
5. From pattern to function: specifying the local grammar……………… 128
5.1. Introduction…………………………………………………………. 128
5.2 Theoretical background…………………………………………….. 129
5.2.1 Overview…………………………………………………….. 129
5.2.2 Defining the scope of the grammar…………………………. 130
5.2.3 Function and meaning………………………………………. 130
5.2.4 Paradigmatic relations in the grammar: system and choice…. 131
5.2.5 Syntagmatic relations: constituency and rank………………...133
5.3 Functional systems and
5.3.1 General……………………………………………………….. 138
5.3.2 Top-level/ clausal systems…………………………………… 139 Cause and effect……………………………………… 139 Hinge…………………………………………………. 140 Hedge………………………………………………… 142 Source……………………………………………… 143 Appositive………………………………………….. 144 Instrument…………………………………………… 145 Circumstance………………………………………… 146 Evaluator…………………………………………….. 147
5.3.3 Systems within the nominal group………………………….. 148 Pre-modifying systems: delimiter…………………… 148 Delimiter (evaluative)………………………………. 149 Delimiter
(classifier)…………………………………149 Delimiter (causal)………………………………...... 149 Scope………………………………………………… 150
5.4 The semantic categories……………………………………………………….. 152
5.4.1 Overview…………………………………………………….. 152
5.4.2 The categories………………………………………………. 153
5.4.3 Occurrence restrictions……………………………………… 159
5.4.4 Functional roles and grammatical parsimony……………….. 160
5.4.5 Summary…………………………………………………….. 161
5.5 From categories to grammatical statement………………………….. 161
5.5.1 Overview…………………………………………………….. 161
5.5.2 Productive causation………………………………………… 162 Active patterns………………………………………. 163 Passive patterns……………………………………… 168
5.5.3 Parametric causation…………………………………………. 170
5.5.4 Relational causation………………………………………….. 173 Relational causatives with ‘ be’ and other copular
verbs …………………………………………………………. 173 Delexical relational causatives
……………………….176 Relational causatives and evaluative adjectives…… 179
5.5.5 Inferential causation ……………………………………… 180
5.5.6 Existential causation…………………………………………. 183
6. Evaluating the local grammar……………………………………………. 186
6.1 General………………………………………………………………. 186
6.2 The parsing process- an overview…………………………………… 187
6.3 The parsing of productive causatives………………………………... 189
6.3.1 Theoretical aspects……………………………………………189
6.3.2 An example from the test corpus…………………………… 193
6.4 Evaluative criteria………………………………………………….. 195
6.5 Evaluative procedure………………………………………………. 197
6.6 Discussion………………………………………………………….. 199
6.6.1 Lexical coverage and pattern matching……………………. 199
6.6.2 Syntactic considerations…………………………………… 200 Word order………………………………………… 200 Discontinuous elements……………………………. 200 Verbs in phase……………………………………… 201 Head categorization………………………………… 202 Multiple-embedding……………………………….. 203
6.6.3 Semantic categorization……………………………………. 204 Definitions of causation……………………………… 204 Semantic classification and ontological
representation ……………………………………………… 205 Finer-grained subdivisions of categories…………… 209
6.6.4 Textual aspects- the problem of anaphoric resolution……… 209
6.7 Summary…………………………………………………………… 211
7. Applications of the local grammar……………………………………… 212
7.1 Preliminaries……………………………………………………….. 212
7.2 Automatic ontology building in the genetic / biochemical domain… 213
7.2.1 Overview…………………………………………………….. 213
7.2.2 The Gene Ontology…………………………………………. 214
7.3 Clinical domain…………………………………………………….. 219
7.3.1 Overview…………………………………………………… 219
7.3.2 Emergent diseases: SARS…………………………………. 220 Background…………………………………………. 220 Information coverage………………………………. 221 The causal profile for a disease outbreak………….. 223 Evaluation…………………………………………. 226
7.3.3 Levodopa: an established therapy / treatment course and its
side-effects…………………………………………………………..227 Background………………………………………….. 227 Information coverage……………………………….. 228
7.3.4 Drug-resistance: anti-malaria drug…………………………... 230 Background………………………………………….. 231 Information coverage………………………………… 233
7.3.5 Summary…………………………………………………….. 233
7.4 Pedagogical applications of the grammar…………………………… 234
7.5 Future research………………………………………………………. 238
7.6 Conclusion…………………………………………………………… 240
1 Causation, science, local grammar
1.1 Introduction
This thesis describes a local grammar of causation with specific reference to the genre
of biomedical research articles. As specialized functional grammars of a language in
restricted textual domains, local grammars have potential applications in the
automatic parsing of natural languages, serving as a basis for information retrieval
and extraction. Arising partly out of the inadequacies of general or global grammars
as analytical frameworks for the automated parsing of unrestricted text, the concept of
a local grammar is inseparable from its utility in providing a linear representation of
the functional elements within semantically-restricted linguistic domains. Ultimately
this approach is derived from the pioneering contribution of Zellig Harris (1968,
1982) in the grammatical analysis of scientific sublanguages.
The local grammar described in this thesis is located firmly within the tradition of
systemic-functional linguistics and is based closely on the Hallidayian notion of
systemic-functional grammar (SFG) (Halliday 1985a). While fundamental principles
such as the notion of system, paradigmatic choice and category are inherited more or
less directly from this general language framework, the major meaning-based
category labels are essentially specific to the local grammar. The thesis should
therefore be seen as an application of Hallidayian principles to the analysis of
language in specialized domains with potential utility in the field of natural language
processing. Ultimately the thesis examines the extent to which a functional grammar
can be derived from a corpus-driven exploration of lexical grammatical patterns and
evaluates the efficacy of implementing such an approach in biomedical information
Causation has been described in the philosophical literature as a fundamental axiom
and postulate of experimental science. The place of causal relations in an evolving
scientific epistemology has been debated by philosophers of science since Aristotle.
While it may be the case that some scholars have gone as far as denying the existence
of a unifying, deterministic theory of cause and effect offering a deeper explanation of
natural processes, causation has nevertheless retained its place as a dominant heuristic
in the post-Enlightenment construction of scientific knowledge.
In linguistics, the study of causation through a narrow focus on periphrastic causative
verbs (ie combinations of verbs such as cause, get, have and make with non-finite
clause complementation) initially provided an important (though subsequently
refuted) extension of semantic theory within the generative paradigm. These so-called
‘causative constructions’ have also proven to be a fertile testing ground for the
investigation of universal-typological similarities between languages. There have
been relatively few attempts however to describe semantic domains such as causation
and their lexical and grammatical expressions using corpus data, with a positive
dearth of corpus studies focusing on causation in restricted genres.
Descriptions of language emerging from large-scale computer-based corpus studies
since the 1980s have increasingly pointed towards the pervasiveness of phraseological
patterns centred on individual lexical items. Such a perspective blurs the traditional
dichotomy between a rule-based grammar and a separate lexicon. This distinction has
been a central tenet of the Chomskyan (and indeed a pre-Chomskyan structuralist)
orthodoxy which dominated linguistics prior to the advent of the electronic corpus.
Implicit in a phraseological perspective is the notion that meaning as realised through
lexis is communicatively prior to syntax, and as a corollary of this position
phraseological patterns centred on lexical items provide a fundamental
psycholinguistic basis for language production and reception. Crucial to the adoption
of this position is a definition of collocation drawing not only on corpus-based
statistical probabilities of lexical co-occurrence but also on lexicographical and more
recently discoursal perspectives. The investigation of lexical grammatical patterns
underpinning causation in a corpus of scientific research articles constitutes a major
part of this thesis and lays the groundwork for the exposition of the local grammar
and its functional elements.
In recent years there has been a trend within corpus linguistics towards the
construction of smaller corpora with more specific research objectives in mind. Small
corpora in the size range of 1-2 million words can be relatively easily assembled from
on-line sources, with Internet-based search engines permitting electronic text
sampling according to very specific search queries. The construction of specialized
corpora on the basis of such narrowly-defined criteria can therefore facilitate the
empirical investigation of lexis and grammar patterns within restricted semantic
domains akin to the sublanguage environments originally envisaged by Harris.
1.2 Aims
The work described in this thesis belongs broadly within the British neo-Firthian
tradition of applied linguistics. This tradition is rooted in the empirical exploration of
language phenomena as the products of everyday social interaction and textual usage
(Widdowson 2000). As the field has expanded from its language teaching origins to
encompass a variety of ‘real world’ linguistic problems, applied linguistics has come
to stress the primacy in linguistic description of attested observational data as opposed
to native-speaker intuition.
As mentioned previously in relation to phraseology, a second perspective which
emerges from the legacy of Firth is the prioritization given to meaning within
linguistic description. The central aim of this thesis is to put forward a specialized
functional grammar of causation specific to the biomedical domain which adheres as
closely as possible to the grammatical patternings of lexis in the text of scientific
research articles. Ultimately such a model should be applicable in turn to the
functional parsing of causative sentences with a view to potential uses in information
extraction. The raw data for the lexical grammar is drawn from a 2 million word
specialized corpus of scientific research articles downloaded more or less in their
entirety11 from on-line sources. Texts have been sampled using an established library
classification scheme to encompass as far as possible what is an extremely diversified
field of scientific research. A second stage in the descriptive process involves the
mapping of the lexical patterns identified onto the semantically-based categories of
the local grammar. Finally, the thesis also explores potential applications of the
grammar, primarily in information extraction from biomedical research articles.
Minus diagram and table caption text
1.3 Why parse causative sentences in scientific articles?
A cursory trawl through the on-line titles and abstracts of a major scientific article
database reveals the striking rhetorical centrality of causation in scientific text. The
example sentences [1-4] below were all retrieved from random on-line searches in the
biomedical domain12, covering a variety of sub-disciplines. Causative verbs linking
cause and effect nominal groups are underlined.
[1]The lanceolate hair rat phenotype results from a missense mutation in a calcium
coordinating site of the desmoglein 4 gene
Article title: Genomics 83 5 May 2004 ;747-756
[2] Progressive liver fibrosis is the main cause of organ failure in chronic liver
diseases of any aetiology
Abstract:Digestive and Liver disease 36 4:231-242 Apr
[3] A polydipsia screening program could minimize morbidity and mortality
associated with this fairly prevalent condition.
Absract:Archives of Psychiatric Nursing 18 2:60-87 Apr
[4] The use of topographically guided PRK with the topographically supported
customized ablation method resulted in significant increases of UCVA and BSCVA
and improved corneal clarity in all patients
Abstract:Ophthalmology 111 3 458-462 Mar 2004
Even within the markedly condensed text of an article title or abstract, causal relations
are accorded a salience which points to their potential in the extraction of information
in scientific text. Given the hypothetico-deductive basis of the empirical research
article with its Introduction-Methodology-Results-Discussion rhetorical macrostructure, causative clauses and clause complexes play a critical role in the distillation
of the explanatory essence of an experimental finding, a diagnostic cause, the effect of
a specific drug therapy or programme of treatment. The importance of causal
Science Direct Database at http://www.sciencedirect.com/
relationships in the achievement of rhetorical aims in biomedical articles can be
readily appreciated in the above articles. This importance is evident despite the quite
substantial conceptual, terminological and methodological differences between the
more process-orientated sub-fields of microbiology and genetics (examples[1-2]
above) and the patient-orientated clinical domain (examples [3-4]). Causation is
similarly prominent in the title of example [1]. In example [2] the essential finding of
the paper- that progressive liver fibrosis gives rise to organ failure- is presented
through a causal relationship expressed in the abstract. In the sub-field of psychiatric
nursing (example [3] above) causation encodes a positive assessment (albeit
modalized) of a major treatment of schizophrenic patients.
As these examples show, causal relations within the sub-genre are realised through a
diverse variety of lexical items and their collocationally-defined patterns far in excess
of the narrowly circumscribed and exhaustively studied periphrastic causative verbs.
There is no a priori listing in existence of these lexical items- the lexical reflexes of
causal relations can only be described empirically through extensive study of
causation using a specialized corpus.
If the achievement of rhetorical aims in scientific text hinges so critically on the
linguistic expression of cause and effect, the prospect is raised that domain-specific
linguistic/grammatical analysis of these logico-semantic connectors can ultimately
provide the basis for powerful automated tools in information-retrieval and extraction
with direct applications in biomedical informatics. The role of a specialized corpus is
important here both as the source of primary data for the grammatical model and as a
test-bed for applications in information extraction. In order to work on the naturallyoccurring language of scientific text, a grammatical model is needed which emerges
inductively as far as possible out of the data, with the minimal intrusion of
introspective pre-conceptions on the part of the researcher. This is essentially the
methodology of corpus-driven grammatical analysis and is the approach used in the
modelling of data in this project.
1.4 Previous research
1.4.1 Overview
In a series of previous papers, the theoretical background to the compilation of the
local grammar has been set out (Allen 1998; 2001a; 2001b; 2002a; 2002b). Briefly
this work describes the notion of sublanguage and sublanguage grammars (Allen
2001a:1-9), descriptive overviews of the local grammars of definition, evaluation and
the original pilot study on causation (Allen 2001a:11-21) and the treatment of
causation in linguistics (Allen 2001b:3-15). In later articles (Allen 2002a; b), the
selective focus on causation within the language of science is justified as a prelude to
the construction of a corpus of biomedical research articles (RAs) which constitutes
the source of data for the grammar described in this project.
The place of causation in the history and philosophy of science is reviewed in Allen
(2002a:4-8) as the basis for a wider discussion of the linguistic and rhetorical
properties of the scientific research article both of which have a bearing on the lexical
grammatical encoding of causal relationships (Allen 2002a; b). The methodology of
corpus construction is described in Allen (2002a:19-27). In building a specialized
corpus, the notion of genre arising from the adoption of a discoursal rather than
terminographic perspective on scientific language has been highly influential (Swales
1990; Gledhill 2000). Such a perspective stresses the delineation of textual sub-fields
based on communities of researchers united by the common activity of textual
production and dissemination. In previous articles, one further methodological
consequence of the corpus-driven perspective on description is taken up: that of data
storage. The theoretical basis and practical utility of the lexical pattern notation
system for the storage of causal lexis is set out in Allen (2002b). This article also
describes the functional mapping process of local grammar compilation from the
databases of lexical patterns extracted manually from the corpus.
As a consequence of the discussion presented in previous articles, the theoretical and
methodological review of these areas will receive cursory treatment only in this
thesis, primarily in order to contextualize the entire project.
1.4.2 Causation and scientific explanations of the natural world
Although an in-depth philosophical treatment of causation is largely beyond the scope
of this thesis, a brief consideration of the place of causal explanation in the
philosophy of science serves to justify the scientific research article as both the source
of data in the development of the grammatical model and as the object of potential
parsing applications of the grammar. In Allen (2002a:4-8) it was shown that the
Aristotelian inductive-deductive method (with its subsequent scholastic refinements)
developed out of a need to derive explanatory frameworks in causal terms by
deduction from established axioms.
Aristotelian scientific explanations had to satisfy the requirements for the four causes:
the formal cause, the material cause, the efficient cause and the final or teleological
cause (see Allen 2002a: 6 for exemplification of these terms from the biomedical
domain). Following the rise of mechanical philosophy in the 18th Century, the notion
of teleological cause was increasingly marginalized in experimental science in favour
of the efficient cause, essentially the agent which gives rise to the causative process
(de Angelis 1973 cited in Norton 2003).
For modern philosophers of science, the a priori status of causation within scientific
epistemology has become increasingly problematic. Russell (1917:132) put it in these
All philosophers, of every school, imagine that causation is one of the
fundamental axioms or postulates of science, yet, oddly enough, in advanced sciences
such as gravitational astronomy, the word 'cause' never occurs…The law of causality,
I believe, like much that passes muster among philosophers, is a relic of a bygone age,
surviving like the monarchy, only because it is erroneously supposed to do no harm
The problematic status of causation in modern science can be illustrated with regard
to two theories of causation, counterfactual causation and probabilistic causation. In
counterfactual terms, instead of saying that X causes Y, the causal relation is re-stated
in the form of a conditional: If X had not occurred, Y would not have occurred. The
theory of counterfactual causation has its origins in the empiricist philosophy of
We may define a cause to be an object followed by another, and where all the
objects, similar to the first, are followed by objects similar to the second. Or, in other
words, where, if the first object had not been, the second never had existed.
Hume (1777, Section VII).
According to Hume, while it might be possible to observe the conjunctions or
associations between different phenomena perceived through the senses, this
conjunctive association was not the same as saying that phenomenon X is necessary or
deterministic for phenomenon Y. The only empirical knowledge of causation which
we can obtain is that of an association between two events. Hume’s theories have
frequently been referred to as regularity theories of causation, according to which
effects invariably follow associated causing events. However there have been
problems with a counterfactual definition of cause and effect, most notably with the
status of the counterfactuals themselves.
One problem with the regularity of theory of Hume is that there are abundant
examples from modern science where there is no deterministic inevitability that cause
X is followed by effect Y. Taking an example from the biomedical domain, it has been
estimated that only 10% of heavy smokers develop lung cancer13 . This and similar
observations have led to attempts to subsume causation within probability theory
(Pearl 2000). On this basis it is possible to conceive of a cause X raising the statistical
probability that effect Y will be produced as a result. Such a theory substantially
weakens the traditional Aristotelian notion of causation in removing the deterministic
component of causal relations. In a review of causal theories, Norton (2003:6) notes
the difficulties which 20th Century developments in mathematics and physics such as
quantum mechanics and chaos theory have produced or a deterministic theory of
causation. Poincaré (1913 cited in Losee 2001) showed that the impossibility of
making infinitely accurate measurements of the initial conditions of a system can
produce huge and unpredictable discrepancies at a later point in time- the essence of
what is now popularly known as chaos theory.
Journal of the National Cancer Institute, March 19th 2003, http://www.cancer.gov
Norton (2003:5) describes a position which he refers to as ‘causal fundamentalism’
which has prevailed in the deterministic wake of Newtonian mechanics :
Nature is governed by cause and effect; and the burden of individual sciences
is to find the particular expressions of the general notion in the realm of their
specialized subject matter.
In Norton’s view, echoing Russell, the notion of cause and effect as some sort of
deeper unifying force of nature has the status of an anachronistic fallacy. In the light
of 20th Century developments in physics such as Quantum Theory, a definition of
effects as being brought about by causes has been replaced by a form of
indeterminism, somewhat undermining the metaphysical status of causation. Within
each scientific sub-domain, scientists seek to discover the mechanisms of causal
relations specific to the phenomena under observation. In fundamental particle
physics, the production of electron anti-neutrinos is related causally to the decay of
electron neutrinos. By way of contrast causation in genetics is frequently expressed in
terms of disruption or disturbance in chemical base pairs making up DNA. Thus the
gene AtCPSF73-II in the plant species Arabidopsis thaliana is identified as the trigger
for reduction in female gamete production Xu et al. (2004).
Such rigidly specified expressions are equated by Norton with the ontologies of
mature sciences. This precision can be contrasted with what Menzies (1996) has
termed the ‘folk’ status of causation. ‘Folk’ causation is the familiar prototypical
notion of cause and effect, the cognitive process by which we ‘organize our
experiences into intelligible coherence’ (Norton 2003:8). For Russell (1917:138-139),
volition is identified as the ‘intelligible nexus between cause and effect’. Under
restricted favourable circumstances which Norton terms ‘hospitable domains’, it is
possible to equate scientific processes with the ‘folk’ causation. In an hospitable
domain the causative nexus can be clearly and indisputably isolated-a common
analogy might be a child’s football breaking a window or a car crash resulting in a
whiplash injury.
window breakage
car crash
whiplash injury
The ‘arrow’ shorthand is a convenient means of conceptualising ‘folk’ causation as an
asymmetric relation ie a cause can produce an effect but an effect cannot bring about
a cause etc. Beyond the restricted environments of such hospitable circumstantial
domains however, causal relations can become vastly more complicated, as
exemplified by the complex chains of molecular collisions within a gas, illustrated
conceptually below. Not only can collisions between molecules be seen as chains in
which a produced effect becomes the cause of a subsequent collision but also
individual collision effects can be derived from more than one separate causing agent
Molecular collisions in a gas as causal chains (adapted from Norton 2003)
Setting aside metaphysical problems raised by the status of causation within the
philosophy of science and the difficulties raised in attempting to apply a blanket
notion of cause and effect within restricted domains, it is argued that a ‘folk’
definition of causation nevertheless serves as ‘umbrella’ term convenient for the
purposes of information retrieval and extraction.These objectives are seen as the
primary areas of application for the grammar put forward in this thesis. More
specifically the semantic domain of ‘folk’ causation can satisfactorily subsume the
highly diversified range of microbiological interactions, biochemical and
pharmaceutical agencies, practitioner-patient interventions and treatment courses etc
which are encountered in the biochemical domain. In other words, the use of a ‘folk’
definition of causation is sufficiently all-inclusive to serve the purposes of the local
grammar which can parse sentences from a variety of biomedical sub-domains and
not just be restricted to a single, narrowly defined sub-discipline.
1.4.3 The language of science
This section reviews in more detail the diachronic and synchronic research on the
language of science and more specifically the scientific research article described in
Allen (2002a:8-11). The historical development of the research article from its
Enlightenment origins as epistolary exchanges between scientists is described in Ard
(1983). The appearance in 1665 of the first scientific journal, Transactions of the
Royal Society, marked an important watershed in scientific writing, as experimenters
sought the rhetorical apparatus and persuasive means to convince a wider audience
removed in time and place from the immediacy of demonstrated experiments.
Bazerman (1983) charts the subsequent development of Transactions over the period
up until 1800, noting the increasing tendencies to embed observations of scientific
phenomena within an accumulating body of scientific literature representing the
prevailing research consensus. The development of the ‘proto’ IntroductionMethodology-Results-Discussion (IMRD) in research articles begins to manifest itself
towards the end of this period as part of a trend towards the increasing
problematisation of complex scientific investigations (Bazerman 1983:16-17).
However this research stops short of more detailed linguistic analysis of research
article text.
The period covering the rise of modern science is described in Bazerman (1984a) who
provides a thorough overview of both linguistic and non-linguistic feature
development in spectroscopy articles from 1893-1980. Among the non-linguistic
tendencies noted are increasing article length, division of articles by section and use
of references. In terms of linguistic features, Bazerman’s work relates increasing
foregrounding of nominalized verbal processes such as ionization and correlation to
the corresponding diminishment of the scientist’s explicit pronomial participation in
the text. This shift is partly paralleled in Myer’s (1990 ) distinction between a
narrative of science in which nominalized arguments as realizations of scientific
processes are highlighted and a narrative of nature typical of scientific
popularizations, in which the scientist, animal or plant is in focus rather than the
process. While Bazerman’s work constitutes a groundbreaking historical survey of
scientific writing it suffers slightly from a restricted focus on narrow area of physics.
It would be interesting for example to see whether these trends are echoed in other
physical sciences as well as biology and medicine.
In contrast to the development of scientific writing traced in diachronic surveys,
applied linguistic research has chiefly concerned itself with the contemporary RA.
Gledhill (2000), identifies two applied linguistic perspectives on the language of
science, one terminographical in orientation, the other discoursal. As Gledhill (2000)
notes the terminographical and discourse perspectives spring from different linguistic
traditions. The terminological perspective views scientific language as a specialized
language variety essentially postulating a demarcation between scientific language
and the general language. Terminography examines the relationship between the
technical language of scientific sub-fields and general language and is closely related
to the notion of sublanguage described in more detail with regard to biomedicine in
chapter 2.
Representative of this terminological tradition is work which has been done on the
definition of terms within specific scientific domains (Sager et al 1980; Picht and
Draskau; Pearson 1996; 1998 cited in Gledhill 2000:20). The work of Pearson for
example draws on the observation that specialized text contains language and
metalanguage collectively constituting either complete or partial definitions of
technical terms. By identifying a limiting number of connective verbs such as is/are,
comprise(s), consist(s) of, define(s), denote(s), describe(s), etc in specialized corpora,
Pearson shows how it is possible to unite the object language ie the term with the
metalanguage of the definition (Pearson 1996:822):
[ ] Kinesin is a motor protein that uses energy derived from ATP hydrolysis to move
organelles along microtubules.
This research points the way forward to the automation of term definition which can
be especially useful in areas of rapid terminological change and profusion of terms.
Other broadly terminological work reviewed in Gledhill (2000) describes the
processes of derivation and word formation of technical terms (Huddleston 1971)
along with collocational descriptions of science-specific lexis (Sager 1980). This
work however pre-dates the era of collocational analysis using statistical software,
leaving open the possibility that important collocations could have been missed
during manual trawls of the data.
The discourse analysis of the scientific RA on the other hand belongs to the
Hallidayian systemic functional tradition. In this perspective scientific language is one
variety of general language; the specialized context of situation of scientific discourse
is reflected in the specific register variables of field, tenor and mode and their impact
on the linguistic features of texts. Thus a football commentary and a science RA are
both varieties of the general language with their linguistic differences is captured by
register variables defining the topic, the relations between the interactants in the
discourse and the role which language is playing in the interaction.
Discoursal perspectives on scientific writing focus on the socio-rhetorical activity of
text production both within scientific research communities and between these
specialist communities and a wider readership through scientific popularization and
apprenticeship. The shared purposes of these communicative events collectively
realise the genre of a text. The Sociologists of science such as Latour and Woolgar
(1986), Myers (1990) and Swales (1990; 1998) have pointed towards the pivotal role
which language plays within these discourse communities in the vouchsafing of
scientific claims emerging from experimental enquiry. The negotiation of claim
acceptance through the rhetorical apparatus of the journal research article provides the
mechanism for the social constructivist model of scientific knowledge. Other genrebased work has focused on the linguistic challenges posed by scientific writing which
novice scientists face during their apprenticeship into their respective discourse
communities (Halliday and Martin 1993). The distinction between the externallydefined genre and register will be enlightened upon in chapter 2..
1.4.4 The place of causation in linguistics
In Allen (2001b), the status of causation in theoretical linguistics was reviewed
historically, taking as its point of departure factive definitions of cause and effect
based on introspected sentences such as Shibatani (1972; 1976) and Givón (1975).
This review also describes the pre-occupation in the linguistics literature with
causation identified in terms of a highly limited group of periphrastic causative verbs
cause, have, make and optionally get and let. In the same article, the use of a limited
number of intuited causative constructions such as the semantic equivalence between
lexical causative kill as in X kills Y and the periphrastic causative X causes Y to die is
described as the basis for the theory of generative semantics (McCawley 1968). This
theory utilises transformational rules tied to the semantic component of a grammar
rather than the syntactic component in Chomsky’s Standard Theory (Chomsky 1965).
In recent years as Chomskyan theory has increasingly sought to examine the principal
universal ‘design’ properties of human languages, the focus on causative
constructions has been prominent in the search for linguistic universals (Comrie 1981;
Song 1991).
Closely related to causation is the semantic domain of resultative constructions which
are clausal or other elements expressing the notion of consequence or effect. The
grammar of English contains a number of adverbial, adjectival and conjunctive
devices for expressing resultative, resulting or resultant consequences:
[a ] As a result of the strike action, publication has been delayed
[ b] Accessive drug use made the patients infertile
[ c] Jean left early so that she could do her Christmas shopping
The case of the lexical item make in the adjectival resultative pattern of NP V NP AP
is illustrative of the problems involved in establishing a rigid demarcation between
strict lexical causatives and resultative patterns (Boaz 2000). Goldberg (1995) argues
for the existence of causatives and resultatives as separate categories which are
independent of the lexical items which they contain. Boaz (2000) puts forward an
alternative view based on corpus examination of lexical semantic relations. On the
basis of the British National Corpus (BNC) evidence the lexical causative make can
also be seen as a prototypical resultative occurring with a wide range of adjectives
which describe resultative states. In the BNC for example, make is the only verb
occurring in the NP V NP AP resultative category with the adjectives wet, tender,
sleepy etc. As a result of the overlap between causatives and resultatives illustrated
by make, it would seem sensible for the purposes of this thesis to subsume resultatives
within causation.
These introspective approaches to the data are contrasted in Allen (2001b) with
corpus-based and corpus-driven studies of causation. This methodological distinction,
alluded to in section 1.2 above, is described in more detail in Tognini-Bonelli (1996,
2001) and is covered in a related paper, Allen (2002a). Within the broad spectrum of
empirical approaches identified with the corpus methodology, it is possible to
differentiate a number of alternative stances to the filtering of the data through preexisting, introspectively-derived categories. Corpus-based approaches are associated
with attempts to verify existing linguistic theories through confrontation with natural
language data. One example of this position is the use of corpus data annotated in
accordance with a particular grammatical model. Gilquin’s (2000; 2002) work on the
extraction of causative patterns from the tagged and parsed ICE and LOB corpora
provides an illustration of the corpus-based approach. This approach is exemplified by
the extraction of causative make using the POS- and syntactic tags:
[word= "mak.*| made" & pos= " V.* " & genre=LOB[A] "[]{0,4}
[pos="VB│ VBN│ BE │ BEN │DO│ HV│ HVN "] []* within s;
Using the XKwic query language shown above, it is possible to retrieve causative
instances of make as in I can’t make a club pay (Gilquin 2002:202-203). In the
example above, the query designates a search on the verb lemma make ie
make/makes/making or made followed by any non-specified lexical items in the 0-4th
position from the search node and finally either by any base form (VB) or past
participle (VBN) of a lexical verb or alternatively any form of the verbs be, been, do,
have or had . In terms of scope however, the approach suffers from the same
restricted focus on a narrowly-defined group of prototypical causative verbs such as
make and have as the generative and typological studies described above.
Tognini-Bonelli (1996; 2001) contrasts this stance with the more purely inductive
approach of corpus-driven linguistics (henceforth CDL). In this thesis, the corpusdriven approach has been adopted for two reasons. Firstly, CDL utilises the lexical
item as the least theoretically pre-conceived unit of grammatical analysis. The CDL
approach is more appropriate therefore as the point of departure for an extensive
lexical survey of a semantic domain such as causation, which cannot itself be
extracted automatically unless the corpus has been semantically tagged. Secondly if
the grammatical description emerging from the data is to have currency as the basis
for a sublanguage parser, it must embody an integrity which can only arise from a
close confrontation with the corpus data.
The work presented in this thesis has its immediate origins in a pilot study of
causation submitted as an MA dissertation (Allen 1998). The description of this
study’s methodology as corpus-driven does have to be qualified in the light of
subsequent work however. In particular the study made use of the POS-tagged
COBUILD Bank of English although the final categories of the grammar marked a
partial functional break from pre-existing syntactic description. In Allen (2002a; b),
the corpus-driven approach is discussed with reference to the compilation of an ad
hoc specialist corpus and in particular the desirability of augmenting the SGML/XML
markup of the corpus with automatic POS-tagging. The corpus-driven stance also has
implications for the storage of large numbers of lexical items and their associated
patterns. The adoption of a specific notation scheme to record these patterns is
described below in section 1.5.3 and in more detail in chapter 4.
1.5 Local grammars
1.5.1 Preliminaries
The literature on local grammars together has been the focus of a previous paper
(Allen 2001a). This work has explored the relationship between the concept of
sublanguage and local grammar with respect to full sentence dictionary definitions
(Barnbrook 1995; 2002), evaluation (Hunston and Sinclair 2000) and causation (Allen
The term ‘local grammar’ originated in a paper by Gross who first raised the prospect
of devising a specialist grammar to cope with elements of ‘peripheral’ language such
as idiomatic expressions or numerical information (Gross 1993). Gross’s perspective
arises from a very different tradition in linguistics, that of generative grammar, which
stresses the role of transformational rules in the capturing of similarities between
semantically-equivalent sentences. The conceptualization of sentence equivalence
owes a substantial debt to Harris’s distributional theory of sublanguages which will be
explored in more detail in chapter 2.
By way of illustration of Gross’s notion of a local grammar, attention can be drawn to
the status of idiomatic expressions within general language descriptions. Generative
theory has always had difficulty in accounting for idiomatic language in conventional
terms of phrase structures or movement rules. While developments in generative
theory such as X bar syntax cope (albeit using intuited examples) to a certain extent
with the symmetry and regularity of non-idiomatic language, the syntactic restrictions
of certain idiomatic combinations are an acknowledged source of difficulty for such
representations. Gross illustrates the workings of a local grammar with respect to the
idiomatic combinations of the verbs lose and blow:
Bob lost his cool.
Bob lost his temper.
Bob lost his cork.
Bob lost his self-control.
Bob blew a fuse
Bob blew a gasket.
It can be readily appreciated that these idiomatic combinations share a high degree of
semantic equivalence. A local grammar can be constructed which captures this
equivalence in the form of finite automata (Gross ibid.:30). Such finite automata
represent the parsing operation in computational terms as a series of ‘states’ read by
the computer from left to right. The diagram below is a representation of the
equivalences in [5] above in which a human agency leads to choices between the two
verbs, lose and blow respectively. The choice of these two verbs imposes its own set
of idiomatic restrictions- lose co-selects cool, cork and temper while blow determines
stack, top etc
Finite automata representation for idiomatic co-selection (adapted from Gross
In Allen (2001a), the problem of parsing unrestricted text in natural language
processing is highlighted upon. In the same paper it is suggested following Barnbrook
and Sinclair (2001) that devising a number of specialized local grammars to work on
stretches of language each encompassing a specific semantic function might be one
way of solving this problem which cannot be adequately covered in a general or
global language grammar. Although these three grammars retain the emphasis on the
grammatical analysis of restricted language which was part of the original suggestion
by Gross, the local grammars of definition, evaluation and causation belong to a
largely separate, neo-Firthian linguistic tradition.
The differences between Gross’s conceptualisation of a restricted focus grammar and
the subsequently published local grammars can now be summarised. Gross’s
perspective on phraseology which sees it as a peripheral area of grammar clearly
belongs to the generative tradition which in the words of Sinclair (1991:103-4) has
treated idiomaticity as a ‘rubbish dump’ for syntactically-deviant language. The
centrality of collocational and colligational patternings enshrined in the idiom
principle (Sinclair ibid.:110) however has been an important perspective to emerge
from the past two decades of computer corpus research.. Secondly representation of a
grammar in the form of a directed acyclic graph is designed to work on artificial or
intuited sentences such as in the examples above. The definition, evaluative and
causation local grammars on the other hand are intended to serve as the basis for the
parsing of natural language, rather than intuited sentences.
1.5.2 A local grammar of dictionary definition sentences
As remarked previously in Allen (2001a:11-15), the local grammar of dictionary
definitions is the most extensively worked out and tested semantically-based grammar
of a sublanguage to date. In this grammar the sublanguage of full sentence dictionary
definitions is already pre-defined as the Collins Cobuild Students Dictionary
(henceforth CCSD) definition database. The sublanguage consists therefore not only
of the lexicographers’ definition sentences but also the marked-up field codes for the
attaching of additional linguistic information such as grammar and pronunciation
guides etc. In analyzing these sentences into the definiens and definiendum functional
halves of lexicographical definitions rather than phrase structure or clausal component
the grammar departs radically from general language representations. The insight
which is recognized in this approach is that the dictionary metalanguage requirements
regularize definitions into a small number of patterns. The information which these
patterns contain can be more usefully described in terms of their functional
components as definitions rather than in terms of traditional phrase-structure rules or
clausal constituents. The practical utility of such an approach can be illustrated with
regard to sense disambiguation, as in the example below (Barnbrook 2002:165):
Examples of local grammar analyses for definitions of breast
A woman’s
A bird’s
the two soft, pieces of flesh
the front
on her chest
that can
produce milk to
feed a baby
part of its body
Here the local grammar has parsed the definition sentences for the headword breast
firstly into a left-side an right-side separated by a hinge element and then into the
functional components of definiendum (Dm) and definiens (Ds). These respective
halves are further decomposed into co-text elements ( C), the superordinate (S) and
with two optional discriminator elements either side (dr). The value of functionally
parsing these elements as discriminators (rather than pre- or post-modifying elements
in line with a PS grammar) should be immediately apparent as these elements provide
the basis for sense disambiguation. Despite this overall functional perspective, the
grammar does not explicitly acknowledge the wider debt to systemic-functional
linguistics (Halliday 1985a) which underpins the local grammar approach as a
functional analysis of a semantically-defined sublanguage.
An important aspect of this work is the application of the grammar in an automatic
parser. The parsing algorithm implemented using the text-matching language AWK
(Aho et al.1988) based on the grammar utilises primarily regularities in the definition
sentence structure and to a lesser extent field codes in the CCSD database to create
parses of the definition sentences with a number of NLP applications in lexicography.
Despite the specificity of these codes to the CCSD database, Barnbrook shows how
the grammar / parser could be adapted to other learner dictionary databases, such as
the OALDCE . In future applications it would also be interesting to apply the
grammar/parser to on-line texts with a view to extracting term definitions from
sources outside of a dictionary database. In an era of rapid terminological change,
automatic term definition is highly desirable.
1.5.3 A local grammar of evaluation
The influence of Halliday is made more explicit in the local grammar of evaluation
which is described in Hunston and Sinclair (1998) and Hunston and Francis (1999).
The description of evaluation shows more clearly the critical link between patterns of
lexical co-occurrence and semantic units which has been one of the principal claims
being made from a corpus-driven methodology.
The descriptive basis for the local grammar of evaluation is the notion of pattern
grammar arising originally out of concerns to represent the grammatical behaviour of
dictionary headwords in the COBUILD dictionary. In a series of publications (Francis
et al 1996; Hunston and Francis (1998, 1999), Hunston and Francis provide detailed
corpus- driven descriptions of the lexical patterns of verbs, nouns and adjectives using
a corpus-driven methodology. The descriptions make use of a shorthand notation
system to represent each lexical item and its associated patterns. A large general
corpus, the COBUILD Bank of English, provides the source data throughout.
The discussion of pattern grammar brings into focus an important if subtle distinction
between the closely-related terms of lexicogrammar used by Halliday (1985a:15) and
the notion of a lexical grammar arising from corpus-driven studies of phraseology.
Lexicogrammar is identified by Halliday within the mainstream of SFL theory as the
traditional meaning of grammar in terms of a recognition of the interdependence
between lexis and structure. One consequence of this perspective is to regard the
lexical item as the most delicate representation of a grammatical system. However as
Hunston and Francis (1999:28) note, this view is at odds with the findings of corpus
linguistics. Results emerging immediately from or as a by-product of the COBUILD
project point to syntagmatic patterns of collocation and colligation centred on
individual words as representing single functional choices in accordance with the
idiom principle. In a lexical grammar, a phraseological pattern defines in Sinclair’s
(1991:6-9) an extended unit of meaning which represents the most delicate choice of a
system, rather than individual lexical items.
Hunston and Francis exemplify the pattern-function mapping with regard to the
evaluative adjective difficult which is exhaustively listed on the basis of the corpus
evidence in terms of a total of 21 separate lexical patterns, a selection of which are
illustrated here:
Example of local grammar analysis of the evaluative difficult
Evaluative Category
pretty difficult
Evaluated Entity
to see the future
to generalise
reading into a
man’s mind
reproduced from Hunston and Francis (1999:133)
The pattern notation system is itself evaluated in Allen (2002b). In this paper it is
pointed out that the system offers the advantage of being able to represent the
individual patterns of large numbers of lexical items in a convenient database form.
There are however difficulties raised by the fact that the pattern notation records cooccurrency restrictions to the right of the search node only, whereas a full functional
specification also needs to account for linguistic elements to the left of the
concordance search node.
The use of ‘mapping tables’ such as that illustrated above for the adjective difficult
represents the second stage in the compilation of the local grammar. The above
example illustrates how the pattern it v-link ADJ to-inf/ing14 is identified with the
functional categories Evaluative category and Evaluated Entity of the
local grammar.
In the light of the project described in this thesis, the evaluative grammar is valuable
firstly in terms of putting forward a corpus-driven methodology for the lexical pattern
storage in database format and secondly for the creation of a functional representation
without sacrificing the integrity of the data. The representation of evaluation in the
grammar is however given only partial exemplification; it remains to be seen how a
full coverage of the evaluative patterns of English in terms of an exhaustive listing of
the adjectives and their lexical co-occurrency restrictions could be provided in
functional terms using a non genre-specific corpus. It is desirable therefore that a
specific local grammar should be compiled with initially more modest ambitions in
mind, which brings us back to the notion of language in restricted environments.
1.5.4 A local grammar of causation
The local grammar of causation was originally developed as a pilot project only using
the general language Bank of English as the descriptive source (Allen 1998).The
project focused on a restricted number of ‘prototypical’ periphrastic causative verbs
such as cause, make, get etc and illustrated how some of the main patterns of cooccurrency involving these lexical items could be mapped onto functional elements of
the local grammar. This preliminary work was also significant in terms of subsuming
The convention in this thesis will be to represent lexical patterns in bold and functional categories in
the semantic notion of prevention under the heading of cause and effect. However a
restricted focus on what has traditionally been referred to the periphrastic causatives
represents but a small fraction of the total lexical resources through which causal
relations are realised in a general corpus of English. For example, transitive verbs
such as kill, break, smash and multi-part human agency verbs such as cajole + into
could also be seen to link causal agency with resultative effects.
In Allen (2001a; 2002a, 2002b) the difficulties involved in developing a lexical
grammatical description of causation using a general language corpus are discussed at
length. The point is made that the genre-specific focus on scientific argumentation
while at the same time concealing human agency significantly scales down the size of
the descriptive problem. In scientific research articles, the concealment of agency and
the representation of hypotheses as chains of nominalisations linked causally reduces
greatly the number of verbs involved in the encoding of cause and effect to a more
tractable subset of English transitive verbs. A specific genre focus makes it possible to
describe the principal lexical patterns through which causation is encoded within the
context of a single project. This reduction in complexity coupled with the utility of a
grammar as the basis for a parser with information extractive applications in
biomedical informatics has been the principal motivation for the compilation of a
specialist corpus of biomedical research papers. The prospect is therefore raised in
terms of providing a more or less exhaustive coverage of the lexis of causation within
a restricted textual environment. Such an enterprise in itself raises substantial
methodological questions relating to the construction of a specialist corpus and the
delimitation of a sublanguage of causation from within the scientific research genre.
In contrast to the work of Barnbrook in which the sublanguage was already predefined as lexicographers’ definitions, the raw data for the construction of a scientific
corpus needs to be delimited from the general language from scratch. Upon closer
inspection, scholarly scientific writing turns out to be far from homogeneous. To this
end the notions of genre and discourse community introduced by Swales (1990) can
be usefully applied to scientific text as the basis for textual selection and corpus
construction. In Allen (2002a), the genre of biomedical research articles is defined
with reference to the discourse community (DC) of biomedical researchers. The DC is
seen in Swalesian terms as a ‘socio-rhetorical’ grouping of researchers sharing
common aims in the dissemination of research texts. Examples of DC groupings in
biomedicine include journal readerships and institutional affiliations among
researchers sharing common goals in textual production and reception. If such subgenres can be defined with reference to specific journal titles, the problem of
sampling across the spectrum of biomedicine can be tackled on a more principled
The definition of a genre in terms of discourse community has been influential in the
construction of a number of small-scale scientific corpora for the purposes of genrespecific phraseological patterns (Gledhill 1995, 2000; Williams 1998). Work on
scientific corpora has involved the active participation of domain experts from within
the discourse community in the selection of representative texts for corpus inclusion.
The methodology of corpus construction and data sampling using an established
library classification scheme outlined in Allen (2002a:26-27) is described in more
detail in Chapter 3 of this thesis.
1.6 Objectives and overall format
The format of the thesis is as follows. Chapter 2 describes the nature of sublanguages
in biomedical research beginning with Harris’s original criteria for sublanguage
identification. A survey of practical applications using the sublanguage approach is
then provided with special reference to clinical narrative analysis and more recently in
the NLP analysis of biomolecular research papers. In particular recent work in NLP
has sought to describe possible semantic relationships in sublanguage environments
which might serve as the basis for information extraction. Given the rhetorical
centrality of causal relations within scientific research text, causation is one such
semantic relation with potential NLP applications in parsing sublanguages. Chapter 3
concerns itself with specific issues relating to corpus representation and design and
the implications of the adoption of a corpus-driven methodology in terms of lexical
pattern data storage. In particular this chapter considers the expansion of the original
130,000 running corpus into the final design and construction of a 2 million word
Halmstad Biomedical Corpus. The basis for arriving at a practical definition of
causation allowing a delimitation of the causative sublanguage within the biomedical
RA is also considered.
The empirical results are set out in two separate chapters. Chapter 4 describes the
significant lexical items encoding causal relationships in the text and their principal
lexical grammatical patterns. These patterns of lexical co-occurrence serve as the
basis for the presentation of the local grammar functional components. Copies of the
corpus and the lexical databases are included on the enclosed CD-ROM. The local
grammar itself is described in chapter 5. In chapter 6 the focus is on the evaluation of
the local grammar as the basis for a functional parser of biomedical RAs. Through
making use of the local grammar configurations of functional patterns presented in
chapter 5, a small test corpus comprising of POS and XML-tagged biomedical RAs is
hand-parsed and the results evaluated in information extraction terms. The chapter
also considers the efficacy of creating software based on the local grammar outline.
Finally in chapter 7 the grammar / parser is evaluated in terms of potential
applications in information retrieval / extraction within the domain of biomedical
informatics. Other possible uses of the grammar such as in the teaching of English for
Specific and Academic Purposes will also be considered. Finally this chapter
considers the relationship between different (future) local grammars and the prospects
which they hold for the longer term goal of automatic parsing of unrestricted text.
2. Biomedical sublanguages: from analysis to application
2.1 Preliminaries
The previous chapter has introduced the notion of a local grammar as a grammar of a
functionally-restricted sublanguage. At this point it is instructive to re-evaluate the
relationship between the concepts of sublanguage, register and genre which have been
alluded to in previous work (Allen 2001a; Allen 2002a). The relationship between the
notions of pattern grammar and local grammar described in Allen (2002b) will also be
The concept of sublanguage and the criteria by which sublanguages can be identified
have already been described in Allen (2001b) with reference to the groundbreaking
contribution of Harris (1968; 1982; 1989). In this chapter, the focus is more
specifically on the application of the sublanguage concept to biomedicine. The
selective focus on the biomedical domain is justified from both linguistic and
informatics perspectives. Of most fundamental importance to this thesis is firstly the
constraining influence of the biomedical research article as a sub-genre on the lexical
grammatical expression of causation and secondly the potential parsing applications
of a specialized grammar of cause and effect in biomedical informatics.
The notion of sublanguage has primarily been used within the NLP community to
describe subsets of language representing constrained varieties of natural language
(McEnery and Wilson (2001:166); Barnbrook (2001:73). It is important to understand
these constraints in terms of what Harris termed ‘closure’- the tendency of a
sublanguage towards being finite. As McEnery and Wilson (ibid.:167) note, closure
can be demonstrated by comparing a corpus of computer manual text such as the IBM
Corpus with a corpus assumed to be representative of the general language such as the
Canadian Hansard Corpus. Detailed lexical comparisons such as type/ token rations
between these corpora show that the IBM Corpus is a much more restricted textual
resource ie the IBM lexis is more ‘closed’ or finite than the Canadian parliamentary
Sublanguage approaches in NLP have been largely confined to attempts to produce
systems of models of analysis in ‘one off’ highly constrained linguistic environments
(McNaught 1992). As Gledhill (2001:22) sublanguages have come to be associated
with the terminological tradition of language for specific purposes (LSP). For Picht
and Draskau (1985:10-11 cited in Gledhill (ibid.:22) LSP examples such as weather
forecasts, biochemistry articles or legal texts are to a large extent completely
divorced from the general language.
The relationship between the notions of sublanguage and local grammar can now be
considered in more detail. Clearly prototypical sublanguages such as the LSP and the
TAUM METEO reviewed in Allen (2001a:24) and the local grammars of definition,
evaluation and causation are not the same phenomena. These differences can be
summarized both in terms of linguistic tradition and scope of application. Prototypical
sublanguages are most clearly identified with computational initiatives based on
formal linguistics in highly restricted and in some cases grammatically ‘deviant’
environments. These domains are exemplified by the ‘telegraphic’ structures of
clinical narratives or weather forecasts.
Work on local grammars stems however from a functional linguistic perspective.
Definition, evaluative and causative sentences do represent semantic sub-sets of the
general language in Lehrberger’s (1982:102) terms and therefore qualify as
sublanguages. These sub-sets however are not restricted to specialized linguistic
environments; the expression of cause and effect is equally likely to be found in sports
commentaries as it is in biomedical research articles. The scope of sublanguage
embodied by causation, definition or evaluation needs to be significantly widened
beyond the restricted focus of the NLP /terminographical tradition of scientific
sublanguages. A functional perspective thus acknowledges the potential extension
from specialized linguistic environment into general language. The description of this
sublanguage on a functional basis constitutes the local grammar itself.
For the purposes of specialist corpora construction however, the somewhat vague
notion of sublanguage inherited from terminological work has proven to be difficult to
apply in the construction of domain specific corpora (Williams 1998). In contrast
functionally-restricted language is subsumed in Neo-Firthian terms within the general
language, the precise relationship being specified through the concepts of register and
genre. In corpus-based studies of language variation, Biber (1988; 1990; 1993)
conflates genre with register to refer to externally-defined text types such as fiction,
sports broadcasts etc. The term text-type on the other hand is reserved for the internal
identification of texts sharing patterns of linguistic feature co-occurrence (Biber
1993:245).These features include specific verb and pronoun counts as well as type
token ratio measures.
Systemic functional linguistics has in contrast reserved register and genre for text
delimitation on internal and external criteria respectively. Register is equated with the
impact of the context of situation on text-internal lexicogrammatical choices. For
example there is a regular and predictable relationship between the register variable of
tenor which captures the role relationships of the interactants and its linguistic
manifestation in modality choices. In building specialist corpora, it is claimed that the
notion of genre provides a more practical solution to the problem of text selection
(Williams1998). Drawing on the notion of the discourse community to provide
external selection criteria to a large extent eskews the problem of constructing a
corpus on the basis of a time-consuming text internal linguistic feature analysis.
The above discussion has focused on the difficulties raised by the clash of linguistic
traditions in terms of equating the terminographically and NPL-derived notion of
sublanguage with the functional and discoursal perspectives on corpus construction.
In this thesis it is argued that genre provides a means of defining a specialized corpus
which contains examples of causation as a functionally-defined sublanguage. The
local grammar
Historically, it was in the field of biomedicine that Harris first demonstrated the
efficacy of his distributional theory of sublanguage grammar analysis (Harris et al
1989). Early biomedical applications of the Harris sublanguage theory have largely
been in the processing of information contained within clinical narratives such as
records of patient diagnosis, treatment and follow-up. However this work pre-dates
the currently dominant paradigm within biology of genetic research driven not only
by theoretical and experimental advances in the sub-field but also by improvements in
computational power and database storage technology. These developments have
begun to facilitate large scale organism-specific description of genetic sequences. The
first phase of the Human Genome Project for example has resulted in the cataloguing
of the 30,000 separate genes specific to the human species as well as some 3 billion
separate DNA base pairs. These and other initiatives demand computer-automated
solutions to the problems of electronic information management and dissemination
via web-based media.
Recent interest in the sublanguage approach has been awakened within the
informatics research community in terms of providing the theoretical basis for the
processing of very large quantities of digital text with web-based IR/IE applications in
mind. The sublanguage approach seeks to describe the use of language in subfields of
science in terms of the structure of its information. This process involves capturing
not only the relevant entities within a given field but also the relations between these
entities in these restricted linguistic environments. As Friedman (2001:223) notes,
automated applications of sublanguage grammars offer real prospects in the fields of
information retrieval and extraction. The basis for this prediction is that the language
so circumscribed is in information-content terms richer than general English because
the identity of the entities involved in each scientific field and the relationships
between them have already been formatted. These relationships are exemplified in
more detail below.
This chapter begins by describing and illustrating in more detail the linguistic and
informational aspects of biomedical sublanguages with reference to Harris’s original
defining properties for sublanguages. Section 2.3 illustrates and contrasts some of the
applications of sublanguage theory specifically in the clinical and molecular biology
sub-domains. In section 2.4 more recent NLP developments in sublanguage theory are
explored specifically in terms of controlled vocabulary and ontology building within
the biomedical domain. These developments are predicated upon the identification of
information-bearing categories as nominalized entities coupled with a specification of
the semantic relations pertaining between the respective nominal groups. The
potential of causation as a semantic dimension in biomedical information
retrieval/extraction lies in the prototypical semantic manifestation of each article’s
central hypothesis as a causal link. A functional analysis of cause and effect linkages
in accordance with the principles of sublanguage delimitation can therefore serve as
the basis for automatic parsing and formatting of information. Such applications can
potentially extract from research articles diseases and their causes, therapy and
treatment courses and their effects together with assessments of pharmaceutical
preparations and their efficacies. Finally section 2.5 explores more specific IR /IE
initiatives in biomedical informatics.
2.2 Distributional sublanguages in biomedicine
Before describing more recent developments in NLP applications of sublanguage
descriptions, it is important to consider in more detail the fundamental properties of
biomedical sublanguages as identified by Harris in his original work on the sub-field
of immunology. Harris categorises these properties in terms of dependency relations,
paraphrastic reductions and inequalities of likelihood.
2.2.1 Dependency relations
The category of dependency relations is a general property of natural languages. In
this descriptive framework nominal entities are referred to in Harris’s terminology as
‘zero’ level words because they are not dependent on other lexical items in the
sentences. For example, in the simple sentence Dogs eat meat, the zero level nouns
dogs and meat do not depend on other lexical items. The same is not true for the
transitive verb eat which is dependent on dogs and meat as verbal arguments. With
reference to the cellular immunology domain Harris isolates a number of zero level
word classes (each of which is labelled by a letter G-C etc) of nominals:
G: eg antigen, bacteria, sheep, blood cells
B: eg ear, rabbit
A: eg antibody, agglutinin, immune globulins
T: eg lymph nodes, serum, adipose tissue
C: eg lymphocytes, plasma cells, reticulum cells
One of the central findings of Harris has been the extent to which the sets G-C above
regularly co-occur with certain verbs and other expressions. These lexical items are
termed in Harris’s theory as ‘first’ level words because they are dependent on the
particular noun subject and verb combinations. These first level combinations are in
turn given letters for shorthand identification purposes:
J: on G-B (injected into, eg Antigen is injected into the ear)
I: on C-B-B (injected into… from, as in cells were injected into rats from nonimmunized rats)
U: on G-T (reaches, concentration in )
on G-C (stimulates, uptake by, sensitizes)
V: on A-T (visible in, distributed in; formed in; drain into; pass through)
on A-C (found in, contained in, synthesized by; adsorbed to; secreted by)
W: on T- (react, affected; swollen, inflamed)
on C- (react, change, develop; enlarge; present; multiply, divide, undergo,
on C-T (present in, persist in, transferred from, drain from, pass through)
on S- (in parallel orientation, rough, clustered, basophilic)
Y: on C-C (is same as, has some similiarity to, is called; formed from, derived
from, develops into)
on C-C-C (bridges the gap between… and, differentiates through ….to )
on S-S (is in the form of, intermingles with)
Harris also notes the tendency for certain sequences of these first level combinations
such as:
GJB: AVC (Antigen is injected into a rabbit. Thereafter antibody appears in the
GJB: TW (Antigen is injected into the left foot pad. Thereafter the homologous lymph
node is inflamed)
While this is a very much abbrieviated representation of Harris’s dependency levels it
serves to illustrate the type of lexical constraints which operate within a biomedical
sublanguage. Harris argues that statements of word class distributions serve to define
a sublanguage as ‘an independent symbolic linguistic system’. In other words on
purely distributional grounds alone the grammar of the sublanguage may to a certain
extent diverge from that of the general language.
2.2.2 Sublanguages and paraphrastic relations
A further aspect of Harris’s theory is that of paraphrastic reductions. Paraphrasis
describes the relationship between the natural language biomedical text and the
simplified, informationally-equivalent sentences of the sublanguage suitable for
formatting in a database. The decomposition of the text sentence via transformational
operations (eg passivization etc) into a number of semantically equivalent sentences
lies behind Harris’s definition of a sublanguage as a mathematical sub-set of
sentences sharing a semantic common denominator.
An exemplification of Harris’s paraphrastic reductions can be found in studies on the
relationship between research articles and their accompanying abstract summaries
(Kittredge 2002; Chuah 2001.). In the examples below, the paraphrases in operation
are shown in the conversion of the full-text sentence [6a] into its abstract
condensation [6b]:
[6a] In the present study, we investigated the capacity of TF to induce inflammation
by injecting human recombinant TF (rTF) into joint cavities of healthy mice.
[6b] In order to assess the proinflammatory capacity of TF itself, the recombinant
extracellular domain of TF was injected intra-articularly into healthy mice.15
Particular attention is drawn to the underlined string containing a causal relationship
in [6a/b] for which the re-structuring of the information content can be stated as:
Ncap of Nagent
to Vc
pro-Ao (Ninflam ) Ncap of Nagent
Example taken from Kittredge (2002:269) using an article from Arthritis Res 2002 4:190-191]
(the capacity of TF to induce inflammation)
(the proinflammatory capacity
of TF)
where Ncap = {capacity, ability, potential, tendency,….}, Nagent is a substance, Vc =
{cause, induce, trigger, ….}, Ninflam = {inflammation,… } and A0 is a lexical function
changing noun into adjective etc. Of relevance to the present project is the extent to
which categories such as Nagent can be set up in order to capture significant functional
generalisations in sentences expressing cause and effect relationships. The corpusdriven identification of functional categories making up the local grammar of
causation will be discussed in chapter 5.
2.2.3 Inequalities of likelihood
Related to the dependency relations identified in 2.2.1 is the property of inequalities
of likelihood which captures the extent to which certain operator-argument
combinations are more likely than others. For example in the invented causative
sentence Smoking causes heart-disease the argument heart-disease is more likely than
malnutrition for the operator causes etc. Given the semantic restrictions operating in
sublanguages, the possibilities are that operator-argument constraints can be described
in more detail as in terms of word class sets. Harris makes the important point that in
some biomedical contexts such as clinical reports, certain lexical items such as have
are very likely as in the combination patient has fever but offer little in terms of new
information (Friedman 2002:223).In sublanguage representation therefore, the string
patient has fever can be represented as
Patient | fever with the redundant verb have omitted or ‘zeroed’.
Large scale distributional analyses have been at the heart of sublanguage applications
such as the framework developed by Sager (1972; 1982) to automatically parse the
most significant information-bearing sentences in pharmacology articles. For example
in the pharmacology sublanguage many sentences are dominated by what Sager terms
‘science-specific’ verbs (the operator in Harris’s terms). Compared to the general
language, Sager notes that verbs of sublanguage sentences are frequently subject to
very narrow constraints in terms of the ranges of allowable subjects and objects.
Within the domain of pharmacology for example, the verbs operate and exchange
with take highly restricted semantic subsets such as the names of ions K+, Na+, Ca+ +
etc. With the emergence of corpus analysis techniques, it is now possible to
investigate these relationships on a more statistical basis, as statistically-defined
collocational patterns which in turn define semantic prosodies. The identification of
significant patterns of co-occurrence underlying causation in biomedical articles is the
focus of chapter 4.
2.3 A survey of biomedical sublanguages
2.3.1 Background
On the basis of Harris’ original work on sublanguage properties, it is ultimately
possible to describe the relationships between a restricted number of verbal operators
and an equally restricted number of arguments which these operators take in the form
of a sublanguage grammar. Thus in a grammar of a biomolecular sublanguage, the
restricted range of arguments for the operator activate can be described as involving
choices from the subclasses [substance] [activate][ substance] or [process]
[activate][ substance] etc. Thus the combination [person] [activate][ substance]
while well-formed in the general language grammar is ruled out in the sublanguage
grammar. Therefore the sublanguage can be characterized in terms of the dependency
relations and paraphrastic reductions of the general language. However the range of
lexis involved is constrained, forming restricted subsets only of the general language.
It can be observed for example that sub-sets combine with each other to realize
specialist semantic relationships while at the same time satisfying the syntactic rules
of the general language.
In order to demonstrate the wider applications of the sublanguage approach in the
biomedical domain it is informative to compare two conceptually rather different subfields. These sub-domains comprise the sub-fields of clinical narratives (Sager 1981,
Sager et al 1987) and more recent research-orientated biomolecular fields. Research
into the formatting of clinical narrative has been in progress since the early 1980s
through the pioneering Linguistic String Project (henceforth LSP).The aim of these
sublanguage representations is to provide a principled statement of the entities and
semantic relations which capture the information structure of doctors’ notes from
clinical consultations, hospital discharge summaries and other patient data. The
process of informational formatting changes the free-text entry of the original text into
a semantic format which facilitates the extraction of key information elements. A
computer representation is possible because the sublanguage analysis mirrors the
linguistic regularities of the general language (Sager et al 1987:8). Relying on a
general language description however would restrict the grammatical representation
to the expression of syntactic relationships only. A sublanguage analysis on the other
hand aims at exploiting syntactic rules to capture and structure information within the
2.3.2 Clinical sublanguages
While the sublanguage analysis and parser for the LSP were based on the theories of
Harris, the grammar itself embodies a constituent formalism rather than the operatorargument system which dates from Harris’s own work on the sublanguage of
immunology (Harris et al 1989). The LSP grammar consists of 40 clinical subclasses
such as symptom, medication, body part etc which collectively mark informational
categories for the clinical sub-domain. In addition there are 14 semantic subclasses
which encompass temporal change (change, increase, decrease etc), evidential
information (no, present) and a number of connective devices (and, but, consistent
A similar sublanguage analysis is embodied in the MedLEE system (Friedman et al
1994). As with the LSP, the analysis identifies a series of semantic subsets in addition
to describing a set of linear rules permitting well-formed sequences of these elements.
A selection of the simplified categories of MedLEE is shown in the table below
(adapted from Friedman et al 2002:227). The database makes a basic tripartite
subdivision into primary, modifier and relational operator categories which are then
further divided into a total of 40 sub-categories. The primary category lists for
example patient behaviour, body function assessment, body measurement etc, while
modifier categories capture significant adjectival and circumstantial information such
as body location and intensity
Primary category 16
Clinical example
user, drinks
breathing, movement
catheter, atrial electronic pacemaker
Device Modifier category heart, respiratory system
increased , came down to
slight, extensive amount
intravenous, continuous infusion
Relational operator and, or, as well as, with
accompanying, including, consistent with
MedLEE has been primarily developed to represent the types of relationships found in
radiological reports. For example one particularly important relationship is that
between radiological data scans and findings eg CT scan revealed a hypodensity
consistent with infarct etc. The relationship between these categories and patterns of
co-occurrence identified in the data can now be illustrated with specific examples
(modified slightly from Friedman et al. 2002: 228).
Substance +
Behaviour +
Target form
Cigarette smoker
[behaviour, smoke,
walking with
[problem, difficult,
[bodyfunc, walk]]
The first column above shows the main categories (in this case for patient behaviour
and body function assessment) of information captured in the database. For each
category, the main patterns of information elements are illustrated. Thus Bodyfunc
See Friedman et al (1994) for a full specification of these categories and semantic relations
captures the patient’s walking ability with the element +finding listing the clinical
assessment difficult etc.
2.3.3 A biomolecular sublanguage
The sub-field of molecular biology is the focus for a NLP system known as GENIES
also based on the sublanguage concept (Friedman et al 2002). Essentially the subclass
categories are common to both GENIES and MedLEE. However the field of molecular
biology is fundamentally different from that of clinical narrative reports which in turn
necessitates its own sublanguage grammar. This representation makes use of
essentially similar entities but needs to capture the molecular pathway relationships
which are particular to the biomolecular domain. A selection of these entities for the
categories of substance and action is shown in the table below (adapted from
Friedman et al 2002:229):
Amino acid
Small molecule
DNA region
Il-2, p53, Caveolin-1
CBl, Fyn, Let-23, Mycp70s6kE389D3E
tyurosind, threonine 229
ZDEVD, guanosine
triphosphate, tetracycline
Src Homology 2
origin of replication,
codon 249
Jurkat cell, LBL-DR/ cells
adrenal glomerulosa,
human, Epstein-Barr virus
activate, induce, mediate,
inhibit, suppress, block,
bind, join,
Act upon
polyvinylidene, difluoride
This category makes no distinction between genes or proteins
Create bond
Action (continued)
cleave, demethylate,
contain, include
result in, lead to
phosphorylate, polymerize
express, overexpress
catalyze, medate, enhance
react, interact
Un-regulate, control,
replace, substitute
This sublanguage analysis is the product not only of domain specialists but also an
empirical process of molecular pathway extraction from a corpus of research articles
downloaded from the MEDLINE reference database. Work is initiated by specialists
who identify genes of interest. An automated system known as GeneWays then
performs a parse of the downloaded texts based on the sublanguage grammar to
extract regulatory pathways associated with the gene in question. Implicit in the
grammar is the recognition that for the purposes of information retrieval/ extraction,
molecular pathway information broken down into a number of substance and action
categories is of greatest relevance. Immediately relevant to the present project on
causation are the actions activate, act upon, cause, generate, modify and promote all
of which can be subsumed under the heading of cause in the sublanguage analysis.
The final stage in this construction of the metabolic pathway knowledge
representation involves once again domain experts in the ‘pruning’ of redundant
information obtained in the automatic parse.
The biomolecular sub-field may be regarded as being broadly typical syntactically for
biomedical fields involving biochemical, molecular and genetic entities because the
information structure of texts is dominated by high informational loading on the
nominal groups. Thus an analytical framework needs to capture the role of the verb
group in conveying interactions between substances as shown in the table above. One
further constraint imposed by the information structure is that nominal group
interactions are also dependent on other interactions, leading in grammatical terms to
embedding. The problem of embedding is exemplified below. Analysis of the
sublanguage for the GENIES system has resulted in the postulation of five top level
categories: substance, action, process, state and relation. As Friedman (2001:230)
states however these categories are too coarse in the sense that they permit non wellformed string combinations. For example the pattern
Substance+action+substance permits the strings Fyn activates Cbl and Cbl
activates Fyn but this biomolecular interaction between two proteins is nonreversible. It is necessary therefore to sub-divide these top-level categories into a
series of finer-grained subdivisions. The table below also adapted from Friedman et
al.(ibid.:230) gives some examples of the sub-language analysis for the molecular
interactions entailed in this scientific sub-field.
Basic patterns
Fyn activates Cbl
Substance + be + actionven
+by + substance
Substance +actionn + of +
Actionn + of +substance
+by + substance
Substance + actionved + by
+ substance
Substance+actionvor +
Actionn + of + substance
and +substance
Cbl was activated
by Fyn
Fyn activation of
Activation of Cbl
by Fyn
Cbl activated by
Cbl activator Fyn
Substance + actionn +with
+ substance
Actionn+ of + substance
Fyn association
with Cbl
Transcription of Il2 gene
Il-2 gene
Fyn activator of
The association of
Cbl and Fyn
Substance + actionn
Target form
[action, active,
[protein, Fyn],
[protein, Cbl]]
[action, attach,
[protein, Cbl],
[protein, Fyn]]
[action, transcribe,
[gene, il-2]]
In the case of the example above, the semantic category of action has a number of
basic patterns involving proteins given the abbreviations Fyn and Cbl. Subscripts for
action denote whether the operator is a verb (present tense, past tense, past participle,
progressive etc) or nominal. The target form in the right-hand column is an expression
of the information contained in the pattern. For example in Fyn activates Cbl, the
structure of the information is represented as a sub-division of action and the two
protein (Fyn and Cbl) arguments of this operator.
As mentioned above, the high degree of information compaction in the nominal
groups which is a feature of scientific research writing in general (see Halliday 1993,
Allen 2002a:21-24) for a more comprehensive treatment of nominalization)
necessitates a representation of the complex embedding prevalent in the sublanguage.
This is exemplified in the examples [7a-7b] below taken from Friedman 2001:231):
[7a] Interleukin-3-induced phosphorylation of BAD through the protein kinase akt
In terms of the sublanguage grammar these relationships would be represented as:
[7b][action, activate, [protein, interleukin-3], [action, phosphorylate, [protein, kinase
akt], [protein, bad]]
In this representation the pre- and post-modified nominal group Interleukin-3-induced
phosphorylation of BAD containing a causal link through the transitive verb induced
is represented through the information structure [action, activate, [protein,
interleukin-3] which is in turn portrayed as one participant phosphorylate of a further
activation with the protein kinase akt. In other words the first interaction is nested
within a wider interaction to form a chain of causally-linked entities. The
phenomenon of embedding within nominal groups linked via causative verbs will be
explored from a lexical perspective in chapter 4 and in terms of the functional
representation of the local grammar in chapter 5.
2.3.4 Clinical and biomolecular sublanguages compared
A comparison can now be made between the type of clinical narratives making up the
LSP and MedLEE projects and the biomolecular focus of GENIES. The question may
be asked at this stage as to what extent this basic sub-division is paralleled in the
wider domain of biomedicine. The distinctions made between these sublanguages
while admittedly somewhat crude do at least open up two possible dimensions of
focus in biomedicine, one of which centres on the patient, the other on specific
biochemical and genetic participants in biomedical processes (Friedman 2002:232).
Despite these differences the subject matter of both sublanguages is partially
overlapping - both representations need to capture relationships involving cells,
diseases, conditions etc as well as formatting temporal and other circumstantial
information concisely. In the case of clinical narratives however the relationships
involve diagnosis and treatment administration and their effects on the patient state.
These relationships can be contrasted with the focus in biomolecular research on the
interactional pathways involving biochemical substances which lead to the heavily
pre-modified, nested nominal groups described above which are in turn related
through a restricted sub-class of verbs.
Ultimately Friedman argues that MedLEE and GENIES can make use of the same
categorical entities (which raises the prospect that they can partially share the same
NLP system) but separate sublanguage grammars need to be written to capture the
very different relations at work in these two scientific sub-fields.
2.4 Natural-language processing and biomedicine
2.4.1 General
The recent advances in NLP and language engineering alluded to above have raised
the prospect that sublanguage grammars can provide the basis for the automatic
parsing of texts in restricted domains. Drawing from fields as diverse as cognitive
science, logic, computer science, psychology and of course linguistics, NLP aims at
facilitating the decoding of human language by computers while at the same time
enabling computers to encode via the interface of human language (Blaschke et al
2002). NLP has traditionally concerned itself with machine translation, automatic
information retrieval and serving as a human-machine interface (Grishman 1986). In
linguistics, NLP has chiefly focussed on the computer verification of grammatical
models from theoretical linguistics.
Central to all of these concerns is the extent to which a computer model of a language
can parse a text, ie create a linear representation of the elements making up each
sentence (Sampson 1992). One of the perennial problems of computational linguistics
has been the failure on the part of explicit formal rules based on linguistic intuition to
cope with the complex structure of natural language even within restricted text
environments. With the rise of computer-held corpora of natural language, stochastic
rules based on statistical patterns of co-occurrence between lexical items have become
incorporated in grammatical descriptions with promising results (Black et al 1993,
Bod 1993).
2.4.2 Applications in the biomedical domain
Within biomedical informatics, the central aims of NLP applications have been firstly
directed towards the formatting of patient records and other clinical narrative sources
and secondly towards the structuring of information in biomedical research articles as
knowledge bases. More specifically NLP applications in the clinical domain have
focused on decision and diagnosis support (Fiszman and Haug 2000; Friedman et al
1999), the information structure of admission diagnoses (Gundersen et al 1996,
Blanquet and Zweigenbaum 1999), data-mining (Doddi et al 2001, Wilcox and
Hripcsak (2000) and the representation of medical vocabularies (Aronsson 2001). The
research domain has similarly seen NLP approaches used particularly successfully in
terms of mapping the relations between biomolecular substances (eg Tanabe and
Wilbur 2002, Gaizauskas et al 2002). Several of these applications in research articles
have used statistical methods to improve the representation of the relations rather than
relying on the manual intervention of domain experts.
The problem of naming biological entities has now assumed considerable importance
as the results of the sequencing of the human genome have filtered into the
burgeoning research literature. One aspect of this process is database curation18 which
is a manual process by which key concepts from a scientific sub-domain are
transferred from the literature to database fields. Similarly the Gene Ontology
Consortium project aims to provide some measure of consensus by creating an
A list of databases for the domains of biology and biomedicine is available at
hierarchically-organised controlled vocabulary which can be queried at various levels
(Tanabe and Wilbur 2002). The aim of such initiatives is to begin to tackle the
problems engendered by variations in term names; thus in the literature of bacterial
protein synthesis, it is possible to discuss the function of these molecules in terms of
‘translation’ and ‘protein synthesis’ etc. The rule-based POS tagger which the authors
put forward can be ‘trained’ to identify gene and protein names, serving potentially to
remove one of the principal obstacles to the construction of domain-specific
2.5. Information retrieval and information extraction
2.5.1 General
Ultimately the utility of any biomedical sublanguage representation lies in
information retrieval (IR) and extraction (IE). IR relates primarily to the use of a
search engine such as Google or Altavista to retrieve relevant documents from the
entirety of on-line material (Baeza-Yates and Ribeiro-Neto 1999). As Blaschke et al
(2002) note, IR has a comparatively long history going back to content analysis
(Salton 1968, 1989, 1991).While this has been an extremely successful
implementation of NLP within informatics and has no doubt contributed very
significantly to the rise of the internet as a source of information, the product of the
query search is ultimately a number of documents of varying relevance. There is thus
a need to develop more powerful document access tools which can extract individual
items of information from retrieved document texts.
2.5.2. Information retrieval
Within the biomedical domain, perhaps the most high-profile application of IR has
been in the PubMed19 system which returns abstracts in response to a specific term
query. The basis for the retrieval of information is a statistically-derived relationship
between query-item and linked document (Hirschman et al 2002). IR systems are
generally rated according to the measures of precision and recall which have been
used in the evaluation of a series of information extraction experiments known as
Message Understanding Conferences (MUCs). According to Salton (1989:248),
retrievals can be rated according to the ratio of the number of relevant records
retrieved to the total number of relevant records in the database (recall) or
alternatively the ratio of the number of relevant records retrieved to the total number
of irrelevant and relevant records retrieved (precision). In the light of the retrieval of
non-relevant material in response to more specific queries, IR can sometimes be seen
as a blunt instrument in the search for information entities.
2.5.3 Information extraction
IE on the other hand has the stated aim of ‘going one step further’ in extracting
relevant elements of information from within a document. These elements are then
structured into a database format. IE has been used in experiments in the provision of
text summary and automatic abstraction.
Work in information extraction has benefited greatly from the success of the MUCs
described above of which there have been a total of seven covering the period 19871998). MUCs have been designed to extract information relating to specific news
events and therefore work on the basis of the identification of a number of key
elements (Hirschman et al 2002). The first of these, the named entities, identify
objects (eg business names, manufactured artefacts, time frames, sums of money,
dates etc which are relevant to the news story. These entities are then normalized – for
example a company name and its acronym occurring later in the text are regarded as
one and the same entity. Template elements collectively collapse all entity mentions
into semantic subtypes which are captured at the document level rather than on a
sentential basis. Template relations on the other hand define the relations between the
named entities mentioned in the document. In the case of a business report, a
relationship could be specified in terms of LOCATION_OF, PRODUCT_OF,
EMPLOYEE_OF etc. For template relations such as LOCATION_OF, the
relationship might be defined between the entities of business organisation and
town/city etc. The structure of a specific event is captured in terms of a scenario
template which adds a timeframe specification to the information captured by the
template relations and thus permits the representation in database format of an event
such as a business take-over or product launch.
While there are obvious parallels between the relations specified in the media related
MUCs and biology in the sense that both can be described in terms of entities and
relations, it would appear from the literature that the results for biology have been
more disappointing than for news. One possible explanation for this is the difference
in naming conventions between biological and news entities. Biological names for
entities such as genes and proteins are productive in the sense that they are often
assembled from a set of prefixes and suffixes. Ideally terms should be monoreferential indicating a 1:1 relationship between term and concept (Ananiadou et al.
2003). Instead the process of automatic literature mining has to cope with the
problems of ambiguities (one term relating to several concepts) or conversely variants
(many terms related to the same concept). Frequently the appearance of a new named
entity is accompanied by an abbreviation. Proux et al. (1998) give the example of the
abbreviation ‘asp’ which can take on up to 40 different meanings as described in the
Acromed database. This lack of consistency makes it difficult for annotators of
training corpora to agree on naming conventions for the texts.
One of the main difficulties involved in information retrieval and extraction is that the
information is not normalized. As Grishman (2002:2/15 reports this problem makes it
difficult to retrieve information about specific relationships found in a text. An
example of such a relationship given by Grishman is the performing of a search to
‘list companies with headquarters in Pennsylvania which declared bankruptcy’. This
information would require the construction of a database to capture these relationships
which would presumably rest on the analysis of the specialist domain of business
reports and the construction of a sublanguage grammar. One particular application of
interest from the biomedical domain is Proteus-BIO which uses a sublanguage
representation to normalize the relationships surrounding infectious disease outbreaks
(Grishman 2002).
Proteus-BIO makes use of a sophisticated search engine to scan the WWW on a daily
basis looking for information relating to new outbreaks of infectious disease such as
Ebola etc. In doing so this system builds on a particularly successful application of
NPL in the area of MUCs (Grishman and Sundheim 1996). The sources of
information are the ProMed-mail of the International Society for Infectious Diseases
and the Disease Outbreak News of the World Health Organisation. The application
works by first applying a filter in conjunction with the search engine to identify
specific lexical items or phrases which are relevant to a disease outbreak. In the
example given by Grishman, the software works on the basis of sublanguage patterns
relevant to disease such as ‘outbreak of <disease> killed <victims>’. The spaces
between the < > brackets are filled by the particular name of the disease and the
number of victims / demographic characteristics of victims etc which form the basis
for the database record.
The key to the operation of the information extraction system is to capture the
relationship between an incident (which reports single or isolated occurrences of the
disease) and an outbreak involving potentially multiple geographical locations and
extending over a wider span of time. The relationship is therefore one of hyponym –
superordinate- this is what Grishman has in mind when he talks about normalizing the
relationships in the database. The sublanguage which forms the basis for the system is
organized around the concept of an event pattern exemplified by the string cholera
killed 7 inhabitants etc. In linguistic terms such an event pattern comprises nominal
participants (a disease outbreak and its victims) in combination with a transitive verb
linking the two nominal groups in a causative relationship. In total some 74 patterns
provide matches for individual incidents identified from the news reports, an example
of which is shown below20:
event pattern np (DISEASE) vg (KILL) np (VICTIM)
matches: Cholera killed 23 inhabitants
event pattern np (VICTIM) vg-passive (KILL) by np (DISEASE)
matches: 23 inhabitants were killed by cholera
On this basis it is possible to capture both active and passive clausal variants which
realise the incident. It is also of interest in terms of extracting relevant information to
format circumstantial (particularly locative and temporal) information in connection
with the disease outbreak. The sublanguage grammar for these events also provides a
adapted from Grishman (2002:5/6)
slot for what Grishman refers to as ‘sentence adjuncts’ which can occur fairly freely
within the clause:
SA* np(VICTIM) SA* vg-passive (KILL) SA* np (DISEASE) SA*
where SA could be three weeks ago, in January 1998 (temporal); in Rwanda,
Northern Thailand (locative) etc.
Of particular interest in the capture of causative information is the role which the
name of the disease plays as the complement of a preposition in a prepositional phrase
serving to connect cause and effect. This information is included in Proteus-BIO in
the form of the following pattern:
(of | by | from | with | due to | because of ) np (DISEASE)
Quite clearly this type of a pattern will capture the causative linkage between an
incident (eg number of deaths, patients hospitalized) and the presumed cause in the
form of a disease. In Proteus-BIO however this marking of a cause and effect
relationship is limited to the prepositional phrases as a sentence adjunct containing the
name of a disease; the local grammar of causation which is the focus of this project
needs to encompass a much wider linguistic marking of causal relationships through
the transitivity of the verb for example. However for the type of news report which is
being searched by the software, the recognition of an event and its cause in the form
of a known disease is a significant step forward in information retrieval.
As Grishman points out (2002:8) the construction of an IR system such as ProteusBIO requires the setting up of a so-called ‘knowledge base’. The work involved in
such a knowledge base includes for example a full specification of the lexical patterns
(examples of which are shown above), a full listing of the lexical items which serve as
the initial query points of entry into the web-based news text together with the
sublanguage grammar rules. One of the difficulties which this process entails is that
the manual analysis of these linguistic patterns is successful only in identifying the
most commonly occurring patterns. In accordance with Zipf’s law (Zipf 1935), it
becomes more difficult to extract the less frequent patterns from manual analysis of
the corpus. To reduce this problem Grishman and his co-workers have produced an
algorithm which enables the computer to ‘learn’ automatically new patterns from an
unannotated corpus. In order to do this, a number of so-called ‘seed’ patterns are
added to the corpus such as ‘disease N killed X people’, ‘victims had symptoms’.
When these patterns are matched, the computer produces a list of relevant documents
from the general corpus which are then added to the list of candidate patterns.
2.7 Summary
This chapter has described applications of the sublanguage approach to specific
problems in biomedical informatics. A number of research initiatives based on
sublanguage grammars are presented, including systems for information extraction in
both clinical and research article domains. Ultimately these initiatives analyze a
multitude of semantic relationships in separate highly restricted domains and their
syntactic realisations in the grammar, raising the prospect that extremely-fine grained
‘idiosyncratic’ grammars and ontologies can be written specific to each scientific subdomain.
This thesis will explore instead a different approach to sublanguage definition based
on a selective semantic focus on one area of meaning: causation. If sentences
encoding cause and effect relationships are treated as a functionally-defined
sublanguage universal throughout scientific textual domains, the analysis of the
sublanguage grammar into semantic components can have significant potential in
3. Methodology
3.1 Introduction
This chapter sets out a methodology for the genre-based identification and
delimitation of causation as a semantic subset of general English.general language.
This delimitation provides the basis for the construction of a specialized corpus which
serves in turn as the principal source data in the compilation of the grammar. As such
the presentation adopted here builds extensively on the methodological pilot studies
presented in Allen (2002a, 2002b). These papers serve to justify the scientific research
article sub-genre as the sublanguage source for the grammar in question. Not only are
the logico-semantic relations of cause and effect in abundance within this sub-genre
but as we have seen in chapter 1 they are also of paramount importance rhetorically at
the nexus of empirical research findings and theoretical consensus.
Of prime importance to the methodological treatment presented in this chapter is the
adoption of a corpus-driven approach to the integrity of the corpus data. However the
distinction which Tognini Bonelli (1996, 2001) makes between corpus-based and
corpus-driven approaches needs to be further clarified in the light of the systemicfunctional framework of the local grammar described in later chapters. As described
in Allen (2002b:4-9), a corpus-driven methodology is consistent with a direct
confrontation with the corpus evidence and the extraction of phraseological patterns
centred on specific lexical items. The data has not been annotated in accordance with
any linguistic theories. It is argued therefore that the initial stage in the compilation of
grammar is essentially corpus-driven.
This position needs to be revised with regard to subsequent stages which involve the
mapping of patterns onto the systemic-functional categories of the local grammar. As
will be explained in chapter 5, the latter stages of this process represent the adoption
of a very specific Hallidayian theoretical stance to the categories of the grammar. The
final product of the local grammar is therefore corpus-based in the sense that the
corpus data is sifted through systems representing probabilistically-related
paradigmatic choices. Such an perspective is identified by Tognini-Bonelli (2001:62)
with the corpus-based approach as a system of abstract possibilities. Implicit in this
position is the belief that language choices made from each system are inherently
probabilistic (Halliday 1991).
The discussion moves on to justify the construction of a small, specialized corpus as a
sublanguage source as opposed to a reliance on pre-existing general language corpora
such as the Bank of English or the BNC. Similarly as the design, construction and
sampling representativity of the pilot corpus has already been described in detail, the
focus in this chapter is on the expansion of the original pilot corpus into the 2 million
word 'final corpus’ which serves as the basis for lexical pattern extraction.
In the light of these and other sampling criteria, the question of corpus
representativeness is once again considered this time with respect to the final corpus
(henceforth referred to as the Halmstad Biomedical Corpus or HBC). The discussion
moves on from consideration of the practical aspects of corpus construction and
sampling to the more problematic issue as to what should be delimited as the
sublanguage of causation. Central to the resolution of this problem is the adoption of
an intuitive basis for the identification of causation.
As has been discussed elsewhere, the discourse of the research article is subject to
very specific constraints which among other aspects are responsible for the very high
degree of nominalized packaging of scientific arguments. Nominalization can be seen
in some respects as substantially reducing the complexity of the task in describing
cause and effect in lexical grammatical terms. This task is accomplished largely by
delimiting verbs encoding causal relationships as a subset of English transitive verb
system. Similarly the deference shown by researchers to their wider discourse
community readerships has a significant effect on the linguistic expression of cause
and effect within the genre. In epistemic terms these constraints lead to heavily
mitigated claims manifested linguistically by a variety of hedging devices. As will be
discussed in section 3.5.3, hedging makes the task of delimiting 'true causation’ from
related areas of inferential relationships more problematic, necessitating a looser
definition of cause and effect for the purposes of sublanguage description which is not
tied to factivity. Finally the methodology of concordance analysis is described
together with the format of data storage for the retrieved lexical patterns.
3.2. Causation and the specialist corpus
3.2.1 Why a specialist corpus?
The outline for the local grammar of cause and effect put forward as a pilot project
drew from the general language Bank of English (henceforth B of E) as the source of
its data. However as has been reported in previous papers (Allen 2002a, 2002b), the
overlap between the logico-semantic relation of causation and the grammatical system
of transitivity makes the task of describing a lexical grammar from a general language
corpus an onerous one if a corpus-driven commitment is to be maintained. The
researcher is faced essentially with the overwhelming task of describing at least a very
significant subset of the English verb system.
This problem can be briefly illustrated with the lexical items doom, dragoon,
browbeat and bully included the table below. These items occur in causative patterns
in a general language corpus such as the B of E but are extremely rare or absent from
the restricted genre corpus described in this thesis.
general corpus example
Jepson's first goal for the club helped secure second
division safety and doomed Southend to a second
successive relegation
Corpus: sunnow/17. Text: N911998042
And I had a history professor and a dean who dragooned
me into taking the exam
Corpus: npr/07. Text: S2000910219.
Before long this pushy hypochondriac has browbeaten
poor Giorgio into dumping the prettily trilling if
admittedly rather anodyne Clara
Corpus: times/10. Text: N2000960328
And she used to bully me into doing my schoolwork
Corpus usbooks/09. Text <tref id=B9000000523>.
One plausible explanation for this absence from a corpus of research articles is that
these verbs of manipulation and coercion usually require animate human subjects. As
Dubois (1986) cited in Allen (2002a:3) has shown, this explicit statement of agency is
a feature of popular scientific writing which focuses on the human participant rather
than scientific processes. In contrast the high degree of nominalisation which is a
feature of scholarly scientific writing has the effect that causal relationships are
frequently expressed in nominal groups, leaving a relatively small and well-defined
group of lexical items as the verb relayers of causal relationships between the nominal
groups. The delexicalising effect of nominalization and the role nominal groups play
in the expression of cause and effect underlying the hypothesis statement is explored
in Allen (2002b:20-25).
3.2.2 The genre approach to small corpus design
As described in chapter 2, the genre approach associated with Swales (1990) has been
influential in the construction of small, specialized corpora. One application of the
genre approach to specialized corpus construction is the work of Gledhill (1995, 1997,
2001) who enlisted the assistance of a group of researchers involved in various
aspects of cancer research. Rooted in an ethnographical approach which places the
construction of scientific text firmly within its sociological context, Gledhill makes
use of the specialists themselves to identify representative texts which had been
disseminated within their respective discourse communities.
Gledhill’s genre-based selection of corpus material immediately predates the
emergence of the World Wide Web (WWW) as a repository of electronic text. Since
the late 1990s, the WWW has begun to challenge traditional paper journals as a
medium for the publication of scientific research articles. The great majority of
leading journals now publish electronically in parallel with the paper versions. In
Allen (2002a), it is argued therefore that this wide availability of textual material
coupled with the use of database search engines to specify very precise sampling
criteria can greatly benefit small corpus building initiatives. These developments
essentially provide a statistical alternative to time-consuming ethnographic work as a
precursor to corpus construction. It should not be thought however that the use of the
WWW to create larger specialist corpora in anyway invalidates the essential
ethnographic approach to small corpora design which has shed a great deal of light on
the discourse practices of scientific communities. The chief benefit of using the
WWW to gather data is that the size of the resulting corpus can be increased and
range and coverage of textual material extended. The sampling of material using
keyword measures is described in more detail in section 3.4.3.
3.3 The HBC Pilot Corpus
The construction of the corpus in two stages has been deemed necessary to provide a
means for assessing key issues of corpus construction such as representativeness and
sampling. The overall objective in the design and construction process was therefore
to determine the extent to which the sampled biomedical RAs constitute a reasonably
homogeneous resource as the basis for the extraction of causative patterns. This
position can be contrasted with research on general language corpora such as Biber
(1988; 1990; 1993) which has chiefly examined the heterogeneity of a corpus drawn
from various spoken and written sources. The work of Biber is however still
influential even for work on a specialist corpus, stressing the need for a cyclical
process of corpus building and the compilation of detailed inventories of the textual
contents as a basis for revision and corpus expansion. These stages are here referred
to as the pilot corpus and final corpus respectively.
The construction of the pilot corpus together with the need to provide a textual and
functional inventory of corpus contents have been described in detail in Allen (2002a)
and so again will only briefly be mentioned at this point. Central to this inventory has
been the need to survey the causation sublanguage which constitutes a subset of the
biomedical text making up the corpus. To this end a provisional taxonomy of
causation was set up including categories of intra-clausal, inter-clausal and text
signalling causative devices at the discourse level. This taxonomy is an attempt to
move the analytical focus beyond the previous narrow focus on the clause to embrace
textual perpectives on cause and effect relationships between textual segments. These
categories are reproduced in the table below; for a full listing of the various category
subdivisions, see Allen (2002a:31-37).
1.Clause internal
Recent advances in treatment have
increased the rate of cure of childhood
ALL to 75 percent or better.
Following on from the discussion in Allen (2002a:28), source files for examples quoted from the
corpus are given the designation GRA (general medical journals) or SRA (specialist journals)
2.Inter-clausal causation
If increased longevity is accompanied by
declines in rates of disability, as
suggested by recent studies, then the
effect of increased longevity on health
care expenditures may be
Because of an excess number of
transplantation-related deaths (Table 2),
the group that received marrow from
matched unrelated donors had a risk of
treatment failure that was higher than
(although not significantly higher than)
that in the group treated with
chemotherapy alone (Table 4). This
result supports our opinion that
transplantation of marrow from a
matched unrelated donor should be
undertaken only in centers where the
results of this procedure are similar to
obtained with matched related donors
3. Meta-sentential causation
Articles in the pilot corpus are drawn from both general medical journals such as the
New England Journal of Medicine (NEJM) and The Lancet and also more specialised
scholarly sources. Statistical analysis attempted to pinpoint whether there were
significant differences in the frequencies of causative encodings between these two
broad groupings and also between the American English adopted by the NEJM and
the British English adopted by the editorial staff of the Lancet. In order to produce the
frequency measures, a preliminary survey of the causation present in the pilot corpus
was conducted. This survey necessitated a working taxonomy of causation to be set
up. The taxonomy in use was broadly functional in its orientation, essentially
modifying a number of categories based on the Hallidayian system of transitivity but
also taking into account hypotactic devices such as because in realising cause and
effect relationships above the level of the clause within the clause complex.
The survey attempted to answer a number of questions which would provide the basis
for expanding the pilot corpus to the final research corpus. Broadly speaking it was
found necessary to determine whether there are significant differences not only
between specialist and generalist journals but also occasioned by the choice of
regional English variety at the editorial level. The results from this preliminary survey
appeared to show that there were no significant differences between the general and
specific journal dimension on the one hand and the American / British English
dimension on the other. The statistical homogeneity was also paralleled by the macrostructure of the empirical /experimental research article with its relatively uniform
Introduction Methodology Results Discussion (IMRD) structure. This in itself was an
important finding which greatly facilitated the task of corpus construction as it meant
that it was now possible to build the corpus on the basis of freely-available specialist
journals (the on-line archives of the NEJM and Lancet require a subscription to the
paper journal for full on-line access).
The construction of the pilot corpus also enabled an estimation of the standard error to
be made, including the extent to which the standard error could be reduced by
expanding the corpus. The results varied considerably according to the frequency of
the causative category. For the relatively infrequent textual realisations of causation,
such as the conjunctive / text-signalling devices mentioned above, the results pointed
towards the construction of a very large corpus, in the region of 60 million words.
Such a size would begin to rival the large commercially available mega-corpora such
as the British National Corpus and would certainly make the HBC one of the largest if
not the largest single genre corpus in the world.
As with all corpus building projects however, the eventual size of the resource is
limited by financial and time constraints. The construction of a 60 million word
resource is largely beyond the individual researcher even in the current era of
electronic text availability and downloading, largely due to the time-consuming nature
of the SGML /XML file formatting.
3.4 From pilot corpus to final corpus
3.4.1 General
The final corpus contains 589 medical research articles published since 1997, taken
from 2 general language journals and 64 specialist periodicals. Collectively these
articles make up some 1.93 million words of running text. In terms of lexical
proportions the corpus is approximately double the size of the so-called ‘first
generation’ general language corpora such as The Brown Corpus or Lancaster Oslo
Bergen Corpus at approximately 1 million words each. The corpus is also larger than
several specialist corpora such as the 0.5 million word PSC (Gledhill 1995) and the
1.02 million word Business English Corpus (Nelson 2000). The HBC is still dwarfed
however by major general language corpus initiatives such as the Bank of English at
450 million words and falls a long way short of the 60 million word dimensions
which from the assessment of the standard error would be needed to ensure proper
representation for the less frequent features in the taxonomy.
Following on from the 135,000 word pilot corpus, the Library of Congress
(henceforth L of C) classification system has been similarly applied to the
construction of the final corpus. This classification scheme is outlined in Allen
(2002a:26). In this paper the point was made that library classifications represent a
somewhat arbitrary division of the subject matter of a discipline; indeed a comparison
between any two library classification systems in many cases reveals substantial
differences in the categorization and hierarchical representation of the subject matter.
An example of this can be seen in the table below, which shows the hierarchical subdivisions for the category 610 medical science in the Dewey Decimal System (DDC)
compared with the category R for medicine in R in the L of C classification.
Comparison between the Dewey Decimal and Library of Congress classification
systems for biomedicine
DDC22 L of C23 610 Medical Sciences
RA Public aspects of medicine
611 Human anatomy, cytology and
RB Pathology
612 Human physiology
RC Internal Medicine
614 Incidence and prevention of
RD Surgery
615 Pharmacology and therapeutics
RE Opthalmology
616 Diseases
RG Gynecology and obstetrics
617 Surgery and related medical
RJ Pediatrics
618 Gynecology
RK Dentistry
619 Experimental medicine
RL Dermatology
RM Therapeutics, pharmacology
RS Pharmacy and materia medica
RT Nursing
RV Botanic, Thomsonian and eclectic
RX Homeopathy
RZ Other systems of Medicine
A cursory glance at this table reveals for example that the L of C classification is
broader, embracing opthalmology, dentistry and nursing as well as including what
might be loosely termed ‘alternative medicine’. An additional difference is that the
DDC categories are generally more all-inclusive; the category DDC 616 Diseases can
be linked at various levels of abstraction with the L of C categories RA, RC, RE, RJ
and RL amongst others. As has been argued previously however, this arbitrariness
does not necessarily negate the library classification scheme as a lexical basis for the
entry into the knowledge domain, given the multidisciplinary nature of biomedical
What this procedure has meant in practice is that each of the 13 subject sub-domains
RA to RT of the L of C system was used as the sampling entry point for the selection
of journals. The basis for the selection of the material was the occurrence of lexical
items in the title of the journal in accordance with the L of C system. Thus articles for
the subcorpus RB (pathology) could be sampled from the Americal Journal of Clinical
Pathology etc. The objective was then to ensure an approximate balance by collecting
the same amount (some 130,000-140,000 running words) of text from each of the
categories. Articles were downloaded in their entirety (minus captions and statistical
information which is frequently not in the clausally complete form in which causation
is encoded). As research articles vary considerably in length depending on speciality
(with tendencies towards particularly long articles in RE for example), there are small
differences between the sizes of each category. Thus the size of text varies between
132,970 words (RL) and 141,544 words (RG).
3.4.2 The ‘final’ corpus: specification and representativeness
The following section describes the composition of the final corpus, on the basis of
the L of C classificatory scheme. Wordsmith Tools was the software application used
to determine the sizes of these textual sources. Also included in the table is
information relating to key word identification. The problem of representativeness is
no less acute for the constructors of small specialist corpora than for researchers
engaged in the building of general language resources and this point has been
discussed at length in Allen (2002a). Even with very narrow criteria in place defining
the extent of a supposed sub-genre, it is still impossible given the constraints of this
present project to claim that a sample of even 2 million words in some way
satisfactorily represents the entire population of biomedical articles published since
1997. Such an undertaking would involve the downloading and formatting of all
medical articles available on-line - the website Free Medical Journals currently
(October 2003) lists 1013 freely available full-text journals (and obviously this only
represents part of all medical journals). Furthermore the list is continually being
extended with new articles appearing monthly. In short therefore, the L of C scheme
used here can never ensure true representativity. The scheme has been used in this
project firstly as a means of demonstrating the main principles which have been used
to construct the corpus and secondly to provide the basis for an inventory of the
articles making up the corpus.
As has been mentioned before however, the composition of each subcorpus has been
determined to a large extent by what was freely available and downloadable in an
easily manageable format. The internet database Science Direct has been particularly
useful in this respect and a very significant number of specialist journals were
obtained from this source alone. Other databases such as Academic Search Elite
contained links to many other journals although access was restricted to abstracts or
enhanced summaries (so-called summary plus) rather than full-texts.
Furthermore the pdf file format of the Academic Search Elite articles creates problems
especially when copying texts formatted in two adjacent columns on a page. If this
copying is done manually via cut and paste operations there is a risk that portions of
the text can easily be lost or misplaced which is an important point given the
commitment in the project to the preservation of the the original article format and its
IRMD structure. The Science Direct provision of files in HTML format in addition to
pdf format however substantially facilitated the copying and formatting of the data
into SGML/XML orthographically-marked up files.
Textual availability and ease of access to empirical research texts also differed
significantly between the sub-categories. Sub-categories RC (surgery) and RE
(opthalmology) for example are dominated in the available databases by case reports
rather than experimental research articles. In case studies where the emphasis is on
observation rather than theory-linked hypothesis confirmation, causation is less
rhetorically central and consequently of less importance as a basis for information
extraction. The incorporation of a small number of case study texts within each of
these sub-categories represents therefore a compromise in order to achieve an
approximate textual balance between the sub-categories.
3.4.3 Corpus composition and keywords
The issue of corpus representativeness has previously been set out in Allen (2002a)
with regard to the pilot corpus which was subsequently expanded into the final
version of the HBC. The extent to which any corpus as a finite resource can be
considered as a representative sample of language has been a perennial source of
controversy and division in linguistics and as such will not be discussed further in this
thesis. Nevertheless having assembled a specialized corpus it is instructive to carry
out an inventory of its contents. The statement of an inventory in lexical terms can
serve to show the extent to which the ideals of the ‘representative’ corpus have been
achieved in terms of medical sub-disciplinary coverage.
Given the relatively small size of the pilot corpus, it was possible to manually mark
up and quantify all textual instances of causation in order to provide an inventory
based on the functional taxonomy described above. However the application of a
similar procedure to the greatly expanded final corpus would present a prohibitively
time-consuming undertaking. Corpus representation is primarily an issue relating to
the external validation of the data rather than constituting a research means in itself.
In order therefore to assess the coverage of the biomedical domain in the 1.93 million
word final corpus, an alternative procedure was used based on lexical counts as
opposed to the previous reliance on functional criteria.
Statistical comparisons were made using word frequency and keyword calculations
performed with Wordsmith (Scott 1997). The use of Wordsmith to examine the
statistical significance of lexical items is now well established in a number of applied
linguistic areas ( Nelson 2000; Van der Wouden 2002; Williams 1998, 2002; Johnson
et al 2003). The keyword procedure uses a chi-square measure in order to determine
keyness- the extent to which specific lexical items occur in sufficient quantities to be
regarded as having an unusually high frequency in a given text. This procedure
involves a comparison between a word list based on a given text or text body with a
separate word list generated from a reference corpus (Scott ibid.: 236).Two keyword
comparisons were made, one external between the HBC as a whole and a reference
corpus of general English, the other internal between the 13 subcorpora and the HBC
final corpus. External comparison
The keyword lists contain broadly speaking two separate sets of lexical items in
addition to the highly specialized sub-domain specific technical terms and
abbreviations. The first of these informal groupings could be glossed as general
medical lexis, such as patient, treatment, therapy and diagnosis etc. These are lexical
items which emerge as keywords in a comparison between the HBC word list and a
general language wordlist, such as that obtained from The Guardian24 newsaper.
Clearly these lexical items serve to delineate the biomedical domain from the general
language corpus. In addition there are a number of lexical items which are common to
the domain of empirical research, such as study, data, methods, observed etc together
with grammatical items such as the preposition of. In addition to nouns, the ‘keyness’
This list can be downloaded from the website http://www.lexically.net/wordsmith/
of grammatical items such as the preposition of may also be significant in the light of
the nominalisation discussion referred to previously. If the importance of
nominalization in scientific research genres is accepted, it may well be the case that
the relative keyness of of is a reflection of the amount of post-modification via
preposition phrases in the corpus compared to general language textual sources. The
top 25 keyword results for this comparison are shown in the table below. For each
lexical item, the table records the total frequency count and percentage in the HBC
and the reference list obtained from 50 million running word Guardian corpus.
The frequency percentage for words in the Guardian corpus is only listed for ≥0.01%.
External keyword comparison between the HBC and Guardian news corpus
2,718,00 2.86
The table is modified from the printout in Wordsmith by omitting the column p, the statistical
significance of the chi-square measurement based on the size of the samples. For each lexical item in the
table, the value of p was calculated to be 0.000000.
8,836.7 Internal Keyword comparisons across the subcorpora
This procedure assessed the extent to which the content of articles as a function of
statistically-significant or in Gledhill’s (2000:110) terms salient lexical items coincided
with the initial article selection based on the L of C classification of the article title.
Salient lexical items identified statistically within the article text can then be compared
with the initial public health classification. Such a comparison can shed light on the
centrality of the sampled text within each L of C sub-domain and the extent to which texts
realize sub-discipliniary links with other sub-domains. The identification of a web of
inter-disciplinary links might go some way towards assuaging any possible textual bias in
terms of corpus representation.
In the case of the study carried out here, the comparison was made between each
subcorpus (RA-RT) and a word list constructed from the entire HBC. This procedure
produced a list of words appearing in unusually high frequencies (compared to the corpus
as a whole) within each category.
In the table below, a similar comparison is exemplified for the 13 sub-corpora, using the
word list produced for RA (public aspects of health). Three sets of lexical items can be
identified in this list. The first of these groupings is related loosely to the RA domain as a
semantic field and consists of lexical items associated with geographical locations (west,
east, Berlin etc) and other items which can be placed in the context of public health (ie
water, fluoride, household etc). A second group might be glossed as non domain-specific
medical items such as bacteria and virus which might be characteristic of any biomedical
text whether scholarly or popular in orientation. A third group comprises lexical items
associated with other identified biomedical domains (ie RB-RT); in other words this
group provides a crude measure of the multi-disciplinary links extending over the subdomain boundaries in the corpus.
Keyword comparisons across for the RA (public health) subcorpus
Keyword comparisons across for the RA (public health) subcorpus (continued)
See comment relating to frequency percentages in the previous table
See comment relating to frequency percentages in the previous table
For several L of C-labelled subcorpora, there is a reasonably good match between the L of
C scheme which provided the initial means for sampling the text and the keywords
subsequently identified by statistical means. In RL (dermatology), RK (dentistry), RJ
(pediatrics), RE (ophthalmology) RF (otorhinolaryngology), RG (gynecology and
obstetrics), RA (public health) and RT ( nursing), the library category / statistical keyword
correspondence is generally strong. For example keywords identified for the RE group
included corneal, optic, visual and eyes which are all strongly related to the subdiscipline of opthalmology. The relationship between the category and the constituent
keywords might be described as constituting the more conceptually-defined level of a
semantic field rather than a lexical set defined by linguistic relations of synonymy or
antonymy (cf McArthur 1986). In the case of the RF category, the keyword procedure
appeared to confirm one area in which this particular subcorpus was skewed due to the
restricted availabilities of texts in this particular medical specialism. The appearance of
the keywords children and school respectively reflected the skewed composition of this
subcorpus derived entirely from the Journal of Pediatric Otorhinolaryngology.
Other subcorpora revealed a more complex pattern of keywords. In particular category
RC (internal medicine) contained the least specific definition of a specialist domain. The
classificatory scheme under RC refers to fields as disparate as psychiatry, and oncology
in addition to highly specialised miscellaneous categories such as arctic, submarine and
sports medicine. The inclusion of significant psychiatric material in this subcorpus,
coupled with the absence of these more peripheral areas (journals of which were not
available electronically without a paper subscription) once again may have led to a bias in
the data.
The diagram below plots the lexical coverage of the HBC final corpus in terms of the L of
C classification scheme and the measures of keyword distributions described above. The
procedure adopted is essentially a comparison between the L of C category on the vertical
axis of the grid below with the results from keyword survey using Wordsmith along the
horizontal axis. The grid intersections therefore mark the extent of the agreement
between the initial subcorpora sampling and the keyword distributions.
As would be expected, the lexical specificity of each subcorpus domain is confirmed by
the strong agreement between the journals and the keyword measurements. However the
comparison is still useful in that the symbols for partial agreement provide a crude
indication of the multi-disciplinary nature of the sub-domains. Thus in the category RM
for example there are keywords indicating multi-disciplinary links to the categories RA,
RB, RC and RS.
While significant agreement exists in certain subcorpora there are nevertheless a number
of differences in the multi-disciplinary extent of each sub-category. Perhaps the most
restricted in terms of keywords is the category RK, whose top 100 keywords were almost
entirely restricted to the dentistry domain with the exception of a number of general
medical words such as patient etc.
L of C and keyword intersections for HBC subcorpora
Subcorpus classification by statistical keyword
Library of Congress classification by keyword
Use of symbols
Significant agreement ( >75% of top-ranking keywords)
Partial agreement (1-25% of top-ranking keywords)
3.5. Identifying causation in the biomedical RA
3.5.1 General
The previous section has outlined the procedure for the expansion of the pilot corpus into
the ‘final’ corpus upon which the empirical analysis is based. A sublanguage of causation
however is a subset of this textual source, bringing into focus the critical problem of
identifying cause and effect within the highly nominalized text of the biomedical subgenre.
It is thus of prime importance to discuss the criteria by which cause and effect can be
separated as a sublanguage from the remainder of the text in the final corpus. As argued
in Allen (2001a:2), the chief difficulty is that causation is a semantic phenomenon which
cannot be retrieved/isolated on an automatic basis alone unless some annotation of
causative lexis has been made prior to the process of retrieval. This identification is made
on the basis of a substitution test which draws on a form of ‘semantic’ intuition, the
cognitive process by which the analyst can interpret a causal relationship as pertaining
either within or between nominal groups.
3.5.2 Semantic intuition
Historically, generative semantic approaches to the problem of delimitation reviewed in
Allen (2001a) have arrived at a definition of causation by intuiting sentences containing a
highly circumscribed number of so-called periphrastic verbs such as cause, affect, make,
get and have.The discussion presented here concentrates purely on the use of periphrastic
causatives as the basis for a simple substitution test for the identification of cause and
effect in the corpus data. This process will firstly be illustrated on simple intuited
sentences before being applied to the corpus data. In the invented examples below, the
verb kill can regarded as causative due to the fact that it can be paraphrased with the
periphrastic (external) verb cause as cause + NP + to die as in the following:
[8a] The cat killed the mouse.
[8b] The cat caused the mouse to die.
Similarly periphrastic make shows clearly the causation encoded in the transitive verb
[9a] The cat frightened the mouse.
[9b] The cat made the mouse afraid.
In examples [8a/b] and [9a/b ] we have what might be regarded as a 'prototypical’ cases
of causation,where the substitution of a periphrastic verb brings out clearly the factive,
causal relationship pertaining between the nominal groups. Here a single clause is
introduced by a subject agentive nominal which might be a physical entity or in this case
an animate being or process / action. The causative verb then links the agentive subject
nominal to the object patient which undergoes the change initiated by the agent. The verb
sets up a dependency relationship such that the causing agency is inferred as being
temporally prior to the production of the effect. In the case of kill and frighten above, the
verb takes on a factive meaning in the sense that the agency can either produce or prevent
a change or effect.
This rephrasing using periphrastic causatives therefore provides a test of causation which
can then be applied to the corpus data. On this basis, the application of the substitution
test to the verb alleviate in example [10] below facilitates a causative interpretation of the
verb as cause+ the symptoms in MS patients + to be alleviated.
[10] These findings confirm, at least to some extent, the anecdotal reports that marijuana
smoking alleviates the symptoms of MS patients…..
SRA536 [4,784]
This paraphrase using periphrastic cause throws light on the factive meaning described in
more detail in section 3.5.3 of the verb alleviate in the sense that the copular verb be
relates the patient nominal group the symptoms in MS patients with the adjectival change
of state alleviated. This test can be seen therefore as a simple tool for the introspective
identication of causative verbs as a sub-system of the English transitive verb system.
There are however a number of marginal issues relating to the working definition of
causation above which need further discussion prior to confrontation with the corpus
data. In particular for the purposes of potential applications of the grammar in
information extraction, it is necessary to consider the inclusion of non-factive causation
in addition to the factive definition described in the semantic substitution test above.
Furthermore it is desirable in any such definition to take into consideration the
predominance of hedging in scientific research article text. Hedging can be seen as
stemming from the building of argumentation as inferential chains of nominalizations as
reported in Allen (2002b: 20-24). These issues are considered in section 3.5.3 and 3.5.4
3.5.3 Non-factivity and hedging
The epistemic status of scientific propositions expressed through causation has been
discussed in terms of in the semantic notion of factivity (Lyons 1977: 794-809). Factivity
constitutes a broad area of semantics overlapping with modality taking up the
commitment on behalf of the speaker / writer to the truth of the proposition expressed.
Factive statements can be verified in terms of the truth or falsity of the proposition which
they contain.There are a number of alternative grammatical realizations of factivity /nonfactivity in English. For example in the statement He knows that Edinburgh is the capital
of Scotland, the commitment to the truth is expressed through the verb know as a factive
predicator (Lyons 1977:794). Non-factive statements on the other hand imply no such
commitment and include realizations through epistemic modal verbs (may, might etc),
modal adverbs (perhaps) and modal adjectives (possible). Non-modalised periphrastic
causative verbs such as the examples [8/9] above readily fit into this category. However
for reasons explored above, the presentation of findings as non-ratified claims puts the
onus on non-factive expression within scientific discourse. What is very common in the
corpus instead is a group of verbs such as associate (see example [11] below) which infer
causal links between nominal participants. This inference is important in the acceptance
of the claim by the discourse community as knowledge.
[11] Moreover, tonsillectomy and/or adenoidectomy are usually associated with an
improvement in sleep quality and body weight in children
Given the genre-based framework of this thesis, it is proposed at this point to consider
non-factivity under the more functional heading of hedging as such a position takes into
account the interactional nature of the scientific RA in terms of the ratification of
research claims by the discourse community. Of direct relevance to the discussion of
hedging in the expression of causation is the work of Hyland (1996). The focus of
Hyland’s paper is to move the debate away from a concentration on formal properties of
grammatical items such as epistemic modal verbs in favour of a more functional account
of the use of hedges in the science RA. As Hyland puts it:
Where there is uncertaintity about the evidential status of the assumptions
between data and hypothesis, claims require varying degrees of hedging. In fact my data
indicates that few knowledge claims are presented in unmitigated form; induction and
inference rather than deduction and causality characterize most arguments in scientific
Hyland (ibid.:435)
Here it would seem that Hyland is equating causation with the expression of absolute
truth. It is argued in this thesis however that a working definition of causation needs to
acknowledge the ‘fuzziness’ in Zadeh’s (1974) terms between genuine ie unhedged
causation and the unmitigated, inferential status of claims.
The full functional categorization of scientific hedging presented by Hyland is largely
beyond the scope of this thesis. Instead the impact of hedging on causation in the corpus
can be illustrated with respect to the broad distinction Hyland makes between contentorientated and reader-orientated hedges (Hyland ibid.: 438). Content-orientated hedges
relate the propositional content of the clause to what is thought to be true in the physical
world. In example [12] this type of hedging encoded through the epistemic modal might
is illustrated in a causative sentence.
[12] Furthermore, exposure to these solvent mixtures might cause epileptiform seizures.
The semantic area of non-factivity can be further illustrated with respect to verbs such as
allow, facilitate, necessitate, predict which might also be included in this category. In
epistemic terms these verbs would not strictly speaking be classified as causative.
However they play an important role in pointing ahead to links between causes and their
potential effects.
[13] The incision allows a speedier approach to the uterine lower segment
[14] cellular matrices damaged by tumor and peritumoral edema, or alternatively it may
facilitate tumor cell invasion into the surrounding brain
[15] Abnormalities in pHi might predict gastrointestinal complications in infants
One of the major contributions of Hyland’s article is to extend the notion of hedging from
this narrow focus on modalization in the form of modal auxiliaries and disjuncts to
consider the contribution of the wider scientific readership as 'guarantors of the
negatability of claims’ (Hyland ibid.:436). In this respect the notion of reader-orientated
hedges becomes important in a genre-based study.
[16]…, it can be seen that if they had used the semi-direct method only, this would have
resulted in five false-negative results for trisomy 18
In the example above, the key expression is the passive extraposed clause it can be seen
that….. This type of hedging invites the active participation of the reader (although
through the use of the passive the reader is not explicitly mentioned) in the ratification of
the hypothesised causal relationship, in this case semi-direct method as cause and five
false-negative results for trisomy 18 as the effect. The examples below include further
cases which it is argued in this thesis should be included in the sublanguage of causation.
In example [17] there is a a causal link inferred between The variation in proliferative
activity between the different lesions and the time course of restenosis
[17] The variation in proliferative activity between the different lesions may partly reflect
the time course of restenosis: the peripheral vein grafts studied by Westerb
SRA 558(1958)
The causal link is modalized with the epistemic verb may and further ‘hedged’with
adverb partly. In this case the reader is invited to make a inference from one sense of the
verb as a sign of a particular situation (LDOC:1189 sense 2) to an effect-underlying
cause relationship(ie X reflects Y). Similarly the combination of verb consistent plus the
preposition with does not in itself encode factive causation but brings together two
potential halves of the causative equation:
[18] were lower than values obtained from wells in the corners of the microtitre plate, an
effect consistent with non-uniform stirring efficiency.
The role of mental processes of cognition in the making of causative inferences is also
illustrated with respect to the verb explain as in [19]:
[19] The absence of neovascularisation in the implants may explain the lack of
Other examples of inferences derive primarily not from the meaning of the verb
concerned but the adjacency between two juxtaposed clausal elements in a sentence. In
example [20] below a causal relationship can be inferred between the main clause with
the passive verb suppress and its hypotactically dependent subordinate clause when 2 was
co-incubated with 3M eserine. Here the production of the suppression effect is nonfactive as it is conditionally dependent on the circumstantial element in the hypotactic
[20] Hydrolysis was suppressed when 2 was co-incubated with 3M eserine
A similar temporal inference can be drawn in example [21] in that the effect of blood
eosinophil diminishment is caused by the cyclosporin A treatment.
[21] Based on reports indicating that blood eosinophils diminish after treatment with
cyclosporin A (CyA) [8, 9, 10] we measured…
As will be exemplified in chapter 4 such inferential relationships pose a problem for a
lexical grammar based on idiom principle-determined patterns of co-occurrence. The
importance of the idiom principle to the notion of pattern grammar is described in Allen
(2002b:9). Sinclair (1991) puts forward the notion that while natural language affords
potentially infinite combinatorial possibilities as open choices, in practice many of these
possibilities are not fully exploited with the result that particular combinations of lexical
items repeatedly occur. These repeated sequences of co-occurrences represent in
functional terms single syntagmatic choices as enshrined in the idiom principle. Corpusdriven studies of individual lexical items have led to the observation that the sequences of
co-occurring lexical items centred individual words which share a semantic profile
(Francis 1993; Sinclair and Renouf 1991). For Hunston and Francis (1999:37) the pattern
of a word in accordance with the idiom principle is defined as ‘all the words and
structures which are regularly associated with the word and which contribute to its
The difficulty raised in example [21] is that while the combination diminish +after
superficially resembles a pattern, the preposition after is not idiomatically determined by
the verb. In this example, after belongs instead to the prepositional phrase realizing in
Hallidayian terms a circumstance. Furthermore there is no sense in which the preposition
after contributes to the meaning of the verb and that as a consequence the verb +
preposition combination constitute in Sinclair’s (1991, 1994) terms a unit of meaning. It
may be the case therefore that some of the co-occurrences centred on individual lexical
items outlined in the next chapter are not true patterns in the Hunston and Francis sense.
3.5.4 Other possible 'borderline' cases
The verb mediate deserves special mention as a possible borderline case. Strictly
speaking as in the case of example [22] below there is no sense of initiating causative
agency in the subject This receptor
[22] This receptor mediates a remarkable vasodilating effect after activation by any of
several CCs
In this particular example however it is argued that the explicit statement of an effect in
combination with the adverbial after activation by any of several CCs necessitates the
inclusion of mediate within the boundaries of the causation sublanguage.
3.5.5 Summary
In this section it has been argued that the heavily inferential nature of the scientific RA
necessitates a wider definition of causation to embrace both factive, non-factive and
meditative instances. The marking up of cause and effect instances based on this
definition will be considered in section 3.6.
3.6 From definition to mark-up
The question of attaching semantic tags will now be considered in relation to the HBC
corpus. In the pilot corpus, the causation sublanguage was marked up using the TEIEmacs software to define a set of customised orthographic tags. These XML tags enable
each instance of causation to be delineated as a sublanguage from the rest of the text. In
addition it was possible to attach a number of semantic attributes in accordance with the
provisional functional categories described above. This mark-up can be illustrated using
the corpus example from [10] above.The causative clause could then be marked up with
the semantic tags <cause> and <effect> as in example [23]:
[23] that marijuana smoking <cause> alleviates the symptoms in MS patients <effect>
There would however appear to be two objections to this form of semantic annotation of
the corpus, the first of which is theoretical, the other practical. Firstly while the
annotation of the corpus with 'customized’ cause and effect tags might provide one means
of retrieving causative clauses and sentences from the remainder of the textual body but
as remarked in Allen (2001a:2) such a practice brings the analyst full-circle back to the
use of introspection in the identification of the sublanguage. This reliance on semantic
tagging is difficult to reconcile with a corpus-driven methodology which as described in
Allen (2002b:7) dispenses with preconceived grammatical categories. Instead the lexical
item as the least theoretically-prejudiced point of entry into the lexical grammatical
system is favoured in corpus-driven description. The second objection is the very timeconsuming nature of attaching XML-defined semantic tags to the entire 2 million word
corpus. Such an undertaking was deemed only practical for the very much smaller pilot
Instead of marking up the entire corpus with semantic tags in XML, the identification of
the causative sublanguage necessitated an alternative methodology. The delimitation of
the sublanguage from the remainder of the text was carried out manually. This process
involved three separate manual trawls through the 1.93 million words of text making up
the corpus and the recording of all lexical items playing a role in the encoding of factive,
non-factive /inferential causal relationships. The resulting word list, dominated by
transitive verbs but also including significant numbers of nouns and adjectives, provided
the basis for the concordance analysis of the lexical grammatical patterns underpinning
causation in the corpus. The concordancing of corpus data together with the extraction
and storage of lexical patterns is considered in section 3.7 below
3.7. Concordancing
The analysis of any corpus whether general or genre-specific in the case of the HBC
requires concordancing software for semi-automatic lexical search and retrieval purposes.
Since the 1980s years a number of software packages have become available with
specific applications in concordance analysis. The relative merits of early concordancing
software are discussed in Hofland (1991) and Higgins (1991) although a general
comparison between these packages is largely beyond the scope of this thesis. For the
purposes of the local grammar project, the basic requirements for corpus search and
retrieval are satisfactorily met by the existing Wordsmith 3.0 package (Scott 1996, 1997)
While it has been claimed that a knowledge of programming languages such as Java can
provide the tool for linguists to tailor corpus retrieval softward to individual research
purposes (Mason 2001) it was not found necessary to 're-invent the wheel’ in
concordance software terms for this particular research project .
These requirements can be specified as follows. Firstly the software should be able to
process corpus files in XML /SGML format (as opposed to plain text ASCII). As has been
described in Allen (2002a:20), research articles were formatted using the TEI-Emacs
software in accordance with the TEI guidelines ensuring compatibility and portability
between different formats. Secondly the software should be capable of performing
searches not only on lexical items but also on specific POS and other purpose-built tags
such as the semantic mark-up schemes discussed in 3.6. Although the decision was made
following the initial analysis of the pilot corpus not to tag the final HBC the option of
tagging at a later stage is desirable from the perspective of testing an automatic parser
based on the grammar. However the construction and testing of such software will not be
considered in this thesis. A third requirement relates primarily to the manipulation of the
concordance output in the form of sorting in order to render specific patterns more
visible. Wordsmith contains a number of features which enable sorting of concordance
lines to be made with respect to the search query node, greatly facilitating the recognition
of patterns. The ability of the software to sort to the left or right of the node will now be
illustrated with data from the corpus. Left sorts enable pre-modifying adjectives to be
retrieved when the node is a nominal participant in a causative clause such as effect,
cause, source or factor etc. In the concordance lines18 below, the node factor is shown in
bold and the pre-modifying adjective is underlined:
maternal age is identified as a risk factor in pregnancy [7]. The
74%, while homosexuality was a risk factor in only 9%. CDC stage II was
dysfunction may be a primary factor in initiating a disease state or
that good probe fit is an important factor in maintaining a low refers
performance status as a prognostic factor in our analysis does suggest
nterconversion is not a complicating factor in the analysis of the assays whet
or not hair color is a critical factor in the binding of
Similarly the importance of right searches which bring to light patterns of nominal postmodification may be illustrated with the verb have in the lines below:
To date, however, no in vivo studies have been reported on the role of amin n,
vious immunohistochemical studies have been performed on human tissue to ntral
1-adrenoceptors and appears to have no effect on the central reuptake rk that
this reduced temperature may have deleterious effects on the From t, we may
conclude that dermoscopy does have an impact on the clinical
the adolescent years may have an impact on the future
pe, can be rapidly
progressive and have an impact on all speech systems.
(1996) suggests large surveys have often relied on voluntary
s. Although decisions on drug safety have often depended on
high school, have poor earnings, and have increased dependency on the
that is, in 1993. This means that we have no information on the true
olonged periods of time may prove to have negative effects on health.
they had been younger, they would have been operated on with no
be evaluated carefully as they may have substantial effects on the
ICTCL patients, the neoplastic cells have an impact on the immune condition
wether subclinical HEV infection may have adverse effects on
In the absence of tagging during the initial phase of the corpus construction the ability of
the software to refine searches in the light of context assumes considerable importance.
For example Wordsmith permits searches on the lemma have in combination with the
preposition on within certain specified limits or context horizons; thus the search from 0L
to 3R (from the node to three words to the right of the node) enables the retrieval of the
These concordance lines have been edited in terms of length in order to permit alignment in order to
render the patterns more visible
co-occurring preposition on but at the same time captures the main collocating nominal
heads such as effect, impact and influence. This pattern is exemplified in the corpus
example below:
[24] The reduction in anemia had no effect on perinatal outcome and birth weight.
Of course had the corpus been tagged as in the B of E, the same pattern could be retrieved
by specifying the search string have + NN+ on to give the range of collocates. The use of
the context limit search in Wordsmith provided essentially the same results although there
is a greater need with a non-tagged corpus for some manual post-editing. In the examples
above, several instances of the auxiliary use of have in present perfect active and passive
sentences would have to be edited out.
The use of frequently occurring, grammatical lexical items such as prepositions is
important in terms of defining searches on non-tagged data. To this aim, the work of
Gledhill (1995, 1996) has paved the way towards the description of phraseological
profiles in small corpora using frequently occurring lexical items as the point of entry
into the lexico-grammatical system. An example of this procedure is illustrated below,
using the combination of the conjunction as and the indefinite article a to define a
number of collocational restrictions on co-selected nouns.
cell morphology may be considered as a cause of male infertility. Amongst
developed spiral ganglion cells as a consequence of some noxious
ary bacterial pneumonia may occur as a complication
of influenza virus ll
in the osteoblast lineage varied as a function of the surface roughness ively
investigated for many years as a means of depressing gastric toxicity oom
temperature. The urine output as a predictor of compliance implies that
has suffered a kidney loss as a result of the ureteric damage,
From the edited lines above, it can be seen how a number of significant post-modified
abstract nouns emerge from the retrieval of the as + a combination. In the light of the
local grammar, it is important to be able to define on the basis of corpus evidence what
lexical items are occurring as significant collocates. This identification is important
because the final stage of the process will involve 'conflating’ lexical items into a series
of functional categories. In order to define and name these categories, it is necessary to be
able to gain a full semantic perspective over the lexical items occurring in each position
with respect to the node.
An alternative means of seeing this relationship is to compute the collocates as a
statistical measure. Wordsmith provides a plot of the significant collocates based on their
position with respect to the search node. Referring to the as + a combination above, the
collocate computation produces the following results for selected lexical (as opposed to
grammatical words):
Significant collocates for the as + a combination
Thus it can be seen from the above that significant collocates are recorded in a number of
positions both to the left and to the right of the node as + a combination. Furthermore
collocates can be re-sorted in any position relative to the node, which greatly facilitates
the description of the collocational profiles.
The prevalence of nominalization in scientific text remarked upon previously puts a
greater onus on the software to be able to describe the patterns of pre- and postmodification of the nominal group. The preposition of for example plays a key role in the
postmodification of the nominal group. In addition to the concordancing and
collocational possibilities described above, it is possible to use the software to identify
phraseologies centred on highly frequent lexical items using the cluster facility to identify
significant groupings. With reference to the preposition of, clusters such as the presence
of, the incidence of and the prevalence of indicate statistically important phraseologies
underpinning cause and effect:
[25] To assess the effect of introducing chloroquine prophylaxis during pregnancy on
prevalence of anemia (10.9 g/dl) at childbirth and perinatal outcome.
3.8. Data storage
3.8.1 The pattern grammar notation
In order to establish a database of phraseologies centred on the query node, it is necessary
to adopt a formalism which lends itself to the storage of lexical items and their
grammatical patterns. In Allen (2002b) the notion of a lexical grammar and its
justification as the basis for the description of collocational/colligational patterns
encoding causative relations was put forward and described in detail. In Allen (2002b)
the concept of grammatical pattern emerging from lexicographic research was explained
in terms of regular patterns of co-selection between lexical items such that the coselections define units of extended meaning in accordance with the idiom principle
(Hornby 1954; Sinclair 1991, 1995; Francis and Hunston 1998; Hunston and Francis
1999). As the theoretical basis for pattern grammar and notation has been covered
previously, the focus of this section will be directed towards the use of the pattern
grammar notation within formats for the storage of patterns retrieved from the final HBC.
3.8.2 Presentational format
The adoption of a corpus-driven methodology which the pattern grammar embodies
essentially determines that the lexical item provides the initial query rather than
morphosyntactic tag (as mentioned previously the HBC has not been automatically POStagged). A corpus-driven presentation of lexical patterns involves therefore a choice
between one of two possible presentation formats.
(a) an alphabetical dictionary-like presentation of causative lexis and their attendant
lexical patterns
(b) a pattern presentation- lexical items are collected together under the same pattern
notation ‘Lexical’ format
The first of these representations involves an alphabetical listing of each lexical item
encoding cause and effect relationships, partially resembling the headword presentation
of a dictionary. Each headword entry therefore would contain a specification of lexical
patterns into which the item enters. An example of this mode of representation is shown
in the table below, for the nominal group effect and the closely related lexical items (side)
effect, the adjective effective and nominal effectiveness.
Dictionary headword format for lexical items
Lexical item
effect (n)
N of n
Most patients became tolerant to the opioid-related side effects
of somnolence and nausea. Two patients who did not become
tolerant t
N on n
practical teaching for parents may bring about very positive
effects on the general progress of deaf children.
have N on In addition to its antiproliferative effects, methotrexate has a
number of effects on the immune system that may
N against This is further supported by directly plotting the observed
effects against mean arterial and venous serum concentrations
( Fig. 3C). Ass
N from n
secondly, there is strong support for the assumption that there
is at least no effect from the treatment large enough to be
clinically interesting.
N owing
sporine has been investigated in mycosis fungoides but can
to n
N at n
N in n
(side) effect n.
effective ADJ
N arise
from n
N into n
facilitating the partition of cocaine’s effects into
pharmacokinetic and pharmacodynamic components using
against n
ADJ as n
Topical antibiotics would not be effective against extraocular
infections, which could then be a source for reinfection
epithelium most similar to normal trachea in the PDGF treated
wounds. CTA is effective as a carrier for the direct delivery of
a growth factor to injured tracheal
une responses.[63] Intralesional IFN-alpha-2b has been shown
to be safe and effective for the treatment of superficial, small,
and well-circumscribed BCCs.
matrix remodeling in this experimental setting. Ramipril
application was highly effective in attenuating hemodynamic
changes and the development of left vent
replacement with a cryopreserved aortic allograft. This study
examines the effectiveness of this strategy on hospital
mortality and morbidity, recurrent e
be ADJ
for n
ADJ in n
have adverse side effects owing to its immunosuppressive
properties. Methotrexate in moderate
with the hope that the severity of ischemic stroke may be
decreased through effects at the cellular level..14 As a result
of these new treatment options,
to pain" [15 and 16]. Beyond its antidopaminergic function,
CPZ has important effects in other neurotransmissory systems.
It is a powerful adrenergic antago
With the aim of reducing the risk of adverse side-effects
arising from widely administered drugs such as indomethacin,
38 indus
N of n on
n ‘ Pattern’ format
The table below provides an example of an alternative representation based on the
collection of lexical items exhibiting the same pattern behaviour under one pattern
heading. In this case the pattern is centred on an adjective (in pattern grammar terms be
ADJ for n). The representation shown here has been shortened somewhat from the
database, with the precise statement of the pattern underlined in each case. Such a
presentational format is not entirely unproblematic however.
Lexical items under pattern be ADJ for n
Lexical item
Corpus example
These observations suggest that non-etching of vital dentine
can be beneficial for pulp responses in some respects
Sufficient intensity at the correct wavelength and adequate
exposure time are critical for satisfactory polymerization.
These cytokines are also essential for the development of
myocardium, carnitine is also important for the transport of
short- and medium-chain fatty acids
Ewing Test and the frequency of acute otitis media, proved
to be moderately predictive for the screening test result at
school age.
In a survey from mainland Greece HCV infection was
responsible for 25% of chronic liver disease patients
Type of surgery and adjuvant therapy did not seem to be
significant for disease-free survival.
Very few drugs are able to penetrate the skin by passive
diffusion at a rate sufficient for therapeutic viability
effects on ultrasonic vocalisations, absence of such effects
may be supportive for selective anxiolytic effects
The organization of entries by grammar pattern does lead to some undesirable
generalizations being made, as will be made clear in section in the next chapter.
Causation is primarily though not exclusively encoded through monotransitive verbs
which means that this particular category is over-represented in terms of lexical items.
Thus under the pattern V n (in conventional terms a monotransitive verb) all verbs
entering into this pattern (some 134 examples) would have to be listed.The solution
therefore would be to list the lexical items under each pattern in their entirety. The full
listing is given in the lexical pattern database described in chapter 4. In addition to
presenting the main lexical patterns and the lexical items which they contain, the major
collocates are indicated where these are have been shown to be significant in the corpus.
This representation is the format adopted by Hunston and Francis in their pattern
representation (Hunston and Francis 1999). In addition to representing the lexical
grammar of causation on a more parsimonious basis, the representation as exemplified in
the table above facilitates the abstraction from lexical co-occurrences to the functional
representation of the local grammar. Such a strategy enables an assessment of the
alignment between the pattern formalism and semantic similarity to be made.
Data retrieved from the final corpus was stored in a database created in MS Access,
containing two alternative formats: one lexical, the other pattern-based. Although it might
be objected that such an enterprise results in a certain amount of data redundancy as data
listed under a lexical entry is repeated under a pattern heading, the advantage on the other
hand is one of flexibility. Thus in the lexical database it is possible to retrieve all the
lexical items under a specific pattern while in the pattern database the reverse is possible
ie the query is a lexical item returning an output of patterns into which the word in
question enters .
3.8.3 Limitations of the pattern grammar notation
In Allen (2002b:33) it was pointed out that there are a number of limitations involved in
the adoption of the pattern grammar notation for the type of data storage described above.
For example in the case of verbal causation it is not possible to capture using the notation
significant generalisations to the left of the search node. In active verb patterns for
example, the complete linear representation of causation would require an agentive
subject in this position. This difficulty is less serious than initially might be thought
because the pattern grammar is seen in this project as an interim stage only in the full
development of a specialized functional grammar.
On a more theoretical point, it has become apparent with the expansion of the pilot study
into the final corpus that the strict notion of a pattern grammar based on lexical cooccurrence and restriction in accordance with the idiom principle is in need of some
revision. The problem is exemplified in examples [26/27] below.
[26] …based on reports indicating that blood eosinophils diminish after treatment with
cyclosporin A (CyA) [8, 9, 10]
[27] max is the failure of atomic bonds across an atomic plane. A fracture plane will arise
when the stress at the tip of the pore reaches max.
In both cases a causal relationship can be inferred between an effect in the main clause
and a cause in the circumstantial adverbial. In pattern notation terms these relationships
centred on the verbs diminish and arise would be encoded as V after and V whrespectively. However the occurrence of the PP headed by after in [26] or the wh
conjunctive element in [27] represent open choices which are not constrained by the verb
in each case. In other words there is no extension of meaning across the phraseological
profile defined by the verb and the immediately adjacent element. It is therefore
necessary to include open choices within the database notation scheme where there is an
inferential causative relationship at work.
3.9. Summary
This chapter has outlined the methodology for the construction of a specialized corpus
composed of recent biomedical research articles. It has been pointed out that the
identification of a causative sublanguage within this corpus is by no means
straightforward, necessitating the adoption of a definition of cause and effect which
embraces the heavily nominalised, inferential nature of causal links presented in the
scientific RA. Bearing in mind these considerations, a working definition of cause and
effect within the genre has been put forward as the basis for the retrieval of lexical items
as concordance queries. In the next chapter these items together with their lexical
grammatical patterns will be described.
4. The lexical patterns of cause and effect
4.1 Introduction
This chapter describes the results which have emerged from the heuristic survey of
lexical grammatical patterns encoding causation identified within the HBC. In sections
4.5-4.8 below the focus is on the nominal, verbal, delexical and adjectival lexical items
identified manually on the basis of the semantic intuition as described in the previous
chapter. The specific focus adopted here is on lexical items at the grammatical level of
group and clause, ignoring therefore the lexis involved in the expression of inter-clausal
hypotactic relationships (eg relationships introduced by the conjunctions since and
because and the relationships of cause / consequence between sentences signalled by
lexical items such as thus). It will be noted that inter-clausal causatives and discourse
marker signals of cause and effect were part of the previous statistical comparison of
causative lexis covering the 140,000 word pilot corpus (Allen 2002a). In the same article
it was pointed out however that there are substantial difficulties involved in terms of
encompassing clause-complex and discourse signalling relations of causation within the
Cobuild Grammar Pattern (henceforth CGP) notation system. The more restrictive focus
adopted in this thesis reflects these problems.
This chapter begins by providing an overall picture of the lexis of cause and effect in
terms of a selective comparison of frequency measures between the HBC and the general
language Guardian newspaper corpus. The bulk of the chapter however is devoted to the
presentation of a taxonomy of lexical patterns adopting the CGP notation which serve as
the basis for mapping on to functional elements to be outlined in chapter 5.
4.2 The lexis of causation 4.2.1 General
The manual analysis of the corpus described in chapter 3 resulted in the identification of
some 208 lexical items for inclusion in a lexical grammar of causation. Subsequent
analysis using Wordsmith Tools 3.0 indicated that these lexical items are distributed
between approximately103 lexical grammar patterns. As will be discussed below
however a number of these patterns on closer analysis may represent open choices
(adopting the terminology of Sinclair 1991) and thus are not genuine patterns of coselection in accordance with the idiom principle. Despite this theoretical objection, these
items are included in the lexical database which provides a dictionary-like listing of
causative lexis together with corpus examples illustrating the patterns into which the
words enter.
At the core of the approximately 200 lexical items identified in the corpus is a group
which has previously been referred to as the periphrastic causatives. Included within this
group are lexical items such as cause, effect, affect, make, get and causative have. Based
on the survey of the final corpus, it is argued that a focus purely on the periphrastic verbs
is artificially restrictive if the goal is a complete description of causation within the
biomedical sub-genre. An alternative picture which emerges from the HBC is one of a
large group (some 140 lexical items) of monotransitive verbs which predominate in the
encoding of causal logico-semantic relationships between heavily pre- and post-modified
nominal groups.
One subset of these causative verbs can be identified on the basis of regular
morphological endings which have a causative meaning (-ize/-ise/-ate). In semantic terms
however these morphological causatives do not constitute a separate group within the
monotransitive subset. The great majority of lexis therefore can be found in a general
language corpus.
As mentioned previously the adoption of a corpus-driven methodology resulted in the
decision being made not to attach morphosyntactic tags on an automatic basis to the
corpus at the ‘pre-processing’ stage although it may be the case that future software
applications of the local grammar may involve the use of a POS-tagged corpus. For the
majority of items identified from the manual survey, this strategy did not raise significant
problems for the retrieval of concordance data. For the great majority of verbs with
regular morphological endings, a wildcard search resulted in the retrieval of past tense /
participle, third person –s and –ing form present participles. The retrieval of causative
have and get posed the greatest problems however. Gilquin (2002) outlines a method for
the retrieval of causative have which makes use of the skeletal parsing and
morphosyntactic tagging scheme in the very small ICE GB corpus. As the option of
grammatically parsing the corpus was not feasible, examples of causative have had to be
obtained by manually disambiguating causative uses from the overwhelmingly dominant
possessive and auxiliary verb uses. Get is a similarly polysemous word but fortunately
from the point of view of retrieval its frequency for all uses was very low in research
article text.
4.2.2 Frequency measures
Primarily the presentation of frequency measures for selective causative lexis seeks to
fulfil a number of purposes. Firstly by concentrating on a selection of ‘prototypical’
causative verbs such as cause, affect and prevent marking the stoppage or blockage of the
causative process the frequency counts aim to demonstrate the centrality of causation to
the RA genre compared with a general language corpus19. While have and especially
make also have significant causative uses, it is difficult to disambiguate these specific
functions in an automated frequency measure exercise and so these were excluded from
the count. In addition the noun effect which frequently occurs in the HBC corpus in
delexical patterns (these will be further discussed in section 4.6 below) is also compared
across the two corpora.
For the purposes of the informal comparison presented here, the 95 million word newspaper corpus is
assumed to be representative of a non-register / genre specific variety of English.
Secondly, the comparative focus on selective coercive-manipulative causatives (Givón
180:333) such as manipulate and compel seeks to justify the concentration on the RA in
terms of reducing the otherwise unwieldy complexity involved in describing causation in
a general language ‘mega’ corpus such as the Bank of English or BNC. Differences
between specialist and generalist corpora are further highlighted by comparing a number
of more technically restricted morphologically-marked causative verbs such as
metabolize and devascularize.
A final dimension of formality for comparison explored is provided by the verb get. The
high degree of polysemy which this verb exhibits makes exact comparisons between
causative uses (with nominal and to-infinitive objects) problematic.
A comparison of frequencies between the HBC and Guardian corpora.
Frequency (Noun N /
Verb V)
cause (N +V)
compel (V)
metabolize (V)
Guardian Corpus
1380 / million words
4041 / million words
833 / million words
681 / million words
87 / million words
250 / million words
4.1 / million words
1.55 / million words
32 / million words
290 / million words
203 / million words
90 / million words
118 / million words
1195 / million words
705/ million words20
9.9 / million words
12.23 / million words
0.18 / million words
devascularize (V)
1.55 / million words
While this table sets out to provide an informal comparison only, a number of points can
be made relating to lexical distributions in the final corpus. It is clearly the case that
‘prototypical’ causative lexis such as cause, effect, affect and prevent are much more
abundant in the scientific research sub-genre. The statistics for force and get are more
difficult to interpret owing to problems in disambiguating causative meanings in both
corpora. Causative get however occurs only once in the HBC which can be seen as a
This figure includes nominal uses of force as in police force etc and is therefore somewhat inflated.
crude measure of the formality of scientific writing. Coercive-manipulative verbs are
similarly under-represented due presumably to the concealment in scientific research
writing of human agency in the causative process. This sub-set of verbs can be seen
largely in terms of relaying logico-semantic relations between nominalized scientific
processes and arguments. Finally the frequencies for the morphological causatives
metabolize and devascularize confirm the specialized nature of the discourse.
Causation is traditionally seen in terms of transitive verbs. However it is important to
account for the role of nominal groups and adjectives in the encoding of cause and effect
within the genre. With regard to nominalization, the noun effect (the most frequent
causative item in the corpus at 4041 occurrences within the 1.93 million words) has a
particularly complicated lexical grammar with participation in a number of delexical
patterns. Other similar lexical items include factor, role and result which occur frequently
in expressions of causation.
Adjectival items are perhaps less prototypically associated with causation. However the
adoption of a lexical pattern approach outlined previously necessitates that one lexical
item has to be selected as the point of entry into the lexicogrammatical system. As will be
revealed in section 4.7 below, some adjectives play a central role in the evaluation of the
epistemic status of the causal relationship demonstrated in the experimental data. The
patterns of evaluative adjectives such as significant, important, essential, critical are
examples of adjectives at the centre of patterns of cause and effect which will be
discussed in more detail below.
4.3. The taxonomy
4.3.1 Outline
The pattern grammar of causation outlined in this section essentially amount to a
taxonomy of lexical grammatical patterns identified in the corpus. It is natural therefore
to make a number of preliminary comparisons between the patterns presented in this
chapter with the other published local grammar taxonomies of Hunston and Sinclair
(2000) and Barnbrook (1995, 2001). With regard to the work of Barnbrook on dictionary
definitions for example, the approach taken from the beginning has been to analyze
definition sentences into functional elements, taking the analysis of dictionary definition
sentences put forward by Sinclair (1991) as its starting point. In Barnbrook’s analysis,
each functional element is firstly described and exemplified before the 17 identified
patterns making up the CCED definition sentences. The definition sentences are then
‘fitted’ into the functional categories put forward. The confrontation between data and
categories which this methodology entails does at times force a revision of the initial
categories (see Barnbrook 2001:141 for a description of the revision process).
While the definition grammar serves as the initial model for a functional division of a
sublanguage, the methodology for the construction of a taxonomy employed by Hunston
and Sinclair (2000) and Hunston and Francis (1999) is closer to the causation grammar
presented in this thesis. It might be possible to see this approach as a ‘bottom up’ rather
than ‘top down’ approach in the sense that the point of departure is the lexical
grammatical pattern centred on the node in a set of concordance lines instead of
functional categories imposed from ‘above’. The process can essentially be seen therefore
as being composed of two stages: a specification of the lexical patterns (stage 1) followed
by a functional mapping (stage 2). The lexical pattern taxonomy described here in this
chapter should be seen as an intermediate, data processing stage as a prelude to the local
grammar itself, which is specified in chapter 5.
4.3.2 The pattern taxonomy
On the basis of the corpus evidence, causation can be found not only as a logico-semantic
relationship between nominal groups encoded at clausal level but also bound up within
the internal structure of the nominal groupl. However the prevalence of nominalization
within scientific writing which has already been commented upon necessitates the
representation of causal relationships contained within the nominal group in terms of preand post- modifiers. Initial analyses based on the corpus have also revealed the role of
adjectives in encoding an evaluative aspect of the causative process. Indeed Hunston and
Francis (2000: 35) specifically note the potential dual function of sentences
simultaneously encoding both causation and evaluation. The use of evaluative adjectives
is directed towards an expression of the writer’s assessment as to the epistemic
significance of the logico-semantic relationship set up between the nominal groups.
For all four over-arching groups the main criterion for differentiating between the
patterns has been the type of complementation pattern associated with the central pattern
item. In the case of adjectival patterns however, there is an association between the
copular verb (in Collins Cobuild English Dictionary21 terms referred to as a link verb,
notated as v-link in the pattern grammar), the central adjective and the complementation
pattern which the adjective co-selects in accordance with the idiom principle. It is
acknowledged here that there is a partial degree of overlap between this presentation and
the preliminary exemplification of patterns in Allen (2002b:25-30) based on the smaller
pilot corpus. The intention in this chapter however is to provide a more exhaustive
treatment of patterns by supplementing and extending this previous material.
The boxes below set out the main elements of the taxonomy in terms of the verbal,
delexical, nominal and adjectival patterns which encode cause and effect in the corpus.
The accompanying numbers indicate the relevant sections in the commentary below.
Verbal patterns 4.4
Sinclair, J. (1995) Collins Cobuild English Language Dictionary 2nd Edition London & Glasgow: Harper
Simple 4.4.2
V n,V n n, V n v,V n to-inf, V it ADJ to-inf
be V-ed by n, be V-ed by v-ing,
Prepositional 4.4.3
V from n, V in n, V with n,V through n,V to n, V toward n, V for n,
V n to n, V into n,V n to n
be V-ed after n, be V-ed between n and n, be V-ed following n,
be V-ed for by n, be V-ed in n, be V-ed through n,
be V-ed through v-ing, be V-ed to n, be V-ed via n, be V-ed with n
be V-ed to n
Clausal 4.4.4
V that, V wh
Delexical patterns 4.5
Patterns with have 4.5.2
have N on n, have N in n, have N through n, have N v-ing
against n
Patterns with play 4.5.3
play N against n, play N in n, play N during n, play N in n,
play N in v-ing, play N by v-ing
Nominal patterns 4.6
Internal 4.6.2
n V-ed by n, n V-ed from n, n V-ed in n ,n V-ed to n
n V-ed with n
External 4.6.3
v-link patterns
v-link N in n, v-link N for n, v-link N of n, v-link N that
Patterns with existential there
there v-link N between n and n, there v-link N with n
Adjectival patterns 4.7
v-link ADJ with n, v-link ADJ adv to-inf,
v-link ADJ against n, v-link ADJ as n, v-link ADJ for n,
v-link ADJ in n,v-link ADJ in v-ing, v-link ADJ of n,
v-link ADJ of v-ing, v-link ADJ on n, v-link ADJ to n,
v-link ADJ to-inf, v-link ADJ upon n
4.4 Verbal patterns
4.4.1 Overview
Verbal patterns exhibit the greatest variety in terms of complementation pattern with
nominal group objects (n), adjectives (ADJ), more rarely pronoun (it), non-finite clauses
(to-inf), finite clauses (V that, V wh-) etc, in addition to a number of patterns involving
the co-selection of various prepositions. The procedure adopted here has been firstly to
make a broad distinction between simple, prepositional and clausal verb
complementation patterns. Simple patterns essentially group nominal and infinitival
object complements together. Following Hunston and Francis (1999 :71-72), the co-
selection or constraint over the choice of the following preposition by the node verb
provides a further group of prepositional verbs in the narrow sense (Quirk et al
ibid.:1150-1161). Complementation of the verb by that- or wh- subordinate clausal
objects necessitates the setting up of a final verb clausal category.
A rather different dimension in the verb complementation taxonomy is provided by the
distinction made between active and passive voice. Here the pattern notation differs
slightly from that presented in Hunston and Francis (1999). In the original presentation,
the passive pattern is regarded as a variation of the active pattern. Thus for a ditransitive
verb in the pattern V n n, the passive equivalent be V-ed to n is not listed as a separate
pattern in the Hunston and Francis analysis. The point that these authors make is that
syntactic variations caused by passivisation or fronting for example do not disrupt the
underlying patterns of verb valency / prepositional co-occurrence. For the purposes of
this thesis however the decision was made from the outset to list verbs in their active and
passive patterns separately as this representation facilitates the task of functional mapping
outlined in chapter 5.
4.4.2 Simple verbal patterns
This sub-section discusses some of the most important simple verb patterns at the level of
clause to emerge from the preliminary analysis. The presentation is divided up into active
and passive patterns and then further numbered in terms of the most significant individual
patterns. Active patterns
The V n pattern encoding in traditional terms a monotransitive verb is by far the most
important numerically in the corpus, with some 134 lexical items listed in the pattern
database. For this reason this pattern will receive a more in-depth treatment. Space does
not permit an exhaustive listing of all lexical items in this pattern; the complete list is
stored in the lexical / pattern database referred to in chapter 3.
The self-curing resins (Protemp Garant and Integrity) caused a significantly higher
temperature rise during polymerization than the dual-c SRA104(1419)
immunological mechanisms and ischemic-reperfusion damage of the allograft may
activate apoptosis in cardiomyocytes and endothelium. SRA570(2941)
This disruption of the regularity of the cycle may create psychic tension in these girls.
Following the methodology set out by Hunston and Francis, the focus of the pattern
representation has been on the valency of the verb- in this case the direct object nominal
group. Both Hunston and Francis (ibid.: 77) note the tendency for the pattern
representation to over-generalize. This phenomenon can be illustrated with the nominal
group n which can often be post-modified with a prepositional phrase as in the examples
in the table above. Prepositional phrases (henceforth PPs) such as in cardiomyocytes in
the example above are not part of the valency / co-occurrency constraint tied to the verb
and so are not represented as separate patterns such as V n in n. However as will be noted
in chapter 5 the functional roles of these PPs need to be accounted for in the final local
grammar if such a grammatical representation is to embody parsing potential.
In order to compensate for the over-generalisation inherent in pattern representation, it is
necessary to record significant collocational restrictions in the nominal group in some
cases. For example, the verbs exert and give occur in V n patterns encoding cause and
effect but the choice of nominal group is restricted in accordance with the idiom
principle. Thus we find the verb exert exhibiting co-occurrency restrictions with the
significant collocates effect, influence and less commonly action underlined:
has been designated as thermotherapy because heat in this range of temperature exerts
an irreversible cytotoxic effect. SRA387(77)
The socioeconomic environment has long been known to exert a powerful influence on
health, SRA79(639)
Conversely, CPA was found to exert exclusive antagonist action when AR and reporter
gene were stably integrate SRA340(4636)
It would appear that the boundary between these restricted collocates and the patterns
described as delexical in section 4.5 is somewhat fuzzy. The question might be asked at
this point as to the difference between exert an effect and have an effect, which it might
be argued both belong to the same V n pattern. Here it is argued that the verb have is
emptier of lexical meaning than exert and will thus be classified under the heading of
delexical verbs.
The case of exert is also important as it illustrates the role of various lexical items with
general, non-technical reference as heads of nominal groups, such as effect, influence,
action or impact. The semantic reference of these groups is realised therefore either
through pre-modifying adjectives (in the examples above irreversible or powerful) and
/or post-modifying prepositional phrases. In the second example above influence is postmodified by the PP on health which serves to specify the medical impact of the effect.
Within such a large group of verbs in this pattern it is possible to identify a number of
semantic dimensions. This division lacks the rigidity of a formal semantic analysis given
the difficulties involved in developing ‘water-tight’ categories but nevertheless serves to
illustrate the range of causative meanings encompassed by simple transitive verbs in
scientific text. The table below sets out to describe these ‘meaning groups’ in more detail.
Semantic sub-sets within monotransitive causative verbs
Semantic sub-set
monotransitive verb examples
allow, facilitate, permit, potentiate
create, generate, fabricate, yield, produce, create
change, adjust, modify, modulate
accelerate, delay, mobilize, reverse
increase, amplify, augment, strengthen, reinforce
complicate, elaborate
weaken, attenuate , diminish, lessen
hamper, impair, constrict, suppress, distort
improve, benefit, enhance, relieve
preserve, restore, stabilize, rigidize, sustain
worsen, aggravate, damage, exacerbate
cause, affect, , dictate, determine, influence, induce,
explain, imply, clarify, infer
relate, constitute, reflect, underlie, correlate
prevent, stop, block, impede, counteract,
eliminate, eradicate, obviate, obstruct,disrupt
Although partly modelled on a thesaurus representation, the sub-hierarchical divisions of
this table should not be seen as embodying the rigour of a lexical thesaurus, such as the
WordNet initiative (Miller et al. 1993). The ‘allow’ group contains verbs with a nonfactive meaning and often relays the potential or possible effects of a cause rather than
making a commitment to the existence of the causative relationship. A very significant
proportion of transitive verbs have a meaning which can be glossed as ‘creating’ or
‘producing’, which leads to the second meaning group labelled ‘create’ in the table. The
‘mediate’ group is strictly speaking not a group at all as it consists of a single lexical
item. It is arguable here whether mediate is actually a causative in the sense that the
attribution of agency is indirect. For the purposes of the grammar being put forward in
this thesis however the role of indirect agency is subsumed within causation as in the
example below:
This receptor mediates a remarkable vasodilating effect after activation by any
of several CCs SRA502(6355)
In the example above, This receptor is not the causing entity but rather plays a passive
role (in the non-grammatical sense) as a mediator of an effect, the cause of which is
encoded in the sentence-final adverbial. In chapter 5 this functional role of mediator
will be explained more fully.
The control sub-group contains central causative verbs such as affect, cause influence
which mark causal linkages without the added notion of change in the effect nominal
group. A more complicated group in terms of potential sub-divisions is the change group.
Under this heading it was possible to delineate a total of at least eight strands of meaning
(marked in bold in the table above) outlined in more detail below. These sub-divisions
are all loosely related to the changing or altering of the physical and biomedical
properties of an entity parameters encoded as effects.
The overall heading change can be seen as being neutral in the sense that there is no
explicit mentioning of the parameters or directionality of change ie in terms of increases
or decreases. The choice of designation for the sub-headings was made either on the
identification of a typical member of the sub-group eg increase, decrease etc. The
relationship between the members of each category and the sub-heading is therefore one
of approximate synonymy. The increase group consists of causative verbs linking a
causing event or entity to an effect marking an increase, strengthening or reinforcement
in some observational or experimental parameter. Similarly the decrease sub-group is
broadly antonymous with the increase group encoding causative processes resulting in
lowering, weakening or de-intensifying of effects.
The complicate group recognises an effect which may be glossed as either rendering a
phenomenon difficult to understand or in some cases disrupting or interfering with the
phenomenon. It could be argued that the improve and worsen groups could be seen as
being part of the increase and decrease meaning groups respectively. The decision
adopted here to regard these as separate categories has been made on the basis that
improving or worsening is seen from the perspective on the patient’s recovery from a
medical condition in contrast to a more neutral viewpoint on a biomedical process.
Continuing under the overarching heading of change it is also possible to see a small
group of transitive verbs in terms of encoding the maintenance or preservation of a
particular parameter. Thus the preserve group includes restore, stabilize, rigidize, sustain
In contrast to these ‘external world’ verbs marking physical processes, there is a small
but not insignificant group of lexical items such as explain, imply and infer. These verbs
encompass ‘internal world’ cognitively perceived causative links. Finally there would
appear to be an important group of verbs in this pattern which serve to relate a less
explicit, more inferential relationship between cause and effect nominal groups. This
group is referred to as the relate group and includes in addition relate, underlie, correlate
and reflect.
At this point the question can be raised as to whether there is any correspondence
between these meaning sub- groups and any specific grammar patterns which they
realise. The demonstration of a correspondence between patterns of co-selection and
meaning has been a long-standing research aim of corpus-driven methodologies. In this
case however it would appear that all these identified strands of meaning in the table
above are encoded through the same V n pattern. In order to disambiguate these semantic
sub-divisions it is necessary instead to examine the specific lexis realising the nominal
group n together with the internal structure of n in terms of pre- and postmodifiers to the
head. This argument can be illustrated with reference to the lexical items improve and
by Henning (1920), however, suggested the possibility that macular pigment
could improve vision in the atmosphere by improving contrast relations (as cited
in Walls SRA395(2893)
It is known that maternal smoking increases the likelihood of children
developing wheezing illnesses and in SRA79(378)
Both these verbs in the examples above are monotransitive ie in pattern grammar terms
they have the same V n designation. The disambiguation of the two meaning groups
however comes from the distinction made between the semantics of the two nominal
groups, vision and likelihood of children developing wheezing illnesses respectively. In
the first example, vision can be seen as an entity which medical science should seek to
improve or restore. In the second example however, the head of the nominal group
likelihood can be measured /quantified.
This ditransitive pattern is restricted in the corpus to two periphrastic verbs, cause and
make respectively. In the case of cause, the first object marks the participant (patients)
which is the recipient of the causative process while the second object embodies the
effect, in this case the production of medium stress in the patient. The example of make
shown below has a different semantic role for the second nominal group object. This time
there is a broadly synonymous relationship between the first object CT and powerful
diagnostic tool. In other words the second object essentially recasts the first nominal
group in terms of its precise medical role as a diagnostic tool.
other hand nurses didn’t believe as much and answered that lack of time devoted causes
patients medium stress (62,16%), SRA83(2373)
technology in diagnosing appendicitis. Advancement in radiographic imaging has made
the CT scan a powerful diagnostic tool in the evaluation of acute appendicitis
Liquids of low viscosity make the examiner press the instrument more firmly onto the
tumor SRA469(306)
Only one example of a bare infinitive complementation pattern was retrieved from the
corpus as shown above with the lexical item make. This pattern would appear to be of
only marginal importance in encoding causal relationships within the sub-genre.
V n to-inf
Again this is a pattern strongly associated not only with certain periphrastic uses but also
with a quite well-defined group of coercive-manipulative causatives as evidenced by
work on a general corpus. Work here on the restricted genre of the medical RA provides
a rather different picture. Periphrastic verbs such as get which occurs in this pattern in the
general corpus are subject to stylistic constraints; this verb is very infrequent in the HBC
in a causative sense and is restricted largely to meanings which can be glossed under
acquisition and also in the collocation get + pregnant. On the basis of intuition it might
be expected that force would also occur in a V n to-inf pattern with a causative meaning.
In the HBC however the occurrence of force is restricted to passive patterns which will be
reviewed in section This tendency to conceal human agency through
passivization in informative texts such as scientific RAs has been noted by a number of
authors (eg Quirk et al.ibid:166; Bazerman 1984:177 cited in Swales 1990). This
concealment of the author-experimenter’s voice is further revealed in the dearth of what
the coercive-manipulative verbs which might be regarded as a subset of causative verbs.
Francis et al.(1996:290) list a number of verbs under the V n to-inf heading; with this
lexical pattern however there is a semantic requirement that a human agency should be
the initating agent of the causative process. The only examples of this type of causation
found in the corpus are compel and prompt as shown below in the table below. The verb
induce however is not subject to such agency constraints; the causing or triggering event
can be non-animate. It must be acknowledged however that occurrence of the V n to-inf
pattern for induce is overshadowed by V n and its passive equivalent:
. An eosinophil cell component, which may cause a pathologist to suggest a drug cause,
seems to be rare and, when prese SRA42(2835)
These statistics compel physicians to make the utmost effort to detect melanomas at an
early, trea
Tumor cells can also express factors to induce surrounding stromal cells
to express MMPs SRA286(1995)
ding of severely high ICP values and symptoms of intracranial
hypertension prompted us to repeat CSF subtractions till postoperative
day 4. SRA360(1893)
The availability of depression-specific psychotherapies stimulated efforts
to increase uniformity in their delivery in clinical practice.
V it ADJ to-inf
This more complex pattern is similarly restricted to one lexical item in the corpus. As
with the V to-inf pattern described above, the importance of make as a periphrastic
causative necessitates documentation of all the patterns which it enters into encoding
cause and effect. The pattern is also described in Hunston and Francis (2000:131)
drawing from the B of E general corpus. However it would appear to be very much
lexically restricted (patterns occur mainly with the verbs make and the non-causative
find). Unusually the pattern includes complementation of the verb with the pronoun it
which is relatively empty of referential meaning (cf it as an extraposed subject Quirk et al
ibid.:1393), followed by an evaluative adjective which is in turn complemented by a nonfinite clause. As pointed out previously while it is possible following Hunston and
Francis to see this pattern in evaluative terms, it is argued here that the pattern also
denotes a cause and effect connection which marks the consequence/ effect of the lack of
experience encoded as a cause.
ecimens in the pathology laboratory. The lack of experience with atrial
appendages makes it difficult to assess what is abnormal and what is the
range of histologic var SRA443(1099) Passive patterns
The next section describes typical patterns of passivisation involving simple (ie nonprepositional) verbs found in the corpus. The two predominant patterns here are listed in
pattern grammar terms as be V-ed by n and be V-ed by v-ing respectively.
be V-ed by n
The be V-ed by n is the most important passive pattern found in the corpus in terms of
the number of lexical items which enter into it. The lexical database lists a total of 60
lexical items under this pattern. One source of difficulty here is posed by designating the
past participle V-ed of the verb as either an adjectival pattern or as a so-called agentless
passive. Agentless passives (Quirk et al. ibid:164-165) occur frequently in the corpus. An
example of this dilemma can be found for the verb activate shown in the example below:
Hence, at sites where platelets and/or the coagulation cascade are activated, the
endothelium releases vasodilators and platelet inhibitors, such as N SRA391(946)
On the one hand the past participle activated might be regarded as being adjectival ie it
describes the state of the coagulation cascade. Clearly such a statement of state cannot be
regarded as causative as it stands as there is no attribution of agency. An alternative
perspective might analyze the sentence causatively as a passive without the agentive PP
(ie the sentence merely describes the effect as the state minus the cause).For the purposes
of describing patterns of causation, a decision was made to focus on passives with
explicit agents realised by PPs headed by the preposition by as shown in the table below.
ochemical method is not affected by post mortem autolysis or formalin
fixation [14]. SRA570(2161)
miniferous tubules suggesting that the dilation of rete testis in NRTN
mice might be caused by an obstruction. SRA341(2416)
exacerbate In one case, symptoms were exacerbated by ultrasound, and i..
be V-ed by v-ing
This pattern represents the second variation on the simple passive verb group. Here the
past participle form of the verb is complemented by a PP itself headed by the preposition
by and complemented by an –ing form of the verb. The –ing form can be seen as a
nominalization of a verbal process.
In a prospective study, the median survival time after retinal detachment
caused by necrotizing retinitis was approximately 7 months.
In preparation for staining, endogenous peroxidase activity was blocked
by washing with 0.3% hydrogen peroxide for 5 min. Slides were I
4.4.3 Prepositional verb patterns
This section describes the patterns of prepositional and other multi-word verbs identified
as encoding causative relations in the corpus. The pattern presentation follows that of
section 4.4.2. above in separating active from passive patterns, rather than seeing
passivisation as a variant on the active pattern. Active patterns
A total of ten separate patterns of active verbs with co-selected prepositions were
identified in the corpus: V from n , V in n, V in v-ing, V with n, V through n, V to n, V
toward n, V into n, V for n, and V to n. These patterns will now be exemplified in turn.
V from n
rends were observed at the two sites (Paris, Dublin). Alternatively,
variations may arise from `differences in the type of healthcare
institution (private, public, private in
geons must decide whether to treat only one condition or both. The
disparity in the results could result from the fact that the IOP is not
stable but fluctuates daily in nor
A total of 7 verbs were identified as occurring in this pattern, one of which (result) was
identified in the original pilot study (Allen 1998:33-34). Other verbs in the pattern
include benefit, originate, secrete and stem. The example of secrete in the V from n
pattern would not appear on reflection to be causative as the verb plus preposition
combination encodes an emanation from a physical source or direction rather than a
genuine cause and effect relationship. This example illustrates a semantic relationship of
key importance especially within the category of prepositional verbs. It would appear that
the causative interpretation of these verbs is essentially metaphoric in the sense that there
is a literal interpretation tied closely to the meaning of the preposition in terms of
displacement to or from the source or direction. The preposition from can be interpreted
as marking a spatial source or a causative source.
V in n
In engineering materials, these internal stresses and strains would result
in structural damages due to micro-cracking and interfacial debonding.
a commercial sample indicates that other compounds present in the
sample do not interfere in the voltammetric determination of FBZ. The
determination of FBZ conc SRA526(1770)
Other verbs sharing this pattern in encoding causation include aid and contribute. Again
the metaphorical interpretation of the preposition in is important in marking the scope of
causation. Here however rather than marking directionality, the semantic relationship is
more one of inclusion in the sense that the action of resulting and interfering is seen as
belonging to structural damages and voltammetric determination respectively. A
variation on this pattern is V in v-ing in which the complement of the preposition is a –
ing participle.
V toward n
Both V toward n and V to n represent patterns where there is a similar metaphorical
extension of meaning from the core directionality of the preposition to the directionality
of the causal process:
reasonable to conclude that actions of locally released cytokines and chemokines
contribute toward BHR in mice, but few studies to date have conclusively demonst
V to n
enetic background or inherited susceptibility may also contribute to the causal
inophil-differentiating factor) are regulated by IL-2, the inhibition of IL-2 by CyA
may lead indirectly to diminished eosinophil counts in the peripheral blood [23]. IL-2
is a
may be a manifestation of tuberculosis. Congenital weakness of the fibrous annuli
predisposes to the development of such aneurysms, particularly when
In the case of the V to n pattern verbs above, the lexical items contribute, lead and
predispose realise somewhat different semantic areas (loosely glossed as addition,
guiding or directing or to cause to be liable to perform a certain action) in isolation. The
common factor uniting these verbs is therefore the directionality embodied by the
V through n
hypothesis," which proposes that cardiovascular disease and type II diabetes originate
through adaptations that the fetus makes when it is undernourished.
Like several patterns involving prepositional verbs, V through n is more frequent in the
passivized form be V-ed through n. The choice of the preposition through is significant
in that it marks a mediatory role for the following nominal group which plays a more
passive role in the process.
V into n
especially considering the severity of this complication. Failure to recognize this may
translate into a high postoperative mortality rate and long-term poor outcome due t
In the active pattern V into n seems to be restricted lexically to translate. The literal
meaning of translating from one language to another seems here to be extended
metaphorically to a causative change of state, marking the causative relationship as a
transition from one entity into another.
V with n
Growth velocities diminished significantly with age in each of these four variables.
These symptoms reversed with cessation of therapy [7]. SRA374(688)
Here we have an example of a preposition with which does not strictly speaking co-occur
with the verb according to the idiom principle. In other words the choice of the PP
following the verb represents an open choice. In the case of reverse shown above, the PP
with cessation of therapy could be replaced by any number of similar PPs realising
adverbial functions in the sentence, such as after surgery, before medication started etc.
This pattern therefore demonstrates an important theoretical problem alluded to in
chapter 3; namely that the notion of pattern is based on idiom principle co-selectional
restrictions in operation. Strictly speaking therefore V with n in the examples noted from
the corpus is not a pattern at all. However there is quite clearly a causative relationship
being encoded with the adverbial element marking some sort of circumstantial cause
which is related through the verb to an effect in terms of some sort of parametric change
in quality.
V for n
As with other patterns illustrated above, V for n is restricted to one lexical item in the
corpus, account. In the light of causation, the meaning of the verb might be glossed as
‘being responsible for’ an effect. Unlike the pattern V with n, this pattern would appear
to be a more typical example of a verb + preposition occurrence. The idiomaticity of the
example is shown in terms of the causative meaning being derived from the combination
of the verb and preposition. In common with many idiomatic expressions there is a
metaphorical extension of meaning from the literal derivation from the semantic field of
finance in serving to provide a causal explanation or reason for a phenomenon.
Genital tract sepsis Puerperal infection and septic abortion together accounted for 34
deaths (28%) out of a total of 133 deaths, making the largest con
V n to n
In this pattern the verb is complemented by a direct object nominal (n) and a
prepositional phrase realising adjuncts, playing the role in SFG terms of
Participants or Circumstants. The role of the preposition to in these patterns is
of paramount importance, marking the direction of causation in the clause. In the pattern
example centred on contribute, little can also be seen as an adjunct.The causative sense of
give is restrictive to the point of being ‘frozen’ in terms of the collocate rise as in the
example below. This restriction is not seen in the following prepositional phrase
We attribute the effect of BMI on the OC levels to a dilution of the
initial concentration SRA229(3410)
However, the effect of blood pressure was small and should contribute
little to the POBF SRA25(2079)
They demonstrated that rotations within ±10° of the modeled
angles give rise to angle distortion less than ±0.6°. Thus, projection
errors were insignific SRA112(2874)
ds and a spectrum of coagulation proteins. We conclude that acute
VZV infection predisposes children to a brisk but nonspecific
immunologic response with multipl SRA231(3296)
predispose Passive patterns
A total of 17 passive prepositional verb patterns were found to encode causation in the
corpus: be V-ed after n, be V-ed as n, be V-ed as cl, be V-ed between n and n, be V-ed
following n, be V-ed for by n, be V-ed for n, be V-ed from n, be V-ed in n, be V-ed
into n, be V-ed on n, be V-ed through n, be V-ed through v-ing, be V-ed to n, be V-ed
via n, and be V-ed with n.
be V-ed after n
As the concentration increased to 3.2 ppm, SLMA has found to be reduced after 2week exposure.SRA177(185)
The be V-ed after n pattern is another example of a verb/ prepositional co-occurrence
which represents an open choice rather than an idiom principle restriction. A preposition,
in this case after, similarly serves to introduce an adverbial or circumstantial element.
Here the sentence is configured as a short passive ie without an agentive phrase headed
by the preposition by. In the absence of such a phrase, the temporal adverbial element
after 2-week exposure is inferred as the cause. It would certainly seem to be the case,
judging by this and other examples that the boundary between true causation and inferred
cause and effect is somewhat fuzzy and that the inference is made by the adjacency of
clausal elements (in this case the verb and the PP as temporal adverbial.
be V-ed following n
It would seem logical to include the be V-ed following n pattern following directly on
from be V-ed after n in the sense that once again we have a causative inference which
can be drawn between the subject expression of IL-10 and the adverbial following antigen
challenge. Strictly speaking the adverbial element introduced by following is not a
prepositional phrase but a non-finite –ing participle clause. With this particular example,
the inference is temporal; the reader makes the sequential connection between the
diminishment as an effect being preceded by the cause.
In contrast, the expression of IL-10 appears to be diminished following antigen
challenge SRA336(2888)
be V-ed through n, be V-ed through v-ing
The essential semantic notion which these closely-related patterns realise is the mediation
of the causative relationship. As with the active pattern, the role of the preposition
through is critical in making a causative inference. The literal-spatial meaning of the
preposition in terms of movement or displacement in ‘going in one side and out of the
other’ is extended metaphorically to cover agency, means or fault in a causative
interpretation. In the case of mediate, the verb + preposition combination suggests the
passive, non-agentive role of the nominal in terms of acting as the channel or conduit for
the causative process. In other words verbs of mediation do not explicitly state the cause;
instead mediatory verbs state the means through which the causative process is realised
minus agency. In the other verbs noted in the pattern, the inference of causative agency is
stronger. In the case of alleviate in the example below, the causative inference is
strengthened by the connection made by the preposition between a problematic situation
(burden) being lightened or lessened through efforts.
This burden can only be alleviated through efforts to decrease the incidence of
stroke. SRA87(387)
Tucker et al., 1999; Jernvall and Thesleff, 2000). The biological
effects of FGFs are mediated through high-affinity tyrosine-kinase
receptors. SRA98(391)
......romoter region of GST contains an AP-1 motif suggesting that
these genes may be regulated through the c-fos and c-jun gene
products (Ainbinder et al., 1997). SRA210(4294)
The verb optimize illustrates the -ing participle variation on the pattern in which instead
of a nominal group in the agentive phrase we find a nominalized verb (in the case below
The sensitivity of the method was mainly optimized through maximizing the amount
of injected Carnitine derivatives SRA520(1704)
be V-ed for by n
The idiomatic nature of the combination verb + preposition for has been described
previously. The close bond between verb and preposition is further illustrated when the
structure is passivized as in the example below. The verb and preposition thus behave as
a unified whole, with the agentive element realised by a PP headed by the preposition by.
This higher energy expenditure can be accounted for by the greater energy expended
by this group for protein synthesis SRA12(4473)
be V-ed via n
Here the causative inference is expressed using the preposition via. There are thus certain
semantic similarities with the preposition through. The meaning of transfer or movement
is shared in both of the example sentences below although the use of mediate suggests a
more passive role for angiotensin II Type I. Causative agency can be inferred more
strongly with regulate which is broadly synonymous with control as a verb determining
the directionality of cause and effect.
This effect of angiotensin II is mediated via angiotensin II Type I (AT1), rather than
AT2, receptors SRA210(2101)
…. whereas the induction is regulated via the ARE/EpRE SRA100(4450)
be V-ed to n
A number of verbs in this pattern, such as ascribe, assign, attribute, link and trace mark
loosely the attribution by a third party of an effect to a specific cause:
Often pulp responses have been ascribed to the activity of particular restorative
materials SRA453(2181)
And the mutations and variations were assigned to a single known haplotype
The contractile response can be partly attributed to an increase of Ca influx through
receptor-operating channel and…. SRA434(520)
….suggest that the risk of valvulopathy was linked to dose and /or duration…
be V-ed in n
The passive equivalent of V in n encodes significant realisations of cause and effect in
the corpus. Lexical items identified in this pattern include elevate, implicate, incriminate,
inhibit, invoke and involve all of which co-select the preposition in. The preposition is
critical in marking the inclusion or connection of the effect nominal preceding the verb
group with the effect nominal in the sentence last position.
SB overdosing has been incriminated in some outbreaks of visceral gout in poultry
We next examined whether the caspase-3 protease was involved in the NaF-induced
cell death response. SRA115(2399)
be V-ed with n
The co-selection of the preposition with and the passivized verb is a very productive
pattern. Some 19 lexical items including alleviate, amplify, assay, associate, attenuate,
improve and restore were recorded in the pattern database. Of these lexical items,
associate is perhaps the most important in this pattern, marking a form of hedging on the
part of the writer. On this basis it is possible to allude to the existence of a causal
relationship prior to the wider confirmation and acceptance of the claim by the discourse
community. Similarly correlate in the be V-ed with n pattern can mark a statistical
relationship from which causation may be inferred. With other lexical items, the PP
introduced by with frequently marks some intervention as a cause such as drug or
treatment administration.
Lower success rates may be associated with true ‘dystocia’…..
The obesity was positively correlated with increased leptin serum levels….
In addition, air removal techniques have been improved with CO2 use,….
Subsequent DVTs can be reduced with low-dose, subcutaneous heparin….
4.4.4 Clausal complementation patterns
In these patterns the active verb is complemented by a clausal element. The main types of
active patterns (V wh- and V that) and passive patterns (be V-ed as cl) are reviewed
below. Active patterns
V whThe complementation by a wh- element again raises questions as to whether the
definition of pattern based on the idiom principle should be adhered to in this description
of the lexical grammar of causation. The corpus revealed three possible verbs illustrating
this pattern: arise, explain and predict, examples of which are given in the table. below.
A fracture plane will arise when the stress at the tip of the pore reaches max
The lower age of the children in this study might partly explain why there was a lack
of effect of prophylactic penicillin V SRA648(1745)
and the amount of alveolar
dead space can predict which patients will have
pulmonary dysfunction postoperatively. SRA12(347)
In all these verbs, the element immediately adjacent is a wh- subordinate clause. However
in arise, the wh-element is not strictly part of the pattern as it realises the role of an
adverbial ie it reflects an open choice from all possible clausal or prepositional phrasal
possibilities as circumstances. In the cases of explain and predict however there would
appear to be a much clearer case for co-selection as the wh-element realises a clausal
object and would therefore be regarded as being part of the valency of the verb. In
Hallidayian terms, the subject (in this case The lower age of the children in this study) is
the theme which is linked by the verb explain to a form of embedded question. Verbs like
explain and predict provide the means of embedding rhetorical questions into stylistically
more acceptable declaratives.
V that
The verbs implicate, mean and substantiate were all identified in the V that pattern. In
each case the subordinate clause is a direct object which is introduced by that as a
projecting conjunction (Halliday 1985a:288-291). In this particular pattern there would
appear to be some association between pattern and meaning, a relationship which is a
recurrent theme in Hunston and Francis (1999). Here we have come some distance from
the explicit expression of cause and effect through ‘external world’ causative verbs.
Instead relationships are presented as implications, inferences and consequences between
scientific arguments or claims put forward from which readers are invited to draw their
own causal conclusions.
In functional terms elaborated upon in the next chapter, this pattern coincides with mental
processes which project in Halliday’s (1985a: 288-291) terms a subordinate clause from a
main clause, marking a logical or hypotactic relationship between the two clauses in the
clause complex. In the example with implicate below, the choice of the verb might be
seen as an example of hedging. The authors are not committing themselves to an explicit
statement of a causal relationship between spatial confinement to the right of the verb as
a cause and limitation of the exposed nerve length as an effect. Similarly the verb mean
below relates a causing entity or event not necessarily to an effect but to a reinterpretation of its consequences. The final example substantiate is a more genuine
example of a projected clause, where the main projecting clause might be seen as a cause
and the identification improvement as an implied effect in the projected clause.
Spatial confinement of BAB to the epidural space implicates that the length of nerves
exposed to active BAB concentrations is limited to a few millimeters; the distance in
the epidural space. SRA534(576)
and advances in laparoscopic techniques have meant that the surgical management of
the groin hernia has undergone extensive re-evaluation. SRA310(314)
However, the present report substantiates that the new method better identifies type
II vWD plasma compared with the original version. SRA40(3990)
substantiate Passive patterns
be V-ed whOnly one passive pattern with a clausal complement was identified in the corpus, that of
be V-ed wh-. A total of 5 verbs were identified with a causative interpretation of this
pattern: aggravate, create, increase, substantiate and suppress. With the exception of
substantiate, the occurrence of the wh- element after the verb can be ascribed to open
choices; substantiate on the other hand is a more likely candidate for constraining the whelement in accordance with the idiom principle. In the example with increase, the reader
is left to infer a possible causative relationship between increased risk as an effect and the
circumstances under which the condition of atrial thrombosis is present.
The risk of systemic emboli is increased when atrial thrombosis is present
Yet it has not been substantiated why exactly 3 years of age should be of particular
interest as to be o SRA584(2190)
be V-ed as cl
This pattern is not strictly speaking an idiom-principle co-selection of elements following
the verb in the sense that there is no restriction with the verb aggravate. Here a coordinating conjunction joins two main clause elements, the skeletal malocclusion may be
aggravated and the patient grows. While it is true that the conjunction as does not in
itself explicitly relate a causative relationship, it nevertheless enables the reader to make a
logical inference based on the adjacency of the two clauses. Once again the pattern
formalism has been extended towards an open-choice configuration which an automatic
parser based on the local grammar representation would need to cope with.
Without treatment, the skeletal malocclusion may be aggravated as the
patient grows SRA255(357).
4.5 Delexical patterns
4.5.1 Overview
The phenomenon of delexicalism has received considerable attention in recent years in
corpus-based studies and was reviewed briefly in Allen 2002b:24). This section describes
in more detail the phenomenon of delexical or desemanticized verbs before looking at
two predominant delexical patterns underlying causation in the corpus. The term
‘delexical’ clearly points to a subset of verbs which are to a large extent empty of
meaning, functioning as little more than connectives. The content in semantic terms is
encoded in the accompanying noun. Sinclair (1991:112-113) refers to the phenomenon of
progressive delexicalization in which a very frequent word such as have is very much
diminished in the sense of its core intuitive meaning in a phrase such as have a laugh etc.
This view of semantic emptiness has however been called into question. Stein (1991:15)
has challenged the idea that verbs such as have, give, make and take in delexical patterns
have lost all their core meaning. Building on this analysis, Allan (1994; 1998)
demonstrates that each delexical verb brings an element of its intuitive meaning in these
structures. In describing distinct meanings for delexical verbs, Allan demonstrates that
there is a cline in terms of the degree to which verbs undergo full desemanticization. The
corpus data examined by Allan show for example that give, have and particularly make
undergo less semantic attrition than take which is almost entirely semantically empty
(Allan 1998:16). In this thesis the term delexical pattern will be applied to combinations
of delexical verb with co-occurring nominal group in which the intuitive meaning is not
fully realised.
In the preliminary analysis of causative lexis, delexical patterns although limited in
number figure prominently in the corpus on the basis of frequency measurements. Clearly
this is one particular area where there is an extension of meaning from one central lexical
item over the pattern. The area of delexical patterns is perhaps the clearest demonstration
of a semantic/meaning association to be seen in the corpus. The difficulty which delexical
verbs pose for the taxonomy being outlined here is whether to classify them under the
categories of verbs or nominals. Despite the degrees of lexical meaning loss in the
delexical verb, it might be possible to argue that the main verbs such as have and play
should be classified under the verbal category. Conversely it could also be argued that
much of the causative meaning as a nominalization of the cause or effect is bound up in
the nominal group such as effect in the case of have or role in the case of play. Indeed
these nominals provide the initial query entry in Wordsmith. In view of these perspectives
it was decided to put delexical patterns into a separate category, one which is separate
from the verbal and nominal over-arching categories.
Delexical patterns in the corpus would appear to fall into two categories, based on the
delexical verbs have and play respectively. These verbs have very clearly defined
collocational profiles in terms of idiom-principle restrictions on the following nominal
groups. It is argued here that have and play retain some intuitive meaning serving to
signal the possession or performance of a cause or effect role.
4.5.2 Patterns with have + nominal group
have N v-ing against n
that the reactive astrocytes around the tumor have some role acting against the
damage caused by the tumor SRA359(1083)
In the example above the collocational relationship between the delexical verb have and
the nominal group can be seen as a metaphorical extension of have as a verb indicating
the possession of qualities or characteristics. In Hallidayian terms this relationship might
be seen as an example of carrier (reactive astrocytes ) and attribute (the post-modified
nominal headed by role). The above table illustrates the only example found in the corpus
of delexical have collocating with the nominal role; it is therefore difficult to substantiate
the claim the head is always post-modified by an ing-participle clause as in acting.
have N in n
efficacy CTG and tretinoin were shown to have a similar efficacy in the reduction of
open and closed comedones SRA185(1963)
Psychiatric nursing has a unique role in the holistic approach to care SRA69 (276)
The pattern above is a variation on the delexical have + nominal group. Here the nominal
group is a non-specific abstract noun such as efficacy or role as in the examples above.
Owing to this lack of specificity, these nouns are post-modified with a prepositional
phrase headed by the preposition in circumscribing or limiting the scope of the causative
have N on n
The two principal collocates of have are effect, influence and impact. The corpus
contained one example of this pattern with affect but this would appear to be either a
genuine error (possibly from a non-native speaker author) or a editorial oversight and was
therefore disregarded from the exposition of the patterns. The preposition on would
appear to be co-selected by the preceding nominal serving as in the previous pattern
example to specify the circumstantial extent of the effect. There is a metaphorical
extension of meaning from the preposition on which may be glossed as marking a
connection with a physical surface to a non-literal interpretation.
ine whether the concentration of TF and TFPI in retroplacental haematoma has any
influence on the level of those substance in blood plasma. SRA422(2184)
It is also worth mentioning that there were significant collocates occurring as premodifying adjectives of the head noun. Examples included positive, significant, lower,
specific, observed occurring in the N-1 position.
have N through n
This pattern is a variation on the above delexical combination, this time with nominal
group post-modified by a PP headed by the preposition through. Here the preposition
serves to specify/deconstruct the relatively non-specific nominal effect in terms of the
means by which the causative process is achieved.
Ferritin has a very strong cytoprotective effect through its capacity to sequester iron
4.5.3 Patterns with play + nominal group
A second significant delexical pattern involves the use of the verb play collocating
strongly with nominals such as role and part. One question which might be asked is
whether there is a semantic difference between the delexical patterns have a role / play a
role. The choice of play for example might suggest a more active role for the causing
agent; in scientific terms such a wording might represent a stronger claim. Further textual
analysis would be needed in order to confirm a tentative hypothesis that have might
represent a form of hedging in presenting a weaker claim.
play N against n
The mother’s education played a protective role against infant death only in poor
households characterized by an aver-age IMR of 58 per 1000 live births, com-pared
with an average IMR of 35 per 1000 live births in non-poor households
This variant of the pattern with the co-selection of the preposition against was only found
with the collocate role. The choice of the evaluative adjective protective as a pre-modifier
of role may well be constrained by the preposition against. However it is difficult to
confirm this observation as only one example of this pattern was found in a corpus of
nearly 2 million words. The single occurrence calls into question the desirability of
representing scarce patterns in the grammar.
play N by v-ing
….the non-catalytic domains of the protease may play an important role by
influencing substrate accessibility and alignment into the specificity poc
Similarly the combination of the delexical verb play and the nominal collocate define the
meaning group in the play N by v-ing pattern. In this particular example, the nominal
head role is again pre-modified with an evaluative adjective but there is an absence of
post-modification which would normally be needed to define the scope of a relatively
general noun (in this case role) more specifically. In the example above the reference of
role is made more specific by the circumstantial adverbial element headed by the
preposition by and complemented by an -ing participle. This element serves to encode the
target of the causative process through the verb influence as the instrument or means by
which the overall cause and effect is realised.
play N in n/ v-ing
The delexical pattern with a nominal group post-modified by a preposition phrase headed
by the preposition in is very significant in the encoding of cause and effect in this genre.
The corpus evidence reviewed substantiates Gledhill’s (2000:126) observation that the
preposition phrase frequently contains a nominalized empirical process or a disease
that other cytokines generated during inhalational antigen challenge play a greater
role during this period in mediating BHR. SRA503(8295)
]ature of the girls' relationship with her boyfriend plays a major part in the regular
use of hormonal contraceptive methods. SRA609(3789)
From the corpus evidence it would appear that the collocates of delexical play are
restricted to role and part. The first example above includes an example of a
discontinuous nominal group with the PP during this period intervening between the head
and the following PP in mediating BHR. In pattern grammar terms during this period
would be regarded as an open choice adverbial which would not be part of the coselectional profile. Consequently the true pattern is completed with the co-selection of the
preposition in which in turn is complemented by either a nominal group or a
nominalization of a verb as the effect. As this pattern is very common in the corpus, only
approximate generalisations can be made about the complement of the preposition in the
specification of the effect. Examples from the corpus suggest that complements of the
preposition include either general medical processes (eg pathophysiology,
etiopathogenesis etc) specific medical conditions ( recurrent tonsillitis, obstetric
cholestasis), -ing participle nominalisations (modulating, directing, determining etc ) of
causative verbs and other nominal groups marking the occurrence of certain entities
derived from verbs (formation, development, transport). These last named less specific
nouns are often post-modified with PPs headed by the preposition of.
4.6 Nominal patterns
4.6.1 Overview
The role of nominalization and nominal groups in scientific writing has been reviewed
briefly elsewhere (Allen 2002b:20-24) and will now be described in more detail as a
prelude to the discussion of causation both within the nominal group and between
nominal groups. Nominalization in the scientific research article has received extensive
attention within Hallidayian linguistics particularly with regard to the concept of
grammatical metaphor (Halliday1985a; 1988; Halliday and Matthiessen (1999; Banks
2001a, 2005). In an historical review, Halliday (1988) traces the role of nominal groups
as nominalizations of scientific processes back to the time of Newton. This nominal style
of writing emerged as Newton sought to develop scientific discourse in terms of
nominalized process verbs linked into logico-semantically related chains of argument.
As Banks (2005:348) describes, the Hallidayian notion of grammatical metaphor is
essential to an understanding of nominalization in scientific writing. In an unmarked ie
most typical sense, nouns encode palpable or abstract entities (marked in Hallidayian
terms by Thing). Similarly the unmarked usage of verbs is to mark processes.
Grammatical metaphor however involves the marked use of a noun to encode a process.
As Banks (2005:248) notes, semantic metaphor keeps in place the grammatical
configuration (cf rosy fingers of the dawn with rays of the sun) with a displacement of
meaning. The use of grammatical metaphor on the other hand retains the meaning but
uses an alternative grammatical form (cf induces labour and the induction of labour).
Banks’ research also raises the important issue of the overlap between grammatical
metaphor as realised through nominalization and the deverbal nouns derived from verbs.
An example of a deverbal noun is bake from which the nouns baking (the process) and
baker (the agent of the process) and bakery (the place at which the process is carried out)
are derived. The term grammatical metaphor is reserved by Banks for deverbal encodings
of the process only.
The utility of nominalization in scientific discourse has been remarked upon by a number
of authors (Ventola 1996; Halliday 1988; Banks 2001a). The compaction of information
is greatly facilitated if a process is nominalized, permitting the use of modifiers and
quantifiers. Furthermore encoding a process as a noun opens up a range of possibilities
whereby the nominalized process can function syntactically in any of the ways typical of
nouns such as subjects, objects or complements within the clause (Banks 2005:350). On a
semantic level, Banks notes one further benefit of nominalization which he refers to as
‘reification’. By rendering a process as a nominal group, Banks argues that it is easier to
see the advancing research consensus as a forward propagating nominalized steps.
In terms of the nominal expression of causation, the cause-effect linkage is encoded
either internally within the nominal group or externally using link verbs between nominal
groups. The procedure adopted in the taxonomy has been to firstly divide up internal
nominal causation into pre- and post-modification patterns respectively. Post-modifying
patterns can then be further sub-divided into two groups, one characterized by non-finite
clausal post modification patterns and the other by prepositional post-modification. The
former group includes patterns beginning with past-participles (pattern notation V-ed)
with a broadly passive meaning. A number of prepositions by, to and in are
complemented by a second nominal often with an agentive role. The second group
referred to above simply relates cause and effect nominal groups via a preposition.
Typical patterns here are N of n and N to n respectively although the choice of
preposition realises slightly different semantic relationships. The external category
includes not only patterns which link nominals via copular verbs but also patterns
involving existential there such as there v-link N of n , there v-link N between n and n
4.6.2 Internal patterns within the nominal group
The tendency towards a high degree of nominalization within the genre of scientific
writing has previously been described in detail (Allen 2002b:21-24). It has become
apparent that in order to provide a more complete picture of causation within the RA,
some reference needs to be made to the encoding of cause and effect within the nominal
group as well as between nominal groups. The principal means by which this is achieved
is via pre- and post-modification using a non-finite –ed participle clause postmodifying a
the head nominal. Pre-modifying patterns
Causative patterns of pre-modification have been studied in the corpus with reference to
several general noun nominal heads which are either synonyms of cause (eg factor
source, agent) or effect (outcome, impact) etc. In pattern grammar terms the structure of
such pre-modifying causation might be stated as n-V-ed n or n-V-ing n where V-ed
represents the corpus search node. The pre-modifier consists of a nominal group (usually
a pharmaceutical agent, chemical or disease etc) which is related via an adjective derived
from a non-finite clause either as an –ed participle or a –ing participle.
Verbal premodifier
neurofibromin, has been shown to regulate ras: the NF1 protein contains a GTPase
activating protein (GAP) related domain which functions as p21ras SRA487(73)
duce serum levels of the luteinizing hormone, prolactin, growth hormone, and
thyroid-stimulating hormone, and to increase corticotrophin SRA502(6649)
ses (Mos and Olivier, 1987). It could be suggested that the ultrasonic vocalisationssuppressing effects of diazepam, chlordiazepoxide, alprazolam and oxazepam are r
The examples above are intended to show some of these typical patterns identified in the
corpus. Patterns involving –ing participle adjectives predominate (-activating, stimulating, -suppressing etc). These adjectives are derived from material process verbs
in Hallidayian terms. These nominal groups are prime expressions of the strong
nominalizing tendencies within the scientific RA already discussed previously.
Essentially pre-modification enables the researcher to package congruent processes
metaphorically as nominal products. On this basis, it is possible to ‘unpack’ the nominal
group in thyroid-stimulating hormone in the second example into the congruent
representation below:
stimulates (the) thyroid
Such a congruent representation enables the assignment of cause and effect to be made,
as in the table above where the hormone is representative as the causing agent. Assigning
cause and effect to the original nominal group thyroid-stimulating hormone provides the
following representation:
This type of functional mapping of causative patterns will be explored more fully in
chapter 5. Post-modifying patterns
A number of prepositional variants on essentially the same pattern could be identified: n
V-ed by n, n V-ed from n, n V-ed in n, n V-ed to n, and n V-ed with n etc. Presentation
of this particular pattern illustrates another of the pattern formalism’s limitations taken up
in Allen (2002b:33). The main problem here is that the pattern presented begins with the
past participle verb which in itself is not a complete statement of causation; ideally there
is a need to include the nominal head within the pattern notation.
n-V-ed by n
This pattern is very productive, with numerous examples of dynamic monotransitive
verbs occurring in the corpus
Verbal post- Example
It has also been suggested that ICAM-1 accelerates the AY-cell destruction driven
by cytotoxic T cells SRA110(668)
In the above sentence, there are two separate examples of causation; the first example is
encoded at the clause level through the mono-transitive verb accelerate between the two
nominal groups; the second encoded at the phrase level within the nominal group the AYcell destruction driven by cytotoxic T cells. The occurrence in the corpus of multiple
levels of cause and effect within the same sentence is by no means uncommon. This
device permits the author to state the existence of a cause and effect relationship between
the separate entities of ICAMA-1 and AY-cell destruction. The grammatical possibilities
afforded by nominalization and post-modification permit the author to re-state the
acceleration relationship in terms of the cause, cytotoxic T cells. The non-finite clause can
be seen as a reduced relative clause (with the omission of the relative pronoun which and
the passive auxiliary is). The passive meaning is completed with the agentive PP headed
by the preposition by and complemented with the cause.
n-V-ed from n
Verbal post- Example
In order to minimize the effects generated from this problem, we only measure
episodes of severe head injury. SRA356(2424)
A similar passive meaning can be seen in the example above; the only variation resulting
from the preposition from linking the effect with the cause. Instead of presenting the
causative relationship as a means or instrument introduced by the preposition by, the
preposition from marks instead a metaphorical displacement from a source.
n-V-ed in n
Verbal post- Example
estrogen receptors and causes phosphorylation of Shc, an adaptor protein usually
involved in growth factor signaling pathways. SRA336(220)
In the above example, there is a link via the periphrastic causative cause to a
nominalization of a biochemical process, phosphorylation of Shc. A second nominal
group is presented in apposition to Shc, which enables the author to re-cast the role of Shc
as a causative agent linked by the past participle involved and the co-selected preposition
n-V-ed to n
Verbal post- Example
endothelium-dependent relaxation of vasodilatation to bradykinin via B2-receptors
linked to the formation of NO (Meyer and Meyer) ( Fig. 8). 6.3. SRA391(3704)
In this pattern the non-finite clause post-modifying the nominal head B2-receptors as as a
non-explicit cause is related to the effect formation of NO via the –ed participle and the
preposition to.
n-V-ed with n
Verbal post- Example
complicated enta is such that intertwin transfusion of blood is not a normal event in pregnancies
complicated with discordant growth, and fetal growth restriction appears to be inde
Past-participle post-modification complemented with a PP headed by with is a very
common causative pattern within the clause. Again it can be questioned whether there is
a true co-selection (and therefore a pattern in the CGPsense) of the preposition with
following the –ed participle or alternatively if the PP is headed by a preposition marking
an open choice.
4.6.3 External patterns
The category of v-link patterns refers to nominal groups which are linked in a causalinferential relationship by a link or copular verb such as be or more rarely become. V-link
patterns will be illustrated with reference to nominal groups headed by factor. In the
examples below, it is shown how concordance lines based on factor can illustrate
variations on v-link patterns encoding cause and effect. v-link patterns
v-link N in n
ys, indicating that the disappearance of the PF4-heparin complex is an important
factor in the subsidence of the disease, and not the disappearance of the antibody
The choice of the denotation N in this pattern reflects the fact that the noun provides the
point of entry into the concordance lines as the node in Wordsmith. Quite clearly the link
verb be relates two nominal groups in a causative relationship with the nominal group
headed by disappearance as the cause agent of the effect nominal headed by subsidence.
This type of pattern is extremely significant in the corpus, representing in systemicfunctional terms a relational process (see chapter 5 for a fuller discussion of this point).
The head of the second nominal, in this case factor, is a general noun which is equated as
a broad synonym with the causing nominal through the link verb. Crucially the effect is
related semantically as an inclusion through the PP in the subsidence of the disease as
post-modification of the head factor.
v-link N for n
Dysfunction of the eustachian tube (ET) is an important etiological factor for
various middle ear diseases such as otitis media with effusion. SRA579(293
This variation on the pattern simply substitutes the preposition for in relating the effect
(in this case various middle ear diseases). It should also be noted that there is a strong
tendency to pre-modify the general head noun with evaluative adjectives (such as
important above) which provide an additional assessment as to the epistemic status
importance of the causative relationship thus postulated.
v-link N of n
been published, almost all from countries where homosexuality is the main risk
factor of HIV infection and without correlation with CD4 cell counts. No STD data
is SRA184(112)
lear factor-B (NF-B) could explain the previous observations. NF-B is a
transcription factor pivotal for expression of genes encoding inflammatory cytokines
such as IL- SRA243(98)
The example above is essentially a variation on the previously-described v-link patterns
with the relationship of general noun causal agent to the effect being made this time
through the preposition of. A variation on this particular pattern is included in the second
example in the table above, in which factor is post-modified by an adjective pivotal again
marking the author’s assessment of the importance of the causative link being made. The
PP headed by for might therefore be considered part of the complement pattern for the
v-link N that
Patient population is another factor that likely contributes to the low morbidity in
this group of patients SRA647(2373)
This variation on the v-link pattern relates a cause via the copular link verb to the
nominal head which is then post-modified by an identifying relative clause. This clause
introduced in this case by the relative pronoun that and includes a finite verb contributes
which serves to define the direction of causation.
117 Patterns with existential there
A very significant pattern occurs in the corpus with existential there as a ‘dummy’ or
extraposed subject introducing a potential causative relationship between two nominal
groups. This pattern seems to be particularly prevalent in the statement of statistical
relationships; significant collocates include relationship and correlation serving to
nominalize the relationship. It is agreed however that a correlation between two factors
does not necessarily imply a genuinely causative relationship. One important
consequence of this position is that it might not be possible to see the directionality of
cause and effect in a correlation. A relationship presented as a correlation might therefore
be seen as an example of hedging, where an author identifies the existence of a
relationship but lacks the evidential data to postulate causal directionality. However as
has been pointed out previously it can be difficult to disambiguate a possible statistical
relationship from an inferred causative relationship.
(there v-link N between n and n)
Most authors agree that there usually is a correlation between the level of umbilical
blood saturation and fetal wellbeing SRA03(1741)
The causative inference in this case is strengthened by interpreting the first
nominalization the level of umbilical blood saturation as a diagnostic measurement and
the second nominal group fetal wellbeing as a beneficial effect of a screening programme.
there v-link N with n
ls of CRP are not only often increased in BP patients [3] but there was a significant
correlation with the serum IL-6 levels in this study SRA473(1241)
Another existential pattern established from corpus study involves there in subject
position together with a nominal group restricted to correlation. At sentence level
however the relationship is identified with only one of the factors (linked through the PP
with the serum IL-6 levels); the identification of the other half of the causal relationship
involves looking back through the preceding discourse. Other patterns
The case of the idiomatic expression under the influence of can also be considered in this
section as a variation on the general pattern of v-link and nominalization. Here there is a
degree of co-selection involving the preposition under plus the nominal influence which
is then post-modified with a PP headed by of. Although it might be expected that such an
idiomatic expression might be considered ‘frozen’, evidence from the corpus suggests
that there is some degree of variation (in the example below with the addition of a postmodifying adjective main :
e rather than respiratory modulation, and effects LF and HF that are under the main
influence of vagal nerve tone, explaining the decrease without significance in
LF/HF SRA148(2556)
4.7 Adjectival patterns
4.7.1 Overview
The role of adjectives in the encoding of cause and effect relationships has been one of
the major findings from the corpus. Individual lexical items involved in adjectival
patterns include consistent, predictive, potent, effective, critical, responsible, significant,
supportive, influential, important, vital, indispensable, indicative, capable, attributable,
detrimental, due, resistant, susceptible, sensitive, sufficient, and dependent each
occurring with separate clausal or prepositional complementation patterns. Generally
speaking it might be said that adjectives provide an additional evaluative assessment as to
the strength or significance of the causative relationship being identified in scientific
terms. There are however a number of ‘fuzzy’ semantic boundaries in the meaning of the
adjectives concerned which are outlined below.
4.7.2 Meaning groups
The first meaning group identified might be labelled ‘consistent with’. In other words an
effect X is consistent or in accordance with an interpretation of Y as a causative agent.
This interpretation brings into focus the difficulties in defining causation discussed in
chapter 3. With the adjective consistent for example, the majority of the concordance
lines show significant collocation with observations and findings.
his description was consistent with a full thickness macular hole.
Our findings are consistent with those of Bachrach et al, who showed
The observations are consistent with the dense labelling by CB1 receptor
In the overwhelming majority of cases, the v-link + ADJ combination is suggesting
agreement between the products of the research process (eg description, findings,
observation etc) and the empirical data of other authors working in the same area rather
than implying a genuinely causal relationship. A more likely candidate for causation is
the sentence below, with the nominal complement of the PP increased bone turnover
interpreted as a cause in agreement with the effect osteoporosis in females.
Osteoporosis in females however was more consistent with increased bone turnover
One practical consequence is that automatic disambiguation of genuine causative
relationships might be problematic as so much is dependent on the ‘semantic’ intuitions
of the linguist and/or the role of extralinguistic knowledge of the biomedical domain.
The second of these meaning groups could provisionally be labelled ‘dependent’. These
adjectives are relatively empty of lexical meaning and serve primarily relational purposes
rather than marking an evaluative function. Further examples include due and
attributable marking the source and directionality of the causative relationship.
A third possible strand of meaning might be identified under the general heading of
‘significance’. This group comprises evaluative adjectives serving to highlight the
importance or statistical significance of a hypothesized causal relationship in the light of
the research consensus. Included in this group are a number of synonyms of important
including vital, indispensable and critical etc. In the example below, collagen is
identified as a causative agent in the production of the effects wound healing and return
of tissue integrity respectively:
Collagen is important in all phases of wound healing and is critical for return of
tissue integrity SRA318 (2448)
Further adjectival groups relate cause and effects with respect to the perceived end
product of biomedical intervention: disease cure and recovery of patients. The choice of
evaluative adjective is thus a reflection of a judgement on the relative success or failure
of a particular medical procedure. Thus the evaluative adjectives effective and detrimental
can be seen as representing opposite ends of the spectrum in marking positive and
negative evaluations of the causative process respectively.
Adjectives such as predictive and indicative constitute a different meaning group in the
sense that X is predictive of Y and X is indicative of Y. In other words one of the nominal
groups is seen as providing evidence for the other. Again this relationship does not
constitute factive causation per se; instead the cause and effect relationship can be
inferred on the basis of extralinguistic knowledge. On this basis it is possible to see X as
an effect which provides evidence for the interpretation of Y as a causative agent or
Finally it is possible to see a group of adjectives such as susceptible, sensitive and
resistant as being descriptive of a passive role in the causative process:
Cutaneous lymphoma is susceptible to immunointerventions. SRA180(3420)
In the example above, the sentence could be paraphrased using a passive verb such as in
Cutaneous lymphoma can be affected by immunointerventions. The role of adjectives
such as susceptible in describing potential reaction is therefore crucial in the
interpretation of immunointerventions as a cause. However in sentences such as the
above, there is no explicit mention of an effect. Presumably the effect would be
recoverable from the co-text.
In the following sections, the main pattern sub-variations are identified; these involve
differences in clausal and prepositional complementation patterns of the adjective
v-link ADJ to-inf
The example below provides an example of a link verb connecting a cause accurate
measurements with the effect which might be nominalized as the diagnosis of critically
ill patients. The causative meaning can be seen if the sentence is paraphrased as accurate
measurements of blood glucose diagnose critically ill patients. Here there is co-selection
of the non-finite clause as a complement of the adjective essential.
Accurate measurements of blood glucose are essential to diagnose critically ill
patients SRA38(2012)
v-link ADJ ADV to-inf
This pattern is essentially a variation on the v-link ADJ to-inf pattern above with the
addition of the adverb enough. The adjective group is therefore headed with potent with
enough as a post-modifying adverb adding extra emphasis complemented in turn by a
non-finite clause. In this sense the object of the non-finite clause antigenic variation is
the potential effect (as marked by the modal auxiliary may) with the nominal group
headed by vaccines as the cause.
The vaccines in these countries may be potent enough to offset antigenic variation,
or pertussis vaccines SRA184(2864)
There would seem to be some restriction in the adjectives forming this pattern. Examples
include evaluations of the strength (powerful, severe, etc), centrality, reliability or
typicality (characteristic) descriptions of the extent, dimensions or proximity of a
causative process (extensive, close, deep, large, short etc).
v-link ADJ against n
From the corpus it can be seen that there is a very significant collocation between the
adjective effective and the preposition against. The example below is typical of the
literature on drug trials in which the efficacy of one particular product is being evaluated
with respect to a specific medical condition. Consequently there is a case to be made for a
negative semantic prosody extending from the preposition against to the complementing
nominal group. Examples of nominal groups in this pattern confirm this suspicion,
including extraocular infections, pain and destructive micro-organisms (Streptococcus
sobrinus etc).
Atovaquone has been shown to be effective against the encysted stage of the
organism in animal models SRA325(2980)
v-link ADJ as n
In this pattern, the evaluative adjective is complemented by a PP headed by the
preposition as. This preposition therefore serves to re-cast the original agentive nominal
These non-specific agents in a more specific biomedical role (in this case giving the
nominal further specification as a therapeutic effect in the causative process). The re-cast
nominal is further post-modified with a PP headed by the preposition for which relates
the therapy to the targeted medical condition melanoma:
These nonspecific agents have not been proven to be effective as adjuvant therapy
for melanoma SRA495(2034)
A further example of this pattern is included below. Here the PP headed by as forms the
complement of the adjective effective. The nominal group appetite suppressants may be
seen as a metaphorical equivalent of a more congruent representation of the causative
process as ‘cannabinoid receptor antagonists suppress appetite’. On this basis the
appetite suppressant nominal group is encoded as the effect:
It is also perceivable that cannabinoid receptor antagonists may be proven effective
as appetite suppressants. A study showing that SR141716A, a selective C
v-link ADJ for n
The co-selection of the preposition for with the adjective is a productive pattern for the
expression of cause and effect in the biomedical RA genre. The pattern database contains
a total of ten lexical items which encode causation through this pattern:
In a survey from mainland Greece HCV infection was responsible for 25% of
chronic liver disease patients SRA288 (2818)
Other examples of lexical items in this pattern include beneficial, critical, essential,
important, predictive, responsible, significant or supportive.
v-link ADJ in n
There would not appear to be any significant lexical restrictions on the adjectives forming
part of this pattern. The choice can therefore be for any subject predicative adjective
describing the importance or beneficial/detrimental influence of the agentive subject. The
preposition in serves to relate via inclusion the extent or limit of the causative process. In
the example below there are two recursive PPs separately defining the scope of the effect.
and that these loci are influential in the Japanese population in determining disease
susceptibility. SRA249(2503)
v-link ADJ in v-ing
The v-link ADJ in v-ing pattern is simply a variation on the pattern immediately above
with the -ing participle as a nominalization of a verb in place of the nominal group.
Several factors may be influential in increasing the prevalence of ocular toxicity
v-link ADJ of n
In this pattern, there is co-selection of the preposition of serving to define the extent of
effect. Causation is inferential in this example; the verb indicative merely suggests a
cause and effect relationship without committing the author to the making of a causative
statement. Along with the verb predictive, this pattern appears to define a meaning group
which can be glossed as ‘showing or revealing’ evidence for either a causative
explanation or the identification of a causative source.
trastructurally, the smooth muscle debris in the fibrosed medial lamellar units [1] is
indicative of muscle degeneration rather than atrophy SRA575(1353)
v-link ADJ on / upon n
A causative interpretation for this pattern appears to be restricted in the corpus to the
lexical items based, dependent and effective respectively.
These findings do not support the concept that a leucine-restricted diet is constantly
effective on plasma glucose levels in HHS (9, 20, 21).SRA(2387)
This latter protective mechanism is dependent upon the release of endogenous
CGRP and tachykinins from afferent neurones and reduced expression of T helper
Type 1-type mediators, such as IL-2, interferon-, and tumour necrosis factor-, and
adhesion molecules such as CD44 SRA503(6741)
v-link ADJ to n
The complementation of the adjective with a PP headed by to is a significant lexical
pattern for the expression of cause and effect. The pattern is illustrated below with
reference to due serving to relate an effect (in this case Down’s syndrome) to a cause.
In approximately 90% of cases, Down's syndrome is due to the non-disjunction of
chromosome 21, SRA286(356)
Other lexical items occurring in this pattern include amenable, attributable, detrimental,
resistant, responsive, subject, susceptible, vital and sensitive. Some of these adjectives
describe the non-agentive role of the subject as in the example below:
Some of these defects that are not picked up radiographically may be amenable to
clinical detection. SRA613(1794)
In the example above, the nominal group these defects that are not picked up
radiographically is presently as passively undergoing or being subject to clinical
detection. In other words clinical detection could be re-cast as being the causal agent as in
Clinical detection picks up these defects. The passive role which the adjective selects is
further highlighted in the example below:
This raised the question whether apafant is subject to active drug efflux
Other verbs in this pattern allow for third party attribution of cause and effect (eg
id not observe any differences with respect to saturation values that may have been
attributable to MSAF. SRA423(1821)
and stents may be detrimental to mucosal wound healing SRA407(3228)
As can be seen from the second example above, negative effects can be encoded through
the adjective detrimental in this pattern:
4.8. Summary
This chapter has explored the representation of lexical grammatical patterns through
which causal relationships are encoded in the biomedical RA. The principal lexical items
have been identified manually (with the exclusion of discourse marker and textual signals
of cause and effect which are difficult to encompass in this type of grammatical
representation). A taxonomy has been presented which identifies verbal, nominal,
delexical and adjectival patterns of cause and effect in the corpus. Finally within each of
these overarching categories individual patterns were presented together with comments
on specific collocational restrictions. In the next chapter the analysis will be developed
into a local grammar by mapping these patterns onto functional categories specific to
5. From pattern to function: specifying the local grammar
5.1 Introduction
The previous chapter has described the principal lexical grammatical patterns of
causation as identified in the HBC corpus, as well as pointing towards the limitations of
using this formalism as the basis for the database storage of these patterns. However this
statement of causation in lexical co-occurrency terms does not in itself constitute a local
grammar which is primarily functional in orientation. In this chapter, the development
from pattern grammar database to fully-fledged local grammar is considered in more
detail. This process involves the conceptualization of causation within the sub-genre in
terms of a closed set of functional/semantic categories and a description of their linear
relationships specific to the domain of biomedical informatics. The identified categories
are then mapped onto the underlying lexical patterns to form the basis for a semanticallymotivated parse of the causation sublanguage.
The chapter begins by describing the theoretical perspectives and general principles
underlying the representation of the grammar. Naturally, such a discussion invites
comparison between the present project and the previously compiled local grammars of
definition and evaluation. It is important to bear in mind the fact that while each local
grammar has its own unique set of categories particular to its communicative function,
there is a partial degree of overlap in terms of the functional categories and their interrelationships.
Following on from the commitment to the primacy of meaning outlined in chapter 1 the
general language framework which is of immediate relevance here is that of systemicfunctional linguistics (Halliday 1985a). As described in chapter 1, the local grammar
should primarily be seen as a specialized systemic-functional grammar with NLP
applications. Section 5.2 therefore explains the relevance of the Hallidayian concepts of
function, system and rank which the local grammar embodies from this general language
grammatical framework. As will be explained in this section however the perspective on
system inherent in the local grammar departs somewhat from Halliday’s use of the term.
The grammar is then outlined as a set of overarching cause and effect functional systems
realized along the syntagm. The commentary strives to emphasize the similarities and
divergences between the local grammar and the general language SFG framework. It will
also be shown that certain systems, such as the element labelled qualifier reflect
more specifically the nominalization of cause and effect which as described in Allen
(2002b:21-25) is highly prevalent in scientific research genres. Employing a systemicfunctional framework, the local grammar conceptualizes each of these functional roles as
a system consisting of a set of paradigmatic choices from a closed set of semantic
categories. A taxonomic listing of these functional / semantic elements specific to the
biomedical domain is then provided which forms the basis for information extraction and
text summary applications to be outlined in chapter 6 and 7.
5.2 Theoretical background
5.2.1 Overview
The representation of the grammar in this thesis borrows considerably from the concepts
of function and system which are central to systemic-functional grammar (henceforth
SFG). It is important to emphasize however that such a commitment to these general
principles does not imply that the local grammar is merely a systemic-functional
grammar on a miniature scale. Significant differences are reflected in the restricted scope
of the local grammar in terms of the sublanguage of causation which it is designed to
work on and the specialized nature of the categories it embodies. Furthermore there are a
number of categories which are specific to the local grammar and not contained in the
general language SFG framework.
A functional perspective for local grammars stands in direct contrast to the parsing
principles outlined in Barnbrook (2001:64) for formal grammars. Barnbrook makes the
point that the basis for the definition grammar should be considered in relation to the
practical utilities of dictionary information formatting rather than attempting to embody
the formal rigour of a grammar based on introspection.
In contrast to formal grammars, the local grammars / parsers for definition and causation
are conceived from the beginning to work on natural languages and by implication be
potentially capable of handling semi-grammatical structures such as ellipted or
discontinuous elements. Parsing operations based on the grammar will be discussed more
fully in chapter 6.
5.2.2 Defining the scope of the grammar
It is necessary firstly to define more closely the overall focus of this local grammar as the
encoding of causal relationships (a) within the nominal group and (b) between nominal
groups within the clause. Discourse level cause and effect relationships between sections
of text encoded by metadiscoursal devices as described in Allen (2002:35) will not be
considered in this grammar. This selective focus is justified on the basis of the use of the
grammar in information extraction applications. In these applications the clause (rather
than textual segments) defines the maximal extent of the extracted text segment centred
on the concordance line node as described in chapter 4.
5.2.3 Function and meaning
It is argued here that if local grammars and their derived parsing software packages are to
extract semantic information from natural language text, a linguistic perspective needs to
be adopted from the outset which prioritizes meaning over form. SFG has not to date
been formalized for computational applications but has been used as the basis for the
manual parsing of a small corpus of child language, the Polytechnic of Wales Corpus
(Souter 1990). Although Barnbook does not mention a Hallidayian allegiance in his
definition grammar, a commitment to the analysis of functional elements working to
achieve the communicative purpose of definition is implicit in his dictionary
grammar/parser. Similarly the functional roles of cause and effect are dependent on
a semantic interpretation of the clause as a causative.
A central element of a functional grammar is the analysis of meaning in the clause in
terms of the three distinct metafunctions: experiential, interpersonal and textual (Halliday
1985a:33-36). While it is possible to describe functions within causative clauses from all
three perspectives, the nature of the genre of scientific writing and potential applications
of the grammar through automatic parsing combine to justify a more selective focus on
the experiential metafunction. Causation is perceived predominantly either as a dynamic
process involving transitive verbs or as a stative relationship between nominalized causes
and effects. In other words a causal relation expresses the agent-initiated production of an
entity or unfolding of a process, event or existential relationship. On this basis the clause
is seen purely in terms of its propositional content.
The interpersonal metafunction on the other hand focuses on the clause as a medium of
exchange where the linguistic ‘commodity’ in Halliday’s terms is either information or
‘goods and services’ (offers or commands to perform a certain action) passed back and
forth between speaker /listener, writer / reader in a discourse. In the biomedical research
article the vast proportion of clauses are encoded as declaratives in terms of information
exchange. The analysis of such clauses in interpersonal terms would therefore appear to
be redundant in terms of shedding light on the propositional structure. The one possible
exception to this statement is the analysis described in the previous chapter of any
hedging elements such as modal auxiliary verbs or modal adjunct adverbs. These devices
constitute important commitments to the truth of or confidence in which the propositional
content of scientific statements is expressed. Modalizing elements need therefore to be
fully represented in the grammar. In this respect the local grammar departs significantly
from the general language SFG framework in analyzing an element of interpersonal
meaning as a hedge category within the primarily ideational representation of the
remainder of the clause complex. The incorporation of these elements as hedges will be
explored in greater detail in section
The final element of meaning which is considered within the general framework of
Hallidayian grammar is that of the textual metafunction (Halliday ibid.:37-65). The
retrieval of concordance lines from the corpus focuses attention inevitably on isolated
sentences centred on the search word node. However these sentences are each sampled
from whole text contexts in the form of the original RAs. It is possible therefore to
consider meaning choices from the perspective of the relationship between each sentence
and the wider co-text. Specific subject encodings releasing theme choices in the theme /
rheme are undoubtedly part of the meaning of the clause as they relate the clausal unit to
the wider message. However these choices are conditioned by textuality and not by the
logico-semantic relation of causation. Theme / rheme choices and their impact on the
lexicogrammatical representation of causation will not therefore be taken up in the
5.2.4 Paradigmatic relations in the grammar: system and choice
The concept of system embodied in the local grammar needs to be clarified as there is
some divergence from the SFG use of the term. The systems and choices they contain
provide the framework for the packaging of information constituting the product of a
future automatic parser. The notion of system as a formalism common to functional
grammars will now be developed in more detail; the functional / semantic labels will not
be explained at this point; see section 5.2 below for a full explanation and
exemplification of these terms.
As Eggins (1993:205) explains, the Hallidayian conceptualization of the clause is
essentially as a linear syntagmatic relation between any number of semiotic systems each
of which realizes an individual constituent. In semiotic terms, a system comprises a
discrete number of signs in opposition to one another each of which realize a separate
choice. An initial choice constitutes therefore an entry condition or point of departure in
reading the clause from left to right along the syntagm.
The complete realization of meaning in SFG terms comprises a network of
paradigmatically related choices as systems with systems related syntagmatically as
structures. One example of an SFG system is that of voice which consists of a
paradigmatic choice between middle and effective options. The difference between these
options hinges on the presence or absence of an agent as in Halliday’s (idid.:169)
examples the glass broke (absence of agent: middle) or the cat broke the glass (agent
present: effective). The effective system itself contains two more delicate choices
between active in which the agent corresponds in traditional terms with the subject of the
clause (the cat broke the glass) or passive in which the affected medium is the subject
(the glass was broken by the cat). Similarly the system of polarity can be thought of as a
closed system containing the choices of positive and negative.
The notion of system embodied in the local grammar departs somewhat from this picture
of closed sets of grammatical choices. In the local grammar, system corresponds more
closely with lexical contrasts as exemplified in Eggins’ (1993:16-18) illustration from the
lexical set of progeny ie kid, child, brat, darling, son, boy etc. This system can be
captured in terms of the contrast between specification of sex (son, boy) and specification
of positive, neutral or negative attitude (darling, boy, brat etc). In the local grammar of
causation, systems characterize meaningful lexical choices specific to the expression of
causation within biomedical domain. System is used in the local grammar to mark
potentially open sets of paradigmatically-related lexical options which stands in marked
contrast to the SFG use of the term to mark closed grammatical systems such as voice,
polarity and pronoun usage.
The workings of systems within a local grammar of cause and effect can now be
illustrated with respect to the intuited sentence Smoking causes heart disease. Here the
semiotic choices made in encoding the causative meaning can be conceptualized in terms
of three systems, referred to here as the cause, hinge and effect systems
respectively. The task of the grammar is to specify what choices can make up each
system; these choices are partially illustrated in the diagram below. Thus under cause,
the semiotic choices are provided by oppositions between the semantic categories of
[drug], [disease], [lifestyle], [side-effect], and [process] etc22.
Reading from left to right, each choice provides the entry point into the following system,
in this case what will be referred to as the hinge system comprising the verbal linkage
between cause and effect nominals of smoking and heart disease. This system involves a
further choice between the different functional oppositions of productive, parametric,
relational, referential and existential causation. The selection here provides the entry
condition into the effect system, involving a further choice between the functional
labels in opposition. To return to the intuited example above, the choices made from the
systems in order to realise the causative meaning can be stated as cause[lifestyle]
→ hinge [productive]→ effect [disease]etc.
Basic systems and choices (shown in bold) in the local grammar
cause system
hinge system
effect system
The representation shown above is simplified in the sense that the invented example is
non-modalized and the cause and effect nominal groups show no internal structure in the
form of pre- or post-modification of the nominal head. As has been remarked upon
The listing of semantic categories here is illustrative only; a fuller exposition can be found in section 5.4
previously, corpus examples reveal high degrees of nominalization; a great deal of
information bound up in these nominals as prepositional groups is potentially worth
extracting automatically. Consequently the grammar needs to consider more delicate
systems in accounting for the hierarchical relationship between constituents. These
relationships will discussed in terms of the rank scale in section 5.2.5 below.
5.2.5 Syntagmatic relations: constituency and rank
Relations along the syntagm are captured using the notion of constituency. At its
simplest, the notion of constituency embraces the structural relationship between larger
and smaller units in the lexico-grammar. In SFG this relationship is specified through the
notion of rank.
As Halliday points out (1985a:21) the semantic point of entry into the lexicogrammar
necessitates the adoption of an alternative mode of representation to the immediate
constituent analysis familiar from form-based structuralist interpretations. Halliday calls
this mode of representation functional bracketing. In the corpus example Intracranial
tumors can also cause swelling of the disk23 below, an immediate constituent analysis
imposing the maximum amount of hierarchical information is compared with functional
In the immediate constituent analysis (a), the structure is stated in terms of phrase
structure (PS) rules. On this basis the determiner + noun string the disk is related
hierarchically to its designation as a noun group. In turn this unit is embedded within the
prepositional group.
Thus this representation imposes the maximum amount of hierarchical structure on the
sentence. This representation can be compared with (b) which embodies Halliday’s
conception of functional bracketing. Here the emphasis is not so much on rigidly
imposing a syntagmatically-motivated hierarchical representation but instead on
corpus reference SRA381(1823)
describing which strings work holistically as functional units within the clause. It is
possible therefore to describe the sentence in terms of two overarching functional units,
the cause string intracranial tumors and the effect string can also cause the
swelling of the disk.
(a) Immediate constituent analysis
Of course it is possible to extend the process by identifying more delicate functional units
within the cause and effect halves of the clause respectively; due to the complex internal
structure of nominal groups25 in scientific research genres it is desirable for example to
describe the functional contribution of pre- and post-modifying elements within the
nominal group as described above. Internal labelling of nominal groups within a
functional grammar is desirable from an information retrieval perspective and will be the
focus of section 5.3.3. Thus the final format of the local grammar incorporates partly the
functional representation in (b) and the imposition of some hierarchical structure in (a).
The next stage in the explanation of constituency is to label the different hierarchical
levels in the grammar. In keeping with the discussion above, the local grammar adopts
the Hallidayian rank scale (Halliday ibid.:24) which serves to identify hierarchical
relationships between units identified on the basis of functional bracketing. Bearing in
mind the nature of the concordance data input (which can be expanded in Wordsmith to
include an orthographic sentence centred on the search node) the highest unit in the rank
scale referred to below is that of clause complex. Using the intracranial tumours example
above, these relationships are illustrated hierarchically in the rank scale diagram below:
clause- complex (Intracranial tumors can also cause swelling of the disk)
group (swelling of the disk)
word (intracranial)
The designation of NP is preserved here for IC analysis derived from a phrase structure grammar while
the term nominal group is a Hallidayian term (Halliday 1985a:180)
Within the complex sentence for example, it is possible for a clause to be embedded
within a larger clause, an example of what Halliday (1985a: 188) calls rank shifting.
Similarly groups can be identified within clauses; these elements can also exhibit rankshifting. Finally at the most delicate level in the grammar there is the level of lexis. The
original statement of rank also postulates the more delicate level of morpheme. However
for a corpus-driven study where individual lexical items form the sampling query, it
would seem sensible to put forward the word at the lowest level of the hierarchy.
The notion of delicacy and rank-shifting in the grammar is illustrated in the example
below with regard to a more complex nominal group. This example is typical of
nominalization in scientific writing, in which the nominal group head effect is
semantically empty and the principal information-bearing elements are in the form of
prepositional phrases each containing a down-ranked nominal group:
[28] IL-1 also potentiates the chemotactic effect of platelet-derived growth factor (PDGF)
on corneal fibroblasts.
In the diagram below the simplified system choice networks for the nominal group the
chemotactic effect of platelet-derived growth factor (PDGF) on corneal fibroblasts are
illustrated and a comparison made with the SFG framework.
Systems and nominalization
Effect[ ]
body part
body part
This representation portrays the various sub-systems within the nominal group headed by
the noun effect. In the SFG framework, such a nominal group would be headed by the
element Thing premodified by optional deictic, numerative, epithet and
classifier elements (Halliday ibid.:180). While acknowledging the validity of these
elements in a general language SFG representation, the local grammar with its point of
departure in facilitating information extraction is more selective in its representation of
elements. The local grammar seeks to address the most important elements in
informational-content terms and render these elements in a more semantically-transparent
form. It should be immediately obvious for example that the labels of cause and
effect are not only more specific than the somewhat vacuous thing but also of much
greater value in signaling the causal relations to be extracted from a text.
Of the pre-modifying systems in the SFG framework it is argued that the deictic
category encompassing in tradition terms deictic pronouns and determiners is too much
based on the recovery of anaphoric relations to be of value in the local grammar. Chapter
6 discusses the problem of anaphoric reference resolution. In the case of the SFG
category numerative marking essentially quantitative information in the nominal
group there would seem to be no specific need to categorize numerical information in
semantic terms. The local grammar instead seeks to categorize the adjectival information
coming before the head of the nominal group.
The pre-modifying system is referred to in this grammar as the delimiter system
which contains three choice elements, referred to as delimiter[classifier],
delimiter[epithet]and delimiter [causal]respectively. The elements of
[classifier] and [epithet] are essentially inherited directly from the SFG
framework although the importance of pre-modifying causation is reflected in the
additional local grammar category of [causal]. The distinction between these
elements will be explored in more detail in section below. In the case above the
choice of the pre-modifying adjective chemotactic is labelled
delimiter[classifier]. The effect system can potentially contain a number of
semantic oppositions, such as [disease], [process], [outcome] etc. These
oppositions are listed in full in section 5.4.2. However in the example above headed by
the relatively ‘empty’ lexical item effect, the system is not further sub-divided and is
labelled simply effect[ ].
The post-modification of the nominal effect is taken up by what will be referred to in
this grammar as qualifier systems each of which contains a series of semantic
oppositions (which are only partially represented here). Again qualifier represents an
element inherited directly from the SFG framework (Halliday ibid.:188). In order to
clarify these system choices, the local grammar analysis of the nominal group the
chemotactic effect of platelet-derived growth factor (PDGF) on corneal fibroblasts is set
out below.
Effect[ ]
of platelet-derived
growth factor
on corneal
An important task of the grammar therefore is to define these oppositions in the form of a
closed set of semantic categories.
5.3 Functional systems and categories
5.3.1 General
The presentation of the grammar at this point moves on to consider in more detail the
semiotic systems internal to causation and their constituent semantic units or categories.
Such a perspective on the notion of category is essentially Aristotelian in the sense that a
category is defined as the conjunction of necessary and sufficient features, that these
features are binary and that there are clear boundaries between the categories (Lakoff
1987; for a critique of this view see also Wittgenstein 1958 and Langaker 1987). In
adopting a classical perspective on the definition of categories it is possible to regard an
element as belonging to one category only rather than recognizing degrees of overlap
between adjacent categories. There are however substantial problems with this view; as
will be shown below, it is not always possible to define the necessary and sufficient
features for every category. Some elements involved in system configurations for
causation for example could be categorized both as [process] and
[bio_function]as will be made clear in section 5.4.
Based on the syntagmatic principles outlined in section 5.2.4 above, these semantic
categories comprise functionally-bracketed units specific to the causative function of the
clause. This distinction is critical in differentiating a local grammar from a general
grammar. The categories therefore should aim at striking a balance between capturing
significant semantic generalizations and avoiding as far as possible a proliferation of
functional roles which would render the grammar unwieldy. This point will be elaborated
upon in section 5.4.5 below. In addition these categories need to be classified in such a
way that the semantic labels are as transparent as possible ie their utility as a basis for
future information extraction initiatives should be self-evident.
5.3.2 Top-level / clausal systems
The outline of the local grammar systems proceeds firstly by describing the top-level or
clausal systems. In 5.3.3 systems of pre- and post-modification internal to the structure of
the nominal group will also be considered. Cause and effect
At its least delicate, the grammar makes a basic distinction between the three overarching functional systems of cause, hinge and effect respectively.
Configurations of these systems and semantic differences between the verbs linking these
nominals provide the basis for a further semantic sub-division into productive,
parametric, relational, inferential and existential causation outlined below in section 5.4.
The workings of these systems will be firstly be illustrated below with respect to the
corpus example Intrauterine perfusion failure can cause cerebral malformations.26
Intrauterine perfusion failure
can cause
Product [Process]
At the top level in the representation the grammar pattern for the lexical item cause is
shown mapping onto the functional systems of cause, hinge and effect. In
general grammar terms the cause constitutes the subject of the clause; within a general
language SFG framework this category would be given the functional label actor.
It could be argued of course that the local grammar analysis of cause is merely the
exact equivalent of Actor /Agent / Initiator and similarly effect as Goal
For the sake of simplicity this (and subsequent) local grammar analysis does not include the full internal
structure of the nominal groups. Nominal group structure will be outlined in section 5.3.3
/Medium / Range etc in accordance with the SFG analysis. The need to set up
semantically-specific categories is of prime importance however. The choice of specific
categories is directly related to the practical applications of the grammar in information
extraction as the grammar is designed only to serve as the basis for a parse of causation in
text. Each of these systems is particular to the communicative function of the clause in
terms of encoding a causative relationship between the subject and object nominal
groups. At the same time the conflation of the cause and effect systems of the local
grammar with their general language SFG equivalents can only enhance the compatibility
with a global parser. The compatibility of local grammars and SFG will be considered in
chapter 7 in relation to the problem of parsing unrestricted text.
The cause and effect functional systems can be seen to encompass a number of
possible configurations. Because causative clauses are being treated as a sublanguage
within the already restricted sub-genre of deductive research articles, each cause and
effect functional label can then be regarded as a system containing a number of
restricted semantic labels or roles specific to the biomedical domain. The cause system
can comprise for example specific medical treatments and care episodes, drugs, genetic,
viral, bacteriological agents or as in the example above biomedical processes- this list is
expanded and exemplified in section 5.4.2. It is intended that these categories should
provide the basis for the profiling of relevant information in future parsing applications of
the grammar. Similarly the effect category can comprise specific manifestations of
diseases, in the form of diagnostic symptoms and other pathological effects which need to
be matched with the cause category. Hinge
Following the definition grammar, the linkage between cause and effect nominal groups
is given the functional label hinge. Broadly this category corresponds in experiential
terms with process in the SFG terms (Halliday ibid.:106-109). Again the desirability
of a using the local grammar term hinge rather than the SFG process needs to be
explained. In the local grammar, the hinge element is a reflection of the importance
attached to verbs as the central connecting elements between cause and effect. Semantic
differences in the hinge are reflected in further hinge divisions in accordance with the
various sub-types of causation outlined in section 5.5.1 below.
The status of the hinge element within causative clauses is not entirely unproblematic
however. In the case of what will be referred to below as parametric causatives, it might
be argued as in the example below that the effect arises from a combination of the
verb minimize in the hinge and the nominal group the number of presentations of newonset generalized seizures object.
[29] this suggests that the correct use of bupropion would minimize the number of
presentations of new-onset generalized seizures.
The argument for an analysis which separates the verb as the hinge from the effect
nominal group is essentially one of consistency across the other sub-types of causation
Furthermore, as will be explained in chapter 6, the adoption of a hinge/effect
distinction applied consistently across the five sub-types of causation avoids a potentially
recursive representation of the effect system ie an analysis in which the effect
system could in turn contain nested effect elements at lower hierarchical levels.
The status of certain verbs making up the hinge needs to be clarified at this point.
Following the original pilot project (Allen 1998) verbs expressing prevention (prevent,
block, inhibit etc) are included within the semantic domain of causation. On account of
their transitivity, ‘prevent’ verbs are included within productive causation. It is
acknowledged however that the analysis of prevent and its approximate synonyms into
the cause hinge and effect systems of productive causation represents a
potential area of difficulty. As can be seen in the corpus example below, the main
problem with the analysis is that the string low birth weight in mycoplasma-colonized
pregnant women is presented as the effect product when in reality the effect was never
was shown to
low birth weight
in mycoplasmacolonized
pregnant women
On this basis, prevention is included in the hinge system because it encodes the
consequence of the causal initiator, in this case Erythromycin treatment. A further
argument is that such an analysis is also more consistent with the negation of prevention
as in the example below.
did not prevent
acute otitis media in infants
SRA648(008) Hedge
The hinge verbal element in the analysis can contain the functional system of hedge.
As mentioned previously, the hedge essentially marks the encoding of interpersonal
meaning. At the risk of over-simplifying the complex structure of the verb phrase, the
local grammar system of hedge makes a basic distinction between two semantic
As explained in section 5.2.4 many of the analyses of the functional units presented here do not display
the maximal possible hierarchical degree of bracketing. Thus the hinge element here is not further
subdivided into the hedge element was shown to and the causative verb prevent.
categories: [modal] and [projection]. These differences are exemplified for
productive causation centred on the V n pattern for cause below:
Product [psych]
mood changes
The category [modal] relates to all epistemic modal verbs such as the modal auxiliary
may in the example above and modal adjuncts which following Quirk et al (ibid.:219)
mark human judgement as to the likelihood that a particular causative relationship
pertains. The local grammar category of [modal] is broadly compatible with the SFG
notion of modality which covers both modalization (broadly epistemic statements of
probability) and modulation encompassing obligation, volition and inclination (Halliday
ibid.:88-89). Within the biomedical RA with its focus on scientific processes rather than
human agents, modalization is by far the most important constituent of [modal].
The [projection] category is reserved for hedging devices which can be seen as
passivized variants of projected mental and verbal process. As such this usage is based
closely on the Hallidayian notion of projection (Halliday ibid.:219-220). Projecting
clauses are in SFG terms mental and verbal processes which serve to quote or report
ideas (Eggins 1993:247). The use of the passive serves to mitigate the claim through
attribution to a third unnamed party. In the case of the example below, hedge is the
clause-internalized form of the projected mental process clause It has been known that 5FU causes blurred vision, circumorbital edema etc. Examples of projection with
productive cause (V n pattern) are shown below:
Hinge [productive]
has been known to
Product [symptom/disease]
blurred vision, circumorbital edema, ocular
pain, photophobia, excessive, conjunctivitis Source
The functional label source occupies the same level in the hierarchy as the cause
hinge and effect elements. In the local grammar the category of source is used
with a fairly narrowly circumscribed group of verbs collectively marking causal
attribution. In SFG terms the agent of the attribution in passivized clauses would be
labelled Actor (Halliday ibid.:110). It is argued here that the local grammar label
source is more specific than Actor marking the authorial source of the identified causal
relationship. It is sometimes the case in biomedical writing that a cause and effect
relationship is not merely stated as existing but is also ascribed to a published source in
the form of a medical authority, documented research findings etc. In the functional terms
of the local grammar, this configuration of functional categories is mapped onto to V-ed
to n by n or V-ed by n to n lexical patterns as shown in the example below.
by n
by the
to n
aged 25
and over
148 Appositive
Apposition, a co-referential semantic relation predominantly between nominal groups is
very common in scientific writing. The appositive system label is used for nominal
groups following either the cause or effect systems which have similar or identical
reference with the preceding nominal group. There would appear to be no SFG
convention for marking co-referential relationships between adjacent nominal groups.
Appositive nominal groups in the biomedical research sub-genre provide the syntactic
and semantic means of packaging and reformulating causes and effects. Predominantly
this re-packaging exploits the possibilities for grammatical metaphor in English in terms
of nominalizing class membership and process mechanism relationships. In the local
grammar the systems cause[appositive] and effect[appositive] mark
these co-referential elements. The appositive systems can be further sub-divided with the
semantic categories as shown in the analysis below.
haridosis type
a storage
V from
resulting from
of ßglucuronidase
an enzyme that is
involved in the
degradation of ….
In the above example, the effect and cause systems both contain appositive
labels. The effect[appositive] serves a classificatory function, serving to denote
membership of the identified disease mucopolysaccharidosis type VII in the class of
storage disorders. These appositive groups can thus be seen as nominalizations of the
congruent mucopolysaccharidosis type VII is a storage disorder etc. Similarly the cause
[appositive] label to the right of the hinge expands on the preceeding nominal by
combining a superordinate label enzyme with a specification of the biochemical
mechanism which is at work. Related to the system [appositive]is the system
[list]illustrated in section which as the label suggests encapsulates all listings
sharing dual reference introduced by eg, such as etc. Instrument
The system of instrument is an optional part of the top-level local grammar system
configurations. As in, the local grammar label is preferred as it is more specific
than the general language SFG configuration as actor.
Vein graft
has been
to n
to the
by smooth
muscle cell
The category of instrument is reserved for non-animate third-party means by which
causal relationship is initiated. In the above relational causative, the cause system is
broken down into the nominal group further categorized as a cause[process], the
general reference for which is made more specific by the post-modifying qualifier
[process]. It is tempting to categorize the PP headed by by as a source following on
from what has been said above about verbs of causal attribution. For the purposes of this
grammar however, source is used to refer to medical researchers, authorities ie animate
sources of causal attribution while instrument encompasses non-animate agency. Circumstance
[ ]
on n
no significant
effect on ultrasounds when applied at
0.03, 0.1 and 0.3
mg/kg i.p.
The above example includes the system circ which is essentially inherited directly
from the transitivity analysis of Halliday. This category includes what would traditionally
be regarded as adverbial elements at the level of clause in a SVOCA analysis. This point
is important to bear in mind in distinguishing qualifier and circ elements;
qualifier refers to elements within the nominal group while circ applies to adjunct
elements modifying the whole clause. Typically the circ system encompasses
experimental (as in the case above), temporal or locative conditions which must be
fulfilled in order for the causal relationship to pertain. The difference between the local
and general grammar analyses is that these circumstantial elements can be filled with
semantic categories such as [experimental] which are related to disease treatment
and experiment. Evaluator
The evaluator category is found in the effect functional category in relational
causatives where the explicit causal linkage rests on an evaluative adjective. The
grammar makes the distinction between evaluator as a clause-level system and
epithet which as an SFG term is confined to the internal workings of the nominal
group as a pre-modifying adjective. Two corpus examples of evaluative adjectives
serving to realize causal links are shown below:
these loci
in the Japanese
Qualifier [process]
for the development
of insulitis
In the first example, the causal link between the cause[geographical]and the
effect[Evaluator] is made using the link verb are. This link verb unites the cause
with the effect which is headed by what would be descriptively termed a predicative
adjective. Essentially this is one functional category which is borrowed from the
evaluation local grammar (Hunston and Sinclair 2000) in that it marks an assessment on
behalf of the writer on the strength, influence or importance of the agent in bringing
about the causative process. The adjective is complemented by the preposition in heading
a PP (in functional terms realizing two qualifier systems). With patterns involving
evaluative categories such as the two examples shown above, there is more or less a
requirement that there is a qualifier element included within the clause in order for it to
be a well-formed causative as the qualifier is necessary to define the effect. It is
possible to regard evaluative patterns as an additional form of hedging in which the
directness of the causal link is reduced. Evaluative patterns will be discussed in section below as a subset of relational causation.
5.3.3 Systems within the nominal group Pre-modifying systems: Delimiter
The delimiter system, consisting of the three semantic categories of [epithet] and
[classifier]and[causal] respectively encompasses what would in a descriptive
grammar be classified as adjective and nominal groups pre-modifying the head noun of
the nominal group. The essential difference here is between the function of the evaluative
adjective in terms of adding a subjective assessment on the part of the researcher
(significant, important, influential etc). [Classifier] on the other hand represents the
function of categorizing the nominal head by delimiting its reference by class
membership. Finally causal relations as attributes are accorded separate status as
[causal]. Delimiter [epithet]
The local grammar analysis presented below shows clearly the function of the
evaluative category with reference to the head nominal effect. Adjectives in this
category mark the biomedical importance, benefits and negative impacts of observed
[Presence] Qualifier
of visceral
a significant
SRA219(1164) Delimiter [classifier]
The classifier category on the other hand marks the use of adjectives and nouns as
classifiers of the nominals they pre-modify. In contrast to the subjectivity of evaluative
adjectives, classifier adjectives serve to delimit the reference of the nominal group by
listing classificatory attributes.
an increase
in nasopharyngeal
It is very frequently the case as in the example above that cause and effect processes are
given body part or anatomical attributes (palantine, tonsillar, adenoid etc). Similarly the
classifier category can include naming entities such as genes, DNA strands,
biochemical, viral and bacteriological names. Delimiter[causal]
In addition to the SFG categories of [epithet] and [classifier], the local grammar includes
of the additional category of [causal]to encompass causal relationships bound up as
pre-modifying elements prior to the nominal group head. This category encompasses
causal relationships which are part of the pre-modification of the nominal group as in the
corpus example phenylethylamine-induced neurotoxicity below.
Delimiter [causal]
It could be argued that the [causal] is in reality a sub-category of [classifier]in
the sense the pre-modifier is a classificatory attribute of the nominal neurotoxicity. Given
the focus of the local grammar on causation however, pre-modifying causal relations are
accorded separate status. Qualifier
The element Qualifier is again a direct inheritance from SFG here used as a general
functional system to encompass post-modifying elements in the nominal group (Halliday
ibid.:187-193). These post-modifying elements comprise both prepositional phrases and
non-finite clausal elements which convey crucial information regarding the further
specification or referential scope of the cause or effect systems listed above. It is
particularly important to accord these elements a status within the grammar where the
head noun is general in reference.
There are a number of ways in which the qualifier element can narrow down the
reference of a general abstract noun headed cause and effect element. This referential
restriction can be seen in terms of creating relationships between the head and more
specific entities eg cause of illness etc and also describing temporal, spatial and other
circumstances under which a particular causative relationship holds. With regard to
circumstantial elements for example, less nominalized genres might encode these
elements congruently as separate clausal elements such as adverbials etc. This
relationship will firstly be illustrated with regard to the functionally-labelled example
Evalua [process]
e not be
to exert
under more
At the top level there is a cause[biochemical] category linked to via the hinge
element to the effect[biochemical]. It can be readily appreciated how the
qualifier categories (ie the post-modification of the nominal heads) serve to
package and elaborate informational elements. Under the cause heading for example,
there are two qualifier categories: qualifier[biochemical] and
qualifier[location] which will now be described in more detail. The first of
these categories is relational in the sense that it describes the extent to which the process
is limited in terms of the biochemical compound to which it is referring (bretazenil). The
second qualifier category has a circumstantial function in that it further constrains the
reference in terms of spatial location. The effect nominals similarly post-modified
using the functional label qualifier [experimental] which subsumes all
experimental conditions under which a cause or effect is seen to be acting.
To a large extent, the qualifier system can be filled with the same semantic
categories as the cause and effect systems with some exceptions outlined below in
5.4.3. Where the cause and effect nominal elements are headed by relatively
unspecified abstract nouns (as opposed to specifically named genes, bacterial and viral
agents and patients etc), there is a tendency for the nominal group to be post-modified.
The qualifier[category] provides the functional means for capturing what are
very significant information-bearing elements in the local grammar. These elements
constrain the reference of the abstract noun heads either by stating relational links to
specific post-modifying elements. Formally it may be noted that qualifier coincides
with prepositional phrases and also non-finite clausal elements embodying a similar
The parsing of qualifier elements is greatly assisted by the presence of a limited number
of prepositions (on, in, at and of etc) which serve as a boundary for the delimitation of the
element. In the case of on, in and at, the specification is made as a metaphorical
extension of the prototypical spatial meaning of the preposition. Other prepositions such
as during mark a specification of temporal limitations, periods / periodicities under which
a certain cause is acting or effect is observed etc. The preposition of on the other hand
marks qualifier relational configurations which following the work of Gledhill
(2000:142) characterize further specifications of quantities (amount, dose etc) and
experimental actions (evaluation, specification) amongst other areas etc. The preposition
with is also an important marker of qualifier although the semantics of the
delimitation are not exactly the same. Qualifier elements headed with the
preposition with create specificity not through restriction by metaphorical spatial /
temporal reference but instead by adding attributes to the nominal head, as in the example
below. Here the PP with congenital heart disease adds additional adjectival information
to the nominal head child:
[30]A basal maternal phenylalanine level 1800 µM (30 mg/dL) significantly increased
the risk for bearing a child with congenital heart disease (p = 0.003).
There are difficulties however with the distinction which the grammar forces between the
head and qualifier elements which can be illustrated with respect to the role of qualifier
elements in post-modifying quantities. Following Gledhill (ibid.:142-149) and in
accordance with the idiom principle it is possible to regard the co-occurrences present in
the string amount / dose / presence + qualifier as a phraseology. In other words the
head (underlined in the table below) plus qualifier element marks one meaning unit.
the amount
a high dose
the presence
the absence
of leaflet fibrosis
of corticosterone
of an inflammatory
of major vascular
Corpus reference
It may be objected therefore that the grammar / parser breaks apart what is in pattern
grammar terms an indivisible unit (N of n etc).
5.4 The Semantic categories
5.4.1 Overview
In order to facilitate the process of semantic categorization, Wordsmith was used to
retrieve examples of causation using a number of common causative search words such
as cause (noun/verb), effect (noun), associate+with (verb). By concentrating on a narrow
focus of lexical items in this way it was possible to begin the process of describing a
closed set of semantic categories which can be used to sub-divide the systems outlined in
5.3 above.
This section considers the make-up of the systems in terms of a limited number of
semantic oppositions or categories defined with reference to the need to capture
significant biomedical information. These categories provide the basis for the formatting
of semantic information contained within the sublanguage. Categories are not necessarily
specific to one system; as will be shown in section 5.4.3 below many semantic categories
which are shared with the cause and effect systems at the clausal level. Within the
nominal group, the same set of semantic categories occur in the qualifier system
where the nominal group is headed by nouns relatively empty of specific meaning eg
cause and effect. In the case of other semantic categories such as patient for example,
there is a restriction to the qualifier element.
5.4.2 The categories
The main semantic categories will now be presented in alphabetical order, followed by an
example from the corpus. These examples are then followed by a selective commentary
intended to shed light on some of the less obvious semantic distinctions which the
grammar makes.
Semantic category
Corpus example within cause, effect or qualifier
Semantic category
The chronic effects of the solvents have been attributed to
the formation of the biologically active epoxides near the
axons. SRA207(3103)
Cardiac dysfunction may be caused by myocardial edema
intrinsic to the diastolic state of the arrested heart
MCF7/ADR cells cannot result from modulating the gene
expression of TopoIIa. SRA13(4779)
the water-soluble ascorbate can, in fetal livers, fully restore
diabetes-induced lipid peroxidation… SRA133(3877)
Corpus example within cause, effect or qualifier
Prostaglandin E2, synthesized at the same time, may cause
damage of the aqueous-blood barrier…. SRA228(1891)
We attributed differences in simulated expenditures between
the two cohorts to three demographic factors: the size of the
original birth cohort, the proportion of persons surviving to
the age of 65 years, and longevity beyond the age of 65.
16.[life stage]
Semantic category
The tooth surface loss, which resulted from herbal tea (mean
0.05 mm2, s.d. 0.02), however was much greater th..
Mediastinitis resulting from esophageal perforation was
suspected preoperatively in two case SRA647(847)
Orally administrated levodopa causes variable and
unreliable clinical responses SRA377(343)
The emergence of tick-borne infections in the United States
has been attributed to reforestation and second-growth
forests SRA262(228)
Overall changes in lower surface hardness caused by
variations in curing intensity and storage temperatures were
analysed by two-way ANOVA
…the fabrication of provisional restorations may reduce the
risk of pulp injury because of its lesser temperature rise
compared to self-curing resins SRA310 (1456)
The LT-+250 polymorphic site has been associated with
human disease in only one other instance. SRA150(2540)
Verbal autopsy of 48,000 adult deaths attributable to
medical causes in Chennai SRA195 (12)
NP interventions do not cause smoking cessation
A higher level of hemoglobin just before childbirth may
also be the result of treatment of anemia in pregnancy.
Low intrinsic activity of bretazenil at benzodiazepine
receptors may therefore not be sufficient to exert anxiolytic
activity under more stressful conditions SRA303(4342)
….but blockage of the catheter may result in an emergency
situation….. SRA546(1845)
…matrix deposition may all be mediated through direct
effects of insulin resistance SRA279(2507)
Laurie suggested that hypoplasia of the aorta was the cause
of death SRA301(403)
This tick is also the vector of Borrelia burgdorferi and
Babesia microti SRA169(378)
Corpus example within cause, effect or qualifier
Some clinicians generated a similar range of forces between
cuts SRA620 (1537)
Diverse epidemiologic factors have been slightly associated
with a high incidence of otitis in children during the first
years of life SRA416(1078)
The 4-1BB/4-1BBL pathway has been shown to induce a
strong anti-tumor cytotoxic T lymphocyte
Presence of infection may also initiate labor by activating
macrophages which…. SRA05(1536)
Vein graft stenosis has been attributed to the process of
myointimal thickening by smooth muscle cell proliferation
Therefore, it was not surprising that trauma was believed to
be the primary cause for macular hole formation.
Variations may arise from differences in the type of
healthcare institution SRA430(2526)
We attributed differences in simulated expenditures between
the two cohorts to three demographic factors: the size of the
original birth cohort, the proportion of persons surviving to
the age of 65 years, and longevity beyond the age of 65.
VZV appears to increase susceptibility of children to
streptococcal infection. SRA129(3101)
The development of side effects, especially those that are
menstrual-related, seem to cause adolescents and young
women to feel that their general and reproductive health is
being threatened SRA609(193)
Day and night variation in IOP was significantly related to
the state of sleep /wakefulness SRA389(8062)
Liquids of low viscosity make the examiner press the
instrument more firmly onto the tumour SRA469(306)
Symptoms of wheezing, especially with exertion, are
attributed to asthma. SRA406(916)
Persistent otitis media with effusion (OME), or glue ear, is
the most common cause of hearing loss during childhood
he APF gel treatment caused surface damage to all
materials,…. SRA31(2027)
Leptospira icterohemorrhagica is transmitted by rats and is
found in sewage water SRA268(1235)26
LCM virus is an agent of acute central nervous system
disease SRA180(318)
While some of these categories such as [disease] may be self-evident it is necessary
in some individual cases to explain more specifically the demarcations made. The
category of [med_appliance] relates to all items of equipment with a specific use in
biomedical experimentation and treatment. [Biochemical] subsumes all chemical and
biochemical substances and their chemical symbol representations such as carbohydrates,
lipids, proteins, enzymes etc). The causal relationships encoded between metabolic
pathways and their biochemical agents are captured in the category cause [pathway];
here this category is specifically lexicalised as pathway thereby facilitating corpus
retrieval based on the node word. The difference between this category and the closely
related [drug] is made on the basis that the latter is in the form of a proprietary,
commercially registered name with a specific function in therapy/treatment such as
Anadin, Aspirin etc. Many examples of specific [drug] causes were retrieved from the
corpus. The term [drug] is used here in the pharmaceutical rather than narcotic sense to
describe any substance and materia medica administered to relieve a medical condition.
Similarly it is also possible to categorize the unexpected and often negative side-effects
of a drug or other narcotic compound not only as resulting effects but also as causal
agents. The grammar therefore includes a semantic category specifically labelled
A distinction is made between [bio_function] reserved for the physiological
workings of internal organs and other anatomical parts and the more all-encompassing
[process] category. In this grammar [process] is used to refer to a nominalized
series of natural/involuntary stages such as interlinked actions and events. [cell]
marks specific causative agency or resulting effect manifestations which can be attributed
to the macro-structure of cytoplasm and nucleus enclosed in a membrane. The grammar
distinguishes between other causative agents such as [organism] for bacteriological
agents and [viral] for diseases caused by known viruses and /or viral mechanisms
such as mutations. Some higher organisms such as insects or rodents play roles as carriers
of disease; this is reflected in the category[vector]. Recognition of the enormous
importance of genetic mechanisms in disease is reflected in the category of [genetic]
to mark agencies attributed to DNA or RNA strands variation. The role of anatomical
structure and morphological change in causative processes is reflected in the category
Other categories reflect external influences. The category of [geographical]relates
specifically to named geographical locations such as continents and countries,
geographical regions and cities and towns within the public health domain. Category
[environmental] encapsulates environmental causes in the broadest possible sense,
including causal agencies invoked by changes in landscape, vegetation, impact of human
activities such as pollution on health. Sources of causation which can be traced to broader
population and sociological factors such as the number of births and deaths combined
with disease proportions, probabilities and statistical make-ups within a population are on
the other hand reflected in the [demographic] category.
The general category of [disease] includes listings of all medical scientificallynamed diseases, illnesses, ailments and complaints. Such listings can be obtained from
general databases such as Yahoo!28and more specific resources aimed at the specialist
such as Pedlynx listing pediatric conditions29. A general distinction between
[disease] and [symptom]is made in the grammar; the latter category headed by the
lexical item symptom is seen as outward evidence of disease. More specifically the
[disease]category encompasses both chronic and short-term conditions such as
seizures, deficiencies alongside genetically-inherited diseases, infectious diseases,
respiratory conditions, diseases of various anatomical parts, sexually-transmitted
conditions, cardiovascular, glandular and metabolic illnesses. Many of these conditions
have regular morphological endings such as –ia, -iasis, -nos and –sis. In future software
applications of the grammar it might be possible to exploit these items in automatic term
A number of categories capture experimental or treatment interventions. The category
[experimental] is closely related to causative relationships holding between the
ambient conditions (eg temperature, pressure, humidity etc) under which an experiment
/medical procedure is performed and their effects. [intervention] includes surgical,
clinical and therapeutic treatment actions of intervention including the use of medical
equipment and treatment materials such as catheters, implants, aids for physical
handicaps etc.
The role of quantities, probabilities, measurements and statistics and statistical anomalies
as causes and effects is reflected in the category [quantity]. This category can be
contrasted with that of [quality] which subsumes qualitative characteristics and
attributes such as variations, changes, modifications and differences. The manifestation,
existence or appearance of an entity or phenomenon is labelled [presence] in the
grammar as in the nominal group presence of infection.[presence]also subsumes
somewhat problematically the absence, surplus or deficiency of an entity. Where
nominals contain a [presence]category there is almost always a qualifier
element which narrows down the reference of the head. The inclusion of the category
[mediator] within the semantic domain of causation is admittedly controversial as
mediation implies intermediate rather than direct agency. From an information extraction
perspective however it is argued that the importance of nominal relationships expressed
in mediation especially in the pharmaceutical and biochemical domains justifies the
establishment of this category in the grammar.
The inclusion of significant psychological and psychiatric material in the corpus
necessitates a separate [psych]category for causes and effects involving mental states,
illnesses, behaviour patterns and personality defects. This category is distinguished from
[state]which refers to physical rather than mental states. Related to the behavioural
domain is the category of [lifestyle], reflecting the importance of personal choices
such as exercise, smoking and drug use in disease aetiology.[diet]is a similar category
relating to the impact of nutritional choices.
The responsiveness or reactivity of an entity, can also serve as a causative trigger and this
is reflected in the category cause [reactivity]. Such causes are frequently
nominalizations of adjectives (eg responsiveness from responsive etc) and can thus be
seen as qualitative attributes rather than quantities which can be parametized
5.4.3 Occurrence restrictions
As has been alluded to in the section above, some semantic categories can occur in any
number of systems while other categories are more restricted in terms of distribution. For
example, the semantic category [disease] can occur as both a cause, and an
effect at clausal level and also in a Qualifier element within a nominal group:
TS may cause hearing disability.
SRA 581(461)
Mediastinitisresulting from
esophageal perforation was
suspected preoperatively in two
case SRA647(847)
of insulitis SRA219(565)
In the case of [vector] however, the occurrence on the basis of the corpus evidence is
restricted largely to the cause system:
Leptospira icterohemorrhagica is
transmitted by rats and is found in
sewage water SRA268(1235)
Other restriction tendencies which have emerged from the corpus are outlined in the table
Semantic category Cause system
Semantic category
Cause system
Effect system
Effect system
Qualifier system
Qualifier system
5.4.4 Functional roles and grammatical parsimony
One possible objection to the corpus-driven grammar presented in this chapter is the
profligacy of semantic roles despite the constraints of the sublanguage reviewed in
chapter 2. As it stands, the grammar runs counter to a parsimonious presentation which is
the embodiment of formal grammars. Within formal grammars based on intuited data it is
possible using powerful syntactic rules to achieve a presentational elegance by capturing
the most significant generalizations possible (cf Chomsky 1980a:145;1980b:03 cited in
Radford 1988 for more recent developments in generative syntax which have conflated
movement rules ). While this profligacy is acknowledged as a possible weakness (and
indeed some categories could be usefully conflated) it must be emphasized that a
proliferation of roles is necessary if significant information-bearing elements are to be
formatted for the purposes of information extraction. The local grammar represents
therefore an essentially applied perspective which has little in common with the elegance
of theoretical grammatical models. More usefully, the grammar can be seen as a stage in
what Halliday (1966:149) sees as the ultimate aim of grammatical description, ‘to reduce
the very large classes of formal items……….. into very small sub-classes’.
5.4.5 Summary
This section has introduced the three main functional systems of cause, hinge and
effect and their constituent components and described their relationships to the SFG
framework. For verbs which attribute causal agency to third parties, a further top-level
category of source is also included in the grammar. Circumstance is another
important element relating to the biomedical / clinical circumstances expressed in
causative clauses. Because of the highly nominalized nature of scientific research writing,
it is necessary in the grammar to describe more delicate systems such as delimiter
and qualifier to fully encompass the complex internal structure of nominal groups.
Finally the main semantic categories which can fill these systems have been outlined and
5.5 From categories to grammatical statement
5.5.1 Overview
The next stage in the presentation of the grammar is to provide a linear description of
causative clauses in terms of configurations of the cause, effect and optional
systems together with their semantic category contents. Following on from Allen (2002a,
b) the procedure adopted in this section is to identify a number of functional sub-types of
causation based primarily on the interpretation of the verb element linking the cause and
effect nominal groups and secondarily on a classification of the resultative effect. The
sub-types identified are listed below.
productive causation
(ii) parametric causation
(iii) relational causation
(iv) inferential causation
(v) existential causation
It will be noted that these categories were previously put forward as the basis for making
a quantitative assessment of corpus representativeness (Allen 2002a). In this thesis
however these causative subtypes are further developed in terms of the overarching
systems and semantic categories put forward in section 5.4.2 above.
Each of these sub-types of causation will be presented in turn; the explanation will
describe how the main lexical patterns described in the previous chapter map onto the
functional / semantic categories as well as attempting to circumscribe the specific lexical
items which enter into each sub-type. In SFG terms, the categories of productive and
parametric causation represent sub-divisions of material processes; the distinction being
made here is largely in terms of restrictions in the type of semantic categories which can
enter into the effect system. Relational causation is however very closely modelled on the
Hallidayian category of relational processes although the local grammar superimposes
more specific functional categories onto the nominal groups either side of the hinge. The
status of inferential and existential causation vis-à-vis the other types of causation process
will also be explained.
5.5.2 Productive causation
Productive causation is presented first as it encompasses the periphrastic causatives
which might be considered as prototypical causal linkages. In the traditional terms of a
valency grammar, the verb is monotransivitive linking an agentive nominal group as
cause with an effect which may be glossed as the creation or manifestation of any entity,
event, process, phenomena or procedure. Productive causation is also the largest category
in terms of the number of lexical items which it contains. The format for the pattern
exposition is shown below with the highest level functional systems on the top row. The
grammar pattern notation is shown mapping onto these functional and semantic
categories, providing a link between the descriptive work of the previous chapter and the
analytical process of describing these linear relationships in functional format.
Functional analyses of productive causation will now be outlined, beginning with the
fundamental distinction between active and passive patterns. In addition to
monotransitive verbs, there are a number of prepositional verbs which on a semantic
basis (ie their linkages to creation or production of entities, events etc) are also included
within productive causation. Active patterns
The basic active and passive configurations for productive causatives are outlined below
[semantic category]
[ semantic category]
be V-ed
by n
[semantic category]
Product [semantic
In its simplest form, productive causation coincides with the pattern notation V n (active)
or be V-ed by n in the passive. In the active configuration, this pattern maps onto three
top-level systems, cause, hinge and effect respectively, which are reversed in the
passive configuration. The effect system is further labelled as product in productive
causatives. The division of what is in SFG terms the Goal/ Medium / Range into
product for productive causatives and parameter for parametric causation marks a
further distinction between the local grammar and the general language SFG framework.
As has been described previously, more finely-grained subdivision of SFG categories is
justified in terms of increasing the specificity of the analysis for the purposes of
biomedical information extraction. Each cause and effect product system is
then further analyzed into one of the semantic category elements described in section 5.4
above and can also include optional delimiter, qualifier and hedge elements.
Given these basic system and semantic category configurations, concrete examples of
causative clauses will now be provided showing how the grammar serves to produce a
linear representation of the various semantic elements contributing to the expression of
levodopa causes
variable clinica
because of its
erratic oral
absorption and
In the above example, the periphrastic causative verb cause provides the initial
concordance sampling point as node. To the left of the hinge the cause is further
subcategorized into a delimiter [classifier] with the commercially-registered
drug designation levodopa given the semantic category [drug]. The effect contains
evaluative and classifier categories pre-modifying the resultative semantic
category of [reactivity]. This particular example has a circumstance element
which is given the semantic label [explanation] because it serves to shed
explanatory light on the causal link identified.
The example below illustrates an analysis of a causative clause with both pre-modifying
delimiter[evaluator] and qualifier[cell] elements attached to the
nominal heads.
of the
Qualifier [cell]
in substantia
nigra pars
Here ablation is seen as the cause which is given the semantic label of [process].
The reference of this nominal head agent is narrowed down to that of the sub-cell level by
the post-modifying PP qualifier element. The effect is further subdivided into a classifier
element, a head element categorized with the semantic category of [outcome] and a
qualifier element containing [cell].
A variation on this simple pattern of a transitive verb with direct object, in which a toinfinitive clause occurs as a complement of the object, is shown below. The
representation maps the pattern shorthand V n to-inf onto the functional and semantic
their detection
Product [intervention]
[practitioner] [intervention]
the clinician
to manage the
site or patient
In this case the n + to-inf string is labelled as Product [intervention] which can
subsequently be broken down into an animate participant and [intervention]. In
accordance with the semantic categories outlined in 5.4.2 the string clinician is labelled
[practitioner] as the named medical expert in the treatment intervention episode.
of time devoted
medium stress
A variant on the transitive pattern is shown above, where the hinge element connects a
qualifier- modified cause[quantity] to an effect element. These elements are
then assigned the labels of [patient] and [psych] respectively. In this
representation, the functional system of effect encompasses both patients and the
change of psychological state brought about by the causative agent. Collectively these
two functional units make up constituents of Product[psychological]. A similar
configuration can be observed with the periphrastic causative make as shown in the
example below. The analysis below maps the pattern grammar notation of V n v onto the
functional categories of the local grammar:
of low
Product [intervention]
the examiner
press the
more firmly
onto the tumor
Here the cause system is subcategorized into a [substance] semantic category the
reference of which is limited by the qualifier[quantity]. The effect system is
further categorized as the product [intervention] which itself is made up of two
constituents given the semantic category labels[practitioner] and
[intervention] respectively.
The treatment of productive causation has so far considered mono- and di-transitive
complementation patterns. The discussion moves on at this point to describe prepositional
/ phrasal verbs in which there is a regular co-occurrence of preposition / particle and verb
and secondly clausal complementation patterns in which the effect maps on to V whpatterns where wh- represents a wh-clause in pattern grammar notation.
V from
may arise from
in the type of
healthcare institution
The analysis above recognizes the phraseology inherent in the prepositional verb
combination arise + from by including the string within the causal link which then relates
the observed effect[quality]to a similar cause[quality]. The basis for
regarding the preposition as belonging to the hinge element is the co-selectivity
restriction pertaining between arise and from. The role of the preposition in determining
the linear arrangement of the cause and effect systems is therefore crucial:
V in
[classifier] [Process]
would result
due to microcracking and
Functional mappings on to clausal patterns are illustrated here with respect to the verb
predict in its V wh- configuration. Following on from the discussion in the section 4.7.2
it has been argued that the verb predict should be included as a causative even based on
an expansion of a definition of causation which goes beyond covering uses as factive
statements. This pattern includes a wh- clause as the object of the verb which necessitates
a partial revision of the previously-established categories. The label [selector] is
used for example to encode the wh- elements which in effect represent a choice from a
restricted group of individuals, in this case labelled [patient] followed by a clausal
element designated by the semantic category of [bio_function].
[body part]
the amount
of alveolar
dead space
will have
SRA12(347) Passive patterns
For the purposes of this grammar, the focus on passive forms is restricted to examples of
what might be traditionally called long passives (Quirk et al:160). By long passive it is
meant that the clause includes an agentive phrase; only on this basis is it possible to
identify and link cause and effect functional categories as discussed in chapter 4. This
statement therefore excludes a very large number of short passive causative verbs where
the agentive phrase is omitted. On account of its transitivity, the verb degrade could be
considered causative but in the sentence below the short passive configuration does not
permit identification of the cause:
[31]APEs are microbially degraded into alkylphenol diethoxylates and alkylphenol
In its simplest format, the functional configuration of a passive causative clause maps the
following semantic categories onto the be V-ed by n pattern. The passive verb group
constituting the causal link is included with the effect which comes first in the clause.
be V-ed
is caused by
The example above contains two relatively simple nominal groups constituting effect
and cause functional systems respectively. An analysis of a passivized productive
causation with multiple qualifier elements within the nominals is included below for
comparative purposes:
of incident of
in North
be V-ed
by n
in these two
Here, the effect functional system contains a [quantity] semantic category.
Because this element is relatively empty semantically, there are three separate
qualifier systems which specify the reference in terms of the disease and its
geographical distribution. In the cause nominal the head is categorized as a product
[process]semantic category.
The functional mapping of passive clauses will now be examined with respect to
prepositional / multi-word verbs as in the corpus example below.
Product [quantity]
The overall increase
of the OC
be V-ed for
hinge [prod]
is accounted for
by breast-feeding
Here the passive auxiliary is, the past participle form of the verb accounted and the
preposition for are represented as constituents of the hinge system. In the further
examples below, the crucial linking element here is the preposition through which is
analyzed as constituting part of the hinge uniting the product [process] with the
be V-ed through
Product [process]
the sections
were dehydrated through
serially graded
be V-ed through
The biological
of FGFs
are mediated through
The argument for including both the verb phrase and the preposition within the hinge is
based on the significant co-occurrence identified from the corpus. The combination be Ved + through therefore forms a meaning unit designated as the hinge element.
5.5.3 Parametric causation
In parametric causation, the effect system involves some form of identifiable change
in a quantifiable biomedical parameter. Essentially parametric causatives can be regarded
along with productive causatives as a subset of Hallidayian material processes. It is
possible of course to argue that a productive process X causes an increase in Y expresses
the same semantic relationship as X increases Y but for the purposes of this grammar,
configurations involving a nominalization of parametric verbs such as increase above
will be regarded as productive causatives.
The analysis of parametric causation is similar to that of productive causatives in that the
causative verb is given the functional label hinge; one important difference however is
the sub-categorization of the nominal group labelled n in pattern grammar terms as
Parameter [temporal]
Postoperative AFIB
the duration
The chief semantic categories which the parameter element can contain are
[temporal] for periods or measures of time (usually related to the course of a disease
or treatment episode) and [quantity]for other parametric measurements and for
quantifiable assessments of likelihood and probability and [reactivity]. In the
example above, the string is analyzed into the hinge[parametric], a parameter
[quantity] and finally a qualifier element which is further sub-categorized with the
semantic category of [intervention].
Using the same verb increase it is possible to see how the parameter label can be
further decomposed to include some assessment of the risk, probability, chance or
likelihood of the effect which is here realized in the qualifier slot. With parametric
causatives, the verb in the hinge is often pre-modified by adverbs with an evaluative,
intensifying function as in significantly in the example below. The example of risk is
illustrative of one of the principal problems which the grammar confronts in terms of
semantic categorization. While a separate category of [probability] might more
accurately encompass strings such as risk, chance or likelihood the incorporation of these
elements into one category of [quantity] serves to reduce an otherwise undesirable
proliferation of functional roles.
A basal maternal
level 1800 µM
(30 mg/dL)
Evaluative [param]
the risk
Qualifier [effect]
for bearing a child
with congenital
heart disease
In the parametric causative clause below, a [biochemical] cause is the initiator
leading to an effect in terms of a change in permeability which is categorized as a
parameter[quantity]. The semantic reference of the parameter system is then
narrowed down with the Qualifier[biochemical] semantic category. This
particular clause also contains a projection element which serves to hedge the
causative verb increase:
petroleum ether [26]
or a 2:1 mixture of
chloroformmethanol [27]
has been
the permeability
of topically
applied lidocaine
and ionized
salicylic acid
Parametric causatives also need to be accounted for in their passive configurations. In the
example below, potency is analyzed as an effect parameter which is given specificity by
the qualifier PP headed by peptide. Both this functional part of the sentence and the
causal_link make up collectively the effect nominal group.
the potency
be V-ed
of the PAR1
can also
by n
be increased
by substituting the serine
residue for the amino acid
5.5.4 Relational causation
Relational causation covers the expression of cause and effect as a (stative) relation of
‘being’ which exists between two nominal entities.In Hallidayian terms, these causatives
correspond to relational processes. Most prototypically this relationship is achieved
through linkage involving a link verb. There are however significant idiomatic
combinations involving delexical verbs which will be explored under the heading of
relational causation in section below. Relational causatives therefore stand in
contrast to the productive and parametric causatives which encode causation through a
transitive causative verb.
178 Relational causatives with ‘be’ and other copular verbs
In relational causatives, the verb is given the functional label hinge[relational]
to mark the lack of transitivity inherent in link verbs. As will become apparent below,
there are strong similarities between the functional role of the hinge in causation and with
definition and evaluative clauses. This representation is exemplified from the corpus
Food allergy
yet another
of acute,
It can be immediately appreciated that this type of causative pattern resembles the
definition sentence sublanguage in that there are two halves which are 'matched’ by the
link verb occupying the hinge. However the function of the hinge element is not
necessarily equative as in the case of the equivalence between definiens and definendum
(Halliday ibid.:124). The semantic relationship between the two sides separated by the
hinge is one of identification. In the example above food allergy is identified to the left
of the hinge as a token which Halliday (ibid.:115) describes as being a ‘sign, name, form,
holder or occupant’ of the more general Value. The role of the label Value is to
define the token food allergy as a cause. .Thus Value plays an important role in serving
to signal the initiator or agent of causation in identifying relational clauses .. In the
example above, the effect is essentially contained within the qualifier label, which
is labelled qualifiereffect accordingly.
While the local grammar being put forward here does not specifically take up questions
relating to theme and rheme choices (Halliday’s textual metafunction), several comments
can be made at this point. It can be seen that the rheme part of the sentence advances the
communicative message. The token element can be seen in these terms as the point of
departure for the message (in informational terms the given part of the message). The
subsequent rheme develops the message by adding new information in the form of the
expression of a causal relationship between the value and the qualifier as the
These relationships are explored in more detail using the example below.
This tick
is also
the vector
Qualifier cause [Organism]
of Borrelia burgdorferi and
Babesia microti
Here the token element is classified using the semantic category [vector] ie the
biological means by which a causative agent (represented by the qualifier
comprising a biological organism) is spread. The nominal head categorized as a Value
to the right of the hinge explicitly labels the organism tick in terms of its function and
provides the basis for further linkage to the qualifier element which lists the
bacteriological cause.
The example below further illustrates these categories, this time with the link verb
become constituting the hinge.
of n
Token [vector]
Aedes albopictus
of LAC virus in
enzootic foci
an important
accessory vector
The token-value configuration shown in the previous examples can of course be reversed,
as in the example shown below. Here the semantic role of Value in defining the token
as a cause is further emphasized with the relational verb include. This time the Value
label with its post-modifying Qualifier element as an effect is to the left of the
hinge, with the specification of the cause as the token to the right.
of postoperative
Hinge [rel]
chest infection in 54
(13%), and respiratory
insufficiency in 21 (5%) of
the cases
So far the examples have been retrieved from the corpus using the lemma cause as the
concordance query item. Equally it is possible to use the lemma effect to obtain further
examples of relational causation as in the example shown below:
Delim [causal]
of n
of inwardly
It will be noted here that essentially the same principle applies, where the nominal heads
activation and cellular effect are also linked in terms of a token-value relationship. In this
particular example, the agentive cause cannabinoid is included as a pre-modifying
[classifier] linked to the effect via the the non-finite –ed participle clause. Delexical relational causatives
A concentration on relational patterns with effect leads us naturally into a discussion of
the delexical patterns described in the previous chapter. Following Halliday’s (Halliday
ibid.:119) distinction between intensive and attributive relational clauses, it is proposed
for the purposes of this grammar to regard delexical patterns involving effect as
attributive clauses.
no effect
on n
on the
of early
A temptation in analyzing these patterns is to see them in terms of the carrier/attribute
distinction made by Halliday (ibid.:120). Unlike the intensive relational verbs already
mentioned which are really manifestations of equality or inclusiveness, these delexical
verbs employ a different semantic means in order to create the relationship between cause
and effect. For the delexical verb have, the relationship is a kind of metaphorical
extension of attribute possession; thus a multiple effect on BHR listed under the effect
category is seen as an attribute possessed by the cause IL-10. The nature of this pattern of
co-occurrence is such that a paraphrase could be given using the verb affect + object for
the string have + effect + on + object.
In the example below, the delexical verb is analyzed into the hinge category; the
causative meaning is bound up in the nominal group which justifies the analysis of this
element as a separate category. It can be readily appreciated that the relationship between
the nominal groups IL-10 and multiple effects is not one of (approximate) synonymy or
inclusion. The functional analysis shows how this pattern maps onto the semantic
categories of the local grammar. The effect lacks specificity which is in turn rendered
by the co-occurrence of the prepositional phrase as the qualifier, given here the subcategorisation as qualifier[reaction].
Hinge Delimiter[ Effect
on n
on BHR
In the previous chapter it was observed that a significant delexical pattern with the
nominal group headed by role co-occurring with the verb play occurs as a marker of
causative relations in the corpus:
Causal range
a role
in the resolution
of an inflammatory
In the example above the link between the cause[genetic]and the nominal role is
created through the material process verb play. In the above analysis, the strongly
collocating play + role unit is mapped onto the semantic category of causal range
which is particular for this idiomatic combination. The nominal role is then given
specificity with the two qualifier elements further sub-categorized with
[intervention] and [disease] semantic categories.
sun-exposure and other
sources of UV-radiation
Causal range
a major
in v-ing
in causing
such as wrinkles,
roughness, laxity,
mottled pigmentation
This category is essentially a modification of the Hallidayian category of range which
reflects the fact that some sequences of verb + nominal do not realize entirely separate
functional participants (Halliday ibid.:146-149). In contrast to previous analyses, the PP
in causing changes is elevated hierarchically to the level of effect rather than being
analyzed as the Qualifier post-modifying the nominal. Finally the effect nominal
includes two functional categories: the first denoting a general parametric change, the
other a more specific listing into dermatological effects. Relational causatives and evaluative adjectives
To conclude this brief survey of functional configurations which underpin relational
causatives we may consider the role of evaluative adjectives in the creation of causal
links between nominal groups. On the basis of the corpus evidence it is possible to
circumscribe a relatively small group of evaluative adjectives such as essential,
significant, important etc which are included alongside the link verb in the hinge
These cytokines
are also
for the
In the above example the lexical pattern be Adj for n centred on essential is mapped onto
the functional systems of cause, hinge and effect. The co-occurrency constraint of
the adjective and preposition for according to the idiom principle justifies regarding the
string essential+ for as a single semantic unit, followed by the nominal group the
development realizing the semantic category of [process].
5.5.5 Inferential causation
So far the discussion of causative patterns has focused on causal links through verbs
marking transitive processes and relations between nominalized entities. All of these
patterns can be modalized using a variety of linguistic elements (most typically modal
auxiliaries and modal disjuncts in the hinge but also using projection devices outlined in
section At this point a fourth sub-category will be introduced, that of inferential
causation. This sub- type of causation is very common in scientific writing and might
also be seen as a form of hedging more specifically in the avoidance of a commitment to
a particular directionality in the causative process.
While it may be objected that inferential causatives may be seen as belonging together
with the hedging devices of modalization and projection outlined for productive,
parametric and relational causatives, the grammar accords them separate status as a subtype in their own right. The basis for this distinction is the fact that the hedge is internal
to the semantics of the verb rather than being achieved externally through modal
auxiliaries, disjuncts or projection devices in the verb group as outlined in section
In the case of inferential causatives such as associate+ with, this non-commitment can be
seen as a direct result of the absence of transitivity with these verb combinations. The
statement of causation is not made linguistically explicit through the transitivity of the
verb and the reader is frequently left to draw conclusions about the directionality of the
causal process on the basis of background knowledge. Frequently the inference of
causation is made on the basis of the juxtaposition alone of two neighbouring functional
systems as in the example below.
Causal inference
Local tumour
be V-ed with
Causal inference
was associated with a two- to
in the rate
In the analysis of inferential causation, top-level categories are adopted which are neutral
with respect to the assignment of causal directionality. While it may be possible on the
basis of domain-specific knowledge to manually distinguish between cause and effect, it
would almost certainly prove to be difficult to write software to perform this process
automatically. Consequently it would appear to be more plausible at present to categorize
cause and effect as neutral causal_inference systems either side of the hinge each
of which can then be filled with the appropriate semantic category. It is thus possible for
the grammar and parser to identify a causal relationship as pertaining between two
participants even if the directionality of causation is left unstated.
This analysis is possible because the verb associate encodes a juxtaposition of two
participants rather than directionality in the causative process in the same way as
transitive productive and parametric causatives. Such verbs clearly function as hedging
devices because they enable the researcher to avoid making the commitment to an
explicit statement of cause and effect. The reader is left to make the inferences on the
basis of domain-specific ie non-linguistic knowledge- indeed as has been commented on
previously this reader ratification of causal inferences is an important aspect of the
hedging process (Hyland 1997). As the example below shows it is possible to infer
[organism] as the agent in the causative process with the resultative
[environmental] effect in the example below. The literal meaning of associate can
be taken as merely the statement in this particular case that two observational phenomena
have been observed together or are occurring together. In the inferred causative
interpretation of associate there is a metaphorical extension of this physical cooccurrency into a causal dependency relationship. Here it can be appreciated that it is
specialist knowledge which enables the inference of causal agency to be attached to the
nominal group Background bacteria:
Causal inference
Background bacteria
be V-ed with
are associated with
Causal inference
water pollution....
In order to assign on an automatic basis cause and effect interpretations to the causal
inference categories above it would be necessary to build into the grammar extralinguistic representations of knowledge such as the fact that bacteria cause water
pollution and not vice versa.
The corpus throws up a number of other causative inferences, the identification of which
would prove more problematic for an automatic parser based on the grammar as shown in
the example below:
Causal inference
Blood eosinophils
diminish after
Causal inference
after n
treatment with cyclosporin A (CyA)
The use of inference in the expression of cause and effect is often manifested in the
juxtaposition of functional elements in the sentence as in the example above. Here the
string blood eosinophils diminish represents an observed experimental effect which is
dependent upon the [intervention] cause as a circumstantial element. In the
analysis above, the causative interpretation rests crucially on the preposition after which
together with the verb diminish is accorded the status of hinge. The difficulty here lies
to a large extent in the lexical pattern identification which is a pre-condition to the
functional mapping. Firstly, it would need to be shown for example that the string
diminish + after form a meaning unit ie the choice of the preposition after is constrained
by the idiom principle. However as has been argued previously, the preposition
represents an open choice in Sinclair’s (1991) terms which is difficult to account for as a
lexical pattern. A second difficulty is that the interpretation of effect to the left of the
hinge is derived from the combination blood eosinophils + diminish which would involve
breaking up the hinge.
One final functional representation of causal inference is listed below with respect to the
intransitive use of the verb arise juxatoposed with a wh-clause marking a circumstantial
Causal inference
A fracture plane
Causal inference
whCirc [quantity]
when the stress at the tip of the
pore reaches max
will arise
Again the selection of the wh element following the verb arise can be inferred as a
causative link but as previously discussed, this identification is difficult to reconcile with
an idiom-principle determined pattern.
5.5.6 Existential causation
Since the completion of the pilot study (Allen 2002a) more extensive corpus work has
revealed the importance of what will be called in this grammar existential causation. As
this term suggests this sub-category is closely based on the Hallidayian process type
category of existential process. Such clauses can be relatively easily retrieved from the
corpus as they all involve the empty subject there plus copular verb be in addition to a
nominalization of a causal relationship or some synonym of the lexical item relationship.
The function of these sentences is to state the existence of a causal relationship rather
than explicitly stating the directionality of the causative process.
The functional representation of this sub-category is given below. In information terms
the extraposed subject there and the copular verb are relatively empty of semantic
Causal existent
There v-link N
There is a relationship
Causal link
lipid peroxidation
Causal link
Here the category causalexistent includes the empty subject there plus the copular
verb and the nominal relationship or its synonym. For existential causation, the linking
function of the hinge is typically marked by a preposition, in this case between which
serves to mark the link between the nominalization to the left of the hinge and the
nominal entities included in the causative relationship. As with inferential causation
outlined previously, a causal relationship between lipid peroxidation and carcinogenesis
is inferred rather than stated explicitly. Consequently these nominal groups are labelled
with the semantic categories only- the directionality of the causal relationship is similarly
left unmarked. With respect to the nominalized cause and effect entities, the precise order
of cause- effect or effect-cause is largely made on the basis of background non-linguistic
knowledge. It is possible therefore with reference to the clause above to interpret the
nominalized biochemical process as a cause of the effect in terms of the development of a
This basic functional format is further developed below in terms of the internal structure
of the cause and effect nominal groups to the right of the hinge.
There vlink N
There was no
Causal link
Causal link
[ ]
Qualifi [Presence]
of AC
squamous cell
and the
The example below illustrates the role of existential causation in bibliographical and
experimental citation. Here in the absence of the empty subject there there is a nominal
group which needs to be accounted for in functional / semantic terms. The system
source therefore can encompass bibliographic citations to which the identification of
the causative relationship is attributed:
studies have
The first case
reports that
a link
adverse pregnancy
a link
the use of
valvular heart disease
5.6 Summary
This chapter has set out the systemic-functional basis for the local grammar in terms of
the over-arching systems of cause, hinge and effect. Causation is represented
primarily as linear configurations of these systems along the syntagm. Each system can
be thought of a set of discrete, paradigmatic choices which are listed as a series of
semantic categories specific to the biomedical domain. The notions of system and
semantic category are then used to exemplify a functional sub-division into productive,
parametric, relational, inferential and existential causatives.
6. Evaluating the local grammar
6.1 General
In this chapter the focus is switched towards an evaluation of the grammar, both in the
analysis of the sublanguage and as the basis for a potential automated parser. In any
discussion of a grammar and descriptions of parsing operations based upon it there is
naturally a significant degree of overlap. The accent in this chapter is on the sequential
steps in which causative clauses found in natural language texts are functionally
segmented in accordance with the grammar outlined in chapter 5.
For illustrative purposes, detailed discussion of the parsing operation is restricted to
productive causation. Finally a preliminary assessment is offered of the results of manual
parsing experiments based on the grammar. These experiments were carried out using a
small test corpus of 13 biomedical texts sampled in accordance with the L of C
classification scheme. The extent to which the grammar can successfully isolate the
sublanguage and carry out parsing tasks of varying degrees of delicacy is thus compared
across the 13 subcorpora categories of the original HBC. It is ultimately envisaged that
the test corpus could embody certain aspects of a training corpus in the sense that the
results of the parsing experiments could then be fed back into an enlarged and improved
grammar and accompanying ontological representation. This process of ‘training’ the
grammar/parser is thus envisaged as operating on a cyclical basis to successively improve
the quality of subsequent parses.
6.2 The parsing process-an overview
In contrast to the definition grammar set out by Barnbrook (1995, 2002), which operated
on a far more restricted sublanguage, the parsing experiments forming the basis for the
evaluation of the causation local grammar have been carried out manually and are not
automated in the form of text matching algorithms. This procedure is justified as the
focus of this thesis is on the grammatical analysis of the sublanguage rather than the
development of software. The goal of the evaluation is to assess the extent to which the
grammar can provide a satisfactory parse of the sublanguage for the purposes of
information extraction.
In a similar fashion to the definition grammar, the parser based on the local grammar of
causation is principally concerned with practical utility. The grammar is also more
closely focused in terms of scope of application than the ‘top down’ formal grammars of
the AI tradition in NPL. Thus a purely linguistic perspective on the assignment of
hierarchical structure in accordance with the grammar needs to be extended to include the
formatting of functional elements which are of value to automated IE applications in the
biomedical domain. Of particular relevance to the informatics perspective is the extent to
which the 38 semantic categories outlined in section 5.4.2 can not only be assigned
correctly but also achieve an appropriate level of balance between specificity,
generalizability and coverage across biomedical text.
As a point of departure in the evaluation of the grammar, it is instructive following
Barnbrook (2002:68-71) to compare the workings of a narrowly focused, semanticallydriven framework of a local grammar with the phrase structure grammar of formal
linguistics. The discussion of the parsing operations based on the grammar is very much
indebted to Barnbrook who uses symbols to represent the functional elements identified
in the parsing process. The use of symbols is borrowed in turn from phase structure
grammar which progressively decomposes a string X into further a structural element(s) Y
using re-write rules of the form X→Y etc.
The flow diagram below summarizes the application of the local grammar as a parser of
causative clauses and clause complexes extracted from biomedical research articles. The
process is divided into pre-processing (stages 1-3) and hand-parsing (4-5) stages with
reference to the five sub-types of causation identified in the grammar. Sub-types of
causation are shown abbreviated as Cpr (productive), Cpa (parametric), Cr (relational), Ci
(inferential) and Ce (existential) respectively.
The parsing process
Stage 1: Input file
stage: RA text
Stage 2:Mark-up stage
POS tagging of text +
orthographic mark-up (XML
Stage 3:Look-up stage:
causative look-up (208 lexical
items); identification of hinge
Stage 4 Causative sub-type assignment
If match hinge = VERB [cause, affect, produce etc] → Cpr
If match hinge = VERB [increase, decrease, extend etc] → Cpa
If match hinge = VERB[ associate etc] → CI
If match be + hinge = NOUN [cause, origin, agent etc] → Cr
If match be + hinge = NOUN [effect, symptom, impact] → Cr
If match be + hinge= ADJECTIVE [important, essential, significant etc] + for
| in → Cr
If match play + hinge= NOUN (role, part) →
If match have + hinge= NOUN (effect, impact) → Cr
If match there + be +NOUN [relationship, correlation, relation etc] +hinge
→ Ce
Stage 5 Functional parsing stage
→ (Dc) C (Qc) (A)(Hd) (A) Hi (De) Pr (Qe).
→ (De) Pr (Qe) (A)(Hd) (A) Aux Hi (Dc ) C (Qc)
→ (Dc), C (Qc) (A) (Hd) (A) Hi (De ) Pa (Qe)
→ (De ) Pa (Qe) (A) (Hd) (A)Aux Hi (Dc), C (Qc)
→ (D) Causallink (Q) (Hd) Aux Hi (D) Causallink (Q)
Crcause → (D) Ctoken, (Qc) (Hd) Vlink (A)| (Hd),(D) Cvalue Qe
→ (D) Cvalue, Qe (Hd) Vlink (A)| (Hd), Ctoken (Qc)
C effect → (D) Etoken, (Qe) (Hd) Vlink (A)| (Hd) (D) Evalue Qe
→ (D) Evalue, Qe (Hd) Vlink (A)| (Hd) Etoken (Se)
→ (Dc) C (Qc) (A) (Hd) (A) Hi (De) E (Qe)
→ (Dc) C (Qc) (A) (Hd) (A) Crange (Qe)
→ Exthere Vlink Det (D) Rel Hi Causallink Causallink
Explanation of symbols
A adverb
C cause
De delimiter (effect)
Exthere existential there
Hd hedge
Pa effect parameter
Rel relationship or synonym
Qc qualifier (cause)
passive auxiliary verb
delimiter (cause)
effect product
Vlink link verb
qualifier (effect)
In the pre-processing stages, the input text (stage 1) is orthographically marked up in
accordance with the TEI standard and POS- annotated automatically (stage 2). The
causative sublanguage is then delimited from the remainder of the text by the
consultation with the lexical and pattern database described in chapter 4 (stage3). This
matching operation identifies the hinge element and any accompanying nominal groups
to the right of the hinge as a prelude to the assignment of the productive, parametric,
relational and inferential causative subtypes (stages 4 and 5). For existential causatives,
identification is made on the basis of the string containing the extraposed subject there +
be + relationship or synonym. Finally the parsing operation proceeds to functionally
segment each causative sentence in accordance with the grammar.
The parsing operation based on the grammar is illustrated in detail below with respect to
active productive causation.
6.3 The parsing of productive causatives 6.3.1 Theoretical aspects
The functional elements for productive causation in accordance with the grammar have
been set out in section 5.4.2. In active configurations, the linear arrangement of these
elements can be re-stated using the following symbols (optional elements are stated in
(Dc ) C (Qc) (A) (Hd) (A) Hi (De) Pr (Qe)
In accordance with the representation above, the three obligatory elements are stated as
the cause (C), the hinge and the effect product labelled Pr. The various delimiter (Dc,
De) and qualifier elements (Qc), (Qe) are optional within the C and Pr nominal groups
respectively and are given the appropriate subscript to denote membership on this basis.
Also optional within the hinge element Hi is the hedge, (Hd) and an adverbial element (A)
which typically occurs in the position to the left or right of the hedge. Adverbial elements
frequently cover epistemic / modal and text cohesive semantic areas. Other terminal
elements such as determiners in the respective nominal groups are not represented as
these elements have little informational value for the purposes of information retrieval.
This is clearly a difference between the local grammar representation and that of formal
grammars which would always represent the terminals of a | an | the etc.
It is envisaged that an automation of the parsing process would work on the following
basis. Firstly the computer would pick up the hinge element Hi by consulting its database
of causative lexis (what will be referred to in this thesis as the causative lexicon
consisting of the 208 lexical items). Contained within the database specification would be
the information that the lexical item identified in the hinge slot is associated with the
productive structural configuration. The computer’s identification of the hinge is thus
crucial to the parsing of the whole sentence as a causative: it identifies the
sentence/clause complex as belonging to the causative sublanguage. Following on from
the work of Barnbrook, the parser can then work by assigning the provisional
designations of Part1, Part2 and Part3 to the cause, hinge and effect elements
effectively identified with reference to the hinge:
The mapping of the functional configurations onto the Part1 , Part2 and Part3
elements can now be made with reference to the hinge. Critically this depends on whether
the pattern is active or passive. The assignment of an active or passive pattern can be
made by examining the lexical contents within the hinge to the left of the verb. In
particular the presence of the terminal symbols corresponding to the passive auxiliary
verb be ie | is | are | was | were | etc provide the basis for the assignment of the passive
pattern in which case Part1 would be read as the effect with Part3 being assigned
the functional role of cause etc.
Adopting the generative arrow convention from formal linguistics, it can now be shown
how the Part1, Part2 and Part3 decompose into their respective obligatory and
optional sub-elements:
Part1 →
(Dc ), C , (Qc)
Part2 →
(Hd), (A), HI
Part3 →
(De), Pr, (Qe).
The next stage of the parsing process is to further decompose the symbols to the right of
the arrow.Here the parse can be facilitated if the test corpus has been POS-tagged during
the pre-processing stages. Assuming that the nominal elements in Part1 and Part3
can be identified effectively as the heads of nominal groups, this information can lead to
the assignment of the obligatory elements of C and Pr respectively. With the
identification of these obligatory elements in place, the parse can then proceed to identify
the delimiter and qualiifier elements in the nominal groups with reference to
the nominal head. In terms of the delimiter category coming before the head, the
assignment can be made on the basis of POS tags for adjectives. More problematic would
be a parse which draws on the semantic distinction made in section between
[classifier], [epithet]and[causal]as finer-grained analyses of the
delimiter category:
Dc, e → Classc,e, Epithetc,e Causalc,e
The sub-categories of [classifer] [evaluative] and [causal] are optional in
both cause and effect nominal groups. The task of assigning the correct functional
interpretation rests on pre-specifying the lexical items which can be allowed in each
category. POS tagging can in some cases resolve the problem - classifer category premodifiers often consist of nouns with an adjectival function such as heart bypass etc.
Potentially however there is a degree of overlap between these two categories in the sense
that adjectives can occur in both as terminals. For the purposes of the evaluation of the
grammar there is a need to specify the lexical items ahead of the computer processing.
Following Barnbrook (2001:70) and in agreement with Sager (1981) it is possible to
exploit the fact that we are dealing here with a sublanguage. While the number of
evaluative adjectives available in the general language is potentially very large, in
practice the sublanguage constraints alluded to in chapter 2 apply. It is therefore possible
to specify a number of adjectives as evaluative on the basis of corpus study by collecting
lists of adjectives which encode researcher / writer epistemic evaluations. These would be
listed as terminal symbols in the re-write rules, as shown below.
Epithetc,e → significant | important | essential |etc
Causalc,e → induced | affected | produced etc
The procedure adopted therefore would be to regard all lexical items not circumscribed
by the epithet or causal categories as occuping the classifer category.
As mentioned above, the specification of terminal elements in this manner marks a break
from the conceptualisation of the parser envisaged in the definition sentence local
grammar. This observation can be made despite the fact that a very strong case has been
made for the definition sentences as a functionally-constrained sublanguage subject to
lexical and semantic restrictions. In any case definitions are much more restricted in
terms of structures available. Even allowing for these restrictions, it is still necessary to
specify the lexical elements as far as possible on the basis of the empirical evidence from
the corpus and also additional lexical items from a thesaurus. The specification of lexical
items as a means of identifying functional elements partly constitutes the representation
of biomedical knowledge in the form of an ontology discussed below.
So far in both the part1and part3 the delimiter category has been identified in relation
to the central cause or effect nominal heads. It is necessary at this point to turn our
attention to the identification of the qualifier elements which are similarly common
to the cause and effect functional categories. As was mentioned in the previous chapter,
the qualifier function encompasses what would be referred to in traditional terms as
prepositional phrasal and non-finite clausal elements postmodifying nominal heads. In
the case of prepositional phrases, the first stage of the identification process therefore is
for the parser to be able to pick up the prepositional element immediately following the
nominal head element. This process can be shown with reference to the example in 6.3.2
below which has two qualifier elements following the cause and effect nominal heads
respectively. In section the evidence suggests that many qualifier elements can be
identified with respect to the closed class of prepositions: in, of and at, with and for
would seem to be the most common. These prepositions might therefore constitute
boundary markers which would permit delineation of the qualifier functional category
from the nominal head. Having identified each qualifier element on the basis of the
prepositions, the parsing process moves on to the assignment of the semantic category of
each qualifier. Referring back to chapter 5.4.2 for example, this would mean
attaching the semantic labels [genetic], [body_part], [patient] etc under the
respective qualifier functional category. In order to be able to perform this
operation, the parser would have to be equipped with an ontological listing of lexical
items in each of these categories which would permit identification of these elements.
6.3.2 An example from the test corpus
The parsing process outlined above can now be exemplified with reference to a sample
causative element extracted from the test corpus. The first stage in the process is the
identification of the causative element from the RE (Opthalmology) test text sample
shown below using the causative lexical item causes which is listed in the database of
causative lexis and lexical pattern described in chapter 4. Assuming that the text has been
POS-tagged, it is envisaged that the initial causative element could be isolated by picking
up the nominal elements to the left and right of the verb. This process is relatively non-
problematic for the noun accumulation to the right of the verb. The matching of the
cause with the element a defect in hexosaminidase A to the left of the hinge assumes the
recognition of the head defect and postmodifying qualifier in hexosaminidase A as
belonging to the same nominal group element. Results from the Bank of English indicate
that a general language parsing framework such as a constraint grammar (Karlsson et al
1995) could serve as the basis for the identification of the interdependency between the
prepositional phrase containing the qualifier and the nominal head.
Storage diseases
Storage diseases are those in which metabolic abnormalities, typically in enzymes associated with cellular
biochemistry result in build-up of intermediate productswhich deposit in cells. Classic examples of this are
the lipid storage diseases and gangliosidoses (Cairns et al., 1984; Cogan et al., 1984; Goebel et al., 1992;
Palmeret al., 1985; Usui et al., 1991). For example, in Tay-Sachs (GM2 gangliosidosis type 1), where a
defect in hexosaminidase A causes accumulation of GM2
ganglioside in RGCs and other neurons. This results in a cherry red spot in the fovea, which actually is the
normal choriocapillaris surrounded by ganglioside-laden RGCs (which are absent in the fovea). RGCs
progressively die, and in so doing result in loss of the cherry red spot.3.6. Neoplastic
Neoplastic diseases directly affecting the RGC body are unusual. The most common primary retinal
neoplasm is retinoblastoma, but the cell of origin is controversial (He et al., 1992; He and Inomata, 1993;
Kivela, 1998; Yuge et al., 1995). Other tumors, such as metastatic neoplasms, can affect the RGC
indirectly, but primaryRGC neoplasms are unknown.
As described above, pre-processing of the text by inserting POS tags would greatly assist
the process by which causes is identified as a verb (to distinguish from its nominal use
etc). This would enable the clause to be split up into part1, part2 and part3 for the
purposes of the parsing process. The absence of the passive auxiliary verb be in the hinge
leads to the assignment of the active pattern V n which then maps on to the over-arching
top level functional categories of cause, Hinge and effect respectively. The parser
then looks to assign nominal head status again on the basis of POS tags to the heads
defect and accumulation respectively as cause and effect. In the case above, the delimiter
slots are not filled; instead the parser works to identify the two post-modifying qualifier
elements in hexosaminidase A and of GM2 ganglioside respectively. The situation is
complicated in the example above by the presence of a third qualifier element in RGCs
and other neurons which is identified by the boundary marker of the preposition in.
The specific labelling of the semantic categories under the qualifier headings requires
access to an ontology. In the first qualifier element in hexosaminidase A, some
assistance can be drawn from the regular morphology –dase which marks this particular
nominal group out as an enzyme. In this case however two qualifier elements are headed
by abbrievations of nominals. The correct labelling of the qualifier elements would
therefore firstly involve the nominalization/anaphoric resolution of the abbrievation with
the nominal group (usually introduced fairly early on in the text). On this basis, RGC is
normalized with retinal ganglion cells. The second abbreviation, GM2 is more difficult to
resolve without expert domain specific knowledge as the full listing is not given in the
a defect
[biochemical] [Prod]
se A
of GM2
in RGCs
and other
6.4 Evaluative criteria
Although the sublanguage of causation is characterized to a very large extent by
grammatically well-formed declarative clauses, the grammar needs to be able to cope
with the high degree of information compaction in nominal groups which is prevalent in
the sub-genre. Frequently causation is found at both nominal (ie within the nominal
group) and clausal levels (between nominal groups) within the same clause which creates
added demands on the grammar in terms of identifying functional elements and their
hierarchical relationships. Furthermore the extent to which the identified lexical patterns
map on to the functional-semantic elements needs to be fully assessed.
The evaluation of the grammar can be carried out on a number of different levels. It is
important to define however what is meant by level in this context. In this analysis, the
term is deliberately ambiguous, applying not only to the levels of linguistic analysis (ie
phrasal, clausal etc) involved but also to the demands made on the grammar in
performing a satisfactory parse as a basis for information extraction. These criteria
collectively define a number of successively more exacting evaluative benchmarks
ranging from low to high level. At the lowest level, an evaluation can be made as to the
extent to which the lexical items in the database can delimit the sublanguage of causation
within the sub-genre. The lexical database described in chapter 4 for example contains a
total of 208 lexical items which on the basis of the corpus evidence play a significant role
in causation within the sub-genre. Using these lexical items as search nodes, it is possible
to assess the extent to which the sublanguage thus identified on a lexical basis can be
isolated from the remainder of the text. Such a process constitutes an important check on
the lexical coverage of the grammar but stops some way short of an in-depth parsing
In the progression from lower to progressively greater evaluative demands made on the
grammar, a second evaluative criterion is the extent to which the manual parse can
correctly assign cause and effect functional designations to the strings in relation to
the hinge element. In the case of productive and parametric causation the directionality
of the causal relationship rests on the identification of the verb in its active or passive
form- in other words in pattern grammar terms whether the verb exists in the form V n or
be V-ed by n etc. As will be shown below, other cases of assignment especially in what
has been referred to in chapter 5 (section 5.4.5) as inferential and existential causation
can be more problematic as the distinction between cause and effect nominals is based on
extralinguistic knowledge which it is difficult to build into software applications.
On the basis of a successful parse into the basic functional components of cause,
effect and hinge, it is now possible to assess the extent to which finer-grained and
more informationally-rich parses can be achieved either through hand-parsing or,
ultimately, automation of the parsing process. This process essentially involves not only
sub-categorizing the functional components of cause and effect into any one of the
semantic categories identified in section 5.4.2 but also parsing the internal structure of the
nominal groups involved in the expression of cause and effect. The parser needs to
identify for example the adjectival delimiter element coming before the nominal
head in addition to the non-finite clauses and prepositional phrases making up the
qualifier elements. As has previously been remarked upon, an important criterion
against which the grammar and parser may be assessed is the extent to which this
information can be extracted from multiply-embedded nominal groups. This enterprise
involves not only a detailed refinement of the qualifier semantic categories but a
fairly exhaustive ontological listing of the lexical items which signal these semantic
categories. The nature of such an ontology is discussed in section below.
Finally the evaluative process can be considered in terms of corpus homogeneity. The
local grammar is intended to form the basis for parses of causatives right across the broad
spectrum of the biomedical informatics domain. There are for example considerable
differences in terms of rhetorical macro-structure between the more experimental subgenres such as pharmacology and clinical domains exemplified by nursing. Experimental
RAs driven by the need to prove a causal hypothesis frequently adopt the IMRD structure
in which causation is rhetorically important. In contrast, the case study tendencies of the
clinical domain show a preference for evaluation. While it might be tempting to regard
the corpus as broadly representative of the biomedical sub-genre in practice biomedical
texts are far from being homogeneous in terms of the frequency, diversity and rhetorical
importance of causative lexical patterns. Texts sampled from individual subcorpora thus
impose varying lexical grammatical demands on a grammar/parser.
6.5 Evaluative procedure
This section presents a preliminary evaluation of the grammar based on hand-parsing
experiments using the test corpus described in 6.1.The results from the pre-processing
and parsing stages are presented in the below in the table below. The table firstly lists the
frequency of all nominal or clausal causative elements identified within each sub-corpus
category text (RA-RT) of the test corpus.The second column provides an estimation of the
percentage of nominal or clausal causative elements which are matched both lexically
and in terms of their lexical grammatical patterns.Thus a causative element encountered
in the test corpus centred on the verb cause in its V n pattern is recorded as a match as
this lexical /pattern exists in the database. The pattern/lexis matching with causative
elements is further broken down into the five sub-categories of causation Cpr, Cpa, Ci, Cr
and Ce. The table also specifies the number of absent lexical items (the most serious
weakness of a lexical grammar in terms of coverage) and absent patterns. The
designation ‘absent’ is used in this context to encompass lexis and / or patterns in the test
corpus which are semantically causative but missing from the original database listing
compiled from the HBC. The column headed total parseable elements lists the total
frequency for all potentially parsable elements in the causative sublanguage of each text
corpus file. Thus a file containing five active productive causative clauses each
potentially parseable into Dc C Qc Hd Hi De E Qe, would give a frequency of 5 x 8 = 40
potentially parseable elements. Out of this number, which represents a maxiumum score
in terms of parsing efficiency, the table provides a count of the problematic elements
representing potential parsing failures. As will be described below problematic parses can
be the result of both syntactic complexities and/or difficulties in attaching functional
/semantic labels to the elements in question.
Hand parsing results from the test corpus
L of C
causative 15
matched 85.7 84.6 60.9 87.5 84.2 93.5 77.3
72.3 64.1 96.4 100
lexis/miss 1
ed pat
Tot. pot.
parseable 61
71 210 54 201 116
Tot. est.
% parse
18 10.1 10.4 25.9 14.4 9.5
6.6. Discussion
13.5 20.4 11.1
6.6 1 Lexical coverage and pattern matching
The most important result to emerge from the pre-processing and parsing evaluations is
the percentage of matched causative elements. This figure ranges from 60.9% in the case
of the RC (internal medicine) sub-genres to 100% for the RS and RT sub-genres
(pharmacy and nursing respectively). On this basis it would appear that the lexical and
pattern coverage of the databases described in chapter 4 is relatively good although the
small size of the test corpus necessitates the sounding of a note of interpretative caution.
Overall the grammar achieves reasonably satisfactory results in terms of delimiting the
causative sublanguage from the remainder of the text.
The differing percentages of matching are however more difficult to interpret
comparatively across the various sub-genres.The RJ article while initally sampled on the
basis of the opthalmological criteria discusses in considerable detail glioma tumour
development and thus overlaps substantially with sub-genre RC ( internal medicine). On
the other hand the parsing of the dentistry article (RK) proved to be more challenging for
the grammar (matching score of 72%), given the more specialised lexical areas which this
sub-domain encompasses, such as specific dental materials and equipment. The
remaining relatively low matching score is for the dermatology article. As remarked upon
above, this article differs from the other RAs in the test corpus owing to the position
paper/review article nature of the text. This article mainly deals with theoretical aspects
of the research consensus and thus causation is very much more prevalent than in a more
experimentally-orientated empirical paper.
The evaluation of the grammar/ parser is continued in more detail below from the point
of view of syntactic, semantic and textual considerations.
6.6.2 Syntactic considerations Word order
As mentioned previously the sublanguage of causation being identified by the lexical /
pattern database consists to a very large extent of declarative SVO elements which
present relatively few departures from the canonical word order of English. The most
important dimension of syntactic variation is that of voice, which influences the
directionality of causation and the labelling of the cause and effect semantic elements.
However despite these general tendencies it is the case that the sublanguage sometimes
does present the local grammar with significant word order challenges.
As the example below shows, researchers do on occasion exploit marked and unmarked
word order permutations to change the information focus within the sentence. It may be
the case that the causative linkage is topicalised in Hallidayian terms by fronting the
adverbial element containingthe Hinge and Cause functional elements. For a parsing
operation based on the identification of elements either side of the hinge, the delimitation
of Cause and Effect elements between turnover and bone formation might prove
As a result of
As a result of
increased bone
Product[ process ]
bone formation marker levels in
serum rise
Corpus file:RG Test Discontinuous elements
In the example below a directionally straightforward analysis of a productive causative
realized through the hinge result from is complicated by the discontinuous element rather
than which disrupts the pre-modifying adjectives in the nominal group(see Quirk et al
ibid.:1348 for a discussion of this point). Essentially what this would mean is that the
parser would not be able to proceed beyond the first classifier adjective systemic
unless a category comparative signalled by rather than could be set up.
[32] enhanced levels of mucosal immunity may result from systemic rather than mucosal
Corpus file: Test RC
It might be possible on the other hand to regard the task of parsing the element rather
than as being unnecessarily delicate for the local grammar. The question can therefore be
framed in terms of what syntactic level of detail is appropriate for the extraction of
meaningful biomedical information. Following Barnbrook (2002: 195-200), a case could
also be made for the parsing of these elements using a general grammar.
A further example of the parsing problems brought about by discontinuous elements is
provided in the relational causative analysis below:
Token [cell]
….cell type
the critical
determining if
and when a
cell will
Corpus file: Test RE.
In this example the chief difficulty is encountered with the coordinating conjunctions if
and when which are encountered between the non-finite verb determining and the
nominal group cell. Verbs in phase
A further syntactic problem occurs in the verb phrase as illustrated in the example below.
Here the two verbs proliferate and yield are in phase (Sinclair et al 1990:184ff, Downing
and Locke 1992:328-331). There is no suggestion here that the idiom principle might be
in operation which would otherwise predict a co-occurrence between the two verbs.
Instead the grammar/parser has to be able to interpret the two verbs as a single functional
unit, only one of which (in this case yield) is part of the lexical database:
[body part]
proliferates to
the clonal column
of n
Qualifier [cell]
of later generated
Corpus file:RE Test Head categorization
Further difficulties are encountered in terms of the information structure of the noun
phrase. Frequently it is the case that the pre- and more often the post-modifying elements
(referred to in SFG terms as qualifierelements) carry the burden of information
content. The head is therefore relatively empty in informational terms. The question is
therefore whether to classify cause or effect elements by the empty head (factor) or by
the informationally richer evaluative (the primary exacerbating) and qualifier
elements (of inflammatory skin diseases).
may be
the primary
of n
of inflammatory
skin diseases
Corpus file:RL Test
A similar problem is found in abstract noun-headed elements as in the example below:
the formation
of retinal
be V-ed
by n
has been
by the fact
that most markers of
retinal cells display
orderly arrays in
Corpus file:Test RE
In this example, the abstract noun fact is complemented by a finite appositive clause
which carries the main information burden of the effect. It is proposed therefore that the
overall classification of the cause or effect should be made with reference to the element
which is informationally richest whether head or pre-/post-modifying element. On this
basis the effect would be listed using the purely descriptive label [process]. Multiple-embedding
As mentioned previously multiple levels of embedding raise considerable difficulties for
a grammar conceived as the basis for an automatic sublanguage parser.
.The example below is fairly typical with a non-specific head effect post-modified with
two qualifier elements. In fact even this representation is a simplification, because there
is a clausal element after her OGC was diagnosed and radiated embedded within the
qualifier[temporal]element which is in turn embedded within the
qualifier[disease] element. This problem raises a similar question to that
discussed in above, namely what the ultimate level of analysis for the local
grammar should be. On a maximal level for example this might involve the
representation of three separate qualifier elements embedded within each other.
However the necessity of providing such a detailed analysis for the purposes of
information retrieval can be questioned assuming the syntactic and semantic operations
of the parsing analysis are possible.
V from
Effect [outcome] Qualifier
One patient died after diagnosis
5 years
the effects
of n
of a
developed 3
years after
her OGC
and radiated
Corpus file:Test RJ
6.6.3 Semantic categorization
The results from this preliminary survey indicate that the percentage of problematic
parsed elements ranges from approximately 8% in the case of RL (dentistry) to 25.9% for
RD (surgery). These problems encompass difficulties in disambiguating genuine
causation from the remainder of the text as well syntactic, semantic and textual problems
which will be illustrated in more detail below. Once again it is difficult to explain these
differences purely in terms of subgenre-based differences between the various texts. As
mentioned above, the application of consistent semantic criteria to the manual
identification of semantic category elements is a problematic undertaking leading to a
wide possible margin of error. The potential problems involved in translating the
grammar into an automatic parser will now be dealt with in more detail below.
210 Definitions of causation
As outlined in chapter 3, the definition of causation used in this thesis has been
deliberately widened from previous factive definitions of cause and effect stemming from
generative linguistics to encompass the perspective of envisaged IE applications. The
difficulties which this stance entails can be illustrated with reference to one frequently
problematic situation with the verb correlate+ with. In this thesis it has been argued that
many causal links are not in fact stated implicitly but instead the researcher uses a
number of linguistic devices such as forms of hedging which ask the reader to make
causative inferences between juxtaposed nominal groups as in the example below:
[33] DNA encoding cytokines known to emphasize components of immune defense that
best correlate with immune protection.
Corpus File:Test RC
The claims made here that while there is no explicit causative link contained in the
sentence, the inferences made through the verb correlate+ with constitute important
components of information contained in the text given the perceived future use of the
grammar in information retrieval / extraction applications. Similarly the semantic domain
of mediation is also included in the grammar although this might be excluded from
strictly factive definitions of causation:
[34]….a variety of immune components may mediate protection at mucosal sites.
Corpus file: Test RC
Clearly the categories of agentive cause and mediator represent rather different
semantic areas. Again based on practical considerations as to the value of information
retrieved from the text, the decision was made to include mediation within causation but
instead marking the semantic category as cause [mediator].
211 Semantic classification and ontological representation
In a functional grammar, the semantic classification of categories assumes key
importance. As mentioned in section 5.4.5 it is desirable to achieve some sort of balance
between a parsimonious representation (ie reducing as far as possible the number of
categories) at one end of the scale while at the same time maximising the specificity of
the categories in terms of capturing significant elements of information. Categories which
are general can conceivably be applied right across the biomedical domain (thus
alleviating the need to write separate grammars for each biomedical sub-speciality) but
correspondingly lose their value in information retrieval/extraction applications.
Similarly an excessive proliferation of sub-speciality specific semantic roles is difficult to
incorporate into the local grammar. It might be argued that a proliferation of roles is more
consistent with the notion of a ‘micro’ as opposed to a local grammar as discussed in
Allen (2002a:18). This raises the question as to how many specific categories are needed
for the biomedical domain. The goal of the grammar is that it should be applicable to any
text from the biomedical domain.
With regard to categorization, it is desirable of course that elements such as diseases,
treatment episodes, genetic and biochemical participants etc should be encorporated into
the grammar as causes or effects. These very specific elements do not raise particular
problems in the hand parsing operation (apart from lack of domain-specific knowledge in
some cases which made categorization difficult). More problematic are general nouns
such as situation, problem, difficulty, system etc which are relatively lacking in
informational content and may in fact be anaphoric /cataphoric in reference, as illustrated
in the example below:
Product[ ]
these difficulties
be V-ed
can be largely eliminated
Corpus file:Test RD
by n
by fast, high-resolution
On a related point, it might be objected that the category process is far too wide for
the purposes of providing the basis for useful information retrieval. As mentioned
previously the morphology of process nominalizations such as age, -al, -ance, -ence, action, -ing, - ion, -ism, -ization, -isation, -ment, -osis, -sis, -th, -ure etc could be
exploited in the automatic parsing process but this leads to a very general category with a
loss of informational focus. It would seem to be difficult to envisage a further subdivision of the category without recourse to domain expertise in classification.
No matter how close the grammar comes to the achievement of a balance between
parsimony and specificity, the test corpus provides examples which seem to elude ‘watertight’ classification, as illustrated by the nominal group-internal causation in the example
Minimal distance spacing
Qualifier [Effect]
otherwise random
Corpus file:Test RE
It would seem difficult to encorporate the string Minimal distance spacing rules into any
of the existing semantic categories as listed in 5.4.2. The only solution here would appear
to be the creation of additional ad hoc category rules which leads full circle back to the
problem of the proliferation of categories.
The automatic categorization of these semantic elements would conceivably rest on their
identification by means of what might be referred to as lexical signals. By way of
illustration, possible lexical signals are listed (by no means exhaustively) for the semantic
cateogory of intervention:
Semantic category
Lexical signals
preventive, palliative, treatment,
procedure, operation, therapy, medical
care,-therapy (eg cardiotherapy etc)cure,
attentation, hospitalization, regimen,
regime,transplant, surgery, surgical
removal (list: -ectomy:mastectomy,
ureterectomy etc),intervention,
administration, screening program
The use of lexical signals would appear at the present stage to be the chief means of
enabling a computer parser based on the grammar to automatically identify and
categorize the semantic elements. There are a number of implicit assumptions associated
with this approach. Firstly and most obviously there is the assumption that an ontology
encompassing the entire biomedical domain is in existence. Such an ontology is similar to
a thesaurus in the sense that information is structured as a set of concepts, axioms and
relationships which describe a specific knowledge domain. The question therefore is to
what extent an ontological representation could be produced to enable these semantic
categories to be picked up on an automatic basis. In the case of diseases, genetic
nomenclature, and DNA sequences etc there exist a number of steadily expanding on-line
databases30 which could be used specifically for the purpose of semantic categorization.
For very specific items such as disease nomenclature this does not represent a significant
problem as these items can be consultated in the database.
For some categories however the listing of all possible lexical signals for the recognition
of the category represents at present at least a seemingly intractable undertaking. In the
example below, the semantic category of [substance] is analyzed for the qualifier
element of scale in the grammar:
of scale
on the
V from
results from
See for example the classification of human diseases by chromosome stored at The Genome Database,
Corpus file:Test RL
The problem which [substance]demonstrates is the difficulty in defining the lexical
contents of what is a relatively abstract and non-technical semantic domain.
[substance]occupies very similar semantic space to Roget’s overarching category of
Matter. At lower and more terminologically circumscribed sub-divisions within Matter
such as Minerals and Metals 31 the process of categorization becomes progressively
easier in terms of developing taxonomic listings covering all included items. Lexical
items such as scale which have a semi-technical usage are therefore more problematic for
the ontology constructor.
In section 5.3.1 the question of Aristotelian approaches categorization on the basis of
binary features was discussed. With regard to the local grammar the adoption of a binary
category perspective makes it difficult to recognize the fact that an element can be
simultaneously part of two or more categories. There are a number of problems entailed
in this adoption of a uni-functional categorization of each element. An illustration is
provided below:
Effect [disease]
with hemiparesis
Effect [symptom]
and seizures
V from
resulting from
Cause [disease]
tumor or its treatment
In particular attention can be drawn to the categorization of the lexical item seizures. The
categorization of hemiparesis as a disease is relatively unproblematic through
consultation with a database of diseases / medical conditions. The question remains
Roget’s Thesaurus 4th Edition section 383, p283
however as to the characterization of seizures as [disease] or [symptom] as
conceivably this item could belong to both categories. In building a grammar based on
semantic categorization, it is necessary to develop an ontological representation which
identifies lexical items as symptoms linked to specific diseases in the form of inclusion
relations etc Finer-grained subdivisions of categories
As in section the category of delimiter is further subdivided into
[classifier] [evaluative]and [causal] respectively. Results from the handparsing experiments would appear to confirm the utility of these sub-categories although
the test corpus does provide some examples which suggest that a finer-grained
classification of delimiter would have been preferable:
that has
a weak
[ ]
Corpus file: Test RG
Here the alternative classifier elements androgenic / estrogenic / gestagenic within
the system have high information value in their own right and are central to the causative
interpretation of the clause. While a finer-grained subdivision might be desirable the
problem of category proliferation would once again present itself.
6.6.4 Textual aspects- the problem of anaphoric resolution
Problems caused by the resolution of anaphors have been recognized in corpus research
since the early 1990s (Garside 1993b; McEnery and Wilson 2001:64)
Work at Lancaster has sought to hand –annotate corpora in such a way that pronouns can
be related to their antecedent nominal groups and the computer trained to recognize these
relationships. As McEnery notes, this area of corpus annotation is at present the province
of the human analyst. As such the resolution of anaphorical reference represents a further
difficulty for an automated parser based on the grammar with the specific IE/IR
applications which have been in mind. Examples of this problem abound in the corpus:
Cause[ ]
the former
higher responses
Test RC
In this example former occurs as a nominalized adjective which in information-content
terms is empty. Consequently it would be necessary to attach anaphoric annotations
manually to the corpus data to enable the computer to retrieve the link to the coreferential nominal group. This type of manual linkage is illustrated in the example
V in
Effect [quantity]
This [←the programme]
would result in
a cost
of US $56
per averted
For the human analyst, the resolution of the pronoun this and its anaphoric referent the
programme is relatively unproblematic. However it is difficult to envisage an entirely
automated process of anaphoric reference which would mean that a parser would simply
format an informationally empty cause element (on the basis of its position to the left of
the hinge element result in).
6.7 Summary
This chapter has provided an outline and preliminary evaluation of the grammar as a
basis for the parsing of the causative sublanguage. It has been shown from the evaluation
of the grammar on a 49,000 running word test corpus that the lexical and pattern
databases provide the basis for the identification of the causation sublanguage, leading to
a matching result of between 60% and 100% depending on the biomedical sub-genre. At
the same time, the problematic nature of semantic element identification and
categorisation was described, pointing towards the need for the construction of an
extensive ontology for the biomedical domain.
7. Applications of the local grammar
7.1 Preliminaries
The previous chapter has evaluated the local grammar mainly from a linguistic
perspective as the basis for an automated parser of biomedical text. This discussion has
centred chiefly on the problems involved in the functional parsing of the causative
sublanguage. More specifically there are sizeable challenges posed not only by semantic
categorization but also the syntactic complexity of the causative sublanguage. In contrast
to the linguistic aspects of the grammar, the main focus of this chapter is on potential
applications in biomedical informatics. It thus considers the potential of the grammar to
extract biomedically relevant informational content from research article texts sampled
from internet-based search engine queries. Finally the chapter moves on to offer a
critique of the corpus-driven methodology used in this thesis and examines the extent to
specialized local grammar framework can offer an analytical advantage over a general
language framework such as SFG as the basis for an automated parser.
The use of search engines to extract biomedically interesting information from on-line
databases is now an essential aspect of the research process. In response to each query it
is possible to retrieve full-text documents which can then serve as the POS- and
orthographically-tagged input files to be formatted using the local grammar as the parsing
basis. The potential areas of applicability are thus very great indeed given the diversity of
NLP initiatives in biomedical informatics. Consequently the applications discussed in this
chapter are selective; following on from the description of biomedical sublanguages in
chapter 2 applications are described with respect to the loosely-defined fields of
biomolecular and clinical subdomains. Within the biomolecular domain where the accent
is on processes at the genetic or molecular level, the application of the local grammar is
illustrated with respect to the automatic construction of an ontology or controlled
vocabulary. Clinical applications on the other hand are chiefly concerned with textsummary operations in which there is a need to rapidly condense the information content
from multiple textual sources. In other words what we are concerned with in both cases is
the challenge posed by the dynamic nature of biomedical research as new diseases
emerge, new treatments and commercial therapies appear and undergo evaluation and
genomic databases proliferate.
7.2 Automatic ontology building in the genetic/biochemical domain
7.2.1 Overview
The first area of application to be considered is that of biomedical ontology construction.
The use of ontologies has been discussed in chapter 5 with reference to the semantic
categorization which is an essential element in the more delicate levels of analysis for
which the local grammar is intended. In this chapter however the focus is on the role
which the identification of causative relations in research texts can play in the
construction of ontological representations within a specialist domain.
Before proceeding further it is necessary to define more closely what is meant by an
ontology. In philosophy the term ontology is reserved for theories of existence which
stress the relationships between entities and processes. Information scientists working in
the AI tradition define ontologies as computer-based lattices or placeholders for
knowledge filled by terminologies, controlled vocabularies, axioms and definitions which
represent consensual knowledge in a given domain. A useful working definition is that of the IEEE Standard Upper Ontology group for whom an ontology is ‘similar to a
dictionary or glossary, but with greater detail and structure that enables computers to
process its content32. Working within this tradition Gruber (1993) highlights the role of
the discourse community in the achievement of consensus as an ontology is essentially a
shared, consensual agreement on a domain-specific knowledge representation.
The term ontology therefore overlaps substantially with the related terms taxonomy and
thesaurus which also aim at representation of semantic relations on an hierarchical basis.
The relationship between these terms can thus be represented in terms of a cline of
logical explicitness:
logical explicitness
more formal
less formal
In an ontology, the accent is on logically explicit relationships which define a hierarchy
of terms as exemplified by the knowledge representations of Guarino and Welty (2000).
As Brewster (2002:4) notes, the large majority of ontology building initiatives to date
have been manually-contrived representations which are time-consuming, costly and
potentially inconsistent.
Automated or semi-automated ontology building is therefore high on the AI/NLP agenda.
The question raised at this point is to what extent a grammar of causation can be used in a
narrowly defined biomedical domain to capture and structure on an automated basis the
hierarchical relations making up the lower levels of the ontology. It has been previously
claimed that the existence of an ontology is essential if the grammar is to assign
appropriately delicate semantic labels such as [disease], [cell] and [vector]
etc to the functional elements it parses. Whilst acknowledging the potential ‘chicken and
egg’ circularity inherent here (ie whether the grammar or ontology is prior) it is argued at
this point that a version of the grammar which parses causal relationships and their
directionality only could have significant use in the lower level population of an
ontology. In other words such an application of the local grammar would be directed
solely towards the assignment of cause and effect designations either side of the
hinge element and would not concern itself with further semantic categorization.
7.2.2 The Gene Ontology
One existing ontology of interest is the Gene Ontology (GO) particularly as it is an
initiative illustrative of the problems presented by the current proliferation of biological
terms. More specifically, there is a need to represent the knowledge structure of
individual genes, their products and their functions. The GO draws on a number of
separate species-specific genomic databases such as FlyBase, the Saccharomyces
Genome Database and the Mouse Genome Database. Many other databases have
subsequently been added by the GO Consortium. The GO is structured in terms of three
sub-components shown below adopted on the basis that they represent common
denominators for all living organisms:
molecular function
biological process
cellular component
Of particular interest from the point of view of causal relations are the molecular function
and biological process sub-components. The molecular function subsumes primarily the
binding and catalytic activities of gene products such as adenylate cyclase activity and
toll receptor binding at molecular level. The activities of gene products are frequently
encoded transitively through causal verbs with which the lexical / pattern database can be
matched on an automatic basis. The biological process ontology on the other hand is
reserved for specific groupings of molecular functions such as alpha-glucoside transport
or signal transduction; molecular functions therefore constitute separate stages within
biological process functions. Causation is less relevant to the construction of the cellular
component ontology as this representation primarily concerns itself with whole / part
meronymic and locative relations between the entities eg between rough endoplasmic
reticulum and nucleus etc.
It is important therefore to consider the types of hierarchical relations between nodes
which the GO captures. One ontological distinction which is of relevance to causal
relations is that between universals and particulars (Smith et al. 2003). Universals refer to
superordinates, types or classes while particulars relate to specific examples or tokens:
Eukaryotic cell
The GO defines the relationships between particulars and universals in terms of is-a and
part-of as pertaining between. Is_a defines a member-class relationship while part_of
defines a meronymic relationship between part and whole. The claim being made here is
that a hierarchy of molecular functions can be recast in terms of a chain of causal
relations such that entity /process X functions as a cause of entity /process Y. On this
basis is_a relationships encode strong, non-modalized cause and effect linkages while is
a_part_of relationships represent secondary, contributary connectives. These
relationships are summarized in the table below:
GO relation
<Particular> is_a <Universal>
<Particular> is a part_of <Universal>
Causal equivalent
X is a cause of Y
X causes Y
Y is caused by X
X contributes to Y
X plays a role in Y
A sample of the GO structure is included below for the biological process negative
regulation of mitosis which is defined by the ontology constructors as ‘any process that
stops, prevents or reduces the rate of mitosis’ (cell division). The biological process of
mitosis is thus important in understanding the cellular division mechanisms which
underly tumour development. The full hierarchical structure is shown strictly speaking in
the form of a directed acyclic graph33. Solid arrows show is_a relationships between
particular and universal while broken arrows describe part_of linkages.
In contrast to a strict hierarchy, directed acyclic graphs allow a particular ie a child to have more than one
universal as parent.
negative regulation of
Regulation of
negative regulation of
cell cycle
regulation of cell
M phase of mitotic
cell cycle
nuclear division
mitotic cell cycle
M phase
cell cycle
cell proliferation
cell growth and / or
cellular physiological
cellular process
biological process
gene ontology
The above diagram adapted from the GO shows the hierarchical relationships in place for
the query item. It is assumed here that the higher order slots in the hierarchy are
consensually fixed by domain specialists and are not liable to significant modification
(unless of course there is a radical paradigm shift which might drastically alter the
conceptual relations of the field). The use of automation for ontology construction on the
other hand really comes into play with the population of the lower levels of the hierarchy.
The local grammar can be used as the basis for structuring the lower hierarchical levels
where there is a proliferation of term entities. The links to be described are the genes
which are causally linked to the biological process of mitosis. In the exemplification
below, the local grammar framework was used as the basis for populating the lower
levels of the hierarchy with genes which are linked to mitosis regulation. The query item
mitosis was used in the Science Direct database to retrieve electronic articles from the
entire biomedical domain. As this is a preliminary investigation only, the retrieval and
analysis of gene-mitosis causal links was restricted to article abstracts for illustrative
Article Reference
Archives of Oral Biology
Volume 49, Issue 11 ,
November 2004, Pages 889894
Analytical Biochemistry
Volume 333, Issue 1 , 1
October 2004, Pages 57-64
Developmental Biology
Volume 273, Issue 2 , 15
September 2004, Pages 210225
Cell Volume 118, Issue 5 , 3
September 2004, Pages 567578
Retrieved causal link
Thus, the major part of the sympathetically nerve-evoked
-adrenoceptor-mediated mitotic response was found to
depend on the activity of neuronal type NO-synthase to
generate NO
Ran is a small GTPase that cycles between a guanosine
diphosphate (GDP)-bound form (RanGDP) and a
guanosine triphosphate (GTP)-bound form (RanGTP)
and plays important roles in nuclear transport and
Knockdown of Xtrb2 by antisense morpholino
oligonucleotides (MOs) disrupted synchronous cell
divisions during blastula stages, apparently as a result of
delayed progression through mitosis and cytokinesis.
Drosophila MEI-S332 and fungal Sgo1 genes are
essential for sister centromere cohesion in meiosis I. We
demonstrate that the related vertebrate Sgo localizes to
kinetochores and is required to prevent premature sister
centromere separation in mitosis
The causal relations retrieved from the abstracts are shown diagrammatically below:
mitosis regulation
cause of
ila MEIS332
Causal initiator genes
The use of automatically-retrieved causal links in the construction of ontologies
illustrates one relatively low level application of the local grammar. By low level it is
meant that the accent is on the retrieval of causal links and the assignment of
directionality in the causal process only rather than attempting to apply the full
sophistication of the grammar in more delicate functional parsing of the sublanguage.
This investigation is obviously very limited in scope. A full evaluation of the local
grammar/parser as an automated tool in ontology building would involve greatly
expanding the text trawl to many hundreds of articles.
7.3 Clinical domain
7.3.1 Overview
In this section, the focus is on three dynamic areas of biomedical informatics in the
clinical domain. These areas will be briefly outlined before the information-formatting
potential of the grammar / parser is assessed in more detail. Section 7.3.2.describes the
applications of a grammar / parser in the formatting of an emergent disease database as
exemplified by the recent SARS (severe acute respiratory syndrome) outbreak in Asia. In
a similar vein, an application of the grammar/parser in the collation/ formatting of sideeffects with regard to a widely used (but controversial) treatment is illustrated in section
7.3.3 with regard to the drug-based therapy for Parkinson’s Disease. Finally the problem
of drug resistance is discussed in 7.3.4 in terms of the potential which the grammar /
parser embodies for extracting information relating to the comparative efficacies of
competing pharmaceutical solutions.
Briefly these applications share the common factor of the potential which they encompass
for the automatization of text summary operations based on the extraction and parsing of
causal relationships identified in retrieved textual sources. These applications briefly
focus on alternating sides of the cause and effect linkage as various ‘unknowns’ in
informational terms. In the case of SARS, the effects as seen in symptoms and local
population impact are in the process of being documented while the research community
has yet to arrive at a consensus as to a specific viral cause. For the treatment of
Parkinson’s Disease, the cause (ie the drug levodopa) is known but the effects (or more
specifically the undesirable side-effects) are the subject of on-going research as the
primary information focus. Similarly the discussion of anti-malaria drug resistance
focuses on the constantly shifting and often unpredictable nature of drug impact on
geographically and ethnically isolated local populations.
7.3.2 Emergent diseases: SARS Background
In the case of an emergent disease outbreak a cluster of symptoms (as the effect) serves
as the basis for query-retrieval. In the case of the1995 Ebola outbreak in Zaire for
example, symptoms were acute (fever, headache, joint and muscle aches, sore throat, and
weakness followed by diarrhoea, vomiting, and stomach pain etc) as there was no carrier
stage for the virus. Other causally-related unknowns connected with sudden viral
outbreaks include an exact definition of habitat (ie the natural reservoir of the disease)
and the precise manner in which the virus makes its first appearance in humans.
Frequently it is the case that for a newly-emergent disease or sudden viral mutation
resulting in the appearance of a more dangerous strain of an existing virus, a search
engine is used to retrieve a number of separate texts from multiple on-line sources.
However each text might concentrate on selective aspects of the outbreak, therapy or
emergent drug resistance which makes it important to be able to synthesize elements of
each text into a composite profile/template.
SARS is a respiratory condition which came to the attention of the medical world during
the Spring of 2003, with major impact centres in China, Hong Kong and Canada. The
condition received considerable media attention partly as a result of high levels of
mortality but also due to an initial cover-up on behalf of the Chinese government and the
draconian measures used to control the outbreak. The SARS outbreak illustrates the need
to process, summarize and disseminate a sudden flurry of text production within a
discourse community in response to the outbreak and its rapid spread throughout the
For evaluative purposes, the grammar was applied to three articles selected at random and
downloaded via a search engines. The search-engine of Emerging Infectious Diseases 34
provided the initial text sources. The object of the exercise was to assess the potential of
the grammar / parser in the synthesis of key informational elements from a number of
texts to create what will be referred to in this thesis as a ‘causative profile’. The causative
profile serves to structure the causal links in order to capture key informational aspects. Information coverage
The successfully extracted and parsed sublanguage from three selected articles (referred
to here as SARS1, SARS2 and SARS3 respectively) was synthesized into one single
source document to create the causative profile based on the initial query. The profile
uses the parsed causative links extracted from the grammar to capture and structure
information generated by the query. In many cases this process involves a text
summarizing process where repeated (and therefore informationally redundant) causal
elements are omitted. For example in the article SARS2 the nominally-encoded
inferential causative SARS-associated coronavirus is repeated five times throughout this
text alone, leading to the local grammar analysis as shown below:
[35] SARS-associated coronavirus (SARS-CoV)
coronavirus (SARS-CoV)
The key information element encoded by the causative relationship is the inferred link
between the cause (coronavirus) and the SARS disease itself. In information terms the
same relationship is eluded to in the SARS3 text, where the relationship is encoded as a
relational causative:
[36] a virus in the coronavirus family is the causal agent of SARS
Qualifier [virus ]
a virus
in the coronavirus
Hinge Cause
v-link N
[Rel] Value
the causal agent
of n
The main difference between the two causative elements is the hedging of the causal
relationship presented as an inference in the second example. Otherwise these
sublanguage elements make similar contributions in information content terms. Other
possible causative elements identified by the grammar / parser are informationally empty
(ie they do not contribute additional information) as in the example of inferential
causation below:
[37] lung pathology associated with this disease
Causal inference
V-ed with
Causal inference
associated with
this disease
In this example lung pathology is linked causally to the disease but this causal agent is
not made sufficiently explicit (beyond the pre-modifying element lung that is) in this
example to justify its capture in the informational profile. The causal profile for a disease outbreak
The structuring of non-informationally-redundant causal links retrieved from the three
articles in question has been made on the basis of formatting the informational elements.
These elements are intended to serve as the focal points for the summarization of the
articles in question and will now be outlined and exemplified in more detail below. To
make the position clearer the representative profile below is presented in condensed form
and does not show all the causal links identified by the grammar in the three test texts.
The commentary below refers to the full profile included on the accompanying CD.
I The disease and its causal agent
A coronavirus was
isolated from patients with
that might be
the primary agent
of severe acute
in Hong Kong
be V-ed
was caused
associated with disease
by a
belonging to
the family
II Secondary (contributory) causal factors pre-disposing patients to the disease
Qualifier [effect]
[Relationa Token
Qualifier Qualifier[mediator]
associated with
acquisition of the
through household
severe disease
[life stage ]
high initial lactate
III Effects of Treatment
IV Causal mechanisms
of steroid
did not
V for
of n
of an adverse
clinical outcome
to affect
account at least
the severity
of the clinical
seems to
[life stage]
in the
[Ellipted cause]
V Prevention
Cause[quantity/ process]
Minimizing individual
exposure to the virus
V in
results in
proliferation and
[process /
in the lung
Product[viru ss]
the viral
and the risk
for a
Sub-component I of the profile captures the crucial link identified in the RA text between
the SARS condition and its coronavirus origin. Subsequent information elements provide
a further context in terms of relating SARS as a recently identified phenomenon to
similar, more extensively documented conditions such as respiratory diseases, pneumonia
etc which have the common denominator of a viral cause. Other causal links capture the
effects of the coronavirus on animals, indicating their physiological extents. Examples of
other coronavirus with similar effects are also given. The level of contagion / risk of
infection is also described together with a review of medical outcomes. With respect to
the local grammar, the formatting of scope elements such as qualifier [patient]
as in in immunocompetent adults provides an important means of describing in summary
form the extent of the disease within the Hong Kong population.
The test articles reviewed contain a number of causal elements which provide indications
as to factors which pre-dispose victims to the disease. In other words these factors might
be considered secondary to the primary viral cause and could relate for example
predisposing circumstances such as age and existing immune system compromise to the
primary effects of the virus. Some of these factors are expressed as correlates- ie they are
identified as variables although a causal relationship between them is not stated openly
and may only be inferred. As has been remarked previously one repeated difficulty is the
demarcation between a genuine causal relationship and a list of variables which may be
co-occurring (and which invite the inference of a causal relationship on the part of the
Sub-component III of the profile is of critical importance in the summarizing of
approaches towards treatment. This point is especially important with a medical condition
such as SARS where a rapid and coordinated response to the threat of rapid contagion is
needed. Given the sudden appearance of the condition during the Spring of 2003, the
consensual agreement has yet to be reached in terms of an effective therapy for SARS.
However the information bearing elements captured in the profile do bring out the effects
of steroid and conventional antibiotic treatments so far used in the frontline defence
against SARS. The effect of delay in treatment is also highlighted in terms of the
complications noted in the condition.
Further causative elements in sub-component IV serves as a bridge between the
formatting of treatments and discussion of the viral mechanisms which are highlighted in
the section below. In particular the process of cytokine dysregulation is discussed in
terms of cellular proliferation in the lung. The final sub-component V of the causative
profile for emergent diseases points towards a set of guidelines which list the emergency
measures which have subsequently been used to stem the tide of this condition. The lack
of consensus with regard to specific therapies put forward to combat an emerging disease
such as SARS is not surprising given its sudden appearance. The causal links listed do
however capture an evaluation of the existing safety precautions and their efficacy
especially for medical professions coming into direct contact with SARS. These links
encode measures which stop some way short of prevention but nevertheless embody
valuable information which can serve as the basis for reducing the hazard to ‘front-line’
medical staff. However a possible beneficial effect of steroid administration is captured
in the profile which points towards a potential treatment. Evaluation
The question can be asked of course as to how ‘typical’ SARS is as an example of a
suddenly emerging disease and to what extent the causal profile represented by (I-V
above) can be applied to other (non-viral) diseases. A further evaluation would
conceivably involve the application of the grammar / parser firstly to other diseases with
a known (isolated) viral origin. Ebola is one relatively recent, high-profile example. The
World Health Organisation35 website provides a useful summary of current outbreaks
which can serve as a portal for information extraction and retrieval via journal articles.
7.3.3 Levodopa: an established therapy /treatment course and its side-effects Background
Parkinson’s Disease is a condition which is characterized by difficulties in movement
caused by neuron death in the brain. It is believed that this in turn results in a deficiency
in dopamine, a chemical which plays a critical role in brain signalling. One therapy which
has been trialled is the drug Levodopa; the use however of this drug has been called in to
question due to disabling side-effects such as dyskinesia (abnormal involuntary
movements). The use of amantadine, an NMDA-receptor antagonist which acts to reduce
dyskinesia brought on by Levodopa is currently undergoing clinical trials36. This case
study therefore attempts to apply the grammar to the formatting of texts in order to
synthesize information on these potential side-effects. A synthesis on this basis might be
useful for healthcare professionals considering the advantages and disadvantages of
alternative treatment courses for this particular condition.
The procedure adopted for this assessment is to use the query item combination of
Levodopa + Parkinson’s Disease to similarly retrieve three research articles from the
Science Direct on-line journal database. In order to render this suitable for a hand-parsing
experiment, the articles available in html format were selected at random from the list of
relevant articles in response to the initial query. The causative sublanguage was then
delineated from the remainder of the text as outlined in chapter 6 and the respective
causative elements hand-parsed in accordance with the local grammar. The results from
all three documents were then synthesized to create a treatment / side-effect causative
profile. Information coverage
In the case of a drug therapy and its side-effects, there are a number of items of
information which it would be desirable to capture in the form of an IE profile. The
information elements for the treatment / side-effect causative profile are outlined in
abbreviated form in the table below.
I The condition (effect) and its biomedical cause / causal mechanisms
[Outcome] Qualifier [cell]
Qualifier [body_part] Hinge
of dopaminergic
in the substantia nigra underlie
the pathoof
physiology Parkinson's
Cochrane Methodology Review http://www.update-software.com/abstracts/AB003467.htm
II The condition and the current status of its drug-therapy treatment
Delim[classifier] [intervention]
Hinge[prod] Delim[classifier] [symptoms]
does adequately control
of entacapone
with levodopa
in n
in the
of Parkinson’s
disease (PD)
III The drug-therapy treatment (cause) and mechanisms of side-effects
play N in n
Hedge Causal
transmissi in striatal
play a role
in the
of Levodopaopioidergic on
IV Reducing side-effects
Delimiter[classifier] [bio]
the selective
adenosine A2A
antagoni KWst
improves the
Qualifier [drug]
V to
[symptom] Qualifier
symptoms without
the disability and
improving QoL of PD
patients with motor
The causal profile for Levopda therapy contains three sub-components, dealing with the
condition and its cause (I), the current treatment for the condition (II) and finally the sideeffects of the treatment (III) respectively. The condensed causative profile above captures
the crucial linkage between Parkinson’s Disease and the nature and location of its
neurological cause. Equally importantly the grammar/parser retrieves the linkage between
the condition and the levodopa drug therapy in sub-component II of the profile. The two
representative sample links here bring out the deficiencies of the standard Levodopa
treatment and suggest that a combination of entacapone with levodopa is more effective.
Further tables (III) discuss causal mechanisms for the side-effect in terms of neurochemistry. Finally in sub-component III the tables link the empirical results for two
possible treatments designed to limit the effects of dyskinesia: the combination therapy
involving levodopa and entacapone (alternatively designated KW-6002 as above)
respectively. On the basis of the parse it is possible to make a positive evaluation of the
combination therapy over the sole treatment with levodopa.
Despite this relatively rudimentary manual application of the grammar /parser it is
possible to gain some sort of appreciation as to the beneficial effects in IR /IE terms.
These effects would of course be substantially multiplied if the text trawl was extended
beyond the three research articles in the experimental sample.
7.3.4 Drug-resistance: Anti-malaria drug Background
The third and final area of applicability to be investigated in this thesis is the problem of
drug-resistance, with specific reference to the sub-tropical disease malaria. Drug
resistance represents a major challenge to medicine in the 21st Century both in terms of
bacterial/viral agents and also the role of parasitic hosts as vectors in the spread of
disease. It has been estimated that worldwide there are 300-500 million separate cases of
malaria per annum.37 In the case of malaria, the epidemiology of the disease is wellknown though complex. The disease is spread by an intracellular protozoan parasite of
one of four species Plasmodium falciparum, P.vivax, P.orale and P.malariae which is
passed on via the bite of the female mosquito of the genus Anopheles. The current
situation is that malaria is endemic in many parts of the developing world, with resistance
reported to virtually all known drug therapies. At the same time this simple picture of
resistance is complicated by variables such as the species of malaria parasite, specific
susceptibility to anti-malarial drug therapies, climate as well as local patterns of
immunity and behaviour among human populations living in these areas.
In response to this worldwide threat, there are a number of drug therapies currently
available, including quinine and chloroquine in more severe cases and the relatively new
amodiaquinine. More recently antifolate drugs have also been used but resistance has
developed quickly. Combinations of these drugs have however resulted in improved cure
rates. Overall there is the problem at genetic level of spontaneous mutations which confer
reduced sensitivity. Excessive drug pressure in localized areas removes susceptible
parasites while resistant parasites survive. Added to the equation is the likelihood that
there may be factors other than drug resistance which contribute to treatment failure.
The complexity of inter-connected factors coupled with the dynamic nature of malaria
infection and drug resistance gives rise to an important area of applicability in which a
grammar of causation can produce structured profiles for information retrieval processes.
The procedure for evaluating parsing and information coverages is the same as that for
Levodopa in that a web database was searched for relevant RAs matching the query
chloroquine + drug-resistance and a random selection of 3 articles available in html
format was made. The pattern and lexical databases were used to delineate a sublanguage
of causation from the remainder of the text with the local grammar categories applied to
create a parse of each causative item into information-bearing elements. The results were
then used to create a causative profile shown below for anti-malarial drug resistance.
World Health Organisation, http://www.who.int/topics/malaria/en/
238 Information coverage
I Disease and causal agency
due to Plasmodium falciparum
II Drug resistance and causal agency
[ Classifier]
drug uptake
in the
for n
for resistance
be V-ed to
has led to
the use
III Causal mechanism- chloroquine resistance
be V-ed
to n
[Reactivity] Qualifier Hinge
Plasmodiu resistance
has been
chloroqui linked to
such as
mefloquine and
in the P. falciparum
multidrug resistance (pfmdr1)
gene and the P. falciparum
chloroquine related
transporter (pfcrt) gene
in the
ate synthase
(dhps) gene
the tolerance
to the drug
VI Drug resistance to mefloquine, quinine, halofantrine and sulphadoxinepyrimethamine
Classifi Process
Qualifier Hedg Hinge[para Reactivity Qualifier[dr Qualifier
polymorphis in the
may modulate
sensitivity both
in P.
mefloquine falciparum
(MF) and
ne (SP)
be V-ed to
has been
linked to
of mutations in the
The tables above provide an indication of the information coverage obtained by parsing
causal linkages according to the lexical database / grammar and then filtering this
information to create a ‘causal profile’. This causal profile is an attempt to capture key
elements of information derived from the query item chloroquine+ drug resistance.
Based on the query, a number of key elements of information are retrieved from the
matching documents. Malaria is firstly linked in sub-component I to its parasitic
organism cause. In the case of the causative element extracted above, the causal link is
contained within a nominal group. In profile II the critical problem of drug resistance and
causal agency is taken up. Application of the local grammar parses the critical link
between drug uptake and drug resistance. Similarly the grammar identifies one
consequence of the decline in chloroquine efficacy: the use of alternative antimalarial
drugs which are then listed.
Profile III establishes the causal mechanism for the increasingly prevalent patterns of
chloroquine drug resistance. The two selected parses listed under this profile bring out
clearly the genetic mechanisms at play in the decline of the drug’s effectiveness in
malaria treatment. More specifically the mechanism is listed in terms of gene mutations
at three separate locations: the genes pfmdr1, pfcfr and dhps respectively. The ability of
the grammar/ parser to extract such specific information relating to precise gene locations
is highly significant bearing in mind the size of existing database initiatives such as the
Human Genome Project.
In profile VI, the use of alternative drugs to choloroquine such as mefloquine, quinine,
halofantrine and sulphadoxine-pyrimethamine is highlighted upon. The problems of
drug-resistance for these compounds are also related to an underlying genetic mechanism,
in this case identified with the specific location of the pfmdr1 gene.
7.3.5 Summary
These experiments in hand-parsing the causative sublanguage in accordance with the
local grammar have pointed towards the construction of ‘bottom up’ information profiles
specific to each area of applicability in the biomedical informatics domain. The
construction of these profiles as a series of ‘templates’ has been made on a similarly
manual basis, in terms of capturing specific elements of information relevant to medical
professions while at the same time filtering out repetitive redundant information
elements. The design and construction of a causative profile therefore involves crucial
decisions on the part of domain specialists relating to the format of information slots
which serve to exclude the ‘noise’ of information repetition. At the same time the
information unknowns either side of the causative hinge need to be targeted in
accordance with specific clinical tasks.
7.4 Overall evaluation
In this section an overall evaluation of the local grammar project is offered. The focus of
the evaluation is on two specific aspects of the thesis. Firstly a critique is made of the
corpus-driven methodology which formed the basis of the empirical work on the corpus
and which resulted in the survey of lexical patterns outlined in chapter 4. Secondly the
evaluation takes up the extent to which the local grammar of causation offers any real
analytical advantage over a general language framework such as SFG from which the
local grammar is so closely derived.
The corpus-driven methodology needs to be examined critically with regard to corpusbased approaches following the distinction made by Tognini-Bonelli (1996, 2001). The
corpus-driven methodology
7.5 Future Research
This thesis has put forward a specialized functional grammar of cause and effect
primarily with NLP applications in biomedical informatics. In this section directions for
future research will be further explored not only within corpus-driven studies of causation
but also with regard to the theory of local grammars in general.
One of the central tenets of this thesis has been the extent to which a grammar of causal
relations defined from within a specialist domain can extract useful information in an age
of rapid scientific development and terminological proliferation. This information can
serve a number of purposes in NLP such as the construction and updating of domainspecific ontologies and controlled vocabularies in addition to on-line research article
summary applications. More work remains to be done in terms of widening the lexical
scope of the grammar as the identification of causative lexical patterns has been carried
out on a manual basis based on a relatively small corpus. ‘Training’ the grammar through
exposure to progressively larger amounts of data will enable the lexical and pattern
database to be extended considerably, leading to greater reliability in the identification of
causal links and the assignment of semantic categories.
Implicit in this exercise is an ongoing commitment to the development of software based
on the grammar which can at least partially automate the sublanguage analysis. During
the initial stage of the project, a number of highly simplified parsing experiments were
carried out on ‘toy’ sentences of the form Smoking causes heart disease using the textmatching language AWK. While the AWK-based experiments provided a useful insight
into the process of automatic parsing, there is still a very wide gulf to be crossed before
‘industrial strength’ software applications can be developed based on the grammar. It is
envisaged for example that sufficiently robust applications would be needed to
satisfactorily cope with the multiple levels of causal relations in the highly compacted
nominal groups which are frequent in scientific writing.
The compilation of a suitably robust grammar with the finely granular semantic
categories described in chapter 5 invites consideration as to the status of the grammar visà-vis the type of ontological representation discussed in this chapter .There would appear
to be two possible conceptualizations of this relationship. On the one hand a sublanguage
grammar might be regarded as wholly separate from the hierarchical statement of
specialist terms and their relationships making up the ontology/controlled vocabulary.
The conceptualization of the local grammar in this thesis might be characterized as on the
other hand as ‘integrationist’ in the sense that a functionally-motivated grammar draws
heavily on the semantic representation of an ontology. As has been discussed previously
this latter position does lead to a ‘chicken and egg’ conundrum with regard to the prior
status of the grammar or ontology.
Ultimately local grammar theory cannot be divorced from a consideration of the
fundamental issues raised by the grammatical parsing of human languages. As pointed
out by Sampson (1992), improving on what have been mediocre levels of parsing
efficiency to date remains the most recalcitrant problem in natural language processing.
As Barnbrook and Sinclair (2001) note, the notion of a local grammar arises out of this
widespread dissatisfaction with the results obtained for global language parsing networks.
The results of this thesis while not implemented in software terms give grounds for
optimism with regard to a specialist functional grammar of causation.
On a wider perspective, the question is to what extent a ‘battery’ in Barnbrook and
Sinclair’s terms of specialized local grammars can achieve parsing results in excess of
those achieved by global parsers. Based on previous work and the empirical studies
explored in this project it would appear that there are two main dimensions for future
research on local grammars. The first dimension which is embodied in this thesis is to
develop genre-specific ‘utilitarian’ local grammars as ad hoc parsing solutions specific to
sub-domain knowledge representations. In this conceptualisation, local grammars are
subsumed within general language frameworks such as the SFG. Following Barnbrook
and Sinclair (2001), a second, what might be termed ‘globalist’ position points forward to
global language parsing coverage by using a suite of independent local grammars. The
parsing of a text would on this basis proceed in terms of two separate stages; the first of
which involves the prior functional identification of a sentence / text segment as a
causative, definition, etc. This stage is followed by a second phase in which the specialist
local grammar / parser is then selectively brought into play. The global parse is thus not
the product of a single grammatical framework but instead represents the integrated
output from a number of mutually exclusive grammatical frameworks.
One important question raised by the adoption of the globalist position is the definition of
identification and status of future local grammars. It is suggested that possible candidates
might include local grammars of citation, duration, comparison, quantity etc- this list is
far from exhaustive. Future research agendas could therefore be directed not only towards
corpus-driven lexical grammatical and functional analyses of these semantic sub-domains
but also the development of consistent criteria for sublanguage identification.
7.6 Conclusion
This thesis has put forward a functional grammar of causation using a corpus-driven
methodology as well as pointing towards the significant potential of such small-scale
grammars in the extraction of scientifically-useful information in the domain of natural
language processing. This study has shown the efficacy of an approach to the grammar of
a semantic domain in which the point of departure is lexis and the notion of the lexical
pattern. A listing of the lexis of cause and effect then serves as the basis for the local
grammar itself: firstly as a description of semantically-driven categories particular to
causation and secondly in terms of the linear relationships between these categories. The
fundamental importance of causation in the rhetoric of scientific discourse is such that a
grammar which identifies cause and effect linkages can serve as the basis for a number of
practical NLP utilities in automatic ontology building, automatic text summary and
information extraction.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF