MAXWELL INSTITUTE for MATHEMATICAL SCIENCES
Statistics Seminars
Abstracts
Model choice plays an increasingly important role in Statistics. From a
Bayesian perspective a crucial goal is to compute the marginal
likelihood of the data for a given model. This however is typically a
difficult task since it amounts to integrating over all model
parameters. The aim of this paper is to illustrate how this may be
achieved using ideas from thermodynamic integration or path sampling.
We show how the marginal likelihood can be computed via MCMC methods on
modified posterior distributions for each model. This then allows Bayes
factors or posterior model probabilities to be calculated. We show that
this approach requires very little tuning, and is straightforward to
implement. The new method is illustrated in a variety of challenging
statistical settings.
This work is in collaboration with Prof Tony Pettitt, QUT, Brisbane.
Recent court challenges have highlighted the need for statistical
research on fingerprint identification. This paper proposes a model for
computing likelihood ratios to assess the evidential value of
comparisons with any number of minutiæ. In contrast with previous
research, the model, in addition to minutia type and direction,
incorporates spatial relationships of minutiæ without introducing
probabilistic independence assumptions. The model has a very promising
discriminating power and the likelihood ratios that it provides are
very indicative of identity of source. Furthermore, the model is able
to support very strongly or strongly identity of source on a
significant proportion of cases, even when considering configurations
with few minutiæ.
The presentation is given as part of Heriot-Watt's mathematical biology
talks and is based on the joint work of Alex Cook, Glenn Marion, Gavin
Gibson. I'll be talking about the work done during the last year of my
PhD as part of BioSS's contribution to the cross-disciplinary,
inter-national EU-funded ALARM project. This has involved modelling the
spread of alien plants in GB: one such plant is Giant Hogweed
(Heracleum mantegazzianum) from SW Asia, which has been damaging
Britain's biodiversity since it was introduced in the 19th C and which
is dangerous to human health. After constructing a spatio-temporal
stochastic model for its spread that takes account of covariates such
as the heterogeneous land-cover and climate of the island, we then fit
the model directly to observed data. Fitting the model was non-trivial
and involved the use of Markov chain Monte Carlo techniques. The nature
of the approach taken means that temporal predictions of the future
spread of the weed can be made, consistent with the invasion history.
This does not appear to be possible with existing statistical
techniques, which assume the process has reached equilibrium. The
approach we have taken can be generalised to other biological systems
exhibiting stochastic variability.
Catastrophic events often occur when two or more processes
simultaneously become extreme - for example, floods in coastal towns
may only occur when tide and river levels are both unusually high, and
a simple machine may only fail when all of the components within the
machine fail at roughly the same time. Quantification of the risks
associated with such events requires that we understand dependence
between random variables at extreme levels, and the development of
models for multivariate extremes is currently an area of active
research within both statistics and probability.
In this talk we present a new model to describe dependence at extreme
levels within a Markov chain, and discuss potential statistical
applications of this model. The key features of our model are that it
is able to allow for asymptotic independence as well as asymptotic
dependence, and that it provides a parsimonious description of extremal
dependence within a fairly broad class of Markov models. The talk does
not require any prior knowledge of extreme value theory.
This talk is based on a recent joint work with Keming Yu (Brunel
University, London). A multivariate random walk model with slowly
changing parameters is introduced and investigated in detail. The
proposed model is not jointly stationary but locally jointly
stationary, where not only the drifts and the cross-covariances but
also the cross-correlations between single series are allowed to change
slowly over time. Hence the proposed model is particularly useful for
modelling and forecasting the value of financial portfolios under very
complex market conditions. Local linear and kernel approaches are
proposed for estimating the drift and the cross-covariance functions,
respectively. The asymptotic biases, variances and convariances of the
proposed estimators are obtained. The properties of the estimated value
of a given portfolio are also studied in detail. Results on the
estimated prediction errors for future values of a single stock or of a
portfolio are derived. Practical relevance of the proposal is
illustrated by application to several foreign exchange rates.
This work is concerned with classifying longitudinal event data for a
set of individuals, specifically identifying homogeneous periods of
activity within such a longitudinal time series, and in classifying
these periods of activity. Such data can arise in many circumstances.
The talk concentrates on offending data but such data can also occur in
education and psychology, for example. While much work has been carried
out related to the classification of offenders, we are interested in
local classification, examining which offences co-occur at the same
point in time. Latent class and other latent models will be presented,
and the results illustrated by a large sample of offenders taken from
the Offenders Index. The benefits and problems of such analyses will be
presented.
Species atlases are, because of their wide geographical and taxonomical
coverage, one of the most important sources of information on the
distribution of species over large spatial scales. These atlases are
essentially databases that consist of (date-stamped) records of
observed presences of plant species in cells of a regular grid that has
been superimposed on the landscape. A pervasive problem in the
interpretation of these data, is that it cannot be assumed that an
absence of a record of a species from a grid cell means that the
species does not occur somewhere in that grid cell, unless detection
probabilities equal one. In addition, coverage of the grid is often
uneven so that detection probabilities can be expected to be spatially
varying.
We propose to use Bayesian image restoration techniques to incorporate
spatially varying detection probabilities in the analysis of species
atlas data. These techniques are based upon the parameterisation of
presence/absence models using Markov Chain Monte Carlo techniques,
while treating the occupancy states of grid cells with no recorded
presences as unobserved (latent) variables. I will present several
examples of the application of this methodology in the analysis of the
British and German atlases of vascular plants.
Recently publications have helped to resolve contradictions in past
evidence on the impact of badger culling on TB incidence in cattle. In
this controversial area, mathematical and statistical analyses are
providing valuable new insights into the complex disease system
involving cattle, wildlife and potentially humans.
Donnelly CA et al. Positive and negative effects of widespread badger
culling on tuberculosis in cattle. Nature advance online publication,
14 December 2005 (doi:10.1038/nature04454).
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature04454.pdf
Woodroffe R, Donnelly CA et al. Journal of Applied Ecology online
early, 14 December 2005 (doi:10.1111/j.1365-2664.2005.01144.x).
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1365-2664.2005.01144.x
Cox DR, Donnelly CA, et al. Simple model for tuberculosis in cattle and
badgers. PNAS, 102, 17588-17593, 6 December 2005.
http://www.pnas.org/cgi/content/abstract/102/49/17588
There are many existing systems for automatic facial recognition which
select the best available match to a questioned image of a human face
to one or more images selected from a database of known people. These
systems are successful and widely used in areas such as security
surveillance. However, they do not attempt to provide any quantitative
measure of quality of match but only give the best available match. In
this respect they fall short of a facial identification which can give
evidential information of use in a court of law. The work described
here provides a statistically based method which can remedy these
defects. It is based on landmark identification of facial features and
routine techniques of shape analysis to provide measures of inter and
intra variability of measured facial features, thus allowing a more
statistical assessment of facial identification.
In epidemiological research, the effect of a potentially modifiable
phenotype or ``exposure'' on a ``disease'' is often of public health
interest. Inferences on this effect can be distorted in the presence of
confounders affecting both phenotype and disease. Issues of confounding
require causal rather than associational arguments. Mendelian
randomisation (see Davey Smith & Ebrahim 2003, for example) is a
method for deriving unconfounded estimates of such causal relationships
and basically exploits the fact that a gene known to affect the
phenotype can often be reasonably assumed not to be itself associated
with any confounding factors and thus has an indirect effect on the
disease. It is well known in the economics and causal literature (e.g.
Pearl, 2000) that these properties define an instrumental variable but
they are minimal in the sense that they only permit unique
identification of the causal effect of the phenotype on the disease
status in the presence of additional and fairly strong assumptions.
These assumptions relate to the distributions of the variables e.g.
multivariate normality, and the nature of the dependencies between
them, e.g. linear. These assumptions are explored in the context of
standard epidemiological applications and the ideas illustrated using
directed acyclic graphs.
Ecological data sets are often highly structured, and so complex
modelling of random effects is necessary to make valid inferences.
Fortunately, ecologists as a whole have been receptive to the need to
use mixed models so that random variation is properly described,
creating opportunities for fruitful collaborations with
statisticians. After giving a brief overview of random effect models
and their estimation using the method of residual maximum likelihood, I
will describe a number applications I have developed in my scientific
collaborations that raise interesting methodological issues. These will
include: a generalised linear mixed model for overdispersed counts of
ticks on grouse chicks; a random coefficients model for within-year
growth rate of sand-eels; using random effects to smooth an ordered
sequence of regression coefficients; and a multivariate random effect
models for compositional data on diet selection.
The ideas of measurement are so ubiquitous that we often fail to notice
them: they are simply parts of the conceptual universe in which we
function. However, it has not always been thus and sometimes, even now,
rips in this usually unnoticed background fabric appear, casting doubts
on one's view of the way the world works. Occasionally these tears have
serious, even fatal consequences. This talk looks at the conceptual
infrastructure of quantification, showing how humans have constructed
it, how it can be interpreted, and how it is manipulated to make valid
inferences about the real world. The talk is illustrated with
measurement tools from psychology, medicine, physics, economics and
other areas.
We consider dependence structures in multivariate time series that are
characterized by deterministic trends. Results from spectral analysis
for
stationary processes are extended to deterministic trend functions. A
regression cross covariance and spectrum are defined. Estimation of
these
quantities is based on wavelet thresholding.
Extreme value theory provides a relatively robust, asymptotically
motivated, basis for drawing statistical inferences about the magnitude
& frequency of events which are extreme and rare. Extreme value
methods are widely used in discplines such as hydrology, finance and
engineering to quantify the risks associated with catastrophic events
such as the breach of a sea wall, the failure of a component within a
machine, or a collapse in the value of a portfolio of investments.
Primary interest often lies in describing changes in the magnitude and
frequency with which extreme events occur, and we review parametric and
nonparametric approaches to the detection and quantification of trends
in the characteristics of extreme events. We illustrate the different
approaches by analysing changes in the characteristics of North Sea
storm surges during the second half of the 20th century, a substantive
application which requires us to adapt and extend the existing
statistical methodology.
When designing an epidemiological experiment in which we expose plants
or animals to disease, we make certain choices---number of hosts,
timing of observations, duration of the experiment, etc---and typically
would be guided by intuition and resource constraints only. We have
shown previously that this may result in considerable wastage of
resources.
In this talk I introduce the topic of design of experiments for
stochastic systems---such as epidemics---that are highly non-linear,
and have been heretofore neglected by much of the optimal design
literature. We adapt the idea introduced by Müller and colleagues
to sample from the utility space using MCMC, in conjunction with
inferential moment closure techniques developed by Krishnarajah and
colleagues. We show how observation times may be selected to maximise
the expected information from two stochastic systems---a death process
(for which existing results have been found) and an SI epidemic, which
has been used previously to model a plant-pathogen system studied by
our collaborators in Cambridge.
Overdispersion is common when modelling discrete data like counts or
fractions. We propose to introduce and explicitly estimate individual
deviance effects (one for each observation), constrained by a ridge
penalty. This turns out to be an effective way to absorb
overdispersion, to get correct standard errors and to detect systematic
patterns. Large but very sparse systems of penalized likelihood
equations have to be solved. We present fast and compact algorithms for
fitting, estimation of standard errors and computation of the effective
dimension. We will present our methodology with applications to counts,
binomial, survival data as well as smoothing of mortality surfaces.
The familiar logit and probit models provide convenient settings for
many binary response applications, but a larger class of link functions
may be occasionally desirable. Two parametric families of link
functions are suggested: the Gosset link based on the Student t latent
variable model with the degrees of freedom parameter controlling the
tail behavior, and the Pregibon link based on the (generalized) Tukey
$\lambda$ family with two shape parameters controlling skewness and
tail behavior. Both Bayesian and maximum likelihood methods for
estimation and inference are explored. Implementations of the methods
are illustrated in the R environment.
The class of double hierarchical generalized linear models is based on
GLMs, and allows three extensions:
(1) Joint estimation of models for mean and dispersion
(2) Random effects in the linear predictor for the mean with a
distribution that is any conjugate of a GLM distribution
(3) Random effects similarly for the linear predictor of the dispersion
model
Spline terms and correlations between random effects may be specified.
Fitting is done by maximizing a form of extended likelihood called the
h-likelihood. The algorithm reduces to fitting an interconnected set of
GLMs. The algorithm does not require prior probabilities, quadrature,
or the EM algorithm, and is much faster than many existing methods.
The more major part of this talk is about conceptual and theoretical
aspects of some new parametric families of distributions and the more
minor part is about some of their intriguing practical ramifications
for "nonparametric" quantile estimation and regression. I will first
describe a couple of novel ways of generating three- and/or
four-parameter families of continuous univariate distributions. These
will afford skewness and the first a variety of tail weights, heavy if
required, with obvious consequences for providing alternatives for
practical statistical modelling, circumventing the need for some of the
more ad hoc methods of robust statistics. I will then consider some
aspects of the interaction between the second of these families of
distributions and kernel-based quantile estimation and quantile
regression. While I won't be addressing problems in either Finance or
Actuarial Science directly, heavy-tailed distributions have particular
links with the former and nonparametric regression with the latter!
Likelihood ratios provide a natural way of computing the value of
evidence under competing propositions. Models for likelihood ratios
have been developed for continuous data (e.g. glass refractive index).
Such methods are also desirable for discrete data such as pollen counts
or gun-shot residue particles. Challenges in development of discrete
models include the presence of zeros and the lack of sufficient amounts
of background data. In this talk a number of approaches to obtaining
likelihood ratios for count data will be discussed and illustrated
using pollen data.
Stereo-photogrammetry provides high-resolution data defining the
shape of three-dimensional objects. One example of
its application is in a collaborative study of the growth
of children's faces. The clinical aims of the study
are to describe the facial shape and growth of healthy
children and to contrast this with the shape and growth of
children who have been born with a cleft lip and/or palate
and who have subsequently undergone surgical repair. Information
can be extracted in a variety of forms. Methods of
analysing landmark shape data are well developed but
landmarks alone clearly do not adequately represent the
very much richer information present in each digitised
face. Facial curves with clear anatomical meaning have
also been extracted. In order to exploit the full
extent of the information present in the images,
standardised meshes, whose nodes correspond across
individuals, have also been fitted. Some of the
issues involved in analysing data of these types will be discussed
and illustrated on the facial growth study. These
include graphical exploration, the measurement of
asymmetry and longitudinal modelling.
Data with an array structure are common in statistics. An early
example is the factorial design and Yates (1937) gave an efficient
algorithm for computing the factorial effects in such a design. The
generalized linear model (GLM) of Nelder & Wedderburn (1972) gives
a
unified approach to analysing regression problems with non-normal
error structure. However, this analysis ignores any array structure
in the data or the model. We develop an arithmetic of arrays which
generalizes Yates algorithm and which allows us to define the
expectation of a data array as a sequence of linear operations on a
coefficient array. This arithmetic also leads to low storage, high
speed computation in the scoring algorithm of the GLM. We call such
a model a generalized linear array model or GLAM. We apply the
method to the smoothing of multidimensional arrays. Some examples
are presented.
This seminar will take place at 3.15 pm in Room 6206, James Clerk
Maxwell Building at the King's Buildings site in Mayfield Road. Tea
and coffee will be available after the seminar in the Mathematics
School, Staff Common Room (5212).
Over recent years, there has been increasing concern relating to many
wildlife species leading to surveys being undertaken to study many of
these populations. We will focus on data typically collected on UK bird
populations: survey data and ring-recovery data. Interest lies is both
estimating the change in population size over time, and identifying the
factors that contribute to this changing population.
We consider a state-space approach to take into account that the survey
data are only estimates of the total population size. In addition we
demonstrate the increased precision that can be obtained when jointly
analysing survey data with ring-recovery data often available. Finally,
we wish to discriminate between competing biological hypotheses, in
order to explain the changes in population size over time. We consider
a Bayesian approach and use reversible jump MCMC to simultaneously
explore model and parameter space to obtain both parameter estimates
and posterior model probabilities. The methods are applied to a real
data set relating to the UK lapwing population and a variety of
interesting results presented.
Most scientific studies concentrate on factors affecting the mean value
of a response variable, described by the fixed effects in a mixed
model. The role of random effects is then to ensure correct inference
about the fixed effects. However, in some situations, the scientific
hypothesis relates not to the fixed effects but to the variances of the
random effects, whilst in other situations it is the random effect
assumption that makes useful modelling possible. I will describe
analyses of some ecological data sets to illustrate these
possibilities, making some observations about random effect models
including the relationship between inference based on REML estimation
and MCMC sampling from the full likelihood.
Suppose a quantity is to be predicted and various models could be used.
The approach in model selection is to use just the prediction of the
model that appears to be best. An alternative is to form a weighted
average of the predictions given by the different models. But what
weights should be given to the different models? Should the weight
given to a model be reduced if it is very similar to another model?
What if two models are virtually identical - should they each be given
half the weight that they would otherwise receive?
This talk considers methods of assigning weights on the basis of the
correlation structure between models. Different weighting strategies
are proposed and desirable properties in a weighting scheme are
suggested. Simulation is used to compare the weighting schemes in
situations where optimal weights can be determined.
This talk is concerned with a stochastic model for the spread of an SIR
(susceptible → infective → removed) epidemic among a closed, finite
population that contains several types of individual and is partitioned
into households. A pseudolikelihood framework is presented for making
statistical inference about the parameters governing such epidemics
from final outcome data, when possibly only some of the households in
the population are observed. The framework includes parameter
estimation, hypothesis tests and goodness-of-fit. Asymptotic properties
of the procedures are derived when the number of households in both the
sample and the population are large, which correctly account for
dependencies between households. The methodology is illustrated by
applications to data on a variola minor outbreak in Sao Paulo and to
data on influenza outbreaks in Tecumseh, Michigan. (joint work with Owen
Lyne)
A parametric representation of a statistical model may involve some
redundancy; that is, the mapping from parameter space to family of
distributions may be many-to-one. Such over-parameterized
representations are often very useful conceptually, but can cause
computational and inferential problems (ridges in the likelihood,
non-estimable parameter combinations). For linear and
generalized-linear models, well-known approaches use either a reduced
basis or a generalized matrix inverse. In this talk I will discuss how
to work with over-parameterized nonlinear models. Aspects covered will
include maximum-likelihood computation, detection of
non-identifiability, and presentation of results. Some implications for
Bayesian analysis will also be touched upon. The work is motivated by
the design of an R package to specify and fit general regression models
involving multiplicative interaction terms; these include the (G)AMMI
models that are used for example in crop science to represent
genotype-by-environment effects, as well as various much-used models
for categorical data in social research.
The problems that can arise in the analysis of a large family study
will be discussed and exemplified using a study where various
quantitative variables associated with hypertension were measured.
Possible solutions will be discussed. Issues examined will be:
normalisation; adjustment for covariates; allowance for the effect of
drug treatment; possible biases in assays; multiple testing problems
when looking at genome wide scans and candidate genes; the problem that
some genetic markers (SNPs) are more informative than others and the
impact this has for localisation of a possible gene affecting a
character. (Joint work with Bernard Keavney)
Statistical finite mixtures have been receiving much attention as a
conceptually simple way of relaxing distributional assumptions. Within
the Bayesian approach, MCMC methods, notably Green's reversible jump
methodology, have made feasible to estimate finite mixtures with an
unknown number of components. Some issues remain open, among them are
the choice of prior distribution for the number k of mixture
components, and the necessity of designing efficient reversible jump
moves for each parametric family of components being mixed.
In this talk I will provide support for the use of a Poi(1) prior
distribution for k. I will also present a new MCMC scheme, the
allocation sampler, which can be applied, with minimal changes, to any
family of component distributions, under the assumption that the
component parameters can be integrated out of the model analytically.
Artificial and real data sets will be used to illustrate the method.
In part, this is joint work with Alastair Fearnside.
Any portfolio credit risk model that is to be used to calculate a loss
distribution associated with defaults and changes in rating must
address
the challenge of modeling dependent defaults and dependent rating
migrations. Most industry models (such as KMV, CreditMetrics,
CreditRisk+) incorporate mechanisms for modeling this dependence,
generally by assuming conditional independence of defaults and
migrations
given common economic factors. However, the calibration of these
mechanisms is often quite ad hoc, despite the fact that the tail of the
portfolio loss distribution is extremely sensitive to small changes in
the parameters governing dependence.
We consider the problem of making formal statistical inference for such
models based on historical default and rating migration data. In the
solution we propose, portfolio credit models are represented as
generalized linear mixed models (GLMMs) and inference is made using
Markov chain Monte Carlo (MCMC) techniques. This general framework
allows
quite complex models where the random effects essentially play the role
of unobserved latent factors influencing default and migration rates;
to
capture economic cycle effects the latent factors are allowed to have a
dynamic time-series structure. An empirical study of Standard and Poors
data shows strong evidence for economic cycles and also reveals
pronounced sectoral heterogeneity in default and migration
rates.
I propose a feasible method of ranking RV estimators based on
actualreturns data. In contrast,
most rankings of RV estimators currently in the literature are either
graphical in nature, most notably the "volatility signature plot",
or rely on asymptotic approximations of the mean-squared errors of the
estimators, or on simulations. The proposed method relies on the
existence of a volatility proxy that is unbiased for the variable of
interest,
and satisfies a certain "zero correlation" condition. The zero
correlation condition has some similarities with instrumental variables
estimation. The volatility proxy must be unbiased but it does not need
to be
very precise; a simple and widely-available proxy for conditional
variance is the daily squared return. In an application to IBM equity
return volatility, I find that a simple realised volatility estimator
based on 5-minute returns performs as well as a wide variety of
competing estimators.
(not available)
Credit risk arises as obligors have a likelihood of defaulting on
pre-arranged payments. The New Basel Capital Accord has set a new
framework for credit risk management for financial institutions, but
its implementation in any financial institution raises a number of
technical modelling challenges. In this talk we focus on a new
methodology for modeling and estimating transition probabilities
between credit classes in a bank's rating system. First, we develop a
new statistical model that describes the typical credit rating process
that most major banks employ. Second, we describe a Bayesian
hierarchical framework for model calibration, using Markov Chain Monte
Carlo techniques implemented through Gibbs sampling. This approach
allows us to address the technical issues related to the estimation of
default probabilities from low default portfolios. Third, we apply this
methodology to the analysis of an extended rating transitions data set
from Standard and Poor's between 1981--2004, and we examine both the
in-sample and out-of-sample performance of the credit rating process
models. The results of this research provide a framework that banks and
other financial institutions can use to show that their internal rating
systems produce estimates of rating transition probabilities that are
reasonable from a regulatory perspective.
We provide a general approach to goodness-of-fit tests across a wide
variety of time series models, including for example linear models,
nonlinear models, short memory models, long memory models,
constant-variance models and ARCH/GARCH
models. The test statistic is generically based on a linear
transformation of a score-marked empirical process, which converges
to the maximum of the square of a standard Brownian motion under
the null hypothesis; its critical values are easy to obtain. Our
test has nontrivial local power under local alternative hypotheses.
(This is a joint work with Howell Tong in London School of Economics
and Political Science.)
The estimation of a linkage map of molecular markers is a prerequisite
of studies to locate genes affecting important quantitative traits. The
estimation is straightforward if markers can be scored on a population
derived from a cross between two inbred parents, but this is not
possible in many plant species, especially bushy or tree species. This
talk focuses on the analysis of a mapping population in one such
species, blackcurrant, and uses some exploratory statistics and simple
genetic models to uncover some interesting features of the population.
Sources of heterogeneity in rating migration behavior are explored
using a continuous time Markov chain based framework. Working in
continuous time circumvents the embedding problem, mitigates the
censoring effect and facilitates term structure modelling with
arbitrary prediction horizons. Classical estimation provides ample
evidence of heterogeneity. However, adopting a Bayesian estimation
procedure can help mitigate the problems arising from data sparsity
and reduce estimation error. The transition probability matrices
estimated for different issuer profiles can be quite different from
each other. Using the CreditRisk+ framework, and a sample credit
portfolio, it can be shown that ignoring heterogeneity may give
erroneous estimates of VaR and a misleading picture of the risk
capital.
In a study of young women's start of first conjugal unions in four
selected countries in Central and Eastern Europe (Russia, Romania,
Bulgaria, and Hungary) in 1980 through 2004, based on data from the
national Gender and Generations Surveys, we use an extension of
piecewise-constant hazard regression to analyze jointly the competing
risks of entry into a cohabitational and a marital union. This
extension
allows us to compare trends over time and relative risks of covariates
ACROSS the two competing risks, a comparison which is infeasible
otherwise. In this manner we find, among many other things, that
marriage risks have dropped dramatically in all four countries since
well before the fall of communism and thus before the societal
transition to a market economy got underway around 1990. There has also
been a counterpart increase in cohabitation in Russia, Romania, and
Hungary, which is a clear manifestation of the Second Demographic
Transition in these countries. In Bulgaria, entry rates into
cohabitation have been surprisingly stable during the last twenty years
of the twentieth century, and they actually FELL during the last years
of our period of observation. This does not preclude that the Second
Demographic Transition has reached Bulgaria also, as is seen when we
discover that rates of conversion of cohabitations into marriages have
declined radically ever since the 1980s, which means that consensual
unions have lasted progressively longer. Evidently, the Second
Demographic Transition is not a unitary movement that reached all
countries in Central and Eastern Europe roughly at the same time; it is
present but its manifestation depends on national circumstances.
We suppose that we have mortality data arranged in two-way tables of
deaths and exposures classified by age at death and year of
death. It is natural to suppose that there is a smooth underlying
force of mortality, the mortality surface, that varies with age and
year (or period). However, observed mortality is subject to more
than stochastic deviation from this smooth surface; for example, flu
epidemics, hot summers or cold winters can disproportionately effect
the mortality of certain age groups in particular years. We call
such an effect a period shock.
We describe the mortality surface with an additive model with two
components: the underlying smooth surface is modelled with
2-dimensional $P$-splines; the period shocks
are modelled with a 1-dimensional $P$-spline in the age direction for
each year. This large regression model is expressed as an
additive generalized linear array model (Currie, Durban and Eilers,
2006). The method is illustrated with Swedish mortality data
taken from the Human Mortality Database.
Patients can acquire infections from pathogen sources within hospitals
and certain pathogens appear to be found mainly in hospitals.
Methicillin-resistant Staphylococcus Aureus (MRSA) is an example of a
hospital acquired pathogen that continues to be of particular concern
to patients and hospital management. Patients infected with MRSA can
develop severe infections which lead to increased patient morbidity and
costs for the hospital. Pathogen transmission to a patient occurs
indirectly via health-care workers that do not regularly perform hand
hygiene. Infection control measures that can be considered include
quarantine for colonised patients and improved hand hygiene for
health-care workers.
The talk develops statistical methods and models in order to assess the
effectiveness of the two control measures (i) isolation and (ii)
improved hand hygiene. For isolation, data from a prospective study
carried out in a London hospital is considered and statistical models
based on detailed patient data are used to determine the effectiveness
of isolation. The approach is Bayesian and involves Monte Carlo
sampling.
For hand hygiene it is not possible, for ethical and practical reasons,
to carry out a prospective study to investigate various levels of hand
hygiene. Instead hand hygiene effects are investigated by simulation
using parameter values estimated from data on health-care worker hand
hygiene and weekly colonisation incidence collected from a hospital
ward in Brisbane. Utilising a deterministic model for vector borne
transmission of diseases, a Markov model is developed and used to
estimate important transmission parameters. Unfortunately for one
transmission parameter there is little information available and an
alternative approach based on the deterministic model eliminates this
parameter so allowing the effects of changing hand hygiene to be
investigated using simulation.
Conclusions about the effectiveness of the two infection control
measures will be discussed and, from a modelling point of view, some
conclusions will be made contrasting simulation models with statistical
studies.
The talk involves collaborative work with Marie Forrester, Emma
McBryde, Ben Cooper, Gavin Gibson and Sean McElwain.
For most households, choosing whether to rent or buy a home is
a difficult, multifaceted problem. Not only do households have to
grapple with the uncertainties of future movements of rents and house
prices and the substantial cost of changing residence. Housing tenure
decisions are further complicated if households' exposure to labour
income risk varies across occupations, industries and regions. Then,
potential correlations with these background risks may influence the
rent or buy decision. In this study, we present preliminary empirical
evidence, derived from the German Socio-Economic Panel (GSOEP), that
both labour income growth and rent growth varies across industries and
regions. We find that income-rent correlations have a statistically
significant influence on industry-specific average rental shares in
West-German federal states. However, the economic significance of the
relationship between real rent growth and real income growth on the
decision to rent or own is rather low. A one standard deviation of the
income-rent correlation implies an increase in rental shares of about
1.75 percentage points. (Work in collaboration with Martin Wersing and
Axel Werwatz).
Hospital length of stay data typically show a distribution with a mode
near zero and a long right tail, and can be hard to model adequately.
Traditional models include the gamma and log-normal distributions, both
with a quadratic variance-mean relationship. Phase-type distributions
which describe the length of time to absorption of a Markov chain with
a single absorbing state also have a quadratic variance-mean
relationship. Covariates of interest include an estimate of the length
of stay for an uncomplicated admission, with excess length of stay
modelled relative to this quantity either multiplicatively or
additively. A number of different models can therefore be constructed,
and the results of fitting these models will be discussed in terms of
goodness of fit, significance of covariate effects and estimation of
quantities of interest to health economists.
Problems involving classification of high-dimensional data, and
‘highly multiple’ hypothesis testing, arise frequently
in the analysis of genetic data and complex signals. In this talk we
show that, in the context of multiple hypothesis testing, the
assumption of independence is much less of an issue in high-dimensional
settings than in conventional, low-dimensional ones. This is
particularly true when the null distributions of test statistics are
relatively light-tailed, for instance when they can plausibly be based
on Normal approximations. These issues are related to the `upper tail
independence' property, which is familiar in problems involving risk
analysis. Similar methods and ideas also lead to new insights for
heavy-tailed data.
Many real-world time series are often assumed to be stationary even
when they are not. Sometimes this has disastrous consequences. This
talk introduces some new tests for time series stationarity. Given two
time series it is often interesting to ask whether there is any
association between them. Various methods have been invented to ask
this question (mostly for stationary series): cross-correlation,
cross-spectral analysis and cointegration. We introduce a new concept,
called costationarity, which looks for linear combinations of locally
stationary time series that are stationary. If two time series are
costationary then there exists a non-trivial, stochastic relationship
that can be exploited. We explain how our costationarity determination
works and apply it to the FTSE100 and SP500 time series and show how
the log-returns of these series are costationary.
Data from insurance portfolios and pension schemes lend themselves
particularly well to the application of survival models. In addition
to the traditional actuarial risk-rating factors of age, gender and
policy size, we find that the use of geodemographic profiles based on
postcode provide a major boost in explaining risk variation. The use
of heterogeneity or frailty models can determine whether there are
further rating factors supported by a data set. The use of
bootstrapping can determine if there might be further financially
important variation not accounted for in a mortality model, while the
use of weights in model-fitting can help limit any mis-statement of
financial risk.
to follow
To compare the means between two conditions (such as disease versus
healthy samples) for a large number of variables given a small number
of replicates, we consider two types of Bayesian hierarchical models:
with noninformative ("objective") and mixture priors for the difference
between the means. In the mixture model, we study sensitivity to the
choice of prior on simulated data and choose the best model using mixed
posterior predictive checks. In the model with noninformative prior, we
propose to conduct the inference using adaptive interval hypothesis
testing where the interval depends on variability of each variable.
These approaches will be illustrated on gene expression data sets
produced by BAIR consortium (www.bair.org.uk).
(Joint work with Alex Lewin and Sylvia Richardson, Imperial College
London)
to follow
The authors are engaged in an ambitious research programme whose
objective is the use of statistical
models to make inferences concerning the climate of Europe for
the past 15,000 years. We have reported in detail Haslett et al (2006)
on the methods used to reconstruct climate at a single site - Sluggan
Moss in Ireland. The modelling approach is that
of Bayesian Space-Time processes. The basic data are multivariate
counts of pollen at different depths in lake sediment. Modern
data on climate and vegetation provide the essential 'training'
information which allows inference on past climate. The essential
hypothesis is that climates at times past in eg Ireland
corresponds to climates somewhere on the Earth in modern times.
Considerable uncertainty surrounds the entire exercise:
radiocarbon dating and sediment depth allow inference about the
calendar age of the samples; the multivariate 'response' of
vegetation (and thus pollen) to multivariate climate does not lend
itself to parametric modelling; pollen counts are hugely zero-inflated
and are
also subject to sampling variation; several species
comprise sub-types which favour separate climate regimes, but
have pollen that are indistinguishable. The likelihoods for pollen
given palaeo-climate are thus rich with challenge. A joint prior
for palaeoclimate space-time history provides possibilities, when
inferring the palaeoclimate corresponding to a single sample, for
borrowing strength from other samples at other locations in
space-time. The essential feature of the joint prior is a model in
which climate changes 'smoothly' in space and time; however there
is strong evidence of occasional very rapid episodes of past
climate change.
The computational methodology relies on MCMC and, for large models
such as this, convergence is not assured even by very long runs,
of days and weeks. Here we report on progress with (a) avoiding
MCMC and (b) uncertainties in modelling chronologies.
Here we focus exclusively on the time domain and discuss reconstruction
at a
single site.
In a surprising range of situations, one encounters samples in which
the observations are combinatorial objects (e.g., in social network
theory, cluster analyses, multiple protein phylogeny, and card-sorting
experiments). This talk shows how a number of standard statistical
goals, such as inference on centre and dispersion, confidence regions,
and a version of the linear model, can all be realized in such
circumstances.
In addition to being crucial to the establishment of archaeological
chronologies, radiocarbon dating is vital to the establishment of time
lines for many Holocene and late Pleistocene palaeoclimatic studies and
palaeoenvironmental reconstructions. The calibration curves necessary
to map radiocarbon to calendar ages were originally estimated using
only measurements on known age tree-rings. More recently, however, the
types of records available for calibration have diversified and a large
group of scientists (known as the IntCal Working Group, IWG) with a
wide range of backgrounds has come together to create
internationally-agreed estimates of the calibration curves. In 2002, I
was recruited to the IWG and asked to offer advice on statistical
methods for curve construction. In collaboration with Paul Blackwell, I
devised a tailor-made Bayesian curve estimation method which was
adopted by the IWG for making all of the 2004 internationally-agreed
radiocarbon calibration curve estimates. In this talk I will report on
that work and on the on-going work that will eventually provide models,
methods and software for rolling updates to the curve estimates.
I will first review well-known differences between odds ratios,
relative risks and risk differences. These results motivate the
development of methods, analogous to logistic regression, for
estimating the latter two quantities. I will then describe simple
parametrizations that facilitate maximum-likelihood estimation of the
relative risk and risk-difference. Further, these parametrizations
allow for doubly-robust g-estimation of the relative risk and risk
difference.
The presentation looks at ways to build multivariate distributions
given
only a partial specification. The problem of constructing such
distributions arises naturally in risk analysis when using expert
generated data. The basic tool used to build up these distributions is
the copula, a model for bivariate dependency. Vines provide a structure
to extend bivariate models to multivariate models, incorporating
additional information about multivariate dependency. We consider ways
of eliciting information about the copulas and how the minimum
information principle can be used to find specific distributions
meeting
the partial specifications. This talk covers joint work of the
presenter
together with collaborators Roger Cooke, Hans Meeuwissen, Daniel
Lewandowski and Dorota Kurowicka over a number of years.
Properties of the soil are influenced by processes that occur over a
wide range of scales from the molecule to the globe. As a result the
spatial variation of soil is substantial, and poses a challenge if we
want to predict its behaviour or monitor its condition. In this seminar
I will show how geostatistical methods, based on the assumption that
soil properties can be treated as realizations of spatially correlated
random functions, have been applied to a range of problems in soil
science. The assumptions of stationarity in the variance, which
underpin most standard geostatistical methods, can be relaxed when
estimation and prediction is recast in terms of the linear mixed model.
I shall show how geostatistical approaches have been applied to
problems such as mapping pollutants around a smelter, predicting
greenhouse gas emissions from the soil and using a rich soil database
to provide forensic intelligence.
LULU smoothers is a class of non-linear smoothers based on compositions
of minima and maxima over different window sizes. They have been shown
to possess very attractive mathematical properties compared to other
non-linear smoothers - see e.g. Rohwer [2]. Although they have been
studied fairly extensively for their mathematical properties, little
attention has been given to their distributional and statistical
properties. However, some progress has recently been made with the
latter - see Conradie, de Wet and Jankowitz [1]. In this talk LULU
smoothers will be introduced and a short review given of their
mathematical properties. We will then give some new results on the
distributions of these smoothers, for the most simple ones as well as
for more complex ones in the class. Furthermore, we will derive some
asymptotic results of the smoothers when the window size tends to
infinity. These limiting distributions are given in terms of the class
of extreme value distributions. We also give its limiting behaviour in
terms of that of the second largest order statistic. Finally, some
numerical and simulation results will also be given to illustrate their
behaviour.
Over the last decade or so, interest in using mixtures of statistical
models for all types of applications has risen rapidly from both the
frequentist and Bayesian points of view. What is so appealing about
such models? Are they always good news? Mainly through examples, I
shall discuss the reasons for this interest and try to point out what
you might gain, what you might lose and the care you should take when
adopting a mixture modelling approach.
Two-filter smoothing is a principled approach for performing optimal
smoothing in non-linear non-Gaussian state-space models where the
smoothing distributions are computed through the combination of
`forward' and `backward' time filters. The `forward' filter is the
standard optimal Bayesian filter but the `backward' filter, generally
referred to as the backward information filter, is not a probability
measure on the space on the hidden Markov process. In cases where the
backward information filter can be computed in closed form, this
technical point is irrelevant. However, for general state-space models
where there is no closed form expression, this prohibits the use of
flexible numerical techniques such as Sequential Monte Carlo (SMC) to
approximate the two-filter smoothing formula. We propose here a
generalised two-filter smoothing formula which only requires
approximating probability distributions and applies to any state-space
model, removing the need to make restrictive assumptions used in
previous approaches to this problem. SMC algorithms are proposed to
implement this recursion and we illustrate their performance on various
problems.
In the field of quality of health care measurement, one approach to
assessing patient sickness at admission involves a logistic regression
of mortality within 30 days of admission on a fairly large number of
sickness indicators (on the order of 100) to construct a sickness
scale, employing classical variable selection methods to find an
"optimal" subset of 10--20 indicators. Such "benefit-only" methods
ignore the considerable differences among the sickness indicators in
cost of data collection, an issue that is crucial when admission
sickness is used to drive programs (now implemented or under
consideration in several countries, including the U.S. and U.K.) that
attempt to identify substandard hospitals by comparing observed and
expected mortality rates (given admission sickness). When both
data-collection cost and accuracy of prediction of 30-day mortality are
considered, a large variable-selection problem arises in which costly
variables that do not predict well enough should be omitted from the
final scale.
In this work (a) we develop a method for solving this problem based on
posterior model odds, arising from a prior distribution that (1)
accounts for the cost of each variable and (2) results in a set of
posterior model probabilities which corresponds to a generalized
cost-adjusted version of the Bayesian information criterion (BIC), and
(b) we compare this method with a decision-theoretic cost-benefit
approach based on maximizing expected utility. We use reversible-jump
Markov chain Monte Carlo (RJMCMC) methods to search the model
space, and we check the stability of our findings with two variants of
the MCMC model composition (MC3) algorithm. We find substantial
agreement between the decision-theoretic and cost-adjusted-BIC
methods; the latter provides a principled approach to
performing a cost-benefit trade-off that avoids ambiguities in
identification of an appropriate utility structure. Our cost-benefit
approach results in a set of models with a noticeable reduction in cost
and
dimensionality, and only a minor decrease in predictive performance,
when compared with models arising from benefit-only analyses.
Two-filter smoothing is a principled approach for performing optimal
smoothing in non-linear non-Gaussian state-space models where the
smoothing distributions are computed through the combination of
`forward' and `backward' time filters. The `forward' filter is the
standard optimal Bayesian filter but the `backward' filter, generally
referred to as the backward information filter, is not a probability
measure on the space on the hidden Markov process. In cases where the
backward information filter can be computed in closed form, this
technical point is irrelevant. However, for general state-space models
where there is no closed form expression, this prohibits the use of
flexible numerical techniques such as Sequential Monte Carlo (SMC) to
approximate the two-filter smoothing formula. We propose here a
generalised two-filter smoothing formula which only requires
approximating probability distributions and applies to any state-space
model, removing the need to make restrictive assumptions used in
previous approaches to this problem. SMC algorithms are proposed to
implement this recursion and we illustrate their performance on various
problems.
The measurement and improvement of the quality of health care are
important areas of current research and development. An indirect way to
evaluate the quality of hospital care is to compare the observed
mortality rates at each of a number of hospitals with their expected
rates, given the sickness at admission of their patients. Patient
sickness at admission is often assessed by using logistic regression of
mortality, for example within 30 days of admission, on a fairly large
number of sickness indicators to construct a sickness scale, employing
classical variable selection methods --- which trade off prediction
accuracy against parsimony --- to find an "optimal" subset of 10--20
indicators. When the goal is the creation of a sickness scale that may
be used prospectively to measure quality of care on a new set of
patients in a cost-effective manner, traditional variable selection
methods can produce sub-optimal subsets, since they do not account for
differences in the data collection costs of the available predictors.
In settings of this type, with two desirable criteria that compete ---
here, high predictive accuracy and low cost --- a method must be found
to achieve a joint optimisation. Here we present a computational
strategy to search the model space and select variables under the
restriction of an upper cost bound imposed by the management of the
project. The practical relevance of the selected variable subsets using
the method of this paper is ensured by enforcing an overall limit on
the total data collection cost of each subset: the search is conducted
only among models whose cost does not exceed this budgetary
restriction.
Conventional model search algorithms in our setting will fail if the
best model under no cost restrictions lies outside the imposed cost
limit and when collinear predictors with high predictive ability are
present. The reason for this failure is the existence of multiple modes
with movement paths that are forbidden due to the cost
restriction. To solve this problem, in this paper we develop a
population-based trans-dimensional reversible-jump Markov chain Monte
Carlo (population RJMCMC) algorithm, in which ideas from the
population-based MCMC and simulated tempering algorithms are
combined. Comparing our method with standard RJMCMC, we find that
the population-based RJMCMC algorithm moves successfully and more
efficiently between distant neighbourhoods of "good" models, achieves
convergence faster and has smaller Monte Carlo standard errors for a
given amount of time. In a case study of n = 2, 532 pneumonia patients
on whom p = 83 sickness indicators were measured, with marginal costs
varying from smallest to largest across the predictor variables by a
factor of 20, the final model chosen by population RJMCMC, both on the
basis of highest posterior probability and specifying the median
probability model, is clinically sensible for pneumonia patients and
achieves good predictive ability while capping data collection costs.
Two-filter smoothing is a principled approach for performing optimal
smoothing in non-linear non-Gaussian state-space models where the
smoothing distributions are computed through the combination of
`forward' and `backward' time filters. The `forward' filter is the
standard optimal Bayesian filter but the `backward' filter, generally
referred to as the backward information filter, is not a probability
measure on the space on the hidden Markov process. In cases where the
backward information filter can be computed in closed form, this
technical point is irrelevant. However, for general state-space models
where there is no closed form expression, this prohibits the use of
flexible numerical techniques such as Sequential Monte Carlo (SMC) to
approximate the two-filter smoothing formula. We propose here a
generalised two-filter smoothing formula which only requires
approximating probability distributions and applies to any state-space
model, removing the need to make restrictive assumptions used in
previous approaches to this problem. SMC algorithms are proposed to
implement this recursion and we illustrate their performance on various
problems.
The behaviour of many systems is complex in the sense that it cannot
be understood by just knowing the properties of simpler elementary
constituents. It is often the case that the large scale behaviour is
mainly dictated by interactions on microscopic scales. Examples of
such systems include human economies, fungal colonies, the earth crust
or shape-memory metal alloys used for making devices. Science of
complexity defines a multidisciplinary field which is interesting both
from fundamental and applied viewpoints. In particular, understanding
complex systems is essential to predict catastrophic events such as
epidemic outbreaks or avalanches leading to material rupture.
In this seminar, we will mainly focus on epidemic modelling in systems
with stochasticity and heterogeneity. The presented approach is based
on the SIR (Susceptible/Infected/Removed) model analysed by means of
percolation theory and methods from statistical mechanics of lattice
systems. We will illustrate the interdisciplinary character of
the
presented ideas by suggesting some analogies with similar approaches
for description of complexity in materials. In particular, the
presented formalism will be used for description of how different
types of heterogeneity affect the possibility of an epidemic outbreak
in realistically complex systems including root systems, neural
ensembles, and soil. Finally, a method for estimation of the
probability of an epidemic outbreak and its final size in
realistically complex networks is suggested.
abstract to come
One of the main goals of epidemiological modelling is to assess risks
of large disease outbreaks and to find strategies to prevent and
control them. The decision-making depends, in the first instance, on
understanding the risk and the associated losses, which subsequently
can be traded against the cost of treatment. Although many epidemics
are characterized by large variability among individual outbreaks,
individual epidemics often follow a well-defined trajectory which is
much more predictable in the short term than the ensemble (collection)
of potential epidemics. We introduce a modelling framework that allows
us to deal with individual replicated outbreaks, based upon a Bayesian
hierarchical analysis. Information about ‘similar’ replicate epidemics
can be incorporated into a hierarchical model, allowing both ensemble
and individual parameters to be estimated. We use the modelling
framework to analyse two replicated experiments involving spread of a
common plant pathogen Rhizoctonia solani
on radish. In the first experiment we study the response of the
pathogen to a biocontrol agent, Trichoderma viride.
The
rate
of
primary
(soil-to-plant) infection is found to be the most
variable factor determining the final size of epidemics. Breakdown of
biological control in some replicates results in high levels of primary
infection and increased variability. Subsequently we expand the model
to include pre-symptomatic stages and quantify the rates of transition
between unobserved classes. Subsequently we consider various control
strategies aiming at reducing the risk and take into account trade-offs
with the loss of healthy plants. Although we use a specific system to
illustrate our approach, the modelling framework is generic and can be
applied to any system in which groups of individuals move between
locations and can carry disease agents without symptoms. The results
have important consequences for parameter estimation, inference and
prediction for emerging epidemic outbreaks.
This talk will provide an overview of computationally intensive
methods for stochastic modelling and Bayesian inference for problems in
computational systems biology. Particular emphasis will be placed on
the problem of inferring the rate constants of mechanistic stochastic
biochemical network models using high-resolution time course data, such
as that obtained from single-cell fluorescence microscopy studies. The
computational difficulties associated with "exact" methods make
approximate techniques attractive. There are many possible approaches
to approximation, including methods based on diffusion approximations,
and methods exploiting stochastic model "emulators".
High-profile hospital "superbugs" such as methicillin-resistant
Staphylococcus aureus (
MRSA) etc have a major impact on healthcare within the UK and
elsewhere. Despite enormous research attention, many basic questions
concerning the spread of such pathogens remain unanswered. For instance
what value do specific control measures such as isolation have? how the
spread in the ward is related to ``colonisation pressure``? what role
do the antibiotics play? how useful it is to have new molecular rapid
tests instead of conventional culture-based swab tests?
A wide range of biologically-meaningful stochastic transmission models
that overcome unrealistic assumptions of methods which have been
previously used in the literature are constructed, in order to address
specific scientific hypotheses of interest using detailed data from
hospital studies. Efficient Markov Chain Monte Carlo (MCMC) algorithms
are developed to draw Bayesian inference for the parameters which
govern transmission. The extent to which the data support specific
scientific
hypotheses is investigated by considering and comparing different
models under a Bayesian framework by employing a trans-dimensional MCMC
algorithm while a method of matching the within-model prior
distributions is discussed how to avoid miscalculation of the Bayes
Factors. Finally, the methodology is illustrated by analysing real data
which were obtained from a hospital in Boston.
Dynamic clustering requires the estimation of the evolution of a spatial dynamic
cluster
process in time based on a sequence of partial observation sets. A suitable generalisation
of the Bayes fi lter to this system would provide us with an optimal estimate of the multi-
cluster multi-object state based on measurements received up to the current time step
and an analogous forward-backward smoother could re ne previous estimates based on
current measurements. Based on the assumption of independent cluster processes, we
describe a generalisation of the optimal Bayes filter and forward-backward smoother for
dynamic clustering. The full independent-cluster Bayes filter requires the association of
all possible measurements to potential points within potential clusters. This approach
is computationally infeasible in almost all situations and so we derive a fi rst-moment
approximation of the independent-cluster Bayes filter, inspired by the first-moment multi-
object Bayes filter derived by Mahler in the aerospace community.