Joint Workshop on Statistical Datamining
EURANDOM, Eindhoven, The Netherlands April 23 and 24, 2003

EURANDOM is organizing this workshop together with PRO-ENBIS, a thematic network with the aim to promote statistics in business and industry.


Jerry Friedman, Stanford University Abstract
David Hand, Imperial College London Abstract
Elsa Jordaan, Dow Chemicals Abstract
Alan Karr, National Institute of Statistical Sciences Abstract
Jacqueline Meulman, Leiden University Abstract
Petra Perner, Institute of Computer Vision and Applied Computer Sciences Abstract
Arno Siebes, Utrecht University Abstract
Andrea Ahlemeyer Stubbe, Database Management Abstract

Organizing Committee

Andrea Ahlemeyer Stubbe, Database Management,
Wim Senden (EURANDOM),
Henry Wynn (EURANDOM/London School of Economics).


Wednesday April 23, 2003



Henry Wynn, Scientific Co-Director EURANDOM


David Hand

Pattern discovery





Jerry Friedman

Importance Sampling





Jacqueline Meulman

Prediction and Dimension Reduction with Nonlinear Transformation of Variables





Petra Perner

Concepts and Methods of Case-Based Reasoning - Why is CBR so attractive for Decision Making? 







Thursday April 24, 2003


Alan Karr

Connections of Data Mining to Other Problems 





Arno Siebes

Relational Data Mining





Andrea Ahlemeyer-Stubbe

A practical Data Mining solution in marketing





Elsa Jordaan

Development of Robust Inferential Sensors 








Abu-Hanna, Ameen


Ahlemeyer-Stubbe, Andrea 

Database Management

Bathoorn, Ronnie 

Universiteit Utrecht

Carmine, Gioia 

Copenhagen Business School

Danilov, Dmitri


Denteneer, Dee

Philips Research

Feelders, Ad 

Universiteit Utrecht

Friedman, Jerry

Stanford University

Goel, Prem 

Ohio State University

Greenfield, Tony 

Greenfield Research Ltd

Grunwald, Peter 


Giudici, Paulo 

University of Pavia

Gunst de, Mathisca


Hand, David

Imperial College London

Herrmann, Daniel 

Robert Bosch GmbH

Jordaan, Elsa 

Dow Chemicals

Karr, Alan 

National Institute of Statistical Sciences

Kenett, Ron- 


Koloidenko, Alexei


Meulman, Jacqueline 

Leiden University

Mushkudiani, Nino


O, Ying-Lie 

Julius Centre for Health Sciences and Primary Care, UMC Utrecht

Ortega-Azurduy, Shirley 

TU/e and CQM

Peek, Niels 

University of Amsterdam

Potharst, Rob 

Erasmus Universiteit Rotterdam

Peelen, Linda 


Perner, Petra 

Institute of Computer Vision and Applied Computer Sciences

Riggelsen, Carsten 

Universiteit Utrecht

Segers, Johan 

Universiteit Twente

Seabra dos Reis, Marco P. 

University of Coimbra

Senden, Wim


Siebes, Arno 

Utrecht University

Smilde, Age 

University of Amsterdam and TNO Nutrition

Muruz Bal, Jorge 

University Rey Juan Carlos

Zempleni, Andras 

Eotovos Lorand University

Verbitsky, Evgeny

Philips Research

Verduijn, Marion

Medical Center Amsterdam

Voorbraak, Frans


Vandev, Dimitar

Bulagarian Statistical Sciety

Wessel, Jaap


Wezel van, Michiel 

Erasmus Universiteit Rotterdam

Wynn, Henry

London School of Economics/EURANDOM

Yinmei, Zhan 


Zoutenbier, Marnix 


Zwet van, Willem 

Unversiteit Leiden



Andrea Ahlemeyer-Stubbe, Database Management, Germany Speakers

A practical Data Mining solution in marketing

In the first part the talk gives a short introduction on Database Marketing(DBM) and Customer Relationship Management (CRM), I like point out where it is related to Data Mining and how Data Mining results are influencing the success of DBM and CRM. The main part of the talk based on a real Database Marketing problem. I like to show by using the Data Mining Process how a Data Mining Project is handled in practices and which problems has to by solved during the Data Mining Process to get a useful result for the marketing people.



Jerome H. Friedman, Stanford University, USA Speakers

Importance Sampling

(Joint work with Bogdan Popescu)

An Alternative View of Ensemble Learning Jerome H. Friedman* Stanford University Learning a function of many arguments is viewed from the perspective of high-dimensional numerical integration. It is shown that many of the popular ensemble learning methods can be cast in this framework. In particular, bagging, boosting, and Bayesian model averaging are seen to correspond to Monte Carlo integration methods each based on different importance sampling strategies. This interpretation explains some of their properties and suggests modifications to them that can improve their accuracy and especially their computational performance.



David J. Hand Imperial College, London, UK Speakers

Pattern discovery

Modern statistical data analysis is predominantly model-driven, seeking to decompose an observed data distribution in terms of major underlying descriptive features, modified by some stochastic variation. A large part of data mining is also concerned with this exercise. However, another fundamental part of data mining is concerned with detecting anomalies amongst the vast mass of the data: the small deviations, unusual observations, unexpected clusters of observations, or surprising blips in the data, which the model does not explain. We call such anomalies patterns and this talk describes technologies which have been developed for discovering patterns, illustrating with real discoveries from a variety of areas.

For sound reasons, which are outlined in the talk, the data mining community has tended to focus on the algorithmic aspects of pattern discovery, and has not developed any general underlying theoretical base. But such a base is important for any technology: it helps to steer the direction in which the technology develops, as well as serving to provide a basis from which algorithms can be compared, and to indicate which problems are the important ones waiting to be solved.

This talk attempts to provide such a theoretical base, linking the ideas to statistical work in spatial epidemiology, scan statistics, outlier detection, and other areas. One of the striking characteristics of work on pattern discovery is that the ideas have been developed in several theoretical arenas, and also in several application domains, with little apparent awareness of the fundamentally common nature of the problem. Like model building, pattern discovery is fundamentally an inferential activity, and is an area in which statisticians can make very significant contributions.



Elsa Jordaan, Dow Chemicals, USA Speakers

Development of Robust Inferential Sensors

Inferential sensors are models that infer important process variables (called outputs) from available hardware sensors (called inputs). Usually the outputs are measured infrequently by lab analysis, material property tests, expensive gas chromatograph analysis, etc. Very often the output measurement process is performed off-line and then introduced into the on-line process monitoring and control system. It is assumed that the inferential sensors' inputs are available on-line either from cheap hardware sensors or from other inferential sensors.

Typically the development of an inferential sensor involves a series of steps like data analysis, data reduction, development of transforms, model building, model explanation, implementation of the sensor and operation of the inferential sensor within the controls of a plant.

For an inferential sensor to be accepted and effective in the industry it has to satisfy the following design requirements: (1) complexity control; (2) ability to use high-dimensional data; (3) robustness; (4) generalisation ability; (5) data compression and outlier detection; (6) incorporation of prior knowledge; (7) self-diagnostic capability and (8) adaptive behaviour.

The greatest risk of violating a design requirement is in the model-building step. Different inference mechanisms can be used for modelling and each has its advantages and disadvantages with respect to the design requirements. These mechanisms can vary from fundamental modelling and linear or multivariate regression to the data-driven modelling approaches like genetic programming, neural networks and support vectors machines.

In many ways these methods are complimentary and by combining them better inferential sensors are developed. Furthermore, it is important to realise that in industrial applications there is no such thing as the perfect model.


Cherkassky V. and F. Muller, Learning from data, Concepts, theory, and methods, John Wiley, New York, 1998.

Decoste D. and K. Wagstaff, Alpha seeding for support vector machines. In International Conference onInternational Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000.

Dhar V. and R. Stein, Seven methods for transforming corporate data into business intelligence, Prentice Hall, Upper Saddle River, NJ, 1997.

Guyon II., N. Matic, and V. Vapnik, Discovering informative patterns and data cleaning. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 181-203, 1996.

Jordaan E. M., Development of Robust Inferential Sensors: Industrial Application of Support Vector Machines for Regression, Ph.D. thesis, Technical University Eindhoven, 2002.

Kordon A., G. Smits, E. Jordaan and E. Rightor, Robust Soft Sensors Based on Integration of Genetic Programming, Analytical Neural Networks, and Support Vector Machines. In Proceedings of WCCI 2002, Honolulu, HW: IEEE Press, pp. 896 - 901, 2002.

Schölkopf B., C. Burges, and V. Vapnik, Extracting support data for a given task. In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference on Knowledge Discovery & Data Mining, Menlo Park, 1995. AAAI Press. TRESP V, Scaling kernel-based systems to large data sets. Data Mining and Knowledge Discovery, 5(3):197-211, 2001.



Alan F. Karr National Institute of Statistical Sciences, USA Speakers

Connections of Data Mining to Other Problems

This talk will attempt to lay out the connections of data mining to several other important classes of problems. Many of these connections are understood poorly if at all, and the talk will focus much more on what we as statistical scientists cannot do yet, rather than on what we have done.

The problems to be treated include several in which NISS is engaged currently:

The scientific complexity is intensified by interactions among these three problems. For example, poor DQ protects confidentiality, while DI (in the form of record linkage) is a means of breaking confidentiality. The needs will be discussed at multiple levels: abstractions, theory and methodology and (scalable) software tools. 



Jacqueline Meulman, Leiden University, NL Speakers

Prediction and Dimension Reduction with Nonlinear Transformation of Variables

In data mining, the data at hand typically consist of ordinal and nominal categorical variables, possibly with many missing values. We will describe feasible analysis methods that assign quantitative values to these qualitative variables that minimize a particular loss function. This loss function is set up, for example, for the purpose of predicting a response variable (that can be continuous or discrete, ordinal or nominal) from a number of predictors (categorical, either ordinal or nominal). Another objective in data mining is to describe a large, complex multivariate data set by visualization, using points and vectors in a low-dimensional representation space (biplot). The quantification of qualitative variables is accomplished by optimal nonlinear transformations. These nonlinear functions include least squares nominal and ordinal transformations, and monotonic and nonmonotonic splines. Because the data are assumed to be categorical, very large data sets can be analyzed by constructing algorithms that are not defined on the number of objects in the analysis, but on the number of categories. Missing data can be handled in an elegant way by using weights in the loss functions that exclude the missing cells from the computation. Although the algorithms are constructed for categorical data, continuous data can be included in the analysis by the use of optimal binning procedures. The methods discussed are developed at the Leiden Department of Data Theory, and have been incorporated in the Categories module in SPSS (from version 10.0 onwards). There are interesting relations with other nonlinear methods, such as support vector machines, and kernel principal components analysis.


Gifi, A. (1990). Nonlinear Multivariate Analysis (Eds: Heiser, W.J., Meulman, J.J., Van de Berg, G.). Chicester (UK): Wiley. Meulman, J.J., Heiser, W.J., & SPSS (1999). Categories. Chicago: SPSS Inc.

Presentation   Figures


Petra Perner Institute of Computer Vision and Applied Computer Sciences, Germany Speakers

Concepts and Methods of Case-Based Reasoning - Why is CBR so attractive for Decision Making?

Decision trees or rule-based system are difficult to utilize in domains where generalized knowledge is lacking. But often there is a need for a prediction system even though there is not enough generalized knowledge. Such a system should a) solve problems using the already stored knowledge and b) capture new knowledge making it immediately available to solve the next problem. To accomplish these tasks case based reasoning is useful. Case-based reasoning explicitly uses past cases from the domain expert´s successful or failing experiences. Therefore, case-based reasoning can be seen as a method for problem solving as well as a method to capture new experiences and make them immediately available for problem solving. CBR has become a very popular and powerful method for decision making over the last decade. It has shown its outstanding performance in a lot of different application fields such as industrial and medical diagnosis, e-commerce, knowledge management, legal reasoning and image interpretation. In the talk we introduce the case-based reasoning process and the methodology behind CBR. We will describe the basic concepts for creating a CBR system and show how CBR works in practice based on applications from image interpretation.


Website on Data Mining
Book Data Mining on Multimedia Data
Conference on Data Mining   and Machine Learning
Technical Committee TC17 on Data Mining and Machine Learning of the Intern. Association of Pattern Recognition (IAPR)
Paper on Data Mining



Arno Siebes, CWI. NL Speakers

Relational Data Mining

In standard data analysis it is assumed that all data resides in one table, i.e., there is one data matrix. However, for many data collections, both in industry and in science, it is far from straightforward how to create such a single table. For example, if we want to analyse personal information together with bank-account data, we have the problem that the number of bank-accounts per person varies widely. Another example is in the life-sciences were one wants to analyse widely different data sets jointly.

For example, a combination of patient data, treatment data gene expression data as well as annotated gene data to search for causes of multi-factorial diseases. Finally, one can think of a library of XML documents in which the documents share at best part of their structure.

In relational data mining, we aim to analyse collections of related tables directly rather than forcing an arbitrary collapse into one table. In this talk I will introduce this area both by presenting some algorithms and by discussing some examples. Special emphasis will be put on statistical issues that arise in this new area.


European Network of Interest: