Cause-effect pairs in machine learning

Book draft

Isabelle Guyon, Alexander Statnikov, Berna Bakır Batu, Eds.

Discovering causal relationships from observational data is becoming a hot topic in data science. Does the increasing amount of available data make it easier to detect potential triggers in epidemiology, social sciences, economy, biology, medicine, and other sciences? The angle we take is that causal discovery algorithms provide putative mechanisms that still need to be challenged by experiments. However, they can help defining policies and prioritizing experiments in large scale experimental designs to reduce costs.

In 2013 we conducted a challenge on the problem of cause-effect pairs, which pushed the state-of-the art considerably, revealing that the joint distribution of two variables can be scrutinized by machine learning algorithms to reveal the possible existence of a "causal mechanism", in the sense that the values of one variable may have been generated from the values of the other (and not the other way around).

The ambition of this book is the provide both tutorial material on the state-of-the-art on cause-effect pairs, put in the context of other research on causal discovery, and a series of advanced readings from articles selected in the proceedings of the NIPS 2013 workshop on causality and the JMLR special topic on large scale experimental design and the inference of causal mechanisms. Supplemental material includes data, videos, slides, and code found on the workshop website.

PART I: Fundamentals

Chapter 1

The cause-effect problem: motivation, ideas, and popular misconceptions

Dominik Janzing

Abstract: Telling cause from effect from observations of just two variables has attracted increasing interest since more than one decade. On the one hand, it defines a nice binary classification problem for which it is easy to define a success rate, in contrast to more general causal inference tasks where no straightforward performance criteria exist. On the other hand, it fascinates researchers because solving this elementary task implies statistical asymmetries between cause and effect that were previously unknown. Discussing some real-world and toy examples, I argue that humans seem to have some intuition about these asymmetries, but I also argue that some straightforward ideas to distinguish between cause and effect are flawed. The discussion on the origin of the true asymmetries relates machine learning, philosophy and physics. For instance, the postulate that Pcause and Peffect|cause contain no information about each other (while Peffect and Pcause|effect may satisfy some ‘non- generic’ relations), is relevant for semi-supervised learning on the one hand, but is also related to the thermodynamic arrow of time on the other hand.

Key words: Cause-effect pairs, information geometry, independence of cause and mechanism

Chapter 2

Evaluation methods of cause-effect pairs

Isabelle Guyon, Olivier Goudet, Diviyan Kalainathan

Abstract: This chapter addresses the problem of benchmarking causal models or validating particular putative causal relationships, in the limited setting of cause- effect pairs, when empirical “observational” data are available. We do not address experimental validations e.g. via randomized controlled trials. Our goal is to com- pare methods, which provide a score C(X,Y), called causation coefficient, rating a pair of variable (X , Y ) for being in a potential causal relationship X -> Y . Causation coefficients may be used for various purposes, including to prioritize experiments, which may be costly or risky, or guiding decision makers in domains in which experiments are infeasible or unethical. We provide a methodology to evaluate their reliability. We take three points of views: (1) that of algorithm developers who must justify the soundness of their method, particularly with respect to identifiability and consistency, (2) that of practitioners who seek to understand on what basis algorithms make their decisions and evaluate their statistical significance, and (3) that of benchmark organizers who desire to make fair evaluations to compare methods. We adopt the framework of pattern recognition in which pairs of variable (X,Y) and their ground truth causal graph are drawn i.i.d.from a “mother distribution”. This leads us to define new notions of probabilistic identifiability, Bayes optimal causation coefficients, and multi-part statistical tests. These new notions are evaluated on the data of the first cause-effect pair challenge. We also compile a list of resources, including datasets of real or synthetic pairs, and data generative models.

Key words: Cause-effect pairs, causal discovery, causation coefficients, identifiability, statistical testing, mother distribution

Chapter 3

Learning Bivariate Functional Causal Models

Olivier Goudet, Diviyan Kalainathan, Michèle Sebag, Isabelle Guyon

Abstract: Finding the causal direction in the cause-effect pair problem has been addressed in the literature by comparing two alternative generative models X -> Y and Y <- X. In this chapter, we first define what is meant by generative modeling and what are the main assumptions usually invoked in the literature in this bivariate setting. Then we present the theoretical identifiability problem that arises when considering causal graph with only two variables. It will lead us to present the general ideas used in the literature to perform a model selection based on the evaluation of a complexity/fit trade-off. Three main families of methods can be identified: methods making restrictive assumptions on the class of admissible causal mechanism, methods computing a smooth trade-off between fit and complexity and methods exploiting independence between cause and mechanism.

Key words: Cause-effect pairs, causal discovery, causal modeling, identifiability, causal mechanisms

Chapter 4

Discriminant Learning Machines

Diviyan Kalainathan, Olivier Goudet, Michèle Sebag, Isabelle Guyon

Abstract: The cause-effect pair challenge has, for the first time, formulated the cause-effect problem as a learning problem in which a causation coefficient is trained from data. This can be thought of as a kind of meta learning. This chapter will present an overview of the contributions in this domain and state the advantages and limitations of the method as well as recent theoretical results (learning theory/mother distribution). This chapter will point to code from the winners of the cause-effect pair challenge.

Key words: Cause-effect pairs, causal discovery, discriminant methods, mother distribution

Chapter 5

Cause-Effect Pairs in Time Series with a Focuson Econometrics

Nicolas Doremus, Alessio Moneta, Sebastiano Cattaruzzo

Abstract: This chapter addresses the problem of identifying the causal structure between two time-series processes. We focus on the setting typically encountered in econometrics, namely stationary or difference-stationary multiple autoregressive processes with additive white noise terms. We review different methods and algorithms, distinguishing between methods that filter the series through a vector autoregressive (VAR) model and methods that apply causal search directly to time series data. We also propose an additive noise model search algorithm tailored to the specific task of distinguishing among causal structures on time series pairs, under different assumptions, among which causal sufficiency.

Keywords: VARmodels, Independentcomponentanalysis, Graphicalmodels, Impulse response functions, Granger causality, Additive noise models, Local projections

Chapter 6

Beyond cause-effect pairs

Frederick Eberhardt

Abstract: The cause-effect pair challenges focused on the development of inference methods to determine the causal relation between two variables. It is natural to then ask how such methods could generalize beyond the two variable case to settings that either involve more variables – such as is the case in graph learning – or to settings where the relationship between the candidate variables does not fall into one of the classes defined by the challenges. This chapter explores the extension of the proposed methods to such cases. It comes to the conclusion that such extensions are not likely to naturally evolve from the approaches that won the pair challenge.

Keywords: graph learning, structure learning, confounding, feedback cycles, variable construction

PART II: Selected readings

Chapter 7

Results of the Cause-Effect Pair Challenge

Isabelle Guyon, Alexander Statnikov

Abstract: We organized a challenge in causal discovery from observational data with the aim of devising a “causation coefficient" to score pairs of variables. The participants were provided with a large database of thousands of pairs of variables {X,Y} (80% semi-artificial data and 20% real data) from which samples were drawn in- dependently (i.e. ignoring possible time dependencies). The goal was to discover whether the data supports the hypothesis that Y = f(X, noise), which for the pur- pose of this challenge was our definition of causality (X causes Y). The participants adopted a machine learning approach, which contrasts with previously published model-based methods. They extracted numerous features of the joint empirical distribution of X and Y and built a classifier to separate pairs belonging to the class “X causes Y" from other cases ("Y causes X", “X and Y are related" but not in a causal way, a third variable may be causing both X and Y, “X and Y are independent"). The classifier was trained from examples provided by the organizers and tested on independent test data for which the truth values of causal relationships was known only to the organizers. The participants achieved an Area under the ROC Curve (AUC) over 0.8 in the first phase deployed on the Kaggle challenge, which ran from March through September 2013 (round 1). The participants were then invited to improve upon the code efficiency by submitting fast causation coefficients on the Codalab platform (round 2). The causation coefficients developed by the winners have been made available under open source licenses. We have made all data and code publicly available at

Key words: causal discovery, cause-effect pairs, benchmark, challenge

Chapter 8

Non-linear Causal Inference using GaussianityMeasures

Daniel Hernández-Lobato, Pablo Morales-Mombiela, David Lopez-Paz, Alberto Suárez

Abstract: We provide theoretical and empirical evidence for a type of asymmetry between causes and effects that is present when these are related via linear models contaminated with additive non-Gaussian noise. Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction. This Gaussianization effect is characterized by reduction of the magnitude of the high-order cumulants and by an increment of the differential entropy of the residuals. The problem of non-linear causal inference is addressed by performing an embedding in an expanded feature space, in which the relation between causes and effects can be assumed to be linear. The effectiveness of a method to discriminate between causes and effects based on this type of asymmetry is illustrated in a variety of experiments using different measures of Gaussianity. The proposed method is shown to be competitive with state-of-the-art techniques for causal inference.

Key words: causal inference, Gaussianity of the residuals, cause-effect pairs

Chapter 9

From Dependency to Causality: A MachineLearning Approach

Gianluca Bontempi, Maxime Flauder

Abstract: The relationship between statistical dependency and causality lies at the heart of all statistical approaches to causal inference. Recent results in the ChaLearn cause-effect pair challenge have shown that causal directionality can be inferred with good accuracy also in Markov indistinguishable configurations thanks to data driven approaches. This paper proposes a supervised machine learning approach to infer the existence of a directed causal link between two variables in multivariate settings with n > 2 variables. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected. Our results show that supervised learning methods may be successfully used to extract causal information on the basis of asymmetric statistical descriptors also for n > 2 variate distributions.

Key words: causal inference, information theory, machine learning

Chapter 10

Pattern-based Causal Feature Extraction

Diogo Moitinho de Almeida

Abstract: This cause-effect pairs challenge was motivated by the contrast between the costs of performing controlled experiments in order to determine causality and the abundance of observational data. Our goal was to provide a value representing our con- fidence of causality determined by the observation data which would help identify the most promising variables for experimental verification of their causal relation- ship. By identifying patterns in functions that generate relevant features, a feature extraction pipeline was architected to allow for the creation of large amounts of complex features with minimal human intervention. Using this pipeline, we were able to finish second in the public leaderboard and first in the private leaderboard. Furthermore, this process by default generates over 20,000 features. In this paper, we analyze which aspects are most important, and create a new pipeline that gets comparable performance with only 324 features.

Key words: Feature Extraction, Machine Learning, Causality.

Chapter 11

Training Gradient Boosting Machines usingCurve-fitting and Information-theoretic Features for Causal Direction Detection

Spyridon Samothrakis, Diego Perez, Simon Lucas

Abstract: Detecting causal relationships between random variables using only matched pairs of noisy observations is a crucial problem in many scientific fields. In this paper the problem is addressed by extracting a number of features for each matched pair using a selection of curve-fitting and information theoretic features. Using these features, we train a pair of Gradient Boosting Machines whose hyperparameters we optimise using stochastic simultaneous optimistic optimisation. The results show that our method is relatively successful, gaining a 3rd place in the 2013 Kaggle’s Causality Challenge. Our method is sound enough to be used in causality detection (or as part of a more comprehensive toolkit), although we believe it might be possible to considerably improve the quality of results by adding more features in the same vein.

Key words: Causality Detection, Gradient Boosting Machine, StoSOO.

Chapter 12

Conditional distribution variability measures forcausality detection

Josè A. R. Fonollosa

Abstract: In this paper we derive variability measures for the conditional probability distri- butions of a pair of random variables, and we study its application in the inference of causal-effect relationships. We also study the combination of the proposed measures with standard statistical measures in the the framework of the ChaLearn cause-effect pair challenge. The developed model obtains an AUC score of 0.82 on the final test database and ranked second in the challenge.

Key words: causality detection, cause-effect pair challenge

Chapter 13

Feature importance in causal inference fornumerical and categorical variables

Bram Minnaert

Abstract Predicting whether A causes B (write A -> B ) or B causes A from samples (X, Y) is a challenging task. Several methods have already been proposed when both A and B are numerical. However, when A and/or B are categorical, few studies have already been performed.

This paper aims to learn the causal direction between two variables by fitting the regressions of X on Y and Y on X with machine learning algorithm and giving preference to the direction that yields a better fit.

This paper will investigate which features are the most important when A/B is numerical/categorical. Via an ensemble method, it finds that the features that are im- portant heavily depend on the different combination of numerical/categorical.

Key words: Causal inference, Deterministic causal relations, Random forest regression, Graphical models, Feature selection

Chapter 14

Markov Blanket Ranking using Kernel-basedConditional Dependence Measures

Eric V. Strobl, Shyam Visweswaran

Abstract Developing feature selection algorithms that move beyond a pure correlational to a more causal analysis of observational data is an important problem in the sciences. Several algorithms attempt to do so by discovering the Markov blanket of a target, but they all contain a forward selection step which variables must pass in order to be included in the conditioning set. As a result, these algorithms may not consider all possible conditional multivariate combinations. We improve on this limitation by proposing a backward elimination method that uses a kernel-based conditional dependence measure to identify the Markov blanket in a fully multivariate fashion. The algorithm is easy to implement and compares favorably to other methods on synthetic and real datasets.

Key words: feature ranking, Markov blanket, machine learning

Acknowledgements: The initial impulse of the Cause-Effect Pair challenge came from the cause-effect pair task proposed in the causality “pot-luck” challenge by Joris Mooij, Dominik Janzing, and Bernhard Schoelkopf, from the Max Planck Institute for Intelligent Systems, who contributed an initial dataset and several algorithms. Alexander Statnikov and Mikael Henaff of New York University provided additional data and baseline software. The challenge was organized by ChaLearn and coordinated by Isabelle Guyon. The first round of the challenge was hosted by Kaggle and we received a lot of help from Ben Hamner. The second round of the challenge (with code submission) was sponsored Microsoft and hosted on the Codalab platform, with the help of Evelyne Viegas and her team. Many people who reviewed protocols, tested the sample code and challenge website are gratefully acknowledged: Marc Boullé (Orange, France), Léon Bottou (Facebook), Hugo Jair Escalante (IANOE, Mexico), Frederick Eberhardt (WUSL, USA), Seth Flaxman (Carnegie Mellon University, USA), Mikael Henaff (New York University, USA), Patrik Hoyer (University of Helsinki, Finland), Dominik Janzing (Max Plank Institute for Intelligent Systems, Germany), Richard Kennaway (University of East Anglia, UK), Vincent Lemaire (Orange, France), Joris Mooij (Faculty of Science, Nijmegen, The Netherlands), Jonas Peters (ETH Zuerich, Swtzerland), Florin Popescu (Fraunhofer Institute, Berlin, Ger- many), Bernhard Sch olkopf (Max Plank Institute for Intelligent Systems, Germany) , Peter Spirtes (Carnegie Mellon University, USA), Alexander Statnikov (New York University, USA), Ioannis Tsamardinos (University of Crete, Greece), Jianxin Yin (University of Pennsylvannia, USA), Kun Zhang (Max Plank Institute for Intelligent Systems, Germany). We would also like to thank the authors of software made publicly available that were included in the sample code: Povilas Daniusis, Arthur Gretton, Patrik O. Hoyer, Dominik Janzing, Antti Kerminen, Joris Mooij, Jonas Peters, Bernhard Schoelkopf, Shohei Shimizu, Oliver Stegle, and Kun Zhang. We also thank the co-organizers of the NIPS 2013 workshop on causality (Large-scale Experiment Design and Inference of Causal Mechanisms): Léon Bottou (Microsoft, USA), Isabelle Guyon (ChaLearn, USA), Bernhard Schoelkopf (Max Plank Institute for Intelligent Systems, Germany), Alexander Statnikov (New York University, USA), and Evelyne Viegas (Microsoft, USA).