Markov Wanderer A blog on economics, science, coding and data. Views are my own.

Get organised

Best practice for research project organisation

This post is going to discuss one way to organise your small (as opposed to ‘big’) data science project or research project: data, code and outputs. I’ll cover how to structure the project, version control, data and data storage, analytical tools, coding standards, and what to do when your project is over.


Of course, these are just my opinions, they’re far from exhaustive, and there may well be good reasons to set up your project differently depending on what it is that you’re doing. I’m interested in hearing different perspectives so get in touch if you have them.

Inevitably the post is going to be geared toward Python because it’s my primary language but much of the advice applies equally well to R. Similarly, although most of what I’m saying applies across platforms, in some in places it may be more relevant to Mac OS.

I’m not going to discuss topics like unit tests, automatic creation of documentation, or making the project into an installable package in this post and, for most small research projects, these features would probably be overkill.

For a more detailed perspective on best practice research project organisation, see Good enough practices in scientific computing. PLoS computational biology, 13(6), e1005510. A similar post from a more pure data science perspective may be found here, and there’s a machine learning oriented cookiecutter project here.

The example project

There’s a small research project github repository that accompanies this post. To use it as the basis of your small research project, open up a command line and type git clone in the directory in which you want to clone it, or download it directly from github.

It is in Python 3 and uses the ONS API to download some macroeconomic time series, process them into tidy data, and then use them within a dynamic factor model† inspired by Chad Fulton’s tutorials/notebooks which you can find here and here.

It is very much a toy example and not intended to be accurate or say anything at all about the real world! It is designed to showcase how the various components of what I’ll say below fit together in practice.

Within the example project, there are Latex templates for both slides and a working paper. These are based on Paul Goldsmith-Pinkham’s excellent templates, the originals of which you can find here for slides and here for the working paper.

Okay, on to the main event…

Project structure

The structure of your project should be a directed acyclic graph with raw data making up the first nodes and research outputs (e.g. paper or slides) the final nodes. Here’s an example for the cookiecutter research project:


Why this directed acyclic graph structure? For reproducibility, you can’t have steps earlier on in the project that depend on steps later on in the process. This may seem completely obvious but, believe or not, I have seen projects where later stage outputs are looped back to be inputs into earlier stages.

Another important principle here is to separate out different phases of the analysis. Sometimes this is about avoiding repeated effort - going from raw data to cleaned data might be very expensive in terms of time.

Before you start your project, it’s really worth taking the time to sketch out on paper how everything will fit together and which parts might depend on each other. Putting a lot of effort into this step will save you a lot of time in the long run. Armed with a clear structure, you will write better, more modular code that does not involve repetition. Of course, research work is inherently uncertain and you shouldn’t be afraid to change up the structure if the focus or goals of the project change.

If you haven’t tried putting figures and tables in a separate directory to your Latex code before then the example project implements an efficient way to do so. You set a single path and can then refer to outputs only by their name (not their full path). If you want to be even more fancy you can move files around during Latex compilation.

Perhaps you need to output your (Latex) writing to Microsoft’s Word format or to markdown as part of your workflow? In that case, I’d suggest using pandoc but be warned that ensuring references, citations, equations, and inputs are included correctly can be fiddly.

One other important principle: friends do not let friends use whitespace in filenames or paths.

Configuration files

You’ll notice that there is a config file, config.yaml, that sits above everything else. The purpose of this is to make adding global settings to your project easier, especially if they are directories. The advantage of this config file is that you can see what settings are being run from one place and, if you do need to change the structure of the project, you only have to do it in one place. Similarly, others on the project can clearly see when and how important settings were changed without trawling through lots of code.

In the example project, I’ve put settings for plots into the config.yaml where they can be conveniently loaded. These start with the - viz: heading in the file.

.yaml is not the only configuration file available and I don’t have a very strong view as to which is best as they all have their pros and cons. I’ve used both .ini and .yaml, and both can work for a simple project. You can find more about the ins and outs of different config file formats here (with handy examples) and here.

Version control

There are many articles on why you should use version control if you’re doing any sort of coding and I’m not going to go over the arguments here. I will link to this primer instead. Most people use git for version control (it’s completely free). Git has a painful learning curve but there are just a handful of commands that you’ll use most of the time, especially on smaller projects. And, if you do run into trouble, there’s always Note that git is the tool to manage version control while github, gitlab, and bitbucket are hosting services for git repositories.

Beyond the software development-type reasons for version control, there are benefits that are particular to research. Journals increasingly require code to be submitted alongside papers; version control encourages good code management that will make submitting your code much easier when the time comes. If you host your code on platforms such as github and gitlab, privately at first, and then publicly when you publish, you can significantly extend the impact of your work. Those same platforms enable collaboration on code, including Latex, with co-authors. Even better, you can use tools like git-blame to understand who changed what and when - useful in all kinds of situations, not just blaming co-authors for that misplaced semi-colon.

The other great use of the various git platforms is to track bugs, to do lists, and even to host wikis.

A few extra tips on the interaction between version control and project structure.

Version control is not meant to track data, only code. However, for outputs, such as figures and tables, it’s less clear where to draw the line. But (as shown above) I’d advise having a scratch-outputs folder that is not under version control that you can spam with hundreds of figures and tables and a (final) outputs folder that holds the tables and figures that are going to make it into the paper and/or slides.

Latex is code! Put it under version control. This also makes it easy to collaborate with co-authors, and work out who changed what when. Some prefer to use tools like Overleaf, an online Latex editor with some WYSIWYG features, instead.

There are some folders, e.g. raw/, that you’d like to keep even though none of the contents of the folder should be under version control. There is a special file for that, called .gitkeep, which tells git you’d like to keep the folder. The file can be completely empty and, on Unix systems, you can create it with touch raw/.gitkeep in the command line.

Likewise, there is a lot of gunk generated by Latex compilation that you probably don’t want to keep under version control. This is what the magic .gitignore file is for in the project structure. It specifies what types of file to ignore. The .gitignore file in the example project will automatically ignore Latex compilation files, everything in the raw/ and output-scratch/ folders, and anything generated indirectly by running Python code or Latex compilation.


I find it useful to think about the main possible classes of data in a research project as being raw, intermediate, cleaned, and output.

As the example project is simple, we are going to skip intermediate data and go straight for clean data.

Raw data

Raw data is just that. No matter how horrible a format it comes to you in (a 50 sheet Excel file with different formats on each sheet anyone?), you should preserve that. Don’t mess with it, keep it to one side and derive other, better data from it. You’ll need it later when you try and replicate your own work.

Intermediate data

Intermediate data is the data you get once you’ve made some progress on getting whatever mess you started with into shape. Maybe you had 5 different spreadsheets and you’ve managed to clean each one and dump them into CSVs. Yes, they are still not tidy, or in the format you need for analysis, or merged. But you’ve made some progress, progress worth making into a distinct phase of your analysis.

Intermediate data can be very large, in which case you may want to consider the speed and efficiency of storing it. For the python library pandas, there’s a nice post here looking at file sizes and loading/saving speeds. As noted, intermediate data should not be under version control. Data versioning does exist but I’ve not (yet) seen it used for research projects - see pachyderm for an example.

Cleaned data

Cleaned data is what’s used to do the analysis. It’s data that’s ready to go into a machine learning model or regression. If a colleague were working on a similar project, this is (hopefully) what you’d send them instead of the 50-sheet Excel monstrosity.

Cleaned data should be stored in tidy format, that is data in which each observation is a row, each variable is a column, and each type of observation unit forms a table. This figure shows a visual example of tidy data.

png From R for Data Science.

If you want to find out more about why it’s good practice to store your data in tidy format then it’s worth reading Hadley Wickham’s paper on it.

In the vast majority of cases, the best data file format for your project’s cleaned data is CSV. Everyone can open it, no matter what analytical tool or operating system they are using. As a storage format, it’s unlikely to change. Without going into the mire of different encodings, save it as UTF-8 (note that this is not the default encoding in Windows). This is especially true of text heavy data.

Of course, CSV is great for tabular data but won’t work for many other kinds. For other cases, Stanford’s library has put together a useful list of preferred file formats for archiving everything from geospatial data to films.

Do not store your data in Excel file formats. Ever. Firstly, it’s not an open format, it’s proprietary, even if you can open it with many open source tools. But, more importantly, Excel can do bad things like changing the underlying values in your dataset (dates and booleans), and it tempts other users to start slotting Excel code around the data. This is bad - best practice is to separate code and data. Code hidden in Excel cells is not very transparent or auditable.


Should you put your tidy, cleaned data under version control? Probably not. But if it’s small and unlikely to change much, it can be quite convenient to do so.

Output data

These are the final figures and tables that tell the story in your analysis. As noted, it’s convenient to put the ones that are going to make it into your paper and any presentations you give under version control, and have a scratch folder for the rest. This a folder that’s for the many extra figures and tables that you’ll create, and perhaps want to glance at, but won’t hold on to.

For figures, most journals require that you use lossless formats such as PDF and EPS. .eps and .pdf are vector image formats, they work by representing the shapes and lines of the image and so can be reproduced any resolution. They are distinct from rasterised formats (.png, .jpg) that work by having pixels that reproduce the image but only at a specific resolution. For images made up of smooth shapes and colours, like most charts, vector formats are superior because they encode the information to show an image at any resolution. For complex images, such as photographs, jpg is usually better because there is a natural limit to the resolution you would ever need in such an image. As journals tend to prefer it, my general recommendation is to use .eps wherever possible and, if you do have real images, find out what format the journal prefers. Not only do .eps files look better, but for figures they tend to take up less space on disk versus the equivalent png or jpg file. Modern programming languages like R and Python can export to all of these formats.

For reasons that are not at all obvious, Powerpoint does not play nicely with vector images but Keynote (Mac OS) and Beamer/Latex (all operating systems) do.‡

What about tables? My current strategy is to export these directly to Latex as .tex files. It’s not so easy to look at these without compiling them using Latex but it saves a lot of time when (automatically) incorporating them into your paper and presentations. Tables as tex files also take up little space on disk and can happily go under version control.*

Analytical tools

By analytical tools, I really mean the combination of programming language and integrated development environment (IDE) that you use. The best practice here is to use the right tool for the job.

In addition to that, it’s worth thinking about how accessible your code will be to others when you release it. Code written in a proprietary language that requires users to shell out some cash just to run it is inevitably less accessible than code written in open source languages.

Unless you’re running very computationally intensive code that needs C++ or Fortran, you’re likely to be using one of Python, R, Julia, or Matlab. If you’re coming from the social sciences then perhaps you’ll be using Stata or EViews. Some of these languages come bundled with, and are almost inseparable from, their IDEs.

As for which IDE to use, many heavy R users swear by RStudio and I know of many Python users who either prefer Spyder (which comes bundled with the Anaconda distribution of Python) or PyCharm (anecdotally this seems to be preferred by software dev types).

Recently, I’ve mostly been using Visual Studio Code. VS Code is an extendible text editor and IDE that is free and very impressive: I’ve used it to run code in Python, R, markdown, and Latex. I believe it also supports Octave (aka free Matlab) and Julia, but I haven’t tested these. There’s syntax highlighting for both Stata and Matlab and - if you already have Stata installed - you can run apparently run Stata code from VSCode! Support for Python is very good; you can switch between environments within the IDE, launch interactive consoles, and remotely connect to an instance of Python running elsewhere. Switching between Python/conda environments with the click of a button is revelatory. See here for a full list of supported languages.

Most additional features require the installation of packages that can be found via the package search. Two essential extensions are git-lens and Markdown preview enhanced.

Coding standards

The validity of your research depends, to a frightening degree, on the quality of your code. There are ways to code better and minimise the risk of mistakes even for small research projects that don’t look anything like a software development project. Most languages have some sort of style guide to help you. Following them will make your code easier to read, more consistent, and more manageable.

For R, there doesn’t seem to be a single agreed upon style, but I’m sure you could do much worse than follow Hadley Wickham’s R style guide, itself based upon the Google R style guide, at least if you’re using the tidyverse ecosystem.

For Python, there is PEP8. Yes, it’s a bit of a monster. Rather than read through it, just install a linter extension in your favourite IDE (see this guide for VS Code) and your code will be automatically checked for most style breaches as you type. It’s a bit daunting to turn this on at first but it encourages you to produce much cleaner code.

For research, it’s worth having the extensions and robustness checks that reviewers might request in mind early on. You don’t want to be faced with a request that’s going to force you to do a huge re-write of your code. Better to try and anticipate reasonable variations on what you’ve done from the start, difficult though that may be.

Make your code as modular as possible, and never re-write the same code twice. If the same code is being re-run, stick it in a function. You will save time in the long run and having functions defined once and only once makes it easy to change in the future too.

Code comments can be helpful. The best code actually has very few comments because what’s happening is clear without them. When that high bar can’t be reached, add comments to make it easier for a reader to understand what your code is doing. Most likely, that reader will be future you.

Perform code reviews. Give what you’ve done to a colleague and ask them to go through it line-by-line checking it works as intended. If they do this properly and don’t find any mistakes or issues then I’d be very surprised. Return the favour to magically become a better coder yourself.

Choose clarity over optimisation, at least as a starting point. Computation is cheap, brain time is not. If you really need to optimise, do it later when you’ve figured out where it will count.

After the project


Unless you’ve been hiding under a rock, you’ll know about the replicability crisis in research. Much of what I’ve outlined above should help make replication as easy as possible: you can git clone your repository into a new folder, add the raw data to the raw/ directory, and then hit go on the code. If the final outputs match up to what you did before, that’s a good sign.

This is certainly not sufficient for replication in the broadest sense, but it is necessary. If even you can’t reproduce your own results from scratch then you can’t expect anyone else to be able to.

Technically, to make the project as reproducible as possible, you should be including information on how to set up the exact same environment (including package versions and operating system) that was used to generate the results. I do think this is going to be essential in the future but, right now, it’s just not practical for all but the most tech-savvy researchers. If you’re using the same OS then conda’s environment files are a step in the right direction when using Python, albeit an imperfect one.

To create and use the conda environment included in the example project, use

conda env create -f ccenv.yml

on the command line, then activate the environment using conda activate ccenv.

To save an environment file from an existing conda environment, use conda env export > yourenv.yml but also use caution: this environment file will likely only work on your computer. It cannot easily be shared with others for them to recreate the same environment (it’s tied to your OS for a start). One rough way around this that I’ve used in the cookiecutter project is to export the environment and then manually edit it to only retain i) Python and its version, and ii) packages that are explicitly imported in the code but with no version numbers. The idea is to ask for the version of Python that was used to generate the results initially but then let conda worry about the versions of the other imported packages, and any dependencies that those packages may have.


Once you have finished your analysis, what do you do with the dataset you have painstakingly put together? Hopefully you’ll make it ‘findable, accessible, interoperable and reusable’ (FAIR) so that others can use it, as recommended by the journals Nature and Scientific Data.

Briefly, Findable equates to having meta-data (including a unique and persistent identifier) and being in a searchable index; Accessible means that data (and meta-data) are retrievable via open and free tools (proprietary formats like Stata .dta or Excel .xls files do not count, but open formats like .csv do) ; Interoperable means that data are in a widely used and machine readable structure such as tidy; and Re-usable means including a data usage license and meta-data on provenance. There’s a more detailed list of criteria here.

Importantly, data should not just be appended to articles as a supplement but lodged in a searchable repository with an identifier that is citable. Use the Stanford library list earlier in the post for information on what file formats to use, and this list from Scientific Data of approved FAIR data repositories.

Incentives to publish data are perhaps not all that they could be currently, but change is afoot and I would say that best practice is to share your data on one of these repositories whenever possible.


When your project is ready to be released, opening it up to the outside world is as easy as clicking a button on github or gitlab. It will be easily searchable. To make life even easier for those finding it, make sure to have an informative readme file (with the citation information) in the main directory, to tag the project appropriately, and to add a user license. If you’re unsure which license is appropriate, there is a useful guide here.


The assignment of due credit for research can cause great distress and disagreement. Among junior researchers, it can be career-making or -breaking. Senior researchers can be apt to believe that they alone are responsible for everything in a piece of work. I’ve heard plenty of anecdotal evidence of senior researchers inappropriately withholding credit, particularly in economics where there are typically very few authors per paper (see Figure 3 of this paper).

I have a couple of recommendations to make assigning research credit fairer, more transparent, and less likely to cause problems or create misunderstandings.

First, if you are managing the project, make sure that everyone’s expectations as to who will be an author are aligned right at the start.

Second, err on the side of being generous with co-authorship. The best outcome is that science progresses more quickly; if bringing aboard an extra person with particular skills helps to achieve that, then go for it. As a recent Nature article put it, “By discouraging specialization, current authorship conventions weaken the scientific enterprise” and “Science is a team sport”. Do not worry that credit will be diluted. For me, the most exciting paper of the 21st century is the Observation of Gravitational Waves from a Binary Black Hole Merger. The author list runs to 3 pages.

To alleviate any concerns about diluting credit, you can always follow the physical sciences model of having authors listed in order of contribution (apart from the last author, who is typically the principal investigator). This is in contrast to the alphabetical ordering common in some other fields.

Finally, once the project is complete, be explicit about who did what by following the Contributor Roles Taxonomy, also known as CRediT. These breakdown scholarly contributions into 14 roles and three levels (lead, equal, and supporting), whether for authors or for those mentioned in the acknowledgements. Examples of roles include conceptualisation, funding acquisition, analysis, writing – original draft, and validation. To their credit, the originators of this system also propose to make the data on contributions machine readable and a number of journals are adopting it for submissions.


I hope you’ve found this post informative. Disagree with anything or think I’ve missed an important point? Get in touch!

*You may find that because the .eps files used for figures are not in a sub-directory of the main .tex folder, you must add a flag to the Latex compiler. In TexShop, the steps are:

  • Go to Preferences
  • Go to Tab “Engine”
  • Go to the field “pdfTeX”
  • In the Latex Input Field add --shell-escape at the end so that it changes from pdflatex --file-line-error --synctex=1 to pdflatex --file-line-error --synctex=1 --shell-escape

‡ You can use .svg in the latest versions of Microsoft Powerpoint. Microsoft dropped support for .eps in Powerpoint due to concerns about security.

† If you’re interested in the model, it has the following specification:

where capital Greek and Latin characters represent matrices, arrows over characters denote vectors, and it is assumed that the different components of the `innovations’ in the error updating equation are uncorrelated so that $ \Sigma $ is a diagonal matrix. The model has one unobserved factor that follows an AR(2), and the errors similarly follow an AR(2).

Specification curve analysis

Econometrics in Python Series - Part V

When specifying a causal model, modellers have a number of options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.

Typically this happens when there isn’t a clear way to do ‘feature engineering’ on the data. For example, you have a high frequency time series which needs to be aggregated to a lower frequency: you could take the maximum, the minimum, or the average over each high frequency time period. A different choice may be appropriate in different settings.

There’s formal evidence that researchers really do make different decisions; this study gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:

Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.

So not only were different decisions made, there seems to be no clearly identifiable reason for them (although, getting a bit meta, perhaps other authors would have analysed this question differently!)

There is usually scope for reasonable alternative model specifications when estimating causal coefficients, and those coefficients will vary with those specifications. Let’s abuse notation and call this property

where $\beta$ is the coefficient of interest.

What can we do to ensure conclusions are robust to model specification change when that change is due to equally valid feature engineering-type choices? The art is all in deciding what is meant by, or what is a valid form for, $\text{d} \text{ specification}$ and showing that, even under different specifications, the estimates of $\beta$ are robust.

It’s standard in economics to include many different model specifications in order to demonstrate robustness to different specifications. For the same target variable in the same context, there might be five or six of these alternative specifications. The picture below, from Autor, Dorn, and Hanson’s paper China Syndrome, gives a flavour.

Table 3 of 'China Syndrome' - Table 3 of ‘China Syndrome’

But there may be times when it’s appropriate to show many different specifications, for example in a contested area, or an area in which the feature choices are very unclear.

Enter specification curve analysis

One way to more comprehensively analyse $ \frac{\text{d} \beta}{\text{d} \text{ specification}}$ is specification curve analysis.

Specification curve analysis as introduced in this paper looks for a more exhaustive way of trying out alternative specifications. from the paper, the three steps of specification curve analysis are:

  1. identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;
  2. displaying alternative results graphically, allowing the identification of decisions producing different results; and
  3. conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.

For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper on the association between adolescent well-being and the use of digital technology.

An example in Python

This example is going to use the concrete data I’ve used previously to look at the effect of ‘superplasticizer’ on the compressive strength of concrete. I’m going to skip over step 1 quickly, as it will vary a lot depending on your dataset.

Step 1

The data don’t actually require any feature engineering, so we’ll have to pretend that - beyond those two key variables - we’re not sure whether other features should be included or not.

Actually, let’s make it a bit more interesting and say that ‘coarse’ and ‘fly’ are actually based on the same raw data, they are just engineered differently in the data for analysis. Therefore we do not include them together in the model at the same time. That really covers step 1.

Step 2

For step 2, displaying alternative results graphically, we need the data and the code.

First, let’s set up the environment, then read in the data:

import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from itertools import combinations
import matplotlib as mpl
import sklearn
jsonPlotSettings = {'xtick.labelsize': 16,
'ytick.labelsize': 20,
'font.size': 22,
'figure.figsize': (10,5),
'axes.titlesize' : 22,
'axes.labelsize': 20,
'lines.linewidth': 2,
'lines.markersize' : 6,
'legend.fontsize': 18,
'mathtext.fontset': 'stix',
'': 'STIXGeneral'}
df = pd.read_excel('../../ManyRegsPandas/Concrete_Data.xls')
df = df.rename(columns=dict(zip(df.columns,[x.split()[0] for x in df.columns])))
   Cement  Blast  Fly  Water  Superplasticizer  Coarse   Fine  Age   Concrete
0   540.0    0.0  0.0  162.0               2.5  1040.0  676.0   28  79.986111
1   540.0    0.0  0.0  162.0               2.5  1055.0  676.0   28  61.887366
2   332.5  142.5  0.0  228.0               0.0   932.0  594.0  270  40.269535
3   332.5  142.5  0.0  228.0               0.0   932.0  594.0  365  41.052780
4   198.6  132.4  0.0  192.0               0.0   978.4  825.5  360  44.296075

This is the pure question - what dependence does concrete strength have on the use of superplasticizer?

results = sm.OLS(df['Concrete'], df['Superplasticizer']).fit()
                            OLS Regression Results                            
Dep. Variable:               Concrete   R-squared:                       0.578
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     1410.
Date:                Fri, 25 Jan 2019   Prob (F-statistic):          5.29e-195
Time:                        xx:xx:xx   Log-Likelihood:                -4804.2
No. Observations:                1030   AIC:                             9610.
Df Residuals:                    1029   BIC:                             9615.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
Superplasticizer     3.4897      0.093     37.544      0.000       3.307       3.672
Omnibus:                       20.707   Durbin-Watson:                   0.639
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.783
Skew:                          -0.298   Prob(JB):                     1.13e-05
Kurtosis:                       3.420   Cond. No.                         1.00

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

That’s the baseline regression, with $\beta = 3.4897$. Now we need to try the alternative specifications.

We have 7 potential control variables. It’s worth bearing in mind what the upper limit on the number of specifications you could potentially run could be, for computational reasons. Each combination is going to be $n$ choose $k$, or

We want to look at all possible values of $k$, which is

So this is not feasible as $n$ gets very large, but should be okay here.

In this case, there are also some mutually exclusive combinations which will reduce the overall number - remember I decided that ‘coarse’ and ‘fly’ are different ways of creating the same variable. Let’s create all possible $2^7 = 128$ combinations first. We can use the Python combinations function to do this.

# A list of the controls
controls = [x for x in df.columns if x not in ['Concrete','Superplasticizer']]
# Generate all combinations in a list of tuples
Allcomb = [combinations(controls, k) for k in range(len(controls)+1)]
# Flatten this into a single list of tuples
Allcomb = [item for sublist in Allcomb for item in sublist]
# Turn all the tuples into lists
Allcomb = [list(x) for x in Allcomb]

Let’s have a look at some of these; the first 5, a random sample of 5, the last 1, and the total number

for i in np.random.choice(Allcomb,5):
[[], ['Cement'], ['Blast'], ['Fly'], ['Water']]
['Fly', 'Water', 'Coarse', 'Age']
['Cement', 'Water', 'Fine']
['Blast', 'Fly', 'Coarse', 'Fine', 'Age']
['Cement', 'Blast', 'Coarse', 'Age']
['Blast', 'Water', 'Coarse']
['Cement', 'Blast', 'Fly', 'Water', 'Coarse', 'Fine', 'Age']

Note that the original specification is included here as [], i.e. no control. We now need to remove the mutually exclusive combinations - that is any combination which has both ‘Coarse’ and ‘Fly’ in it. Then we’ll look at the last entry to see if it has worked.

Allcomb = [y for y in Allcomb if y not in [x for x in Allcomb if ('Coarse' in x) and ('Fly' in x)]]
['Cement', 'Blast', 'Water', 'Coarse', 'Fine', 'Age']

Great - the old last combination, which mixed features, has been dropped. Now we need to iterate over all possible regression specifications and store the coefficient calculated in each one.

AllResults = [sm.OLS(df['Concrete'], 
                      df[['Superplasticizer']+x]).fit() for x in Allcomb]

You can see this has run all of the possible combinations; here are the regression results for the last entry:

Superplasticizer    0.840783
Cement              0.085463
Blast               0.064191
Water              -0.119120
Coarse              0.016815
Fine                0.002805
Age                 0.106915
dtype: float64

Great. Let’s store the results in a dataframe. As well as the coefficient on superplasticizer, I’ll store the standard errors, ‘bse’, and the pvalues for the independent variables. I’ll then reorder everything by coefficient value.

# Get coefficient values and specifications
df_r = pd.DataFrame([x.params['Superplasticizer'] for x in AllResults],columns=['Coefficient'])
df_r['Specification'] = Allcomb
# Get std err and pvalues
df_r['bse'] = [x.bse['Superplasticizer'] for x in AllResults]
df_r['pvalues'] = [x.pvalues for x in AllResults]
df_r['pvalues'] = df_r['pvalues'].apply(lambda x: dict(x))
# Re-order by coefficient
df_r = df_r.sort_values('Coefficient')
df_r = df_r.reset_index().drop('index',axis=1)
df_r.index.names = ['Specification No.']
                   Coefficient                 Specification       bse  \
Specification No.                                                        
31                    1.044216  [Cement, Blast, Coarse, Age]  0.059440   
27                    1.034839   [Cement, Blast, Water, Age]  0.058165   
58                    1.290024                   [Fine, Age]  0.079633   
62                    1.336140  [Blast, Water, Coarse, Fine]  0.095310   
45                    1.154499            [Cement, Fly, Age]  0.072391   
19                    0.912858                [Cement, Fine]  0.072651   
55                    1.243370                [Coarse, Fine]  0.086451   
50                    1.196307   [Cement, Coarse, Fine, Age]  0.067479   
25                    1.008358        [Cement, Coarse, Fine]  0.074518   
93                    2.842257                         [Age]  0.073861   

Specification No.                                                     
31                 {'Superplasticizer': 1.3490880141286832e-60, '...  
27                 {'Superplasticizer': 6.447248960284443e-62, 'C...  
58                 {'Superplasticizer': 9.824299541334832e-53, 'F...  
62                 {'Superplasticizer': 5.604831921131288e-41, 'B...  
45                 {'Superplasticizer': 2.5456524931721465e-51, '...  
19                 {'Superplasticizer': 8.7290431310275e-34, 'Cem...  
55                 {'Superplasticizer': 7.235976198602693e-43, 'C...  
50                 {'Superplasticizer': 1.5168657130127636e-61, '...  
25                 {'Superplasticizer': 1.6517230301301733e-38, '...  
93                 {'Superplasticizer': 2.233901784516485e-201, '...  

Now I will plot the results for the coefficient as a function of the different specifications, adding the standard errors as a swathe.

fig, ax = plt.subplots()
ax.yaxis.major.formatter._useMathText = True
ax.axhline(y=np.median(df_r['Coefficient']),color='k',alpha=0.3,label='Median',dashes=[12, 5])
ax.fill_between(df_r.index, df_r['Coefficient']+df_r['bse'], df_r['Coefficient']-df_r['bse'],color='b', alpha=0.3)
ax.legend(frameon=False, loc='upper left',ncol=2,handlelength=4)


Let’s now have a matrix which shows, for each specification, whether a particular set of features was included. There are 7 features, so there’ll be 7 rows, and we should expect no column to have both ‘Coarse’ and ‘Fly’ highlighted. There’s going to be some data wrangling to do this: I’ll first sort each row in the specification column alphabetically, then count the occurrences of each control variable in each row (0 or 1).

Then, to go from a column where each cell is a dict of counts of control variables in that row’s specification, I’ll transform to a set of columns, one for each control variable. These cells will have counts in. The counts should all be 0 or 1, so I’ll then map them into boolean values.

With a matrix of 0s and 1s with rows as specifications and columns as variables, I can easily create a heatmap.

df_r['Specification'] = df_r['Specification'].apply(lambda x: sorted(x))
df_r['SpecificationCounts'] = df_r['Specification'].apply(lambda x: Counter(x))
                   Coefficient                           Specification  \
Specification No.                                                        
0                     0.228428  [Age, Blast, Cement, Fine, Fly, Water]   
1                     0.327962       [Blast, Cement, Fine, Fly, Water]   
2                     0.468836             [Blast, Cement, Fly, Water]   
3                     0.522836        [Age, Blast, Cement, Fly, Water]   
4                     0.653542              [Blast, Cement, Fine, Fly]   

                        bse  \
Specification No.             
0                  0.087860   
1                  0.104747   
2                  0.088731   
3                  0.075540   
4                  0.076913   

                                                             pvalues  \
Specification No.                                                      
0                  {'Superplasticizer': 0.009459124471543073, 'Ce...   
1                  {'Superplasticizer': 0.0017915187476705682, 'C...   
2                  {'Superplasticizer': 1.5457095399610106e-07, '...   
3                  {'Superplasticizer': 7.881232377381058e-12, 'C...   
4                  {'Superplasticizer': 6.77195621959008e-17, 'Ce...   

Specification No.                                                     
0                  {'Age': 1, 'Blast': 1, 'Cement': 1, 'Fine': 1,...  
1                  {'Blast': 1, 'Cement': 1, 'Fine': 1, 'Fly': 1,...  
2                    {'Blast': 1, 'Cement': 1, 'Fly': 1, 'Water': 1}  
3                  {'Age': 1, 'Blast': 1, 'Cement': 1, 'Fly': 1, ...  
4                     {'Blast': 1, 'Cement': 1, 'Fine': 1, 'Fly': 1}  
df_spec = df_r['SpecificationCounts'].apply(pd.Series).fillna(0.)
df_spec = df_spec.replace(0.,False).replace(1.,True)
                     Age  Blast  Cement   Fine    Fly  Water  Coarse
Specification No.                                                   
0                   True   True    True   True   True   True   False
1                  False   True    True   True   True   True   False
2                  False   True    True  False   True   True   False
3                   True   True    True  False   True   True   False
4                  False   True    True   True   True  False   False
5                  False   True    True  False   True  False   False
6                  False   True    True  False  False   True    True
7                  False   True    True   True  False   True    True
8                  False   True    True   True  False   True   False
9                   True   True    True   True  False   True    True
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(df_spec.T, aspect='auto',, interpolation='None')
ax.yaxis.major.formatter._useMathText = True


Now let’s try colouring these depending on whether they are significant or not. We’ll use the plasma colormap, which here will mean that a blueish colour implies significance.

This will follow a somewhat similar approach but begins with the pvalues. The first step is to convert the dict of pvalues to columns, one for each variable, in a new dataframe. I’ll then sort the columns and set the cell values to 0 for significant, 1 for insignificant (at the 0.05 level), and leave missing entries as NaNs. When it comes to plotting, I’ll set those NaNs to appear white while the valid in/significant entries appear in the colours of the plasma heatmap.

df_params = df_r['pvalues'].apply(pd.Series)
df_params = df_params.reindex(sorted(df_params.columns), axis=1)
df_params[np.abs(df_params)>0.05] = 1 # Insignificant
df_params[df_params<=0.05] = 0. # Significant
df_params['Coefficient'] = df_r['Coefficient']
                   Age  Blast  Cement  Coarse  Fine  Fly  Superplasticizer  \
Specification No.                                                            
0                  0.0    0.0     0.0     NaN   0.0  0.0               0.0   
1                  NaN    0.0     0.0     NaN   0.0  0.0               0.0   
2                  NaN    0.0     0.0     NaN   NaN  0.0               0.0   
3                  0.0    0.0     0.0     NaN   NaN  0.0               0.0   
4                  NaN    0.0     0.0     NaN   0.0  0.0               0.0   

                   Water  Coefficient  
Specification No.                      
0                    0.0     0.228428  
1                    0.0     0.327962  
2                    0.0     0.468836  
3                    0.0     0.522836  
4                    NaN     0.653542  
fig = plt.figure()
ax = fig.add_subplot(111)
cmap =
ax.imshow(df_params[controls].T, aspect='auto', cmap=cmap, interpolation='None')
ax.yaxis.major.formatter._useMathText = True


Step 3

Considering the full set of reasonable specifications jointly, how inconsistent are the results with the null hypothesis of no effect?

This step uses a permutation test which shuffles up the data and re-runs the regressions. It assumes exchangeability, i.e. that the rows are not related in any way. In the original paper on specification curve analysis by Simonsohn et al., they discuss the example of whether hurricanes with more feminine names are perceived as less threatening and hence lead to fewer precautionary measures by the general public, as examined originally in this paper. If you’re interested, Simonsohn et al. accept the null of there being no difference in precautionary behaviour based on the name of the hurricane using specification curve analysis.

So, to do this, we’re going to shuffle up the randomly assigned variable. In our toy example, that’s going to be superplasticizer. As the authors put it,

The shuffled datasets maintain all the other features of the original one (e.g., collinearity, time trends, skewness, etc.) except we now know there is no link between (shuffled) names and fatalities; the null is true by construction.

Although, in our case, it is the superplasticizer value that will be shuffled. Let’s first make a copy of the dataframe ready to shuffle:

Num_shuffles = 50

def retShuffledResults():
    allResults_shuffle = []
    for i in range(Num_shuffles):
        df_shuffle = df.copy(deep=True)
        df_shuffle['Superplasticizer'] = sklearn.utils.shuffle(df['Superplasticizer'].values)
        Results_shuffle = [sm.OLS(df_shuffle['Concrete'], 
                     df_shuffle[['Superplasticizer']+x]).fit() for x in Allcomb]
    return allResults_shuffle
allResults_shuffle = retShuffledResults()
df_r_shuffle = pd.DataFrame([[x.params['Superplasticizer'] for x in y] for y in allResults_shuffle])
df_r_shufflepval = pd.DataFrame([[x.pvalues['Superplasticizer'] for x in y] for y in allResults_shuffle])

         0         1         2         3         4         5         6   \
0  3.017799  0.348324  2.103696  2.342652  0.238608  0.119278  0.152364   
1  2.939502  0.205683  2.009524  2.243891  0.044811 -0.042069 -0.006277   
2  3.004296  0.255635  2.127853  2.322167  0.218430  0.084593  0.127544   
3  3.031353  0.338988  2.118547  2.364655  0.234529  0.171963  0.182143   
4  2.969443  0.250435  2.034338  2.294939  0.123191  0.026125  0.037847   

         7         8         9     ...           86        87        88  \
0  2.124654  0.152692  0.216249    ...     0.077730  0.052367  0.043836   
1  2.078909  0.014767  0.071263    ...    -0.047398  0.002010 -0.005702   
2  2.148499  0.116719  0.112361    ...     0.040043  0.069590  0.071732   
3  2.168407  0.140604  0.217297    ...     0.102334  0.134740  0.101656   
4  2.098849  0.042894  0.140568    ...    -0.033597 -0.001233 -0.028179   

         89        90        91        92        93        94        95  
0  0.031032  0.087474  0.086622  0.048941  0.016861 -0.011674  0.024902  
1 -0.068846  0.009561  0.009350 -0.017208 -0.034570 -0.035247 -0.016576  
2  0.043392  0.037542  0.044300  0.129716  0.089750  0.015758  0.050699  
3  0.048139  0.145640  0.155569  0.130373  0.135638  0.066984  0.104164  
4 -0.057247  0.045333  0.027806 -0.013531 -0.028678 -0.021878 -0.035316  

[5 rows x 96 columns]

Notice that there are multiple shuffled regressions for each specification number. We take the median of over all possible values for each specification number:

med_shuffle = df_r_shuffle.quantile(0.5).sort_values().reset_index().drop('index',axis=1)

These data can be added onto the main plot, along with everything else:

f, axarr = plt.subplots(2, sharex=True,figsize=(10,10))
for ax in axarr:
    ax.yaxis.major.formatter._useMathText = True
                 label='Coefficient under null (median over bootstraps)')
# use if you wish to label orginal specification
#orig_spec = df_r[df_r['Specification'].apply(lambda x: not x)]
#axarr[0].scatter(orig_spec.index,orig_spec['Coefficient'],s=100.,color='k',label='Original specification')
                 label='Median coefficient',
                 dashes=[12, 5])
axarr[0].legend(frameon=False, loc='upper left',ncol=1,handlelength=4,markerscale=10)
axarr[0].set_title('Specification curve analysis')
cmap =
axarr[1].imshow(df_params[controls].T, aspect='auto', cmap=cmap, interpolation='None')
plt.subplots_adjust(wspace=0, hspace=0.05)


The authors of the specification curve analysis paper provide three measures of whether, as a whole, the null should be rejected. (i) the median overall point estimate (ii) the share of estimates in the specification curve that are of the dominant sign, and (iii) the share that are of the dominant sign and also statistically significant (p<.05)

Step 3 part (i)

(i) is calculated from the % of coefficient estimates with as or more extreme results. We need to divide the number of bootstrapped datasets with larger median effect sizes than the original analysis by the total number of bootstraps, which gives the p-value of this test.

pvalue_i = np.double(sum(med_shuffle>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle))

Step 3 part (ii)

ii) requires this to be repeated but only with results of dominant sign. You can see from the plot that we’re going to again get a very large p-value but here’s the process anyway. First, we determine the dominant sign and then calculate the p-value for part ii)

gtr_than_zero = np.argmax( [len(df_r[df_r['Coefficient']<0.]), len(df_r[df_r['Coefficient']>0.])]) # 0 is <0 and 1 is >0
    gtr_than_zero = True
    gtr_than_zero = False
    pvalue_ii = np.double(sum(med_shuffle[med_shuffle>0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle[med_shuffle>0]))
    pvalue_ii = np.double(sum(med_shuffle[med_shuffle<0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle[med_shuffle<0]))

Step 3 part (iii)

For part iii), we repeat the same process but only for those which were statistically significant and of dominant sign.

med_shuffle_signif = df_r_shuffle[df_r_shufflepval>0.05].quantile(0.5).sort_values().reset_index().drop('index',axis=1).dropna()
    pvalue_iii = np.double(sum(med_shuffle_signif[med_shuffle_signif>0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle_signif[med_shuffle_signif>0]))
    pvalue_iii = np.double(sum(med_shuffle_signif[med_shuffle_signif<0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle_signif[med_shuffle_signif<0]))

As was likely from visual inspection of the figures, the p-values are $ \leq 0.01 $ in each case. We have tested whether, when considering all the possible specifications, the results found are inconsistent with results when the null hypothesis is true (that superplasticizer and strength are unrelated). On the basis of the p-values, we can safely reject the null that the bootstrapped and original specifications are consistent. The tests as carried out strongly imply that $\beta > 0 $ and that this conclusion is robust to specification change.


Researchers are always going to disagree about how to analyse the same data set. Although which specifications to include or exclude from specification curve analysis inevitably involves choices, I think that this is a useful and more comprehensive way to see how sensitive results are to those choices.

Putting women scientists onto Wikipedia

In a previous post, I shared links about the predictors for not participating in higher education, and about how it is difficult to reach audiences in “remote rural or coastal areas and in former industrial areas, especially in the Midlands” (according to the Social Mobility Commission). In this post, I look at another dimension of participation in higher education: gender.

Women are heavily under-represented in STEM (Science, Technology, Engineering, and Mathematics) subjects. In the UK, they make up just 25% of STEM undergraduates but 57% of the total undergraduate population.

It’s little better for economics, as this article in the Financial Times (£) shows, and the direction of the trend is worse: in the US, while the fraction of women undergraduates taking STEM subjects has increased, the fraction taking economics has declined. In the UK in 2011/12, it was 28% and trending downwards. The problems aren’t just widely held misapprehensions of what economics is about, or #WhatEconomistsDo. There is solid analytical work looking at ways in which the culture of economics may be hostile for women too. This work is nicely summarised by Prof. Diane Coyle (£), again in the Financial Times. Although both economics and STEM have a problem, I’ve mused before that economics could perhaps learn from science when it comes to outreach.

A campaign to inspire women to enter STEM subjects

My Imperial College London physics colleague Dr. Jess Wade (@jesswade on twitter) has come up with a novel way to help inspire more women to enter STEM subjects. She has been busily and heroically writing Wikipedia articles on women scientists of note since 2016. As she says,

“Wikipedia is a really great way to engage people in this mission because the more you read about these sensational women, the more you get so motivated and inspired by their personal stories.” - Dr. Jess Wade

Picked at random, here is the site of one of those of women whose Wikipedia page Jess has created: Frances Pleasonton, who worked on neutron decay.

What I think is most powerful about Jess’ approach is that it has huge reach, because Wikipedia has huge reach. Normally, it’s nigh on impossible to measure the impacts of outreach beyond a questionnaire issued at the end of an event. The audiences who attend science outreach events are typically self-selected, and they are rarely, if ever, followed over time to see if their relationship with science changes after the event.

Discussing her approach on BBC Radio 4’s Inside Science, Jess expressed her frustrations at well-meaning but likely ineffective outreach programmes which are costly and may do little to reach, or inspire, their intended audience. As was also noted on the programme, scientists can be endlessly methodical in the lab but - when it comes to outreach - their embrace of the scientific method could be better, and outreach programmes need to be better evaluated. Economists could definitely help here.

What is very cool about Jess’ campaign is that it is possible to get an idea, a rough one at least, of its impact. So just how huge is the reach of this campaign? Let’s find out.

Estimating the reach of Wikipedia pages

Feel free to skip this section if you’re not interested in the details of how the data were collected.

Wikipedia tracks page views, literally the number of times a wiki page has been requested. It’s not a perfect measure of the number of people viewing a webpage (you can find more info on the influences here) as some people are likely to be repeat visitors. Also, if an article is contentious, Wikipedia editors may visit it a lot. The debated page on Stanley Kubrick, for example, has had 396 edits by 203 editors since 2017 (at the time of checking).

So page views aren’t perfect, but they’re likely to be a good order of magnitude indicator of the number of people who have viewed a page.

To get all of the stats for the pages, I found Jess’ editor page, which includes an option to show all newly created pages. With some data wrangling via the beautifulsoup and pandas python packages, I obtained a list of people for whom pages were created. There may be a few extra pages which are not individuals included in error here, and perhaps some missing - but the wrangling should deliver most of them.

With the data on the names of the pages collected, I grabbed the page views using the handy wiki page view API and the requests python package. Here’s a snippet of the page views data table:

article Willetta_Greene-Johnson Xiangqian_Jiang Yewande_Akinola
2017-12-01 0.0 0.0 0.0
2018-01-01 0.0 0.0 0.0
2018-02-01 0.0 0.0 167.0
2018-03-01 0.0 26.0 248.0
2018-04-01 0.0 8.0 282.0
2018-05-01 130.0 15.0 152.0

I used matplotlib and seaborn to show the results.

Impact of the campaign

So: how many people has Jess helped reach information on women in STEM? Over 200,000. This is simply astonishing.

Number of page views as a function of time

The blue line shows the cumulative total number of page views of all pages. The green lines show just how much hard work this has been - there is one for every individual page created. I’ve put in a few of the scientists’ names. Note that the page views data lag a bit behind the page creations.

To put the total number of views into some sort of context, the Royal Society Summer Science Exhibition, which I ran a stand at in 2014, gets around 12,000 visitors per year. Another comparison is that there were fewer than 100,000 undergraduates studying physical sciences in the UK in 2014-2015. So this is genuinely reaching an amazing number of people.

In the figure below, you can see a few of the most popular pages for 2018 so far:

Most visited articles 2018

It’s hard to know who is looking at these pages but it’s certain that they wouldn’t have been if Jess hadn’t created them (and inspired others to do the same). As well as Dr. Stuart Higgins’ Science in the Supermarket from my previous post I think this is a great example of how innovative outreach can be more effective in reaching audiences.

Who is not participating in Higher Education?

Given my work in both economics and Science, Technology, Engineering, and Mathematics (STEM), I’ve become interested in what factors determine groups’ participation in higher education, what groups are being left out, and what might be done about it.

Poverty means low participation

According to a Social Mobility Commission report from 2016, the most important determinant of whether someone goes to university at all or not is poverty, or, more precisely, whether someone receives free school meals. This applies across gender and ethnicity, though as the report notes “Disadvantaged young people from White British backgrounds are the least likely to access Higher Education”.

A lack of diversity in socio-economic background is perhaps less visible than some other troubling aspects of participation. But, if diversity matters at all, all dimensions of diversity matter.

Unfortunately, people from lower income/wealth backgrounds are some of the most difficult to influence with outreach campaigns as they tend to live in “remote rural or coastal areas and in former industrial areas, especially in the Midlands” according to the 2017 Social Mobility Commission’s ‘State of the nation’ report. I’m from one of the parts of the UK specifically identified in this report, the High Peak, and it’s unfortunately not all that surprising. Higher education institutions, and jobs which require advanced qualifications, are physically and psychologically at a distance. Other poorly ranked areas are similar: they include West Somerset (324 of 324), Thanet (274 of 324), and a cluster around West Norfolk.

There are detailed data on participation in higher education amongst young people available from the Office for Students. I’ve made a choropleth of these data below. The geographical areas with low participation are much the same as the problem areas identified in the report on social mobility. If you’re not interested in where the data come from, skip the box below the figure.

Youth higher education participation rate by local authority district. Shown: Manchester and the Peak District.

Data on youth HE participation

The Office for Students provide data on the number of young people who participate in HE by middle super output areas. These are quite small areas so I’ve aggregated to local authority districts using a mapping which comes from data on households in poverty. I plotted these data with folium using maps from the ONS Open Geography portal. Minor gripe: no geojson format was available, so I had to make my own from the shapefiles.

Science in the supermarket

Recently, I discussed how to reach those with the least HE participation with outreach superstar and Imperial College London colleague Dr. Stuart Higgins (whose award-winning podcast Scientists Not The Science is worth checking out). As I understand it, the best advice - based on research - is that you need to show young students a path into higher education which could work for them; that it’s feasible, that it’s for people ‘like them’, and that they’re good enough to make it.

I was talking to Stuart because of an amazing recent initiative he’s been involved with called Science in the Supermarket which puts what he’s learned into practice. Stuart and some other volunteers supported by Imperial College went to a supermarket in Somerset to engage young and old alike with science demos, and to tell them about careers in STEM. Although on a small scale, I think the brilliance of this initiative is that it avoids the self-selection problem which some other outreach programmes suffer from. I would love to see Economists in the Supermarket, or even Applied Mathematics in the Supermarket!

Update 25/08/18

Stuart has written up the results of the Science in the Supermarket project he ran so that others can learn from it. Laudably, by setting out everything from the project timetable, to the letters asking for volunteers, to the design of the meta-evaluation, to the costs, Stuart has made this intervention as reproducible as possible. Others can build upon what he has done. It’s a more scientific way to run an outreach programme.

Stuart gave me some informal pointers on ‘what I would think about if starting another project’ which I’ve made some minor tweaks to and reproduced below:

  • Understand your own motivation and define a target; trying to approach a big problem can feel overwhelming and paralysing, starting with a specific, local goal can help
  • Accept that balancing engagement with a day job is challenging
  • Set a realistic scope for the project and accept that ‘good enough’ is good enough
  • If possible, get both bottom-up (to help share the workload), and top-down support (to add legitimacy, open doors to resources, etc)
  • Try and be evidence-based where possible

Another resource he mentioned is this Aspires Report on ‘Young people’s science and career aspirations’. The two key findings I took away from it were that young people aren’t necessarily aware of the careers which science can open up (economics!) and that ‘science capital’ is a strong predictor of aspiring to a career in science but that this capital is unevenly distributed across socio-economic groups.

Processing all of this, it seems like making STEM careers and/or STEM practitioners familiar to young people is one of the most useful outcomes outreach programmes can strive for.

Why the latest, most exciting thing in machine learning is... game theory

And when I say latest, this particular method was invented in 1953.

Machine learning has interpretability issues. New EU legislation, the General Data Protection Regulation, includes a line about “the right … to obtain an explanation of the decision reached”, including by an algorithm.

Of course, there are many other good reasons to want the decisions of algorithms to be understandable and explainable. Interrogating why an algorithm makes the choices it does can highlight whether it’s working as intended, and, in some situations - such as public policy - transparency and interpretability may be essential ingredients of decision making.

But non-linear models are just not that easy to decompose into their fundamental components, they are - to an extent - a ‘black box’. Ideally, we would be able to find the contribution of each input feature to the final prediction. In linear models, this is trivially achieved by the combination of the level of a feature and its regression coefficient. That is, for a linear model $f$ with features $x_{i\nu}$, $\nu \in {1,p}$ at a point $i$ such that

the contribution from feature $\nu$ is $x_{i\nu}\cdot\beta_\nu$. In non-linear models, it’s not so simple.

Shapley values

Game theory to the rescue. In 1953 Lloyd Shapley introduced values which effectively find, for a co-operative game, each player’s marginal contribution, averaged over every possible sequence in which the players could have been added to the group of players (Alvin Roth talks about it here). These are called Shapley values and, in a nutshell, they are the average expected marginal contribution of one player after all possible combinations of players have been considered.

This is exactly the kind of problem we want to solve to understand how different features contribute to a predicted value in a non-linear model, for instance in a machine learning. But it’s easier to understand them in the linear case. The Shapley value for the linear model above would be, for feature $\nu$:

where no Einstein summation is implied. Summing over the different features gets back a number which is simply related to the overall prediction given by $f$,

The general equation for Shapley values looks more complicated, but is described by a function $g$ that assigns a real number to each coalition $S$, that is, to each subset of the combination of features, such that $g(S)$ represents the amount (of money or of utility) that coalition $S$ is able to transfer among its members in any way that they all agree to. Here it is:


Shapley values for machine learning

Shapley values have a number of nice properties which are both familiar from linear decompositions/linear models and highly desirable for machine learning models:

  • the Shapley value contributions sum to the difference between the full prediction and the average prediction (efficiency)

  • two features which contribute equally to any subset to which they’re added have the same Shapley value (substitutability/symmetry)

  • a feature which doesn’t influence the predicted value has a Shapley value of 0 (dummy player)

These nice properties are not trivial for non-linear models, and Shapley values are the only way to achieve them concurrently. They’re also what suggest to me that Shapley values will become the primary interpretability method used and understood. There must be some catch, right?

There is. Which is why other methods, such as local surrogate models like LIME, are not going away anytime soon. If the factorials and sum over all combinations of input features in the equation didn’t give it away, Shapley values are computationally expensive. As this paper points out, “every exact algorithm for the Shapley value requires an exponential number of operations”. Oh dear.

The good news is that there are good approximations out there. The even better news is that there is a Python library called shap which implements a fast approximation method, is easy to use, and is even optimised for sklearn. The paper behind this is here.

Not everyone is convinced by Shapley values but I think they could be particularly important as they have properties which are so clearly and neatly analogous to decompositions of linear models.

If you’d like to find out more about how Shapley values work, see these excellent explainer blog posts which I drew on heavily for this post: