Markov Wanderer A blog on economics, science, coding and data. Views are my own.

Specification curve analysis

Econometrics in Python Series - Part V

When specifying a causal model, modellers have a number of options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.

Typically this happens when there isn’t a clear way to do ‘feature engineering’ on the data. For example, you have a high frequency time series which needs to be aggregated to a lower frequency: you could take the maximum, the minimum, or the average over each high frequency time period. A different choice may be appropriate in different settings.

There’s formal evidence that researchers really do make different decisions; this study gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:

Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.

So not only were different decisions made, there seems to be no clearly identifiable reason for them (although, getting a bit meta, perhaps other authors would have analysed this question differently!)

There is usually scope for reasonable alternative model specifications when estimating causal coefficients, and those coefficients will vary with those specifications. Let’s abuse notation and call this property

where $\beta$ is the coefficient of interest.

What can we do to ensure conclusions are robust to model specification change when that change is due to equally valid feature engineering-type choices? The art is all in deciding what is meant by, or what is a valid form for, $\text{d} \text{ specification}$ and showing that, even under different specifications, the estimates of $\beta$ are robust.

It’s standard in economics to include many different model specifications in order to demonstrate robustness to different specifications. For the same target variable in the same context, there might be five or six of these alternative specifications. The picture below, from Autor, Dorn, and Hanson’s paper China Syndrome, gives a flavour.

Table 3 of 'China Syndrome' - Table 3 of ‘China Syndrome’

But there may be times when it’s appropriate to show many different specifications, for example in a contested area, or an area in which the feature choices are very unclear.

Enter specification curve analysis

One way to more comprehensively analyse $ \frac{\text{d} \beta}{\text{d} \text{ specification}}$ is specification curve analysis.

Specification curve analysis as introduced in this paper looks for a more exhaustive way of trying out alternative specifications. from the paper, the three steps of specification curve analysis are:

  1. identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;
  2. displaying alternative results graphically, allowing the identification of decisions producing different results; and
  3. conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.

For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper on the association between adolescent well-being and the use of digital technology.

An example in Python

This example is going to use the concrete data I’ve used previously to look at the effect of ‘superplasticizer’ on the compressive strength of concrete. I’m going to skip over step 1 quickly, as it will vary a lot depending on your dataset.

Step 1

The data don’t actually require any feature engineering, so we’ll have to pretend that - beyond those two key variables - we’re not sure whether other features should be included or not.

Actually, let’s make it a bit more interesting and say that ‘coarse’ and ‘fly’ are actually based on the same raw data, they are just engineered differently in the data for analysis. Therefore we do not include them together in the model at the same time. That really covers step 1.

Step 2

For step 2, displaying alternative results graphically, we need the data and the code.

First, let’s set up the environment, then read in the data:

import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from itertools import combinations
import matplotlib as mpl
import sklearn
jsonPlotSettings = {'xtick.labelsize': 16,
'ytick.labelsize': 20,
'xtick.labelsize':20,
'font.size': 22,
'figure.figsize': (10,5),
'axes.titlesize' : 22,
'axes.labelsize': 20,
'lines.linewidth': 2,
'lines.markersize' : 6,
'legend.fontsize': 18,
'mathtext.fontset': 'stix',
'font.family': 'STIXGeneral'}
plt.style.use(jsonPlotSettings)
df = pd.read_excel('../../ManyRegsPandas/Concrete_Data.xls')
df = df.rename(columns=dict(zip(df.columns,[x.split()[0] for x in df.columns])))
print(df.head())
   Cement  Blast  Fly  Water  Superplasticizer  Coarse   Fine  Age   Concrete
0   540.0    0.0  0.0  162.0               2.5  1040.0  676.0   28  79.986111
1   540.0    0.0  0.0  162.0               2.5  1055.0  676.0   28  61.887366
2   332.5  142.5  0.0  228.0               0.0   932.0  594.0  270  40.269535
3   332.5  142.5  0.0  228.0               0.0   932.0  594.0  365  41.052780
4   198.6  132.4  0.0  192.0               0.0   978.4  825.5  360  44.296075

This is the pure question - what dependence does concrete strength have on the use of superplasticizer?

results = sm.OLS(df['Concrete'], df['Superplasticizer']).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Concrete   R-squared:                       0.578
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     1410.
Date:                Fri, 25 Jan 2019   Prob (F-statistic):          5.29e-195
Time:                        xx:xx:xx   Log-Likelihood:                -4804.2
No. Observations:                1030   AIC:                             9610.
Df Residuals:                    1029   BIC:                             9615.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Superplasticizer     3.4897      0.093     37.544      0.000       3.307       3.672
==============================================================================
Omnibus:                       20.707   Durbin-Watson:                   0.639
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.783
Skew:                          -0.298   Prob(JB):                     1.13e-05
Kurtosis:                       3.420   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

That’s the baseline regression, with $\beta = 3.4897$. Now we need to try the alternative specifications.

We have 7 potential control variables. It’s worth bearing in mind what the upper limit on the number of specifications you could potentially run could be, for computational reasons. Each combination is going to be $n$ choose $k$, or

We want to look at all possible values of $k$, which is

So this is not feasible as $n$ gets very large, but should be okay here.

In this case, there are also some mutually exclusive combinations which will reduce the overall number - remember I decided that ‘coarse’ and ‘fly’ are different ways of creating the same variable. Let’s create all possible $2^7 = 128$ combinations first. We can use the Python combinations function to do this.

# A list of the controls
controls = [x for x in df.columns if x not in ['Concrete','Superplasticizer']]
# Generate all combinations in a list of tuples
Allcomb = [combinations(controls, k) for k in range(len(controls)+1)]
# Flatten this into a single list of tuples
Allcomb = [item for sublist in Allcomb for item in sublist]
# Turn all the tuples into lists
Allcomb = [list(x) for x in Allcomb]

Let’s have a look at some of these; the first 5, a random sample of 5, the last 1, and the total number

print(Allcomb[:5])
for i in np.random.choice(Allcomb,5):
    print(i)
print(Allcomb[-1])
print(len(Allcomb))
[[], ['Cement'], ['Blast'], ['Fly'], ['Water']]
['Fly', 'Water', 'Coarse', 'Age']
['Cement', 'Water', 'Fine']
['Blast', 'Fly', 'Coarse', 'Fine', 'Age']
['Cement', 'Blast', 'Coarse', 'Age']
['Blast', 'Water', 'Coarse']
['Cement', 'Blast', 'Fly', 'Water', 'Coarse', 'Fine', 'Age']
128

Note that the original specification is included here as [], i.e. no control. We now need to remove the mutually exclusive combinations - that is any combination which has both ‘Coarse’ and ‘Fly’ in it. Then we’ll look at the last entry to see if it has worked.

Allcomb = [y for y in Allcomb if y not in [x for x in Allcomb if ('Coarse' in x) and ('Fly' in x)]]
Allcomb[-1]
['Cement', 'Blast', 'Water', 'Coarse', 'Fine', 'Age']

Great - the old last combination, which mixed features, has been dropped. Now we need to iterate over all possible regression specifications and store the coefficient calculated in each one.

AllResults = [sm.OLS(df['Concrete'], 
                      df[['Superplasticizer']+x]).fit() for x in Allcomb]

You can see this has run all of the possible combinations; here are the regression results for the last entry:

AllResults[-1].params
Superplasticizer    0.840783
Cement              0.085463
Blast               0.064191
Water              -0.119120
Coarse              0.016815
Fine                0.002805
Age                 0.106915
dtype: float64

Great. Let’s store the results in a dataframe. As well as the coefficient on superplasticizer, I’ll store the standard errors, ‘bse’, and the pvalues for the independent variables. I’ll then reorder everything by coefficient value.

# Get coefficient values and specifications
df_r = pd.DataFrame([x.params['Superplasticizer'] for x in AllResults],columns=['Coefficient'])
df_r['Specification'] = Allcomb
# Get std err and pvalues
df_r['bse'] = [x.bse['Superplasticizer'] for x in AllResults]
df_r['pvalues'] = [x.pvalues for x in AllResults]
df_r['pvalues'] = df_r['pvalues'].apply(lambda x: dict(x))
# Re-order by coefficient
df_r = df_r.sort_values('Coefficient')
df_r = df_r.reset_index().drop('index',axis=1)
df_r.index.names = ['Specification No.']
print(df_r.sample(10))
                   Coefficient                 Specification       bse  \
Specification No.                                                        
31                    1.044216  [Cement, Blast, Coarse, Age]  0.059440   
27                    1.034839   [Cement, Blast, Water, Age]  0.058165   
58                    1.290024                   [Fine, Age]  0.079633   
62                    1.336140  [Blast, Water, Coarse, Fine]  0.095310   
45                    1.154499            [Cement, Fly, Age]  0.072391   
19                    0.912858                [Cement, Fine]  0.072651   
55                    1.243370                [Coarse, Fine]  0.086451   
50                    1.196307   [Cement, Coarse, Fine, Age]  0.067479   
25                    1.008358        [Cement, Coarse, Fine]  0.074518   
93                    2.842257                         [Age]  0.073861   

                                                             pvalues  
Specification No.                                                     
31                 {'Superplasticizer': 1.3490880141286832e-60, '...  
27                 {'Superplasticizer': 6.447248960284443e-62, 'C...  
58                 {'Superplasticizer': 9.824299541334832e-53, 'F...  
62                 {'Superplasticizer': 5.604831921131288e-41, 'B...  
45                 {'Superplasticizer': 2.5456524931721465e-51, '...  
19                 {'Superplasticizer': 8.7290431310275e-34, 'Cem...  
55                 {'Superplasticizer': 7.235976198602693e-43, 'C...  
50                 {'Superplasticizer': 1.5168657130127636e-61, '...  
25                 {'Superplasticizer': 1.6517230301301733e-38, '...  
93                 {'Superplasticizer': 2.233901784516485e-201, '...  

Now I will plot the results for the coefficient as a function of the different specifications, adding the standard errors as a swathe.

plt.close('all')
fig, ax = plt.subplots()
ax.scatter(df_r.index,df_r['Coefficient'],lw=3.,label='',s=0.4,color='b')
ax.set_xlabel(df_r.index.name)
ax.set_ylabel('Coefficient')
ax.yaxis.major.formatter._useMathText = True
ax.axhline(color='k',lw=0.5)
ax.axhline(y=np.median(df_r['Coefficient']),color='k',alpha=0.3,label='Median',dashes=[12, 5])
ax.fill_between(df_r.index, df_r['Coefficient']+df_r['bse'], df_r['Coefficient']-df_r['bse'],color='b', alpha=0.3)
ax.legend(frameon=False, loc='upper left',ncol=2,handlelength=4)
plt.show()

png

Let’s now have a matrix which shows, for each specification, whether a particular set of features was included. There are 7 features, so there’ll be 7 rows, and we should expect no column to have both ‘Coarse’ and ‘Fly’ highlighted. There’s going to be some data wrangling to do this: I’ll first sort each row in the specification column alphabetically, then count the occurrences of each control variable in each row (0 or 1).

Then, to go from a column where each cell is a dict of counts of control variables in that row’s specification, I’ll transform to a set of columns, one for each control variable. These cells will have counts in. The counts should all be 0 or 1, so I’ll then map them into boolean values.

With a matrix of 0s and 1s with rows as specifications and columns as variables, I can easily create a heatmap.

df_r['Specification'] = df_r['Specification'].apply(lambda x: sorted(x))
df_r['SpecificationCounts'] = df_r['Specification'].apply(lambda x: Counter(x))
print(df_r.head(5))
                   Coefficient                           Specification  \
Specification No.                                                        
0                     0.228428  [Age, Blast, Cement, Fine, Fly, Water]   
1                     0.327962       [Blast, Cement, Fine, Fly, Water]   
2                     0.468836             [Blast, Cement, Fly, Water]   
3                     0.522836        [Age, Blast, Cement, Fly, Water]   
4                     0.653542              [Blast, Cement, Fine, Fly]   

                        bse  \
Specification No.             
0                  0.087860   
1                  0.104747   
2                  0.088731   
3                  0.075540   
4                  0.076913   

                                                             pvalues  \
Specification No.                                                      
0                  {'Superplasticizer': 0.009459124471543073, 'Ce...   
1                  {'Superplasticizer': 0.0017915187476705682, 'C...   
2                  {'Superplasticizer': 1.5457095399610106e-07, '...   
3                  {'Superplasticizer': 7.881232377381058e-12, 'C...   
4                  {'Superplasticizer': 6.77195621959008e-17, 'Ce...   

                                                 SpecificationCounts  
Specification No.                                                     
0                  {'Age': 1, 'Blast': 1, 'Cement': 1, 'Fine': 1,...  
1                  {'Blast': 1, 'Cement': 1, 'Fine': 1, 'Fly': 1,...  
2                    {'Blast': 1, 'Cement': 1, 'Fly': 1, 'Water': 1}  
3                  {'Age': 1, 'Blast': 1, 'Cement': 1, 'Fly': 1, ...  
4                     {'Blast': 1, 'Cement': 1, 'Fine': 1, 'Fly': 1}  
df_spec = df_r['SpecificationCounts'].apply(pd.Series).fillna(0.)
df_spec = df_spec.replace(0.,False).replace(1.,True)
print(df_spec.head(10))
                     Age  Blast  Cement   Fine    Fly  Water  Coarse
Specification No.                                                   
0                   True   True    True   True   True   True   False
1                  False   True    True   True   True   True   False
2                  False   True    True  False   True   True   False
3                   True   True    True  False   True   True   False
4                  False   True    True   True   True  False   False
5                  False   True    True  False   True  False   False
6                  False   True    True  False  False   True    True
7                  False   True    True   True  False   True    True
8                  False   True    True   True  False   True   False
9                   True   True    True   True  False   True    True
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(df_spec.T, aspect='auto', cmap=plt.cm.gray_r, interpolation='None')
ax.set_xlabel(df_r.index.name)
ax.set_ylabel('Control')
plt.yticks(range(len(df_spec.columns)),df_spec.columns)
ax.yaxis.major.formatter._useMathText = True

png

Now let’s try colouring these depending on whether they are significant or not. We’ll use the plasma colormap, which here will mean that a blueish colour implies significance.

This will follow a somewhat similar approach but begins with the pvalues. The first step is to convert the dict of pvalues to columns, one for each variable, in a new dataframe. I’ll then sort the columns and set the cell values to 0 for significant, 1 for insignificant (at the 0.05 level), and leave missing entries as NaNs. When it comes to plotting, I’ll set those NaNs to appear white while the valid in/significant entries appear in the colours of the plasma heatmap.

df_params = df_r['pvalues'].apply(pd.Series)
df_params = df_params.reindex(sorted(df_params.columns), axis=1)
df_params[np.abs(df_params)>0.05] = 1 # Insignificant
df_params[df_params<=0.05] = 0. # Significant
df_params['Coefficient'] = df_r['Coefficient']
print(df_params.head(5))
                   Age  Blast  Cement  Coarse  Fine  Fly  Superplasticizer  \
Specification No.                                                            
0                  0.0    0.0     0.0     NaN   0.0  0.0               0.0   
1                  NaN    0.0     0.0     NaN   0.0  0.0               0.0   
2                  NaN    0.0     0.0     NaN   NaN  0.0               0.0   
3                  0.0    0.0     0.0     NaN   NaN  0.0               0.0   
4                  NaN    0.0     0.0     NaN   0.0  0.0               0.0   

                   Water  Coefficient  
Specification No.                      
0                    0.0     0.228428  
1                    0.0     0.327962  
2                    0.0     0.468836  
3                    0.0     0.522836  
4                    NaN     0.653542  
fig = plt.figure()
ax = fig.add_subplot(111)
cmap = plt.cm.plasma
cmap.set_bad('white',1.)
ax.imshow(df_params[controls].T, aspect='auto', cmap=cmap, interpolation='None')
ax.set_xlabel(df_params.index.name)
ax.set_ylabel('Control')
plt.yticks(range(len(controls)),controls)
ax.yaxis.major.formatter._useMathText = True

png

Step 3

Considering the full set of reasonable specifications jointly, how inconsistent are the results with the null hypothesis of no effect?

This step uses a permutation test which shuffles up the data and re-runs the regressions. It assumes exchangeability, i.e. that the rows are not related in any way. In the original paper on specification curve analysis by Simonsohn et al., they discuss the example of whether hurricanes with more feminine names are perceived as less threatening and hence lead to fewer precautionary measures by the general public, as examined originally in this paper. If you’re interested, Simonsohn et al. accept the null of there being no difference in precautionary behaviour based on the name of the hurricane using specification curve analysis.

So, to do this, we’re going to shuffle up the randomly assigned variable. In our toy example, that’s going to be superplasticizer. As the authors put it,

The shuffled datasets maintain all the other features of the original one (e.g., collinearity, time trends, skewness, etc.) except we now know there is no link between (shuffled) names and fatalities; the null is true by construction.

Although, in our case, it is the superplasticizer value that will be shuffled. Let’s first make a copy of the dataframe ready to shuffle:

Num_shuffles = 50

def retShuffledResults():
    allResults_shuffle = []
    for i in range(Num_shuffles):
        df_shuffle = df.copy(deep=True)
        df_shuffle['Superplasticizer'] = sklearn.utils.shuffle(df['Superplasticizer'].values)
        Results_shuffle = [sm.OLS(df_shuffle['Concrete'], 
                     df_shuffle[['Superplasticizer']+x]).fit() for x in Allcomb]
        allResults_shuffle.append(Results_shuffle)
    return allResults_shuffle
    
allResults_shuffle = retShuffledResults()
df_r_shuffle = pd.DataFrame([[x.params['Superplasticizer'] for x in y] for y in allResults_shuffle])
df_r_shufflepval = pd.DataFrame([[x.pvalues['Superplasticizer'] for x in y] for y in allResults_shuffle])
print(df_r_shuffle.head())

         0         1         2         3         4         5         6   \
0  3.017799  0.348324  2.103696  2.342652  0.238608  0.119278  0.152364   
1  2.939502  0.205683  2.009524  2.243891  0.044811 -0.042069 -0.006277   
2  3.004296  0.255635  2.127853  2.322167  0.218430  0.084593  0.127544   
3  3.031353  0.338988  2.118547  2.364655  0.234529  0.171963  0.182143   
4  2.969443  0.250435  2.034338  2.294939  0.123191  0.026125  0.037847   

         7         8         9     ...           86        87        88  \
0  2.124654  0.152692  0.216249    ...     0.077730  0.052367  0.043836   
1  2.078909  0.014767  0.071263    ...    -0.047398  0.002010 -0.005702   
2  2.148499  0.116719  0.112361    ...     0.040043  0.069590  0.071732   
3  2.168407  0.140604  0.217297    ...     0.102334  0.134740  0.101656   
4  2.098849  0.042894  0.140568    ...    -0.033597 -0.001233 -0.028179   

         89        90        91        92        93        94        95  
0  0.031032  0.087474  0.086622  0.048941  0.016861 -0.011674  0.024902  
1 -0.068846  0.009561  0.009350 -0.017208 -0.034570 -0.035247 -0.016576  
2  0.043392  0.037542  0.044300  0.129716  0.089750  0.015758  0.050699  
3  0.048139  0.145640  0.155569  0.130373  0.135638  0.066984  0.104164  
4 -0.057247  0.045333  0.027806 -0.013531 -0.028678 -0.021878 -0.035316  

[5 rows x 96 columns]

Notice that there are multiple shuffled regressions for each specification number. We take the median of over all possible values for each specification number:

med_shuffle = df_r_shuffle.quantile(0.5).sort_values().reset_index().drop('index',axis=1)

These data can be added onto the main plot, along with everything else:

plt.close('all')
f, axarr = plt.subplots(2, sharex=True,figsize=(10,10))
for ax in axarr:
    ax.yaxis.major.formatter._useMathText = True
axarr[0].scatter(df_r.index,df_r['Coefficient'],
                 lw=3.,
                 s=0.6,
                 color='b',
                 label='Coefficient')
axarr[0].scatter(med_shuffle.index,
                 med_shuffle.iloc[:,0],
                 lw=3.,
                 s=0.6,
                 color='r',
                 marker='d',
                 label='Coefficient under null (median over bootstraps)')
axarr[0].axhline(color='k',lw=0.5)
# use if you wish to label orginal specification
#orig_spec = df_r[df_r['Specification'].apply(lambda x: not x)]
#axarr[0].scatter(orig_spec.index,orig_spec['Coefficient'],s=100.,color='k',label='Original specification')
axarr[0].axhline(y=np.median(df_r['Coefficient']),
                 color='k',
                 alpha=0.3,
                 label='Median coefficient',
                 dashes=[12, 5])
axarr[0].fill_between(df_r.index, 
                      df_r['Coefficient']+df_r['bse'], 
                      df_r['Coefficient']-df_r['bse'],
                      color='b',
                      alpha=0.3)
axarr[0].legend(frameon=False, loc='upper left',ncol=1,handlelength=4,markerscale=10)
axarr[0].set_ylabel('Coefficient')
axarr[0].set_title('Specification curve analysis')
cmap = plt.cm.plasma
cmap.set_bad('white',1.)
axarr[1].imshow(df_params[controls].T, aspect='auto', cmap=cmap, interpolation='None')
axarr[1].set_ylabel('Controls')
axarr[1].set_xlabel(df_r.index.name)
axarr[1].set_yticks(range(len(controls)))
axarr[1].set_yticklabels(controls)
plt.subplots_adjust(wspace=0, hspace=0.05)
plt.show()

png

The authors of the specification curve analysis paper provide three measures of whether, as a whole, the null should be rejected. (i) the median overall point estimate (ii) the share of estimates in the specification curve that are of the dominant sign, and (iii) the share that are of the dominant sign and also statistically significant (p<.05)

Step 3 part (i)

(i) is calculated from the % of coefficient estimates with as or more extreme results. We need to divide the number of bootstrapped datasets with larger median effect sizes than the original analysis by the total number of bootstraps, which gives the p-value of this test.

pvalue_i = np.double(sum(med_shuffle>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle))
print('{:.3f}'.format(pvalue_i))
0.005

Step 3 part (ii)

ii) requires this to be repeated but only with results of dominant sign. You can see from the plot that we’re going to again get a very large p-value but here’s the process anyway. First, we determine the dominant sign and then calculate the p-value for part ii)

gtr_than_zero = np.argmax( [len(df_r[df_r['Coefficient']<0.]), len(df_r[df_r['Coefficient']>0.])]) # 0 is <0 and 1 is >0
if(gtr_than_zero==1):
    gtr_than_zero = True
else:
    gtr_than_zero = False
print(gtr_than_zero)
if(gtr_than_zero):
    pvalue_ii = np.double(sum(med_shuffle[med_shuffle>0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle[med_shuffle>0]))
else:
    pvalue_ii = np.double(sum(med_shuffle[med_shuffle<0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle[med_shuffle<0]))
print('{:.3f}'.format(pvalue_ii))
True
0.005

Step 3 part (iii)

For part iii), we repeat the same process but only for those which were statistically significant and of dominant sign.

med_shuffle_signif = df_r_shuffle[df_r_shufflepval>0.05].quantile(0.5).sort_values().reset_index().drop('index',axis=1).dropna()
if(gtr_than_zero):
    pvalue_iii = np.double(sum(med_shuffle_signif[med_shuffle_signif>0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle_signif[med_shuffle_signif>0]))
else:
    pvalue_iii = np.double(sum(med_shuffle_signif[med_shuffle_signif<0]>np.median(df_r['Coefficient'])))/np.double(len(med_shuffle_signif[med_shuffle_signif<0]))
print('{:.3f}'.format(pvalue_iii))
0.006

As was likely from visual inspection of the figures, the p-values are $ \leq 0.01 $ in each case. We have tested whether, when considering all the possible specifications, the results found are inconsistent with results when the null hypothesis is true (that superplasticizer and strength are unrelated). On the basis of the p-values, we can safely reject the null that the bootstrapped and original specifications are consistent. The tests as carried out strongly imply that $\beta > 0 $ and that this conclusion is robust to specification change.

Conclusion

Researchers are always going to disagree about how to analyse the same data set. Although which specifications to include or exclude from specification curve analysis inevitably involves choices, I think that this is a useful and more comprehensive way to see how sensitive results are to those choices.

Putting women scientists onto Wikipedia

In a previous post, I shared links about the predictors for not participating in higher education, and about how it is difficult to reach audiences in “remote rural or coastal areas and in former industrial areas, especially in the Midlands” (according to the Social Mobility Commission). In this post, I look at another dimension of participation in higher education: gender.

Women are heavily under-represented in STEM (Science, Technology, Engineering, and Mathematics) subjects. In the UK, they make up just 25% of STEM undergraduates but 57% of the total undergraduate population.

It’s little better for economics, as this article in the Financial Times (£) shows, and the direction of the trend is worse: in the US, while the fraction of women undergraduates taking STEM subjects has increased, the fraction taking economics has declined. In the UK in 2011/12, it was 28% and trending downwards. The problems aren’t just widely held misapprehensions of what economics is about, or #WhatEconomistsDo. There is solid analytical work looking at ways in which the culture of economics may be hostile for women too. This work is nicely summarised by Prof. Diane Coyle (£), again in the Financial Times. Although both economics and STEM have a problem, I’ve mused before that economics could perhaps learn from science when it comes to outreach.

A campaign to inspire women to enter STEM subjects

My Imperial College London physics colleague Dr. Jess Wade (@jesswade on twitter) has come up with a novel way to help inspire more women to enter STEM subjects. She has been busily and heroically writing Wikipedia articles on women scientists of note since 2016. As she says,

“Wikipedia is a really great way to engage people in this mission because the more you read about these sensational women, the more you get so motivated and inspired by their personal stories.” - Dr. Jess Wade

Picked at random, here is the site of one of those of women whose Wikipedia page Jess has created: Frances Pleasonton, who worked on neutron decay.

What I think is most powerful about Jess’ approach is that it has huge reach, because Wikipedia has huge reach. Normally, it’s nigh on impossible to measure the impacts of outreach beyond a questionnaire issued at the end of an event. The audiences who attend science outreach events are typically self-selected, and they are rarely, if ever, followed over time to see if their relationship with science changes after the event.

Discussing her approach on BBC Radio 4’s Inside Science, Jess expressed her frustrations at well-meaning but likely ineffective outreach programmes which are costly and may do little to reach, or inspire, their intended audience. As was also noted on the programme, scientists can be endlessly methodical in the lab but - when it comes to outreach - their embrace of the scientific method could be better, and outreach programmes need to be better evaluated. Economists could definitely help here.

What is very cool about Jess’ campaign is that it is possible to get an idea, a rough one at least, of its impact. So just how huge is the reach of this campaign? Let’s find out.


Estimating the reach of Wikipedia pages

Feel free to skip this section if you’re not interested in the details of how the data were collected.

Wikipedia tracks page views, literally the number of times a wiki page has been requested. It’s not a perfect measure of the number of people viewing a webpage (you can find more info on the influences here) as some people are likely to be repeat visitors. Also, if an article is contentious, Wikipedia editors may visit it a lot. The debated page on Stanley Kubrick, for example, has had 396 edits by 203 editors since 2017 (at the time of checking).

So page views aren’t perfect, but they’re likely to be a good order of magnitude indicator of the number of people who have viewed a page.

To get all of the stats for the pages, I found Jess’ editor page, which includes an option to show all newly created pages. With some data wrangling via the beautifulsoup and pandas python packages, I obtained a list of people for whom pages were created. There may be a few extra pages which are not individuals included in error here, and perhaps some missing - but the wrangling should deliver most of them.

With the data on the names of the pages collected, I grabbed the page views using the handy wiki page view API and the requests python package. Here’s a snippet of the page views data table:

article Willetta_Greene-Johnson Xiangqian_Jiang Yewande_Akinola
date      
2017-12-01 0.0 0.0 0.0
2018-01-01 0.0 0.0 0.0
2018-02-01 0.0 0.0 167.0
2018-03-01 0.0 26.0 248.0
2018-04-01 0.0 8.0 282.0
2018-05-01 130.0 15.0 152.0

I used matplotlib and seaborn to show the results.


Impact of the campaign

So: how many people has Jess helped reach information on women in STEM? Over 200,000. This is simply astonishing.

Number of page views as a function of time

The blue line shows the cumulative total number of page views of all pages. The green lines show just how much hard work this has been - there is one for every individual page created. I’ve put in a few of the scientists’ names. Note that the page views data lag a bit behind the page creations.

To put the total number of views into some sort of context, the Royal Society Summer Science Exhibition, which I ran a stand at in 2014, gets around 12,000 visitors per year. Another comparison is that there were fewer than 100,000 undergraduates studying physical sciences in the UK in 2014-2015. So this is genuinely reaching an amazing number of people.

In the figure below, you can see a few of the most popular pages for 2018 so far:

Most visited articles 2018

It’s hard to know who is looking at these pages but it’s certain that they wouldn’t have been if Jess hadn’t created them (and inspired others to do the same). As well as Dr. Stuart Higgins’ Science in the Supermarket from my previous post I think this is a great example of how innovative outreach can be more effective in reaching audiences.

Who is not participating in Higher Education?

Given my work in both economics and Science, Technology, Engineering, and Mathematics (STEM), I’ve become interested in what factors determine groups’ participation in higher education, what groups are being left out, and what might be done about it.

Poverty means low participation

According to a Social Mobility Commission report from 2016, the most important determinant of whether someone goes to university at all or not is poverty, or, more precisely, whether someone receives free school meals. This applies across gender and ethnicity, though as the report notes “Disadvantaged young people from White British backgrounds are the least likely to access Higher Education”.

A lack of diversity in socio-economic background is perhaps less visible than some other troubling aspects of participation. But, if diversity matters at all, all dimensions of diversity matter.

Unfortunately, people from lower income/wealth backgrounds are some of the most difficult to influence with outreach campaigns as they tend to live in “remote rural or coastal areas and in former industrial areas, especially in the Midlands” according to the 2017 Social Mobility Commission’s ‘State of the nation’ report. I’m from one of the parts of the UK specifically identified in this report, the High Peak, and it’s unfortunately not all that surprising. Higher education institutions, and jobs which require advanced qualifications, are physically and psychologically at a distance. Other poorly ranked areas are similar: they include West Somerset (324 of 324), Thanet (274 of 324), and a cluster around West Norfolk.

There are detailed data on participation in higher education amongst young people available from the Office for Students. I’ve made a choropleth of these data below. The geographical areas with low participation are much the same as the problem areas identified in the report on social mobility. If you’re not interested in where the data come from, skip the box below the figure.

Youth higher education participation rate by local authority district. Shown: Manchester and the Peak District.


Data on youth HE participation

The Office for Students provide data on the number of young people who participate in HE by middle super output areas. These are quite small areas so I’ve aggregated to local authority districts using a mapping which comes from data on households in poverty. I plotted these data with folium using maps from the ONS Open Geography portal. Minor gripe: no geojson format was available, so I had to make my own from the shapefiles.


Science in the supermarket

Recently, I discussed how to reach those with the least HE participation with outreach superstar and Imperial College London colleague Dr. Stuart Higgins (whose award-winning podcast Scientists Not The Science is worth checking out). As I understand it, the best advice - based on research - is that you need to show young students a path into higher education which could work for them; that it’s feasible, that it’s for people ‘like them’, and that they’re good enough to make it.

I was talking to Stuart because of an amazing recent initiative he’s been involved with called Science in the Supermarket which puts what he’s learned into practice. Stuart and some other volunteers supported by Imperial College went to a supermarket in Somerset to engage young and old alike with science demos, and to tell them about careers in STEM. Although on a small scale, I think the brilliance of this initiative is that it avoids the self-selection problem which some other outreach programmes suffer from. I would love to see Economists in the Supermarket, or even Applied Mathematics in the Supermarket!

Update 25/08/18

Stuart has written up the results of the Science in the Supermarket project he ran so that others can learn from it. Laudably, by setting out everything from the project timetable, to the letters asking for volunteers, to the design of the meta-evaluation, to the costs, Stuart has made this intervention as reproducible as possible. Others can build upon what he has done. It’s a more scientific way to run an outreach programme.

Stuart gave me some informal pointers on ‘what I would think about if starting another project’ which I’ve made some minor tweaks to and reproduced below:

  • Understand your own motivation and define a target; trying to approach a big problem can feel overwhelming and paralysing, starting with a specific, local goal can help
  • Accept that balancing engagement with a day job is challenging
  • Set a realistic scope for the project and accept that ‘good enough’ is good enough
  • If possible, get both bottom-up (to help share the workload), and top-down support (to add legitimacy, open doors to resources, etc)
  • Try and be evidence-based where possible

Another resource he mentioned is this Aspires Report on ‘Young people’s science and career aspirations’. The two key findings I took away from it were that young people aren’t necessarily aware of the careers which science can open up (economics!) and that ‘science capital’ is a strong predictor of aspiring to a career in science but that this capital is unevenly distributed across socio-economic groups.

Processing all of this, it seems like making STEM careers and/or STEM practitioners familiar to young people is one of the most useful outcomes outreach programmes can strive for.

Why the latest, most exciting thing in machine learning is... game theory

And when I say latest, this particular method was invented in 1953.

Machine learning has interpretability issues. New EU legislation, the General Data Protection Regulation, includes a line about “the right … to obtain an explanation of the decision reached”, including by an algorithm.

Of course, there are many other good reasons to want the decisions of algorithms to be understandable and explainable. Interrogating why an algorithm makes the choices it does can highlight whether it’s working as intended, and, in some situations - such as public policy - transparency and interpretability may be essential ingredients of decision making.

But non-linear models are just not that easy to decompose into their fundamental components, they are - to an extent - a ‘black box’. Ideally, we would be able to find the contribution of each input feature to the final prediction. In linear models, this is trivially achieved by the combination of the level of a feature and its regression coefficient. That is, for a linear model $f$ with features $x_{i\nu}$, $\nu \in {1,p}$ at a point $i$ such that

the contribution from feature $\nu$ is $x_{i\nu}\cdot\beta_\nu$. In non-linear models, it’s not so simple.

Shapley values

Game theory to the rescue. In 1953 Lloyd Shapley introduced values which effectively find, for a co-operative game, each player’s marginal contribution, averaged over every possible sequence in which the players could have been added to the group of players (Alvin Roth talks about it here). These are called Shapley values and, in a nutshell, they are the average expected marginal contribution of one player after all possible combinations of players have been considered.

This is exactly the kind of problem we want to solve to understand how different features contribute to a predicted value in a non-linear model, for instance in a machine learning. But it’s easier to understand them in the linear case. The Shapley value for the linear model above would be, for feature $\nu$:

where no Einstein summation is implied. Summing over the different features gets back a number which is simply related to the overall prediction given by $f$,

The general equation for Shapley values looks more complicated, but is described by a function $g$ that assigns a real number to each coalition $S$, that is, to each subset of the combination of features, such that $g(S)$ represents the amount (of money or of utility) that coalition $S$ is able to transfer among its members in any way that they all agree to. Here it is:

where

Shapley values for machine learning

Shapley values have a number of nice properties which are both familiar from linear decompositions/linear models and highly desirable for machine learning models:

  • the Shapley value contributions sum to the difference between the full prediction and the average prediction (efficiency)

  • two features which contribute equally to any subset to which they’re added have the same Shapley value (substitutability/symmetry)

  • a feature which doesn’t influence the predicted value has a Shapley value of 0 (dummy player)

These nice properties are not trivial for non-linear models, and Shapley values are the only way to achieve them concurrently. They’re also what suggest to me that Shapley values will become the primary interpretability method used and understood. There must be some catch, right?

There is. Which is why other methods, such as local surrogate models like LIME, are not going away anytime soon. If the factorials and sum over all combinations of input features in the equation didn’t give it away, Shapley values are computationally expensive. As this paper points out, “every exact algorithm for the Shapley value requires an exponential number of operations”. Oh dear.

The good news is that there are good approximations out there. The even better news is that there is a Python library called shap which implements a fast approximation method, is easy to use, and is even optimised for sklearn. The paper behind this is here.

Not everyone is convinced by Shapley values but I think they could be particularly important as they have properties which are so clearly and neatly analogous to decompositions of linear models.

If you’d like to find out more about how Shapley values work, see these excellent explainer blog posts which I drew on heavily for this post:

Econometrics in Python Part IV - Running many regressions alongside pandas

The fourth in the series of posts covering econometrics in Python. This time: automating the boring business of running multiple regressions on columns in a pandas dataframe.

Data science in Python is the open source package pandas, more or less. It’s an amazing, powerful library and the firms, researchers, and governments who use it are indebted to its maintainers, including Wes McKinney.

When data arrive in your Python code, they’re most likely going to arrive in a pandas dataframe. If you’re doing econometrics, you’re then likely to want to do regressions from the dataframe with the minimum of fuss and the maximum of flexibility. This post sets out a way to do that with a few extra functions.

There are two main ways to run regressions in Python: statsmodels or scikit-learn. The latter is more geared toward machine learning, so I’ll be using the former for regressions. The typical way to do this might be the following (ignoring imports and data importing), with a pandas dataframe df with an x-variable ‘concrete’ and a y-variable ‘age’:

mod = sm.OLS(df['Age'],df['Concrete'])
results = mod.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    Age   R-squared:                       0.414
Model:                            OLS   Adj. R-squared:                  0.414
Method:                 Least Squares   F-statistic:                     728.1
Date:                     05 May 2018   Prob (F-statistic):          1.05e-121
Time:                        00:00:00   Log-Likelihood:                -5672.3
No. Observations:                1030   AIC:                         1.135e+04
Df Residuals:                    1029   BIC:                         1.135e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Concrete       1.2693      0.047     26.984      0.000       1.177       1.362
==============================================================================
Omnibus:                      761.497   Durbin-Watson:                   0.998
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             9916.238
Skew:                           3.411   Prob(JB):                         0.00
Kurtosis:                      16.584   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In the rest of this post I will outline a more flexible and extensible way of doing this, which will allow for multiple models and controls, with code snippets you can copy, paste, and then forget about.

Loading the data

Our data are on the compressive strength of concrete - I know, brilliant, and we could talk more about the fascinating history of concrete and its importance for the economy, but we should get to the stats. The data are from the UC Irvine Machine Learning datasets repository; see here for them.

df = pd.read_excel('concrete_data.xls')
df.head()
   Cement (component 1)(kg in a m^3 mixture)  \
0                                      540.0   
1                                      540.0   
2                                      332.5   
3                                      332.5   
4                                      198.6   

   Blast Furnace Slag (component 2)(kg in a m^3 mixture)  \
0                                                0.0       
1                                                0.0       
2                                              142.5       
3                                              142.5       
4                                              132.4       

   Fly Ash (component 3)(kg in a m^3 mixture)  \
0                                         0.0   
1                                         0.0   
2                                         0.0   
3                                         0.0   
4                                         0.0   

   Water  (component 4)(kg in a m^3 mixture)  \
0                                      162.0   
1                                      162.0   
2                                      228.0   
3                                      228.0   
4                                      192.0   

   Superplasticizer (component 5)(kg in a m^3 mixture)  \
0                                                2.5     
1                                                2.5     
2                                                0.0     
3                                                0.0     
4                                                0.0     

   Coarse Aggregate  (component 6)(kg in a m^3 mixture)  \
0                                             1040.0      
1                                             1055.0      
2                                              932.0      
3                                              932.0      
4                                              978.4      

   Fine Aggregate (component 7)(kg in a m^3 mixture)  Age (day)  \
0                                              676.0         28   
1                                              676.0         28   
2                                              594.0        270   
3                                              594.0        365   
4                                              825.5        360   

   Concrete compressive strength(MPa, megapascals)   
0                                         79.986111  
1                                         61.887366  
2                                         40.269535  
3                                         41.052780  
4                                         44.296075  

Those column names are rather long! I’ll just take the first word of each column name, and take a quick look at the data:

df = df.rename(columns=dict(zip(df.columns,[x.split()[0] for x in df.columns])))
df.describe()
            Cement        Blast          Fly        Water  Superplasticizer  \
count  1030.000000  1030.000000  1030.000000  1030.000000       1030.000000   
mean    281.165631    73.895485    54.187136   181.566359          6.203112   
std     104.507142    86.279104    63.996469    21.355567          5.973492   
min     102.000000     0.000000     0.000000   121.750000          0.000000   
25%     192.375000     0.000000     0.000000   164.900000          0.000000   
50%     272.900000    22.000000     0.000000   185.000000          6.350000   
75%     350.000000   142.950000   118.270000   192.000000         10.160000   
max     540.000000   359.400000   200.100000   247.000000         32.200000   

            Coarse         Fine          Age     Concrete  
count  1030.000000  1030.000000  1030.000000  1030.000000  
mean    972.918592   773.578883    45.662136    35.817836  
std      77.753818    80.175427    63.169912    16.705679  
min     801.000000   594.000000     1.000000     2.331808  
25%     932.000000   730.950000     7.000000    23.707115  
50%     968.000000   779.510000    28.000000    34.442774  
75%    1029.400000   824.000000    56.000000    46.136287  
max    1145.000000   992.600000   365.000000    82.599225  

Defining functions to run regressions

Let’s set up a function which we can pass a dataframe to in order to run regressions on selected columns:

def RegressionOneModel(df,Xindvars,Yvar,summary=True):

    if(type(Yvar)==str):
        Yvar=[Yvar]
    if(len(Yvar)!=1):
        print("Error: please enter a single y variable")
        return np.nan
    else:
        xf = df.dropna(subset=Yvar+Xindvars)[Xindvars+Yvar]
        Xexog = xf[Xindvars]
        model = sm.OLS(xf[Yvar].dropna(),Xexog)
        reg = model.fit()
    if(summary):
        return reg.summary2()
    else:
        return reg

How this does work? It’s easiest to show with an example.

regResults = RegressionOneModel(df,['Cement','Blast'],'Concrete')
print(regResults)
                 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.878    
Dependent Variable: Concrete         AIC:                8332.8955
Date:               2018-05-05 00:00 BIC:                8342.7701
No. Observations:   1030             Log-Likelihood:     -4164.4  
Df Model:           2                F-statistic:        3705.    
Df Residuals:       1028             Prob (F-statistic): 0.00     
R-squared:          0.878            Scale:              190.64   
---------------------------------------------------------------------
          Coef.     Std.Err.       t       P>|t|     [0.025    0.975]
---------------------------------------------------------------------
Cement    0.1079      0.0017    63.4736    0.0000    0.1046    0.1113
Blast     0.0671      0.0045    14.9486    0.0000    0.0583    0.0760
------------------------------------------------------------------
Omnibus:               7.719        Durbin-Watson:           0.983
Prob(Omnibus):         0.021        Jarque-Bera (JB):        6.461
Skew:                  0.117        Prob(JB):                0.040
Kurtosis:              2.690        Condition No.:           3    
==================================================================

This function takes a variable number of X vectors and regresses Y (‘concrete’) on them. But what if we want to run many regressions at once? Fortunately statsmodels has some capability to do this. Unfortunately, it’s not all that intuitive and, to use it with ease, we’ll need to extend. I want it to be flexible enough so that it:

  • works with X as a string, list, or a list of lists (for multiple models)
  • accepts a number of controls which are the same in every model
  • returns either a multi-model regression results summary or a single model summary as appropriate

To make this all work, we need a couple of extra functions. One just labels different models with Roman numerals and could be jettisoned. The other one is just a quick way of combining the variables to send to the regression.

def write_roman(num):

    roman = OrderedDict()
    roman[1000] = "M"
    roman[900] = "CM"
    roman[500] = "D"
    roman[400] = "CD"
    roman[100] = "C"
    roman[90] = "XC"
    roman[50] = "L"
    roman[40] = "XL"
    roman[10] = "X"
    roman[9] = "IX"
    roman[5] = "V"
    roman[4] = "IV"
    roman[1] = "I"

    def roman_num(num):
        for r in roman.keys():
            x, y = divmod(num, r)
            yield roman[r] * x
            num -= (r * x)
            if num > 0:
                roman_num(num)
            else:
                break

    return "".join([a for a in roman_num(num)])

def combineVarsList(X,Z,combine):
    if(combine):
        return X+Z
    else:
        return X

Finally, there is a function which decides how to call the underlying regression code, and which stitches the results from different models together:

def RunRegression(df,XX,y,Z=['']):

    # If XX is not a list of lists, make it one -
    # - first by checking if type is string
    if(type(XX)==str):  # Check if it is one string
        XX = [XX]
     # - second for if it is a list
    if(not(any(isinstance(el, list) for el in XX))):
        XX = [XX]
    if(type(y)!=str): # Check y for string
        print('Error: please enter string for dependent variable')
        return np.nan
    title_string = 'OLS Regressions; dependent variable '+y
    # If Z is not a list, make it one
    if(type(Z)==str):
        Z = [Z]
    #XX = np.array(XX)
    # Check whether there is just a single model to run
    if(len(XX)==1):
        Xpassvars = list(XX[0])
        if(len(Z[0])!=0):
             Xpassvars = list(XX[0])+Z
        regRes = RegressionOneModel(df,Xpassvars,[y],summary=False)
        regResSum2 = regRes.summary2()
        regResSum2.add_title(title_string)
        return regResSum2
    elif(len(XX)>1):
        # Load in Z here if appropriate
        addControls = False
        if(len(Z[0])!=0):
             addControls = True
        # Case with multiple models
        info_dict={'R-squared' : lambda x: "{:.2f}".format(x.rsquared),
               'Adj. R-squared' : lambda x: "{:.2f}".format(x.rsquared_adj),
               'No. observations' : lambda x: "{0:d}".format(int(x.nobs))}
        regsVec = [RegressionOneModel(df,combineVarsList(X,Z,addControls),
                                              [y],summary=False) for X in XX]
        model_names_strList = ['Model '+\
                           write_roman(i) for i in range(1,len(XX)+1)]
        float_format_str = '%0.2f'
        uniqueVars = np.unique([item for sublist in XX for item in sublist])
        uniqueVars = [str(x) for x in uniqueVars]
        results_table = summary_col(results=regsVec,
                                float_format=float_format_str,
                                stars = True,
                                model_names=model_names_strList,
                                info_dict=info_dict,
                                regressor_order=uniqueVars+Z)
        results_table.add_title(title_string)
        return results_table

Putting it all together

Let’s see how it works. Firstly, the simple case of one y on one x.

regResults = RunRegression(df,'Blast','Concrete')
print(regResults)
           OLS Regressions; dependent variable Concrete
==================================================================
Model:              OLS              Adj. R-squared:     0.400    
Dependent Variable: Concrete         AIC:                9971.8287
Date:               2018-05-05 00:00 BIC:                9976.7660
No. Observations:   1030             Log-Likelihood:     -4984.9  
Df Model:           1                F-statistic:        688.0    
Df Residuals:       1029             Prob (F-statistic): 1.57e-116
R-squared:          0.401            Scale:              936.87   
---------------------------------------------------------------------
          Coef.     Std.Err.       t       P>|t|     [0.025    0.975]
---------------------------------------------------------------------
Blast     0.2203      0.0084    26.2293    0.0000    0.2038    0.2367
------------------------------------------------------------------
Omnibus:               39.762       Durbin-Watson:          0.477
Prob(Omnibus):         0.000        Jarque-Bera (JB):       43.578
Skew:                  -0.502       Prob(JB):               0.000
Kurtosis:              3.081        Condition No.:          1     
==================================================================

Or several x variables:

regResults = RunRegression(df,['Cement','Blast'],'Concrete')
print(regResults)
           OLS Regressions; dependent variable Concrete
==================================================================
Model:              OLS              Adj. R-squared:     0.878    
Dependent Variable: Concrete         AIC:                8332.8955
Date:               2018-05-05 00:00 BIC:                8342.7701
No. Observations:   1030             Log-Likelihood:     -4164.4  
Df Model:           2                F-statistic:        3705.    
Df Residuals:       1028             Prob (F-statistic): 0.00     
R-squared:          0.878            Scale:              190.64   
---------------------------------------------------------------------
          Coef.     Std.Err.       t       P>|t|     [0.025    0.975]
---------------------------------------------------------------------
Cement    0.1079      0.0017    63.4736    0.0000    0.1046    0.1113
Blast     0.0671      0.0045    14.9486    0.0000    0.0583    0.0760
------------------------------------------------------------------
Omnibus:               7.719        Durbin-Watson:           0.983
Prob(Omnibus):         0.021        Jarque-Bera (JB):        6.461
Skew:                  0.117        Prob(JB):                0.040
Kurtosis:              2.690        Condition No.:           3    
==================================================================

Here comes the fun - to run multiple models, we need only pass a list of lists as the X variable in the function:

Model_1_X = ['Cement', 'Blast']
Model_2_X = ['Coarse','Fine']
Model_3_X = ['Fly', 'Water']
ManyModelResults = RunRegression(df,
                                 [Model_1_X,Model_2_X,Model_3_X],
                                 'Concrete')
print(ManyModelResults)
OLS Regressions; dependent variable Concrete
===========================================
                 Model I Model II Model III
-------------------------------------------
Blast            0.07***                   
                 (0.00)                    
Cement           0.11***                   
                 (0.00)                    
Coarse                   0.03***           
                         (0.00)            
Fine                     0.01***           
                         (0.00)            
Fly                               0.00     
                                  (0.01)   
Water                             0.19***  
                                  (0.00)   
R-squared        0.88    0.81     0.78     
Adj. R-squared   0.88    0.81     0.78     
No. observations 1030    1030     1030     
===========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

There’s a keyword argument, Z, which we can pass controls (here just ‘Age’) via:

ManyModelsWControl = RunRegression(df,
                                 [Model_1_X,Model_2_X,Model_3_X],
                                 'Concrete',
                                 Z = 'Age')
print(ManyModelsWControl)
OLS Regressions; dependent variable Concrete
===========================================
                 Model I Model II Model III
-------------------------------------------
Blast            0.05***                   
                 (0.00)                    
Cement           0.08***                   
                 (0.00)                    
Coarse                   0.03***           
                         (0.00)            
Fine                     -0.01**           
                         (0.00)            
Fly                               -0.06***
                                  (0.01)   
Water                             0.12***  
                                  (0.00)   
Age              0.10*** 0.11***  0.10***  
                 (0.01)  (0.01)   (0.01)   
Superplasticizer 1.04*** 1.44***  1.84***  
                 (0.06)  (0.08)   (0.08)   
R-squared        0.92    0.87     0.88     
Adj. R-squared   0.92    0.87     0.87     
No. observations 1030    1030     1030     
===========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Finally, it’s easy to pass multiple controls:

ManyModelsWControls = RunRegression(df,
                                 [Model_1_X,Model_2_X,Model_3_X],
                                 'Concrete',
                                 Z = ['Age','Superplasticizer'])
print(ManyModelsWControls)
OLS Regressions; dependent variable Concrete
===========================================
                 Model I Model II Model III
-------------------------------------------
Blast            0.05***                   
                 (0.00)                    
Cement           0.08***                   
                 (0.00)                    
Coarse                   0.03***           
                         (0.00)            
Fine                     -0.01**           
                         (0.00)            
Fly                               -0.06***
                                  (0.01)   
Water                             0.12***  
                                  (0.00)   
Age              0.10*** 0.11***  0.10***  
                 (0.01)  (0.01)   (0.01)   
Superplasticizer 1.04*** 1.44***  1.84***  
                 (0.06)  (0.08)   (0.08)   
R-squared        0.92    0.87     0.88     
Adj. R-squared   0.92    0.87     0.87     
No. observations 1030    1030     1030     
===========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

By the way, the statsmodels summary object which is returned here has an .as_latex() method - useful if you want to dump results straight into papers.

Do you have a better way to quickly run many different kinds of OLS regressions from a pandas dataframe? Get in touch!

NB: I had to remove the doc strings in the above code because they slowed down the page a lot.