Markov Wanderer A blog on economics, science, coding and data. Views are my own.

Econometrics in Python Part IV - Running many regressions alongside pandas

The fourth in the series of posts covering econometrics in Python. This time: automating the boring business of running multiple regressions on columns in a pandas dataframe.

Data science in Python is the open source package pandas, more or less. It’s an amazing, powerful library and the firms, researchers, and governments who use it are indebted to its maintainers, including Wes McKinney.

When data arrive in your Python code, they’re most likely going to arrive in a pandas dataframe. If you’re doing econometrics, you’re then likely to want to do regressions from the dataframe with the minimum of fuss and the maximum of flexibility. This post sets out a way to do that with a few extra functions.

There are two main ways to run regressions in Python: statsmodels or scikit-learn. The latter is more geared toward machine learning, so I’ll be using the former for regressions. The typical way to do this might be the following (ignoring imports and data importing), with a pandas dataframe df with an x-variable ‘concrete’ and a y-variable ‘age’:

mod = sm.OLS(df['Age'],df['Concrete'])
results =
                            OLS Regression Results                            
Dep. Variable:                    Age   R-squared:                       0.414
Model:                            OLS   Adj. R-squared:                  0.414
Method:                 Least Squares   F-statistic:                     728.1
Date:                     05 May 2018   Prob (F-statistic):          1.05e-121
Time:                        00:00:00   Log-Likelihood:                -5672.3
No. Observations:                1030   AIC:                         1.135e+04
Df Residuals:                    1029   BIC:                         1.135e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Concrete       1.2693      0.047     26.984      0.000       1.177       1.362
Omnibus:                      761.497   Durbin-Watson:                   0.998
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             9916.238
Skew:                           3.411   Prob(JB):                         0.00
Kurtosis:                      16.584   Cond. No.                         1.00

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In the rest of this post I will outline a more flexible and extensible way of doing this, which will allow for multiple models and controls, with code snippets you can copy, paste, and then forget about.

Loading the data

Our data are on the compressive strength of concrete - I know, brilliant, and we could talk more about the fascinating history of concrete and its importance for the economy, but we should get to the stats. The data are from the UC Irvine Machine Learning datasets repository; see here for them.

df = pd.read_excel('concrete_data.xls')
   Cement (component 1)(kg in a m^3 mixture)  \
0                                      540.0   
1                                      540.0   
2                                      332.5   
3                                      332.5   
4                                      198.6   

   Blast Furnace Slag (component 2)(kg in a m^3 mixture)  \
0                                                0.0       
1                                                0.0       
2                                              142.5       
3                                              142.5       
4                                              132.4       

   Fly Ash (component 3)(kg in a m^3 mixture)  \
0                                         0.0   
1                                         0.0   
2                                         0.0   
3                                         0.0   
4                                         0.0   

   Water  (component 4)(kg in a m^3 mixture)  \
0                                      162.0   
1                                      162.0   
2                                      228.0   
3                                      228.0   
4                                      192.0   

   Superplasticizer (component 5)(kg in a m^3 mixture)  \
0                                                2.5     
1                                                2.5     
2                                                0.0     
3                                                0.0     
4                                                0.0     

   Coarse Aggregate  (component 6)(kg in a m^3 mixture)  \
0                                             1040.0      
1                                             1055.0      
2                                              932.0      
3                                              932.0      
4                                              978.4      

   Fine Aggregate (component 7)(kg in a m^3 mixture)  Age (day)  \
0                                              676.0         28   
1                                              676.0         28   
2                                              594.0        270   
3                                              594.0        365   
4                                              825.5        360   

   Concrete compressive strength(MPa, megapascals)   
0                                         79.986111  
1                                         61.887366  
2                                         40.269535  
3                                         41.052780  
4                                         44.296075  

Those column names are rather long! I’ll just take the first word of each column name, and take a quick look at the data:

df = df.rename(columns=dict(zip(df.columns,[x.split()[0] for x in df.columns])))
            Cement        Blast          Fly        Water  Superplasticizer  \
count  1030.000000  1030.000000  1030.000000  1030.000000       1030.000000   
mean    281.165631    73.895485    54.187136   181.566359          6.203112   
std     104.507142    86.279104    63.996469    21.355567          5.973492   
min     102.000000     0.000000     0.000000   121.750000          0.000000   
25%     192.375000     0.000000     0.000000   164.900000          0.000000   
50%     272.900000    22.000000     0.000000   185.000000          6.350000   
75%     350.000000   142.950000   118.270000   192.000000         10.160000   
max     540.000000   359.400000   200.100000   247.000000         32.200000   

            Coarse         Fine          Age     Concrete  
count  1030.000000  1030.000000  1030.000000  1030.000000  
mean    972.918592   773.578883    45.662136    35.817836  
std      77.753818    80.175427    63.169912    16.705679  
min     801.000000   594.000000     1.000000     2.331808  
25%     932.000000   730.950000     7.000000    23.707115  
50%     968.000000   779.510000    28.000000    34.442774  
75%    1029.400000   824.000000    56.000000    46.136287  
max    1145.000000   992.600000   365.000000    82.599225  

Defining functions to run regressions

Let’s set up a function which we can pass a dataframe to in order to run regressions on selected columns:

def RegressionOneModel(df,Xindvars,Yvar,summary=True):

        print("Error: please enter a single y variable")
        return np.nan
        xf = df.dropna(subset=Yvar+Xindvars)[Xindvars+Yvar]
        Xexog = xf[Xindvars]
        model = sm.OLS(xf[Yvar].dropna(),Xexog)
        reg =
        return reg.summary2()
        return reg

How this does work? It’s easiest to show with an example.

regResults = RegressionOneModel(df,['Cement','Blast'],'Concrete')
                 Results: Ordinary least squares
Model:              OLS              Adj. R-squared:     0.878    
Dependent Variable: Concrete         AIC:                8332.8955
Date:               2018-05-05 00:00 BIC:                8342.7701
No. Observations:   1030             Log-Likelihood:     -4164.4  
Df Model:           2                F-statistic:        3705.    
Df Residuals:       1028             Prob (F-statistic): 0.00     
R-squared:          0.878            Scale:              190.64   
          Coef.     Std.Err.       t       P>|t|     [0.025    0.975]
Cement    0.1079      0.0017    63.4736    0.0000    0.1046    0.1113
Blast     0.0671      0.0045    14.9486    0.0000    0.0583    0.0760
Omnibus:               7.719        Durbin-Watson:           0.983
Prob(Omnibus):         0.021        Jarque-Bera (JB):        6.461
Skew:                  0.117        Prob(JB):                0.040
Kurtosis:              2.690        Condition No.:           3    

This function takes a variable number of X vectors and regresses Y (‘concrete’) on them. But what if we want to run many regressions at once? Fortunately statsmodels has some capability to do this. Unfortunately, it’s not all that intuitive and, to use it with ease, we’ll need to extend. I want it to be flexible enough so that it:

  • works with X as a string, list, or a list of lists (for multiple models)
  • accepts a number of controls which are the same in every model
  • returns either a multi-model regression results summary or a single model summary as appropriate

To make this all work, we need a couple of extra functions. One just labels different models with Roman numerals and could be jettisoned. The other one is just a quick way of combining the variables to send to the regression.

def write_roman(num):

    roman = OrderedDict()
    roman[1000] = "M"
    roman[900] = "CM"
    roman[500] = "D"
    roman[400] = "CD"
    roman[100] = "C"
    roman[90] = "XC"
    roman[50] = "L"
    roman[40] = "XL"
    roman[10] = "X"
    roman[9] = "IX"
    roman[5] = "V"
    roman[4] = "IV"
    roman[1] = "I"

    def roman_num(num):
        for r in roman.keys():
            x, y = divmod(num, r)
            yield roman[r] * x
            num -= (r * x)
            if num > 0:

    return "".join([a for a in roman_num(num)])

def combineVarsList(X,Z,combine):
        return X+Z
        return X

Finally, there is a function which decides how to call the underlying regression code, and which stitches the results from different models together:

def RunRegression(df,XX,y,Z=['']):

    # If XX is not a list of lists, make it one -
    # - first by checking if type is string
    if(type(XX)==str):  # Check if it is one string
        XX = [XX]
     # - second for if it is a list
    if(not(any(isinstance(el, list) for el in XX))):
        XX = [XX]
    if(type(y)!=str): # Check y for string
        print('Error: please enter string for dependent variable')
        return np.nan
    title_string = 'OLS Regressions; dependent variable '+y
    # If Z is not a list, make it one
        Z = [Z]
    #XX = np.array(XX)
    # Check whether there is just a single model to run
        Xpassvars = list(XX[0])
             Xpassvars = list(XX[0])+Z
        regRes = RegressionOneModel(df,Xpassvars,[y],summary=False)
        regResSum2 = regRes.summary2()
        return regResSum2
        # Load in Z here if appropriate
        addControls = False
             addControls = True
        # Case with multiple models
        info_dict={'R-squared' : lambda x: "{:.2f}".format(x.rsquared),
               'Adj. R-squared' : lambda x: "{:.2f}".format(x.rsquared_adj),
               'No. observations' : lambda x: "{0:d}".format(int(x.nobs))}
        regsVec = [RegressionOneModel(df,combineVarsList(X,Z,addControls),
                                              [y],summary=False) for X in XX]
        model_names_strList = ['Model '+\
                           write_roman(i) for i in range(1,len(XX)+1)]
        float_format_str = '%0.2f'
        uniqueVars = np.unique([item for sublist in XX for item in sublist])
        uniqueVars = [str(x) for x in uniqueVars]
        results_table = summary_col(results=regsVec,
                                stars = True,
        return results_table

Putting it all together

Let’s see how it works. Firstly, the simple case of one y on one x.

regResults = RunRegression(df,'Blast','Concrete')
           OLS Regressions; dependent variable Concrete
Model:              OLS              Adj. R-squared:     0.400    
Dependent Variable: Concrete         AIC:                9971.8287
Date:               2018-05-05 00:00 BIC:                9976.7660
No. Observations:   1030             Log-Likelihood:     -4984.9  
Df Model:           1                F-statistic:        688.0    
Df Residuals:       1029             Prob (F-statistic): 1.57e-116
R-squared:          0.401            Scale:              936.87   
          Coef.     Std.Err.       t       P>|t|     [0.025    0.975]
Blast     0.2203      0.0084    26.2293    0.0000    0.2038    0.2367
Omnibus:               39.762       Durbin-Watson:          0.477
Prob(Omnibus):         0.000        Jarque-Bera (JB):       43.578
Skew:                  -0.502       Prob(JB):               0.000
Kurtosis:              3.081        Condition No.:          1     

Or several x variables:

regResults = RunRegression(df,['Cement','Blast'],'Concrete')
           OLS Regressions; dependent variable Concrete
Model:              OLS              Adj. R-squared:     0.878    
Dependent Variable: Concrete         AIC:                8332.8955
Date:               2018-05-05 00:00 BIC:                8342.7701
No. Observations:   1030             Log-Likelihood:     -4164.4  
Df Model:           2                F-statistic:        3705.    
Df Residuals:       1028             Prob (F-statistic): 0.00     
R-squared:          0.878            Scale:              190.64   
          Coef.     Std.Err.       t       P>|t|     [0.025    0.975]
Cement    0.1079      0.0017    63.4736    0.0000    0.1046    0.1113
Blast     0.0671      0.0045    14.9486    0.0000    0.0583    0.0760
Omnibus:               7.719        Durbin-Watson:           0.983
Prob(Omnibus):         0.021        Jarque-Bera (JB):        6.461
Skew:                  0.117        Prob(JB):                0.040
Kurtosis:              2.690        Condition No.:           3    

Here comes the fun - to run multiple models, we need only pass a list of lists as the X variable in the function:

Model_1_X = ['Cement', 'Blast']
Model_2_X = ['Coarse','Fine']
Model_3_X = ['Fly', 'Water']
ManyModelResults = RunRegression(df,
OLS Regressions; dependent variable Concrete
                 Model I Model II Model III
Blast            0.07***                   
Cement           0.11***                   
Coarse                   0.03***           
Fine                     0.01***           
Fly                               0.00     
Water                             0.19***  
R-squared        0.88    0.81     0.78     
Adj. R-squared   0.88    0.81     0.78     
No. observations 1030    1030     1030     
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

There’s a keyword argument, Z, which we can pass controls (here just ‘Age’) via:

ManyModelsWControl = RunRegression(df,
                                 Z = 'Age')
OLS Regressions; dependent variable Concrete
                 Model I Model II Model III
Blast            0.05***                   
Cement           0.08***                   
Coarse                   0.03***           
Fine                     -0.01**           
Fly                               -0.06***
Water                             0.12***  
Age              0.10*** 0.11***  0.10***  
                 (0.01)  (0.01)   (0.01)   
Superplasticizer 1.04*** 1.44***  1.84***  
                 (0.06)  (0.08)   (0.08)   
R-squared        0.92    0.87     0.88     
Adj. R-squared   0.92    0.87     0.87     
No. observations 1030    1030     1030     
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Finally, it’s easy to pass multiple controls:

ManyModelsWControls = RunRegression(df,
                                 Z = ['Age','Superplasticizer'])
OLS Regressions; dependent variable Concrete
                 Model I Model II Model III
Blast            0.05***                   
Cement           0.08***                   
Coarse                   0.03***           
Fine                     -0.01**           
Fly                               -0.06***
Water                             0.12***  
Age              0.10*** 0.11***  0.10***  
                 (0.01)  (0.01)   (0.01)   
Superplasticizer 1.04*** 1.44***  1.84***  
                 (0.06)  (0.08)   (0.08)   
R-squared        0.92    0.87     0.88     
Adj. R-squared   0.92    0.87     0.87     
No. observations 1030    1030     1030     
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

By the way, the statsmodels summary object which is returned here has an .as_latex() method - useful if you want to dump results straight into papers.

Do you have a better way to quickly run many different kinds of OLS regressions from a pandas dataframe? Get in touch!

NB: I had to remove the doc strings in the above code because they slowed down the page a lot.

Econometrics in Python part III - Estimating heterogeneous treatment effects using random forests

The third in a series of posts covering econometrics in Python. Here I look at ‘causal forests’.

As I mentioned in a previous post, there are methods at the intersection of machine learning and econometrics which are really exciting. Susan Athey is very active in this space and has written a number of papers, including a review article of where the cross-over between economics and computer science is headed. In this post, I’m going to look at recreating an example from her paper, ‘Estimation and inference of heterogeneous treatment effects using random forests’ (Wager & Athey, 2017).

The paper is a recipe for doing non-parametric causal estimation of heterogeneous treatment effects. Imagine a natural experiment with people who are different according to a set of covariates $X_i$, and who are assigned a treatment $W_i \in {0,1}$. The response is $Y_i \in \mathbb{R}$, giving $(X_i,W_i,Y_i)$ for each observation. Then the treatment effect, given by

can be estimated using a random forest, where the observed response $Y^{(h)}$ is labelled for either the treatment case ($h=1$) or the no treatment case ($h=0$).

What is especially nice about Wager and Athey’s approach is that it employs the power of classification and regression trees but provides point estimates of the treatment effect that are pointwise consistent and satisfy,

that is, the error in the pointwise estimate is asymptotically Gaussian. There are some conditions and assumptions for their method to work. The main one is unconfoundedness, which is defined as

meaning that the response variables are independent of the assignment to the treatment or control group once the covariates are accounted for.

To show how this works, I will reproduce one of the simulation experiments from their paper. There is an R package which implements their method, but I couldn’t find one in Python. There are some more comments on the approach at the end of the post.

Simulation experiments

We will set $X \thicksim \mathcal{U} ([0 , 1]^d )$ for a $d$ dimensional space of covariates, and assume so that the noise in the response variable is homoskedastic. Further assumptions are that the mean effect, $m(x)$, and the treatment propensity, $e(x)$, are

The true data generating process will be

with the number of observations set to $n=5000$. To make this a tough test, we can set $d=6$ so that there are a few noisy covariates which don’t influence the response variable but could confuse the regression trees (as only the first two dimensions are important).

Let’s first generate the data using Python:

import numpy as np
import statsmodels.formula.api as sm
import pandas as pd
import random
import matplotlib.pyplot as plt
from sklearn import tree
import scipy.interpolate
# No. dimensions of covariates
d = 6
# No. observations
n= 5000
# s set size
s = 2500
X = np.random.uniform(0,1,size=(n,d))
def xi(x):
    return (1+1./(1.+np.exp(-20.*(x-1./3.))))
tau = np.multiply(xi(X[:,0]),xi(X[:,1]))

we can take a look at the function $\xi(X)$,

def tauxy(X_0,X_1):
    return np.multiply(xi(X_0),xi(X_1))
# Quick plot of tau as function of X_1, X_2 assuming continuous support
def plotFunc(func):
    X_0 = np.linspace(0, 1, 1000)
    X_1 = np.linspace(0, 1, 1000)
    X, Y = np.meshgrid(X_0, X_1)
    Z = func(X, Y)'seaborn-white')
    plt.imshow(Z, vmin=1., vmax=4., origin='lower',
              extent=[X_0.min(), X_0.max(), X_1.min(), X_1.max()],

The true $\tau(x)$ as a function of $X_0$ and $X_1$

Double-sample causal trees

Now we apply the recipe from their paper:

  1. Draw a random subsample of size $s$ from ${1,\dots,n}$ without replacement and divide into two disjoint sets $\mathcal{I}$ and $\mathcal{J}$ such that $\lvert\mathcal{J}\rvert = \lceil s/2 \rceil$ and $\lvert\mathcal{I}\rvert = \lfloor s/2 \rfloor$.
  2. Grow a tree via recursive partitions and split using the $\mathcal{J}$ data but no $Y$ observations from the $\mathcal{I}$ sample. The splitting criteria to use for double-sample causal trees is the squared-error minimising split.
  3. Estimate the leaf-wise response from the $\mathcal{I}$ sample observations.

The causal tree point estimates are given by

where $X_i \in L$ means that $L(x)$ is the leaf containing $x$. However, because our example is based on $\tau(x)$, we will learn directly on that.

The sampling is done as follows:

# Draw a random subsample of size s.
# Choose s ints between 0 and n randomly
subSampleMask = random.sample(range(0, n), s)
# Create set I
setIMask = random.sample(subSampleMask,
setI = [X[setIMask]]
dfSetI = pd.DataFrame(data=X[setIMask])
# Create set J
setJMask = [i for i in subSampleMask if i not in setIMask]
setJ = [X[setJMask]]

The regression tree is trained on the $\mathcal{J}$ set using the sklearn decision tree with a mean-squared error criterion to determine the quality of the splits,

# Create tree on the J set
clf = tree.DecisionTreeRegressor(criterion='mse')
clf =[0], tau[setJMask])

Now we can produce the predictions for $\hat{\tau}(x)$ using the $\mathcal{I}$ set, and look at the out of sample $R^2$,

tau_hat = clf.predict(dfSetI.iloc[:,:d])
# Out of sample R^2:

A fairly strong out of sample score! Let’s have a look at the out of sample test more closely by plotting it (with some linear interpolation),

def plotResults(dfSetI,tau_hat):
    # Set up a regular grid of interpolation points
    xi, yi = np.linspace(dfSetI.iloc[:,0].min(),
                         dfSetI.iloc[:,0].max(), 100), np.linspace(dfSetI.iloc[:,1].min(), dfSetI.iloc[:,1].max(), 100)
    xi, yi = np.meshgrid(xi, yi)
    # Interpolate
    rbf = scipy.interpolate.Rbf(dfSetI.iloc[:,0], dfSetI.iloc[:,1], tau_hat, function='linear')
    zi = rbf(xi, yi)

    plt.imshow(zi, vmin=1., vmax=4., origin='lower',
               extent=[dfSetI.iloc[:,0].min(), dfSetI.iloc[:,0].max(), dfSetI.iloc[:,1].min(), dfSetI.iloc[:,1].max()],

$\hat{\tau}(x)$ using the double-sample causal tree.

This does seem to capture the treatment well given the two dimensions which matter - despite the ‘noise’ from the other four dimensions. The authors also outline a method for (essentially) bootstrapping which repeats the above process but with different samples. The final estimate is

This is a really nice application of classification and regression trees to causal effects. However, I did find their paper a bit difficult to follow in places, especially on the splitting rules for causal trees versus regression trees. Specifically, in the example, it seems like mean squared prediction error is the right splitting criterion for the causal tree because the tree is being directly trained on the treatment, $\tau$. But in general, $\tau$ is not directly available and the splits of the tree must be chosen to as to maximise the variance of $\hat{\tau}(X_i)$ for $i\in\mathcal{J}$ instead.

Wager, S., & Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. Link to paper.

Econometrics in Python Part II - Fixed effects

In this second in a series on econometrics in Python, I’ll look at how to implement fixed effects.

For inspiration, I’ll use a recent NBER working paper by Azar, Marinescu, and Steinbaum on Labor Market Concentration. In their paper, they look at the monopsony power of firms to hire staff in over 8,000 geographic-occupational labor markets in the US, finding that “going from the 25th percentile to the 75th percentile in concentration is associated with a 17% decline in posted wages”. I’ll use a vastly simplified version of their model. Their measure of concentration is denoted $\text{HHI}$, and they look at how this affects $\ln(w)$, the log of the real wage. The model has hiring observations which are organised by year-quarter, labelled $t$, and market (commuting zone-occupation), $m$:

where $\alpha_t$ is a fixed year-quarter effect, and $\nu_m$ is a fixed market effect.

The code

The most popular statistics module in Python is statsmodels, but pandas and numpy make data manipulation and creation easy.

import pandas as pd
import statsmodels.formula.api as sm
import numpy as np

As far as I can see the data behind the paper is not available, so the first job is to create some synthetic data for which the answer, the value of $\beta$, is known. I took the rough value for $\beta$ from the paper, but the other numbers are made up.

# Synthetic data settings
commZonesNo = 15
yearQuarterNo = 15
numObsPerCommPerYQ = 1000
beta = -0.04
HHI =np.random.uniform(3.,6.,size=[commZonesNo,

# Different only in first index (market)
cZeffect =np.tile(np.tile(np.random.uniform(high=10.,
cZnames = np.tile(np.tile(['cZ'+str(i) for i in range(commZonesNo)],
# Different only in second index (year-quarter)
yQeffect = np.tile(np.tile(np.random.uniform(high=10.,
yQnames = np.tile(np.tile(['yQ'+str(i) for i in range(yearQuarterNo)],
# commZonesNo x yearQuarterNo x obs error matrix
HomoErrorMat = np.random.normal(size=[commZonesNo,

logrealwage = beta*HHI+cZeffect+yQeffect+HomoErrorMat
df = pd.DataFrame({'logrealwage':logrealwage.flatten(),
    Cz       HHI  logrealwage   yQ
0  cZ0  5.175476     5.683932  yQ0
1  cZ0  4.829876     4.732797  yQ0
2  cZ0  5.284036     5.261500  yQ0
3  cZ0  4.024909     4.027340  yQ0
4  cZ0  3.674694     3.802822  yQ0

Running the regressions is very easy as statsmodels can use the patsy package, which is based on similar equation parsers in R and S. Here’s the normal OLS measure:

normal_ols = sm.ols(formula='logrealwage ~ HHI',
                            OLS Regression Results                            
Dep. Variable:            logrealwage   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     23.39
Date:                Fri, 16 Feb 2018   Prob (F-statistic):           1.32e-06
Time:                        23:20:13   Log-Likelihood:            -6.3063e+05
No. Observations:              225000   AIC:                         1.261e+06
Df Residuals:                  224998   BIC:                         1.261e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept      9.6828      0.044    217.653      0.000       9.596       9.770
HHI           -0.0470      0.010     -4.837      0.000      -0.066      -0.028
Omnibus:                     5561.458   Durbin-Watson:                   0.127
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4713.381
Skew:                           0.289   Prob(JB):                         0.00
Kurtosis:                       2.590   Cond. No.                         25.3

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As an aside, the intercept can be suppressed by using ‘logrealwage ~ HHI-1’ rather than ‘logrealwage ~ HHI’. The straight OLS approach does not do a terrible job for the point estimate, but the $R^2$ is terrible. Fixed effects can get us out of the, er, fix…

FE_ols = sm.ols(formula='logrealwage ~ HHI+C(Cz)+C(yQ)-1',
                            OLS Regression Results                            
Dep. Variable:            logrealwage   R-squared:                       0.937
Model:                            OLS   Adj. R-squared:                  0.937
Method:                 Least Squares   F-statistic:                 1.154e+05
Date:                Fri, 16 Feb 2018   Prob (F-statistic):               0.00
Time:                        23:20:31   Log-Likelihood:            -3.1958e+05
No. Observations:              225000   AIC:                         6.392e+05
Df Residuals:                  224970   BIC:                         6.395e+05
Df Model:                          29                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
C(Cz)[cZ0]        4.4477      0.016    281.428      0.000       4.417       4.479
C(Cz)[cZ1]       10.0441      0.016    636.101      0.000      10.013      10.075
C(Cz)[cZ10]      10.4897      0.016    663.407      0.000      10.459      10.521
C(Cz)[cZ11]      12.2364      0.016    773.920      0.000      12.205      12.267
C(Cz)[cZ12]       8.7909      0.016    556.803      0.000       8.760       8.822
C(Cz)[cZ13]       8.6307      0.016    545.917      0.000       8.600       8.662
C(Cz)[cZ14]      12.1590      0.016    768.937      0.000      12.128      12.190
C(Cz)[cZ2]       11.5722      0.016    733.999      0.000      11.541      11.603
C(Cz)[cZ3]        7.4164      0.016    469.160      0.000       7.385       7.447
C(Cz)[cZ4]       10.4830      0.016    663.719      0.000      10.452      10.514
C(Cz)[cZ5]        6.2675      0.016    396.634      0.000       6.237       6.299
C(Cz)[cZ6]        7.1924      0.016    455.045      0.000       7.161       7.223
C(Cz)[cZ7]        5.2567      0.016    333.177      0.000       5.226       5.288
C(Cz)[cZ8]        6.3380      0.016    401.223      0.000       6.307       6.369
C(Cz)[cZ9]        5.8814      0.016    372.246      0.000       5.850       5.912
C(yQ)[T.yQ1]      0.1484      0.012     12.828      0.000       0.126       0.171
C(yQ)[T.yQ10]    -2.2139      0.012   -191.442      0.000      -2.237      -2.191
C(yQ)[T.yQ11]    -0.2461      0.012    -21.280      0.000      -0.269      -0.223
C(yQ)[T.yQ12]     3.0241      0.012    261.504      0.000       3.001       3.047
C(yQ)[T.yQ13]    -2.0663      0.012   -178.679      0.000      -2.089      -2.044
C(yQ)[T.yQ14]     2.9468      0.012    254.817      0.000       2.924       2.969
C(yQ)[T.yQ2]      2.0992      0.012    181.520      0.000       2.076       2.122
C(yQ)[T.yQ3]      5.0328      0.012    435.196      0.000       5.010       5.055
C(yQ)[T.yQ4]      7.4619      0.012    645.253      0.000       7.439       7.485
C(yQ)[T.yQ5]     -0.9819      0.012    -84.907      0.000      -1.005      -0.959
C(yQ)[T.yQ6]     -2.0630      0.012   -178.396      0.000      -2.086      -2.040
C(yQ)[T.yQ7]      5.4874      0.012    474.502      0.000       5.465       5.510
C(yQ)[T.yQ8]     -1.5476      0.012   -133.824      0.000      -1.570      -1.525
C(yQ)[T.yQ9]      0.2312      0.012     19.989      0.000       0.208       0.254
HHI              -0.0363      0.002    -14.874      0.000      -0.041      -0.031
Omnibus:                        0.866   Durbin-Watson:                   1.994
Prob(Omnibus):                  0.648   Jarque-Bera (JB):                0.873
Skew:                           0.003   Prob(JB):                        0.646
Kurtosis:                       2.993   Cond. No.                         124.

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This is much closer to the right answer of $\beta=-0.04$, has half the standard error, and explains much more of the variation in $\ln(w_{m,t})$.

Econometrics in Python part I - Double machine learning

The idea is that this will be the first in a series of posts covering econometrics in Python.

At a conference a couple of years ago, I saw Victor Chernozhukov present his paper on Double/Debiased Machine Learning for Treatment and Causal Parameters. It really stuck with me because of the way it fruitfully combines econometrics and machine learning. Machine learning is obsessed with prediction, and is getting very good at it. Econometrics is obsessed with causality and identification, and pretty good at it - especially in ‘messy’ real-world situations. Combining the two promises to provide powerful new ways to understand causal relationships.

So, in brief, what does ‘double’ machine learning do? It’s one way to bring the power of machine learning for prediction on non-linear problems into an econometric context in which the asymptotic properties of the estimates of the parameters of interest are known to behave well. The problem is that just applying machine learning to predict outcomes ($Y$) from a treatment or variable ($D$) in the presence of many controls ($X$) will lead to biased estimates of the model parameter ($\theta$). The double machine learning method of Chernozhukov et al. delivers point estimators that have a $\sqrt{N}$ rate of convergence for $N$ observations and are approximately unbiased and normally distributed.

The clearest example, which I reproduce here from the paper, is of partially linear regression. They take it themselves from Robinson (1988). The model is

with $X = (X_1,X_2,\dots,X_p)$ a vector of controls. Here $\eta=(m,g)$ can be non-linear.

The naïve machine learning approach would be to estimate $D\cdot\hat{\theta} + \hat{g}(X)$ using one of the standard algorithms (random forest, support vector regression, etc). The authors of the paper show that doing this means that $\hat{\theta}$ effectively has a slower than root $N$ rate of convergence due to the bias in estimating $\hat{g}$.

They suggest overcoming this bias using orthogonalisation and splitting the sample. They obtain $\hat{V} = D - \hat{m}(X)$ using machine learning on an auxiliary sample; finding the mean of $D$ given $X$. With the remaining observations, they define an estimator for $\theta$, $\check{ \theta}$, which is a function of $\hat{V}$, $D$, $X$, and an estimate of $g$ given by $\hat{g}$. As they say (with a slight change in notation),

By approximately orthogonalizing $D$ with respect to $X$ and approximately removing the direct effect of confounding by subtracting an estimate of $\hat{g}$, $\check{ \theta}$ removes the effect of regularization bias … The formulation of $\check{ \theta}$ also provides direct links to both the classical econometric literature, as the estimator can clearly be interpreted as a linear instrumental variable (IV) estimator, …

The double comes from estimating $\hat{V}$ in the auxiliary problem, as well as $\hat{g}$, before calculating the estimator $\check{\theta}$. In their paper, Chernozhukov et al. also discuss estimating average treatment effects, local average treatment effects, and average treatment effects for the treated using a more general formulation where $g$ is a function of both $X$ and $D$. More on the technical details and other applications can be found in the paper; here we’ll look at an example estimation in the context of a model

Double machine learning in practice

So how does it work in practice? With the sample split into two sets of size $n=N/2$ indexed by $i\in I$ and $i \in I^C$, there are four steps,

  1. Estimate $\hat{V} = D - \hat{m}(X)$ using $I^C$
  2. Estimate $Y = \hat{g}(X) + \hat{u}$ using $I^C$
  3. Estimate
  4. Construct the efficient, cross-fitting estimate:

Simulated example

This example was inspired by this great post by Gabriel Vasconcelos. To make it more exciting, I’ll use a slightly different functional form with $g$ as sine squared and $m$ as the wrapped Cauchy distribution:

Let’s keep it simple and set $\nu=0$ and $\gamma=1$. The wrapped Cauchy looks like this:

The wrapped Cauchy distribution

Our model is

$x_i$ has length $K=10$ and will be generated from a multivariate normal distribution, the true value of the causal parameter will be $\theta=0.5$, and $b_k=1/k$. The errors will be

and I’m going to use the scikit learn implementation of the random forest regressor to do the machine learning. The code, using Python 3, is

import numpy as np
from sklearn.datasets import make_spd_matrix
import math
import statsmodels.api as sm # for OLS
from sklearn.ensemble import RandomForestRegressor # Our ML algorithm
# Set up the environment
randomseednumber = 11022018
N = 500 # No. obs
k=10 # = No. variables in x_i
theta=0.5 # Structural parameter
b= [1/k for k in range(1,11)] # x weights
sigma = make_spd_matrix(k,randomseednumber) #
# NUmber of simulations
MC_no = 500
def g(x):
    return np.power(np.sin(x),2)
def m(x,nu=0.,gamma=1.):
    return 0.5/math.pi*(np.sinh(gamma))/(np.cosh(gamma)-np.cos(x-nu))
# Array of estimated thetas to store results
theta_est = np.zeros(shape=[MC_no,3])

for i in range(MC_no):
    # Generate data: no. obs x no. variables in x_i
    X = np.random.multivariate_normal(np.ones(k),sigma,size=[N,])
    G = g(,b))
    M = m(,b))
    D = M+np.random.standard_normal(size=[500,])
    Y =,D)+G+np.random.standard_normal(size=[500,])
    # Now run the different methods
    # OLS --------------------------------------------------
    OLS = sm.OLS(Y,D)
    results =
    theta_est[i][0] = results.params[0]

    # Naive double machine Learning ------------------------
    naiveDMLg =RandomForestRegressor(max_depth=2)
    # Compute ghat,Y)
    Ghat = naiveDMLg.predict(X)
    naiveDMLm =RandomForestRegressor(max_depth=2),D)
    Mhat = naiveDMLm.predict(X)
    # vhat as residual
    Vhat = D-Mhat
    theta_est[i][1] = np.mean(,Y-Ghat))/np.mean(,D))

    #  Cross-fitting DML -----------------------------------
    # Split the sample
    I = np.random.choice(N,,replace=False)
    I_C = [x for x in np.arange(N) if x not in I]
    # Ghat for both
    Ghat_1 = RandomForestRegressor(max_depth=2).fit(X[I],Y[I]).predict(X[I_C])
    Ghat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],Y[I_C]).predict(X[I])
    # Mhat and vhat for both
    Mhat_1 = RandomForestRegressor(max_depth=2).fit(X[I],D[I]).predict(X[I_C])
    Mhat_2 = RandomForestRegressor(max_depth=2).fit(X[I_C],D[I_C]).predict(X[I])
    Vhat_1 = D[I_C]-Mhat_1
    Vhat_2 = D[I] - Mhat_2
    theta_1 = np.mean(,(Y[I_C]-Ghat_1)))/np.mean(,D[I_C]))
    theta_2 = np.mean(,(Y[I]-Ghat_2)))/np.mean(,D[I]))
    theta_est[i][2] = 0.5*(theta_1+theta_2)

Below is a plot of the kernel density estimates of $\theta$ using seaborn. The peak of the distributions for OLS and double ML without cross-fitting are off the true value, but the cross-fitted double ML procedure gets much closer.

The estimates of $\theta$

So there it is: double machine learning is a useful technique at the intersection of machine learning and econometrics which can produce approximately unbiased and normally distributed point estimates in semi-parametric settings.

Making a publication quality plot with Python (and latex)

High level languages like Python and R are great partly because entire workflows can be done within them; from data ingestion, to cleaning, to analysis, to producing plots and regression tables. But when I looked around online, I found that there wasn’t a huge amount of information on how to do one of the last stages - producing plots - in a way that is consistent with what is required by journals.

Journals often ask for figures in lossless formats (think pdf, tiff, svg, and eps as opposed to png or jpg), in certain sizes, and at a specific or minimum resolution. What is most important in a journal article or working paper is clearly the content. However, when a paper looks good, and its figures are crisp, clear, and communicate a message, it helps to deliver the content in the way intended. Low resolution, rasterised images just look bad (at best) and distract from the point of the figure (at worst).

If you’re not convinced of the benefits of lossless formats over rasterised ones, try creating a pdf with more than five or six very high resolution but simple (= not too many features) plots embedded as pngs using latex. Yes, it’s big! For simple plots, lossless formats take up far less space on disk and they look better.

As an author, making plots easily digestible and a part of the narrative of the paper can enhance the experience for the reader substantially. Different colours, types of line, and levels of transparency can help here. For consistency, you may want to include the same mathematical symbols in the main text as you do in the legend using latex.

I often want to do all of the above, so I’ve put together an example. While much will need to be changed for other examples, it’s a good starting point.

Let’s begin with some parameter settings. Matplotlib, the Python plotting library, has a style file with defaults in. I’m going to change a few of these. They’re mostly obvious.

xtick.labelsize: 16
ytick.labelsize: 16
font.size: 15
figure.autolayout: True
figure.figsize: 7.2,4.45
axes.titlesize : 16
axes.labelsize : 17
lines.linewidth : 2
lines.markersize : 6
legend.fontsize: 13
mathtext.fontset: stix STIXGeneral

The first real choice is about the relative size of the figure, and the font sizes of the plot title, axes titles, and label sizes. The journal Nature requires that double-column figures be 183 mm wide, which is 7.2 inches using the units which Matplotlib works in. Heights can differ, but I choose an eye-pleasing 1.6 ratio. The only other really important choice here is what fonts to use. I’ve gone for Stix as it can be used for both latex and normal fonts, and it looks professional in plots. To use these settings, they just need to go in a plain text file called ‘PaperDoubleFig.mplstyle’ which we can point Matplotlib at later.

The data

The data for the example are from the Office for National Statistics (ONS) website and are international comparisons of productivity. You can find the raw data here, and I’m using the data behind their Figure 4. Although the page has an interactive feature, a hover which tells you the values in cross-section, the plot is hard to read if (as the article presumes) you’re interested mostly in the UK relative to the other countries. We’ll fix that later. Personally, I’m not a fan of the horizontal guide lines so I’ll be omitting those too.

The code

Let’s see some code! Import the required libraries, and load up the style file for Matplotlib:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter,AutoMinorLocator
import matplotlib as mpl
# Directory and filename; style file open
# Change to the directory which contains the current script
dirFile = os.path.dirname(os.path.join('YourDirHere',
# Load style file, 'PaperDoubleFig.mplstyle'))
# Make some style choices for plotting
colourWheel =['#329932',
dashesStyles = [[3,1],
            [4, 1, 1, 1, 1, 1]]
# Point to the data
fileName = 'rftxlicp1017unlinked.xls'

You’ll notice that I’ve also defined colourWheel and dashesStyles. These are for plotting, and encode different colours and line dashes respectively. Each line in the time series plot will be differentiated by iterating over both. The colours originally come from colour brewer, with a few additions and changes. There are more colours than are needed, but this set of colours can be used in other plots, or for qualitative choropleths.

Next, read in the data and process it. Here’s one I made earlier:

# Read in and prep the data
df = pd.read_excel(os.path.join(dirFile,fileName),
                  sheetname='Table 4')
df = df.iloc[3:,:]
df = df.rename(columns=dict(zip(df.columns,df.iloc[0,:])))
df = df.iloc[2:,:]
df = df.rename(columns={np.nan:'Year'}).set_index('Year')
df = df.dropna()
# Take a look to make sure this has worked nicely

which produces:

  Canada France Germany Italy Japan UK US G7 G7 exc. UK
1995 87 86.2 86.7 94.6 86.2 80.7 79.8 83 83.2
1996 87.6 87.2 87.7 95.3 88.5 82 81.7 84.6 84.8
1997 89.5 88.8 89.7 96.6 88.4 84 83.5 86.1 86.3
1998 90.7 90.5 90.1 97.2 88 85.8 86 87.7 87.8
1999 93 91.8 92.3 97.6 88.5 87.6 88.7 89.7 89.8

so it looks like everything has been processed correctly. Now onto the plotting:

fig, ax = plt.subplots()
for j,series in enumerate(df.columns[:-2]):
        alphaVal = 1.
        alphaVal = 0.6
        linethick = 3.5
                linestyle = '-',
ax.yaxis.major.formatter._useMathText = True
ax.yaxis.set_minor_locator(  AutoMinorLocator(5))
ax.xaxis.set_minor_locator(  AutoMinorLocator(5))
nameOfPlot = 'GDP per hour (constant prices, indexed to 2007)'
ax.legend(frameon=False, loc='upper left',ncol=2,handlelength=4)

Here’s the plot which comes out, necessarily rendered here as a png but saved as a pdf if you use the code above:

The code looks more complicated than just using df.plot() but we get a lot for that extra complexity, including: the UK productivity time series being emphasised relative to those of the other countries, each country having a unique combination of colour and dash, the number of tick marks being sensible, only individual countries being plotted (df.columns[:-2] omits the two G7 related columns), and the y-axis ticks labels appearing on the right-hand side (which I think looks better for time series plots). Note that I’ve specified dpi=300 to set the resolution to what is often the minimum for journal submission.

I mentioned latex in the post title. Assuming you have the full Miktex distribution installed (for example), then adding in latex is as easy as putting it into the title or label strings so that


gives in the figure. This will render just like the other text in the figure.

No doubt there is a better way of packaging some of this up for use in other examples. As an alternative, Seaborn is a fantastic tool for quick, easy, good-looking data visualisation in Python but for journal articles, a straighter, plainer style like this may be more appropriate.