The fourth in the series of posts covering econometrics in Python. This time: automating the boring business of running multiple regressions on columns in a pandas dataframe.
Data science in Python is the open source package pandas, more or less. It’s an amazing, powerful library and the firms, researchers, and governments who use it are indebted to its maintainers, including Wes McKinney.
When data arrive in your Python code, they’re most likely going to arrive in a pandas dataframe. If you’re doing econometrics, you’re then likely to want to do regressions from the dataframe with the minimum of fuss and the maximum of flexibility. This post sets out a way to do that with a few extra functions.
There are two main ways to run regressions in Python: statsmodels or scikit-learn. The latter is more geared toward machine learning, so I’ll be using the former for regressions. The typical way to do this might be the following (ignoring imports and data importing), with a pandas dataframe df
with an x-variable ‘concrete’ and a y-variable ‘age’:
= sm.OLS(df['Age'],df['Concrete'])
mod = mod.fit()
results print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Age R-squared: 0.414
Model: OLS Adj. R-squared: 0.414
Method: Least Squares F-statistic: 728.1
Date: 05 May 2018 Prob (F-statistic): 1.05e-121
Time: 00:00:00 Log-Likelihood: -5672.3
No. Observations: 1030 AIC: 1.135e+04
Df Residuals: 1029 BIC: 1.135e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Concrete 1.2693 0.047 26.984 0.000 1.177 1.362
==============================================================================
Omnibus: 761.497 Durbin-Watson: 0.998
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9916.238
Skew: 3.411 Prob(JB): 0.00
Kurtosis: 16.584 Cond. No. 1.00
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In the rest of this post I will outline a more flexible and extensible way of doing this, which will allow for multiple models and controls, with code snippets you can copy, paste, and then forget about.
Loading the data
Our data are on the compressive strength of concrete - I know, brilliant, and we could talk more about the fascinating history of concrete and its importance for the economy, but we should get to the stats. The data are from the UC Irvine Machine Learning datasets repository; see here for them.
= pd.read_excel('concrete_data.xls')
df df.head()
Cement (component 1)(kg in a m^3 mixture) \
0 540.0
1 540.0
2 332.5
3 332.5
4 198.6
Blast Furnace Slag (component 2)(kg in a m^3 mixture) \
0 0.0
1 0.0
2 142.5
3 142.5
4 132.4
Fly Ash (component 3)(kg in a m^3 mixture) \
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
Water (component 4)(kg in a m^3 mixture) \
0 162.0
1 162.0
2 228.0
3 228.0
4 192.0
Superplasticizer (component 5)(kg in a m^3 mixture) \
0 2.5
1 2.5
2 0.0
3 0.0
4 0.0
Coarse Aggregate (component 6)(kg in a m^3 mixture) \
0 1040.0
1 1055.0
2 932.0
3 932.0
4 978.4
Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) \
0 676.0 28
1 676.0 28
2 594.0 270
3 594.0 365
4 825.5 360
Concrete compressive strength(MPa, megapascals)
0 79.986111
1 61.887366
2 40.269535
3 41.052780
4 44.296075
Those column names are rather long! I’ll just take the first word of each column name, and take a quick look at the data:
= df.rename(columns=dict(zip(df.columns,[x.split()[0] for x in df.columns])))
df df.describe()
Cement Blast Fly Water Superplasticizer \
count 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000
mean 281.165631 73.895485 54.187136 181.566359 6.203112
std 104.507142 86.279104 63.996469 21.355567 5.973492
min 102.000000 0.000000 0.000000 121.750000 0.000000
25% 192.375000 0.000000 0.000000 164.900000 0.000000
50% 272.900000 22.000000 0.000000 185.000000 6.350000
75% 350.000000 142.950000 118.270000 192.000000 10.160000
max 540.000000 359.400000 200.100000 247.000000 32.200000
Coarse Fine Age Concrete
count 1030.000000 1030.000000 1030.000000 1030.000000
mean 972.918592 773.578883 45.662136 35.817836
std 77.753818 80.175427 63.169912 16.705679
min 801.000000 594.000000 1.000000 2.331808
25% 932.000000 730.950000 7.000000 23.707115
50% 968.000000 779.510000 28.000000 34.442774
75% 1029.400000 824.000000 56.000000 46.136287
max 1145.000000 992.600000 365.000000 82.599225
Defining functions to run regressions
Let’s set up a function which we can pass a dataframe to in order to run regressions on selected columns:
def RegressionOneModel(df,Xindvars,Yvar,summary=True):
if(type(Yvar)==str):
=[Yvar]
Yvarif(len(Yvar)!=1):
print("Error: please enter a single y variable")
return np.nan
else:
= df.dropna(subset=Yvar+Xindvars)[Xindvars+Yvar]
xf = xf[Xindvars]
Xexog = sm.OLS(xf[Yvar].dropna(),Xexog)
model = model.fit()
reg if(summary):
return reg.summary2()
else:
return reg
How this does work? It’s easiest to show with an example.
= RegressionOneModel(df,['Cement','Blast'],'Concrete')
regResults print(regResults)
Results: Ordinary least squares
==================================================================
Model: OLS Adj. R-squared: 0.878
Dependent Variable: Concrete AIC: 8332.8955
Date: 2018-05-05 00:00 BIC: 8342.7701
No. Observations: 1030 Log-Likelihood: -4164.4
Df Model: 2 F-statistic: 3705.
Df Residuals: 1028 Prob (F-statistic): 0.00
R-squared: 0.878 Scale: 190.64
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
Cement 0.1079 0.0017 63.4736 0.0000 0.1046 0.1113
Blast 0.0671 0.0045 14.9486 0.0000 0.0583 0.0760
------------------------------------------------------------------
Omnibus: 7.719 Durbin-Watson: 0.983
Prob(Omnibus): 0.021 Jarque-Bera (JB): 6.461
Skew: 0.117 Prob(JB): 0.040
Kurtosis: 2.690 Condition No.: 3
==================================================================
This function takes a variable number of X vectors and regresses Y (‘concrete’) on them. But what if we want to run many regressions at once? Fortunately statsmodels
has some capability to do this. Unfortunately, it’s not all that intuitive and, to use it with ease, we’ll need to extend. I want it to be flexible enough so that it: - works with X as a string, list, or a list of lists (for multiple models) - accepts a number of controls which are the same in every model - returns either a multi-model regression results summary or a single model summary as appropriate
To make this all work, we need a couple of extra functions. One just labels different models with Roman numerals and could be jettisoned. The other one is just a quick way of combining the variables to send to the regression.
def write_roman(num):
= OrderedDict()
roman 1000] = "M"
roman[900] = "CM"
roman[500] = "D"
roman[400] = "CD"
roman[100] = "C"
roman[90] = "XC"
roman[50] = "L"
roman[40] = "XL"
roman[10] = "X"
roman[9] = "IX"
roman[5] = "V"
roman[4] = "IV"
roman[1] = "I"
roman[
def roman_num(num):
for r in roman.keys():
= divmod(num, r)
x, y yield roman[r] * x
-= (r * x)
num if num > 0:
roman_num(num)else:
break
return "".join([a for a in roman_num(num)])
def combineVarsList(X,Z,combine):
if(combine):
return X+Z
else:
return X
Finally, there is a function which decides how to call the underlying regression code, and which stitches the results from different models together:
def RunRegression(df,XX,y,Z=['']):
# If XX is not a list of lists, make it one -
# - first by checking if type is string
if(type(XX)==str): # Check if it is one string
= [XX]
XX # - second for if it is a list
if(not(any(isinstance(el, list) for el in XX))):
= [XX]
XX if(type(y)!=str): # Check y for string
print('Error: please enter string for dependent variable')
return np.nan
= 'OLS Regressions; dependent variable '+y
title_string # If Z is not a list, make it one
if(type(Z)==str):
= [Z]
Z #XX = np.array(XX)
# Check whether there is just a single model to run
if(len(XX)==1):
= list(XX[0])
Xpassvars if(len(Z[0])!=0):
= list(XX[0])+Z
Xpassvars = RegressionOneModel(df,Xpassvars,[y],summary=False)
regRes = regRes.summary2()
regResSum2
regResSum2.add_title(title_string)return regResSum2
elif(len(XX)>1):
# Load in Z here if appropriate
= False
addControls if(len(Z[0])!=0):
= True
addControls # Case with multiple models
={'R-squared' : lambda x: "{:.2f}".format(x.rsquared),
info_dict'Adj. R-squared' : lambda x: "{:.2f}".format(x.rsquared_adj),
'No. observations' : lambda x: "{0:d}".format(int(x.nobs))}
= [RegressionOneModel(df,combineVarsList(X,Z,addControls),
regsVec =False) for X in XX]
[y],summary= ['Model '+\
model_names_strList for i in range(1,len(XX)+1)]
write_roman(i) = '%0.2f'
float_format_str = np.unique([item for sublist in XX for item in sublist])
uniqueVars = [str(x) for x in uniqueVars]
uniqueVars = summary_col(results=regsVec,
results_table =float_format_str,
float_format= True,
stars =model_names_strList,
model_names=info_dict,
info_dict=uniqueVars+Z)
regressor_order
results_table.add_title(title_string)return results_table
Putting it all together
Let’s see how it works. Firstly, the simple case of one y on one x.
= RunRegression(df,'Blast','Concrete')
regResults print(regResults)
OLS Regressions; dependent variable Concrete
==================================================================
Model: OLS Adj. R-squared: 0.400
Dependent Variable: Concrete AIC: 9971.8287
Date: 2018-05-05 00:00 BIC: 9976.7660
No. Observations: 1030 Log-Likelihood: -4984.9
Df Model: 1 F-statistic: 688.0
Df Residuals: 1029 Prob (F-statistic): 1.57e-116
R-squared: 0.401 Scale: 936.87
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
Blast 0.2203 0.0084 26.2293 0.0000 0.2038 0.2367
------------------------------------------------------------------
Omnibus: 39.762 Durbin-Watson: 0.477
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43.578
Skew: -0.502 Prob(JB): 0.000
Kurtosis: 3.081 Condition No.: 1
==================================================================
Or several x variables:
= RunRegression(df,['Cement','Blast'],'Concrete')
regResults print(regResults)
OLS Regressions; dependent variable Concrete
==================================================================
Model: OLS Adj. R-squared: 0.878
Dependent Variable: Concrete AIC: 8332.8955
Date: 2018-05-05 00:00 BIC: 8342.7701
No. Observations: 1030 Log-Likelihood: -4164.4
Df Model: 2 F-statistic: 3705.
Df Residuals: 1028 Prob (F-statistic): 0.00
R-squared: 0.878 Scale: 190.64
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
Cement 0.1079 0.0017 63.4736 0.0000 0.1046 0.1113
Blast 0.0671 0.0045 14.9486 0.0000 0.0583 0.0760
------------------------------------------------------------------
Omnibus: 7.719 Durbin-Watson: 0.983
Prob(Omnibus): 0.021 Jarque-Bera (JB): 6.461
Skew: 0.117 Prob(JB): 0.040
Kurtosis: 2.690 Condition No.: 3
==================================================================
Here comes the fun - to run multiple models, we need only pass a list of lists as the X variable in the function:
= ['Cement', 'Blast']
Model_1_X = ['Coarse','Fine']
Model_2_X = ['Fly', 'Water']
Model_3_X = RunRegression(df,
ManyModelResults
[Model_1_X,Model_2_X,Model_3_X],'Concrete')
print(ManyModelResults)
OLS Regressions; dependent variable Concrete
===========================================
Model I Model II Model III
-------------------------------------------
Blast 0.07***
(0.00)
Cement 0.11***
(0.00)
Coarse 0.03***
(0.00)
Fine 0.01***
(0.00)
Fly 0.00
(0.01)
Water 0.19***
(0.00)
R-squared 0.88 0.81 0.78
Adj. R-squared 0.88 0.81 0.78
No. observations 1030 1030 1030
===========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
There’s a keyword argument, Z
, which we can pass controls (here just ‘Age’) via:
= RunRegression(df,
ManyModelsWControl
[Model_1_X,Model_2_X,Model_3_X],'Concrete',
= 'Age')
Z print(ManyModelsWControl)
OLS Regressions; dependent variable Concrete
===========================================
Model I Model II Model III
-------------------------------------------
Blast 0.05***
(0.00)
Cement 0.08***
(0.00)
Coarse 0.03***
(0.00)
Fine -0.01**
(0.00)
Fly -0.06***
(0.01)
Water 0.12***
(0.00)
Age 0.10*** 0.11*** 0.10***
(0.01) (0.01) (0.01)
Superplasticizer 1.04*** 1.44*** 1.84***
(0.06) (0.08) (0.08)
R-squared 0.92 0.87 0.88
Adj. R-squared 0.92 0.87 0.87
No. observations 1030 1030 1030
===========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
Finally, it’s easy to pass multiple controls:
= RunRegression(df,
ManyModelsWControls
[Model_1_X,Model_2_X,Model_3_X],'Concrete',
= ['Age','Superplasticizer'])
Z print(ManyModelsWControls)
OLS Regressions; dependent variable Concrete
===========================================
Model I Model II Model III
-------------------------------------------
Blast 0.05***
(0.00)
Cement 0.08***
(0.00)
Coarse 0.03***
(0.00)
Fine -0.01**
(0.00)
Fly -0.06***
(0.01)
Water 0.12***
(0.00)
Age 0.10*** 0.11*** 0.10***
(0.01) (0.01) (0.01)
Superplasticizer 1.04*** 1.44*** 1.84***
(0.06) (0.08) (0.08)
R-squared 0.92 0.87 0.88
Adj. R-squared 0.92 0.87 0.87
No. observations 1030 1030 1030
===========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
By the way, the statsmodels summary object which is returned here has an .as_latex() method - useful if you want to dump results straight into papers.
Do you have a better way to quickly run many different kinds of OLS regressions from a pandas dataframe? Get in touch!
NB: I had to remove the doc strings in the above code because they slowed down the page a lot.