This Python package implements the panel (and single entity) event study models, covering the naive two-way fixed effects implementation, and the interaction-weighted implementation from Sun and Abraham (2021) (derived from https://github.com/lsun20/EventStudyInteract).
The package includes three sets of functions:
Data cleaning
: Functions to prepare data frames for the analytical set of functions, e.g., ensuring that they are in the right format, and have the right columns (with the right content)Analytical
: Direct implementation of the event study modelsUtilities
: Tools to assist the user in setting up input-output flows
pip install paneleventstudy
Refer to the JuPyTeR notebook example_paneleventstudy.ipynb
paneleventstudy.dropmissing(data, event)
data
:
pandas dataframe
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
- A copy of
data
with rows corresponding to missingevent
dropped - Display on the interface the number of rows in
data
, and the number of rows in the output data frame
Most panel event study methods in the literature, and is the case at present for all methods covered in this package, only work with balanced panel data.
That is to say that all entities in the data set must have the same number of time periods.
This package checks if the input data is indeed a balanced panel with entities
-
Check if all entities
$i$ have the same number of time periods$$L(\mathbf{t}_{i}) = L_T \ \forall \ i \ for \ L_T \in \mathcal{N}^{+} $$ -
Optionally, if the calendar time variable in the input data frame is numeric, further check if the smallest and largest time values are the same for all entities
$i$ $$\min(\mathbf{t}_{i}) = L_Tmin \ \forall \ i \ for \ L_Tmin \in \mathcal{N}^{0+} $$ $$\max(\mathbf{t}_{i}) = L_Tmax \ \forall \ i \ for \ L_Tmax \in \mathcal{N}^{+} $$
paneleventstudy.balancepanel(data, group, event, calendartime, check_minmax=True)
data
:
pandas dataframe
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
calendartime
:
Integers or integers matching the label of the column in data
containing calendar times going from the earliest to last time period; this can be user-fed or generated from gencalendartime_numericscalendartime
.
check_minmax
:
Boolean to trigger option for a deeper check, which verifies if all entities in group
have the same minimum and maximum values in calendartime
; default is True
, and can be used when the calendartime
column is generted from gencalendartime_numericscalendartime
, or are already preset as integers
A Boolean indicating if data
is balanced.
In the difference in difference (DiD) methodology, which event studies are a variant of, if treatment is truly exogenous, the treatment effect is estimated by comparing the average outcomes of the treated group (received treatment) against the control group (did not receive treatment).
Discussing endogeneity aside, in panel event studies, and indeed dynamic DiD, it is possible for no groups to be never-treated, e.g., in staggered DiD setups. Choosing the right control group is essential in establishing the right counterfactual, on which unbiased or consistent treatment effect estimates are conditioned on. In these cases, we may want to use the last-treated group as a control group. This was argued prominently in recent DiD papers, such as Callaway and Sant'Anna and Sun and Abraham (2021).
This function tells us which group(s) is / are the control groups, whether never-treated or last-treated, which is essential for the analytical functions later.
paneleventstudy.identifycontrols(data, group, event)
data
:
pandas dataframe
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
A copy of data
with a new column labelled control_group
indicating if the entity in group
is a control group (never-treated or last-treated).
Event studies methodologies essentially estimate the dynamic treatment effect relative to the onset of treatment (i.e., before and after treatment).
This is akin to asking "what is the effect of treatment
This function generates a column containing these relative times from two sets of information:
- Calendar time; and
- When the treatment or event happened
paneleventstudy.genreltime(data, group, event, calendartime, reltime='reltime', check_balance=True)
data
:
pandas dataframe
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
calendartime
:
Integers matching the label of the column in data
containing calendar times going from 0 (earliest time period) to T (last time period); this can be generated from gencalendartime_numericscalendartime
reltime
:
String to be used as the label of a new column containing relative times going from -L to +K as integers, with 0 being the timing of treatment onset
check_balance
:
Checks if data
is a balanced panel; default option is True
A copy of data
with a new column labelled reltime
containing the relative times for all calendar times in calendar
by entities in group
.
Sun and Abraham (2021)'s interaction-weighted event study methodology requires (1) the estimation of cohort-specific treatment effects, and (2) cohort shares by relative times. To do this, the methodology requires an identifier for groups that were treated in the same calendar time.
paneleventstudy.gencohort(data, group, event, calendartime, cohort='cohort', check_balance=True)
data
:
pandas dataframe
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
calendartime
:
Integers matching the label of the column in data
containing calendar times going from 0 (earliest time period) to T (last time period); this can be generated from gencalendartime_numericscalendartime
cohort
:
String to be used as the label of a new column containing the treatment cohort of respective entities in group
; default is 'cohort'
check_balance
:
Checks if data
is a balanced panel; default option is True
A copy of data
with a new column labelled cohort
indicating the treatment cohort that the entities group
belong to.
For generalise across the infinitely many possible formats that calendar times can be presented in (e.g., miliseconds, seconds, days, months, quarters, years, or even custom ones), calendar times can be converted into numerics. This eases computation in the rest of the package, by converting the calendar time column into integers starting from 0 (earliest) to T (latest).
paneleventstudy.gencalendartime_numerics(data, group, event, calendartime, calendartime_numerics='ct')
data
:
pandas dataframe
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
calendartime
:
Column matching the label of the column in data
containing calendar times going from the earliest time period to the last time period
calendartime_numerics
:
String to be used as the label of a new column containing the calendar times converted into nonnegative integers with 0 being the earliest, and T being the latest period
A copy of data
with a new column labelled calendartime_numerics
with numeric version of calendartime
, which can then be passed to the analytical functions.
The basic functional form of estimating equations in the DiD and event study methodology is a linear regression, which requires variables in the RHS of the equation to not be multicollinear, or invariant. This function checks if this is indeed the case.
paneleventstudy.checkcollinear(data, rhs)
data
:
pandas dataframe
rhs
:
A list containing strings matching the labels of the columns in data
to be checked for collinearity and invariance; precedence goes to columns in the rightmost of rhs
(if two columns are collinear, the one appearing later in rhs
is not included in the output)
A list of labels in rhs
which should be dropped to avoid multicollinearity or invariance in rhs
columns in data
.
The basic functional form of estimating equations in the DiD and event study methodology is a linear regression, which requires the matrix containing the variables in the RHS of the equation to satisfy full column rank. This function checks if this is indeed the case.
paneleventstudy.checkfullrank(data, rhs, intercept='Intercept')
data
:
pandas dataframe
rhs
:
A list containing strings matching the labels of the columns in data
to be checked for full rank; precedence goes to columns in the rightmost of rhs
intercept
:
String containing the label of the intercept column (column of numerics 1), which will be given precedence in the procedure; set as None
if no intercepts are contained in data
, and the default is 'Intercept'
, which is the default when using patsy.dmatrices()
A list of labels in rhs
which should be dropped to for the matrix containing rhs
columns in data
to satisfy full rank.
Estimates dynamic treatment effects using a standard TWFE model.
Specifically, we are interested in estimating
paneleventstudy.naivetwfe_eventstudy(data, outcome, event, group, reltime, calendartime, covariates, vcov_type='robust', check_balance=True)
data
:
pandas dataframe
outcome
:
String matching the label of the column in data
corresponding to the outcome variable; this is the LHS variable in the regression
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
reltime
:
Integers matching the label of the column in data
containing relative times going from -L to +K, with 0 being the timing of treatment onset; this can be generated from calendartime
generated from genreltime
, and reltime=-1
is automatically chosen as the reference period
calendartime
:
Integers matching the label of the column in data
containing calendar times going from 0 (earliest time period) to T (last time period); this can be generated from gencalendartime_numericscalendartime
.
covariates
:
List of columns corresponding to control variables in data
to be included in the RHS of the regression; if no covariates are to be included, set covariates=[]
vcov_type
:
String corresponding to the type of variance-covariance estimator in linearmodels.PanelOLS.fit(), which is called during the estimation process; default option is 'robust'
check_balance
:
Checks if data
is a balanced panel; default option is True
Returns a pandas dataframe with 3 columns, indexed to reltime
:
parameter
: The point estimates of the interaction-weighted average treatment affectslower
: The lower confidence bound ofparameter
upper
: The upper confidence bound ofparameter
Estimates dynamic treatment effects using the interaction-weighted estimator described in Sun and Abraham (2021).
Again, for the following structural equation, we are interested in estimating
This implementation has 3 broad steps.
-
Calculate the cohort shares by relative time,
$\mathbb{E} (E_i = e | E_i \in g )$ where$g$ is the set of relative times included in the analysis. This package uses a no-constant linear regression model with an OLS estimator as per the Sun and Abraham (2021)'s original Stata package here. Using a linear regression approach, instead of simple tabulation, allows for calculation of standard errors of the cohort share estimates.$$1{E_i = e | E_i \in g } = w_{e,l} D_{i, t}^{l} + e_i$$ -
Estimate the cohort-specific average treatment effects,
$CATT_{e, l}$ , by interacting the cohort dummy with the treatment / relative time dummy,$1(E_i = e) D_{i,t}^{l}$ .$$Y_{i,t} = \alpha_i + \alpha_t + \sum_{l=-K}^{-2} \delta_{l} 1(E_i = e) D_{i,t}^{l} + \sum_{l=0}^{M} \delta_{l} 1(E_i = e) D_{i,t}^{l} + \mathbf{X_{i, t} \gamma} + \varepsilon_{i, t}$$ -
Calculate the interaction-weighted average treatment effects using output from steps 1 and 2 for every relative time
$l$ . In this current version, the estimated confidence bands are scaled the same way.$$\hat{\beta_l} = \sum_{e} \hat{\delta_{l}} \hat{w_{e,l}} \ \forall \ l$$
paneleventstudy.interactionweighted_eventstudy(data, outcome, event, group, cohort, reltime, calendartime, covariates, vcov_type='robust', check_balance=True)
data
:
pandas dataframe
outcome
:
String matching the label of the column in data
corresponding to the outcome variable; this is the LHS variable in the regression
event
:
String matching the label of the column in data
corresponding to the event variable; this should be a dummy variable indicating the pre- (values 0 prior to relative time 0) and post- periods (values 1 from relative time 0 onwards)
group
:
String matching the label of the column in data
containing the categorical levels of the individual entities
cohort
:
Integers matching the label of the column in data
containing the categorical levels of the cohorts in the data set generated from gencohort
(e.g., all entities treated in calendar time 3 should take the value 3 in this column)
reltime
:
Integers matching the label of the column in data
containing relative times going from -L to +K, with 0 being the timing of treatment onset; this can be generated from calendartime
generated from genreltime
, and reltime=-1
is automatically chosen as the reference period
calendartime
:
Integers matching the label of the column in data
containing calendar times going from 0 (earliest time period) to T (last time period); this can be generated from gencalendartime_numericscalendartime
.
covariates
:
List of columns corresponding to control variables in data
to be included in the RHS of the regression; if no covariates are to be included, set covariates=[]
vcov_type
:
String corresponding to the type of variance-covariance estimator in linearmodels.PanelOLS.fit(), which is called during the estimation process; default option is 'robust'
check_balance
:
Checks if data
is a balanced panel; default option is True
Returns a pandas dataframe with 3 columns, indexed to reltime
:
parameter
: The point estimates of the interaction-weighted average treatment affectslower
: The lower confidence bound ofparameter
upper
: The upper confidence bound ofparameter
Estimates dynamic treatment effects (
paneleventstudy.timeseries_eventstudy(data, outcome, reltime, covariates, vcov_type='HC3')
data
:
pandas dataframe
outcome
:
String matching the label of the column in data
corresponding to the outcome variable; this is the LHS variable in the regression
reltime
:
Integers matching the label of the column in data
containing relative times going from -L to +K, with 0 being the timing of treatment onset
covariates
:
List of columns corresponding to control variables in data
to be included in the RHS of the regression; if no covariates are to be included, set covariates=[]
vcov_type
:
String corresponding to the type of variance-covariance estimator in statsmodels.regression.linear_model.RegressionResults.get_robustcov_results, which is called during the estimation process; default option is 'HC3'
Returns a pandas dataframe with 3 columns, indexed to reltime
:
parameter
: The point estimates of the interaction-weighted average treatment affectslower
: The lower confidence bound ofparameter
upper
: The upper confidence bound ofparameter
This function calls plotly's graph_objects module to show the event study estimates (dynamic treatment effects) to be shown as a line chart, together with their confidence bands (can be manually excluded). Moreover, it exports an interactive graph as a html file via plotly's plotly.io.write_html(), and a static graph as a png file via plotly's [plotly.io.write_image()]. Users of this package may, of course, opt to other charting packages, modules, or scripts to plot the event study estimates.
paneleventstudy.eventstudyplot(input, big_title='Event Study Plot (With 95% CIs)', path_output='', name_output='eventstudyplot')
input
:
Output from the analytical functions (paneleventstudy.naivetwfe_eventstudy()
, paneleventstudy.interactionweighted_eventstudy()
, paneleventstudy.timeseries_eventstudy()
); manually exclude columns lower
and upper
if only the point estimates are to be shown
big_title
:
String containing the main title of the figure; default is 'Event Study Plot (With 95% CIs)'
path_output
:
String containing the directory of where the output files should be saved in; default is ''
, i.e., the present working directory
name_output
=
String containing the file name of the image and html file to be generated; default is 'eventstudyplot'
- pandas>=1.4.3
- numpy>=1.23.0
- linearmodels>=4.27
- plotly>=5.9.0
- statsmodels>=0.13.2
- sympy>=1.10.1