: # Initialize Otter
import otter
grader = otter.Notebook(“ps4.ipynb”)
1 Econ 140 – Problem Set 4
Before getting started on the assignment, run the cell at the very top that imports otter and the
cell below which will import the packages we need.
Important: As mentioned in problem set 0, if you leave this notebook alone for a while and come
back, to save memory datahub will “forget” which code cells you have run, and you may need to
restart your kernel and run all of the cells from the top. That includes this code cell that imports
packages. If you get not defined errors, this is because you didn’t run an earlier
code cell that you needed to run. It might be this cell or the otter cell above.
[4]: import numpy as np
import pandas as pd
import statsmodels.api as sm
1.1 Problem 1. Efficient Markets Hypothesis
Does the stock market efficiently use information in valuing stocks? The Efficient Markets Hypothesis (“EMH”), developed by Nobel-prize winner Eugene Fama, maintains that current stock
prices fully reflect all available information. An implication of this hypothesis is that returns in
the current period should not be systematically related to information known in earlier periods.
Otherwise, we could use this information to predict stock returns, thus violating EMH. As an analyst at an investment management company, you have been tasked with examining the validity
of the EMH. You obtained a dataset of 142 randomly-selected firms listed on the New York Stock
Exchange, consisting of the following four variables:
Variable Description
return Total return from holding a firm’s stock over a one-year period, from
January 2014 to December 2014. Note that an annual return such has
31.4% is entered in the dataset as 31.4.
dkr A firm’s debt to capital ratio in 2013.
lnetincome Natural log of the net income for a firm in 2013.
lsalary Natural log of the total compensation for a firm’s CEO in 2013.
1
Using these data, you estimated the following two regressions.
Regression 1
Regression 2
Question 1.a. Based on the results for the two OLS regressions, what is the sign of the correlation
between dkr and lnetincome? Alternatively, is there not enough information to determine the sign
of the correlation?
Type your answer here, replacing this text.
Question 1.b. Interpret the coefficient on lnetincome in Regression 2.
Type your answer here, replacing this text.
Now suppose you added another variable to the regression, and obtained the following regression
results.
2
Regression 3
Question 1.c. Suppose that you use Regression 3 to examine whether EMH holds. What are the
null and alternative hypotheses?
Type your answer here, replacing this text.
Question 1.d. Carry out the test in part (c) at the 5% level. Do you reject or fail to reject the
null hypothesis?
Type your answer here, replacing this text.
Question 1.e. Interpret the result you obtained in part (d), in light of your task of examining the
validity of EMH.
Type your answer here, replacing this text.
Question 1.f. Provide (at least) two reasons why there might be imperfect multicollinearity
present in Regression 3.
Type your answer here, replacing this text.
Question 1.g. Which of the following statements is true based on a comparison of Regression
2 and Regression 3? – (i) dkr and lnetincome are highly-correlated. – (ii) dkr and lsalary are
highly-correlated. – (iii) lnetincome and lsalary are highly-correlated. – (iv) All of the above. –
(v) None of the above.
Type your answer here, replacing this text.
Question 1.h. The sample of 142 stocks only include companies that were traded on the NYSE
as of the end of 2013. A company that went out of business, for instance, before the end of that
year could not enter the sample. How would this sampling affect the estimated coefficient relative
to the population regression?
Type your answer here, replacing this text.
3
1.2 Problem 2. Airlines and Antitrust
Antitrust authorities have long been concerned that airline carriers may exercise their market power
by charging higher fares. The greatest concern arises when one airline runs the vast majority of
flights in and out of an airport. Usually this happens when an airline designates an airport as
a national or regional “hub” of their operations. The dataset airfares.csv consists of average
fares and other characteristics of popular U.S. origin-destination pairs (e.g., Boston-Chicago) for
the year 2000.
Variable Description Units
lfare logarithm of the average fare
on the route
log of fare in 2000 dollars
dist distance of the route thousands of miles
passen average number of
passengers per day
thousands of passengers
concen market share of biggest
airline carrier on the route,
measured in terms of
passengers carried
fraction (e.g., 0.55 = 55%
market share)
origin city of origin of flight
destin city of destination of flight
: af = pd.read_csv(“airfares.csv”)
af.head()
Question 2.a. Regress lfare on dist, passen and concen, with robust standard errors. Make
sure the cell below (and all regression questions in this assignment) shows your regression results
like you’ve done in previous assignments, otherwise we cannot give credit. This assignment will be
a little less guided. Make sure do use different variable names for each separate coding part to avoid
unexpected errors from reusing variables. Refer to previous assignments if you need a refresher on
how we performed different regressions. Don’t forget to add a constant to your regressions.
Question 2.b. What is the interpretation of the coefficient on passen?
Type your answer here, replacing this text.
Question 2.c. Based on your OLSEs, and assuming the OLS assumptions hold, what is the partial
effect of the market share of the largest carrier on air fares? Is your answer consistent with the
hypothesis that firms use their market power to charge higher prices?
Type your answer here, replacing this text.
Question 2.d. How would you test whether market power is used the same way on more popular
and less popular routes? Write down the model and the hypothesis, carry out the estimation and
the test.
This question is for your code, the next is for your explanation.
4
Question 2.e. Explain.
Type your answer here, replacing this text.
Question 2.f. We need to question whether the results of the regression in part (d) is revealing
a causal relationship between concentration and airfares. In particular, we are concerned whether
our estimation results on U.S. data are valid for other markets, such as Europe and Asia. Give one
reason why the results would not be “externally valid” if applied to the airline industry in one of
these other two regions.
Type your answer here, replacing this text.
Question 2.g. We are also aware of several potential threats to “internal validity” of the results.
For each one of the five main internal validity threats, describe one possibility that could plausibly
lead to that particular threat.
Type your answer here, replacing this text.
1.3 Problem 3. World Health Organization
The World Health Organization (“WHO”) collects data which assesses the health care outcomes
of the populations in 191 countries across the globe, as well as exploring potential explanations for
those outcomes. These data are published in the annual “World Health Report.” The file who.csv
contains five years (1993-1997) of these data. The variables in the panel of countries include:
Variable Description
comp composite measure of health care attainment
dale disability-adjusted life expectancy
year 1993,1994,1995,1996,1997
hexp per capita health expenditure
hc3 educational attainment (tertiary schooling)
country number assigned to country
oecd dummy indicator for an OECD member country
gini Gini coefficient for income inequality
geff World Bank measure of government effectiveness
voice World Bank measure of democratization of the political process
tropics dummy indicator of tropical location
popden population density (people per square mile)
pubthe proportion of health expenditure paid by public authorities
gdpc normalized per-capita GDP
[5]: who = pd.read_csv(“who.csv”)
who.head()
Question 3.a. Create a new variable for the dataset that is the square of educational attainment
(hc3). Then regress life expectancy (dale) on health expenditures (hexp), the educational attain5
ment in the country (hc3), and its square (the variable you created). For now, select rows from
1997 and use only these rows in the regression. Use robust standard errors and don’t forget to
add a constant term. Comment on whether you think the relationship between life expectancy and
education is linear or quadratic and why you came to that conclusion.
This question is for your code, the next is for your explanation.
Question 3.b. Explain.
Type your answer here, replacing this text.
Question 3.c. To the specification in part (a), add the additional control variables: gini, tropics,
popden, pubthe, gdpc, voice, and geff. Test whether these additional regressors are jointly
significant (we do the F-test for you in this part, you just have to interpret it). What effect does
inclusion of these additional controls have on the coefficients of the other included regressors?
This question is for your code, the next is for your explanation.
[7]: # This is the code for your regression.
,→is
model_3b = …
results_3b = …
results_3b.summary()
[8]: # Please don’t change this cell, just run it.
results_3b.f_test(“gini, tropics, popden, pubthe, gdpc, voice, geff”).summary()
Question 3.d. Explain.
Type your answer here, replacing this text.
Question 3.e. Return to the simpler regression specification in part (a). We want see if the
determinants of life expectancy are different for rich and poor countries. Use membership in the
“Organization of Economic Cooperation & Development” (oecd) as the indicator of a rich country.
The OECD had 30 member countries during this time period. Perform a test of the hypothesis
that all three of the coefficients in the population regression are equal for OECD and non-OECD
countries.
Hint: You will need to create three new variables.
This question is for your code, the next is for your explanation.
6
[52]: # This extra code cell may be helpful
…
Question 3.f. Explain.
Type your answer here, replacing this text.
Question 3.g. Give an example of a time-invariant variable that would result in different life
expectancy across countries.
Type your answer here, replacing this text.
Question 3.h. Estimate the regression having a fixed effect for each country in the sample. We
have defined the endogenous and exogenous variables for you, you just have to fill in the rest.
Notice how we converted the country variable to a set of dummy variables for each country. You
can ignore the coefficients for every country variable. What change took place in the coefficients
on the education variables? Explain why you think there was a change in these coefficients.
This question is for your code, the next is for your explanation.
[49]: # .get_dummies transforms a categorical variable into a dataframe of dummy␣
,→variables,
,→variable
countries = pd.get_dummies(who[‘country’], prefix=”, prefix_sep=”)
who_country = who[[‘dale’, ‘hexp’, ‘hc3’, ‘hc3^2’]].join(countries)
y_3h = who_country[‘dale’]
,→in
X_3h = sm.add_constant(who_country.drop(columns=[‘dale’, ‘191’]))
model_3h = sm.OLS(…, …)
results_3h = model_3h.fit(…)
results_3h.summary()
Question 3.i. Explain.
Type your answer here, replacing this text.
Question 3.j. Give an example of an entity-invariant variable, which is excluded from the estimated regression model in part (a), that would result in variation in life expectancy over time.
Type your answer here, replacing this text.
Question 3.k. Perform regression with time fixed effects. Are the results consistent with your
reasoning about the entity-invariant variables? The procedure for this question will be similar to
3.h. Drop the dummy variable for 1993 for this question.
This question is for your code, the next is for your explanation.
7
Question 3.l. Explain.
Type your answer here, replacing this text.
Question 3.m. Perform a test that all time fixed effects are jointly equal to zero. Remember that
we excluded 1993. What is the result of your test?
This question is for your code, the next is for your explanation.
Question 3.n. Explain.
Type your answer here, replacing this text.