Part 1: Concepts and Definitions
Each question gives a definition or example. Write the LETTER in the blank of the term that this is most likely referring to. Use each letter only once.
1. ____ FiveThirtyEight uses this distribution because it is more platykurtic than normal.
2. ____ As sample size grows, this statistic converges to the population mean.
3. ____ If I set α at a lower value, I reduce this.
4. ____ When population statistics are unknown, I need this to calculate the standard error of the mean.
5. ____ 1.96 is the approximate critical value for the 95% CI in this distribution.
6. ____ Method to test for means across more than two (sub)samples.
7. ____ For a statistical test, when P < α, I reject this.
8. ____ I should use this test if I want to know whether COVID rates differ across age-categories.
9. ____ If I set a lower level of confidence, I typically reduce this.
10. ____ This is how I typically refer to outcomes in the statistical/conceptual models that I am testing.
11. ____ I need to test this assumption, whether running t-test, ANOVA, or regression.
12. ____ These are typically “wider” when n is smaller.
A. Chi-Square tests
B. ANOVA
C. t-tests
D. Correlation
E. Type-I error
F. Type-II error
G. Dependent variable
H. Independent variable
I. P < α
J. μ
K. CI
L. X̄
M. σ
N. σ_X̄
O. s
P. Null hypothesis
Q. Alternate hypothesis
R. z distribution
S. t distribution
T. Normality
U. Homogeneity of variances
Part 2a): Inferential Statistics of ACS Data
Open ACS16PAIncome.dta in STATA (from week 12) and use various commands to answer the following questions or perform the following analyses.
2.1) What is the unit of analysis? Also, in one sentence, describe what the age, gender, race, and education variables say about observation #1.
_____________________________________________________________________________________
2.2) What are the number of observations? _____________
Place the ONE STATA command here that you used to get the answer.
2.3) Use one graphical method and one non-graphical method to determine the normalcy of the variable “incwbf.” Based on these analyses, is it normally distributed? Be sure to put your answer in terms of kurtosis and skewness. PLEASE, DO NOT ADD YOUR STATA OUTPUT.
_____________________________________________________________________________________
_____________________________________________________________________________________
Place the ONE STATA command here for your graphical method.
Place the ONE STATA command here for your non-graphical method.
2.4 BONUS 4 points) If “incwbf” is NOT normally distributed, what is one method, based on information in the class lecture and DM 13.7 that might help fix the problem?
_____________________________________________________________________________________
2.5) Use a simple regression to calculate the raw gender gap in earnings. PLEASE, GIVE ONLY YOUR STATA COMMAND, NOT THE FULL OUTPUT..
Place the ONE STATA command here for your regression.
Based on your results, fill in the blanks to complete this sentence (and select either more or less as appropriate):
“Using data from the ACS, female workers make $_________ (more/less) on average than male workers (SE = ____; P < _____).”
2.6) Finally, use a simple regression to calculate the raw gap in earnings between white and black workers. PLEASE, GIVE ONLY YOUR STATA COMMAND, NOT THE FULL OUTPUT.
Place the ONE STATA command here for your regression.
Based on your results, fill in the blanks to complete this sentence (and select either more or less as appropriate):
“Using data from the ACS, black workers make $_________ (more/less) on average than white workers (SE = ____; P < _____).”
Part 2b): Inferential Statistics of Indiana Waste Data
Open WasteIndiana.dta in STATA This is data on actual COVID cases in Indiana (zip code 15701) from June to September. It also shows the BioBot (a testing company) prediction of cases from tests of wastewater samples. For a given date in the data, the “biopredict” variable gives a prediction of the number of cases that are likely to occur due to tests of the virus from sewage data. The “two_week_after” variable gives the actual confirmed cases that occurred in the 15701 zip code two weeks after the date of the wastewater test.
2.7) What is the unit of analysis?
_____________________________________________________________________________________
2.8) What are the number of observations? _____________
2.9) Based on the description of the data above, what should be the dependent variable and what should be the independent variable? (which variable is likely to predict the other).
_____________________________________________________________________________________
2.10) Create a twoway plot that gives 1) a scatterplot between the DV and IV and 2) an lfit between the DV and IV.
Place the resulting graph here.
Briefly describe the general trend of the relationship between DV and IV.
_____________________________________________________________________________________
2.11) Perform a simple regression of the DV on the IV.
Place the ONE STATA command for the regression here.
Briefly describe the statistical significance of the coefficient and the overall fit of the model.
_____________________________________________________________________________________
2.12) Estimate the new cases that the model predict based on your regression. Use the “predict varname, xb” command to do this.
Place the ONE STATA command here that you used to estimate linear predications.
Run the command “sort date” to sort the data by the week that BioBot samples were taken. Create a second graph (“graph twoway”) that shows both 1) the “connected” relationship between “two_week” cases and date, and 2) the “connected” relationship between your regression predicted cases (the variable you just created) and date. Use “help” internet searches, and youtube to figure things out.
Place the ONE STATA command here that you used to get the answer.
Place the resulting graph here.
Describe what this graph shows and what it might mean for local Public Health leaders in the use of BioBot data.
_____________________________________________________________________________________
_____________________________________________________________________________________
Part 3: Interpreting Statistics and Tests
Answer the following questions in the space provided. Please show your work/calculations.
3.1) Look at the t-test output from STATA. Use the information available in the table to compute the 95% CIs for each group (4 ?s). Use this T-distribution calculator to find the critical value for each group (df = n-1) (https://goodcalculators.com/student-t-value-calculator/).
Two-sample t test with equal variances
——————————————————————————
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
———+——————————————————————–
0 | 537 238.5102 7.066302 163.7493 ? ?
1 | 179 623.6313 22.69016 303.5737 ? ?
———+——————————————————————–
What do you suspect H0 is in this test? Using the estimated CI’s only, would you reject H0 or fail to reject H0?
_____________________________________________________________________________________
_____________________________________________________________________________________
3.2) Look at the ANOVA output from STATA. Use the information available in the table to compute the F stat (1 ?)..
Analysis of Variance
Source SS df MS F Prob > F
————————————————————————
Between groups 175437.942 2 87718.9708 ?
Within groups 692097.895 168 4119.63033
————————————————————————
Total 867535.836 170 5103.15198
What do you suspect H0 is in this test (in general terms)? Comparing the calculated F ratio to an eyeball test of 4, would you reject H0 or fail to reject H0?
_____________________________________________________________________________________
_____________________________________________________________________________________
3.3) Assume that family incomes for a community are normally distributed with μ = $60,000 and σ = $15,000. What is the probability that a family picked at random will have an income in the following ranges? Please answer in the form of a sentence.
1. Within one standard deviation of the mean.
2. Greater than or equal to $60,000.
3. Above $100,000.
4. Below $15,000.
3.3.1)________________________________________________________________________________
_____________________________________________________________________________________
3.3.2)________________________________________________________________________________
_____________________________________________________________________________________
3.3.3)________________________________________________________________________________
_____________________________________________________________________________________
3.3.4)________________________________________________________________________________
_____________________________________________________________________________________
3.4) What is the most important thing you learned in this class (2 lines)? What is something you might say at a holiday party describing something surprising you learned (2 lines)?
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________