Instructions: You will enter your answers on this sheet and upload it along with your do file (which should record all of the steps you are taking). When you paste Stata output into this document, please put it in 9-point Courier New font to avoid the output wrapping. In order to avoid glitches in software versions or between different types of computers, please submit a pdf version of this document, as well as your do file.
IMPORTANT INFORMATION: There is more than one way to skin a cat – meaning, that different analysts will take different approaches, and one approach is not necessarily better than another. You should worry less about getting the “right” answer (right can be relative, in this case, depending on the assumptions you make), and more about documenting your process and logic in the assignment. Additionally, I fully expect that groups will come up with different combinations of dependent and independent variables and approaches to answering questions of interest.
As a reminder, the abbreviated codebook for these data is found below, and there are files for the codebook and dataset in the folder Stata >> Stata project. You have previously worked with this dataset and will want to rely on the work you have already done to clean up or create new variables.
Use the eathealth15.dta dataset to complete the following tasks.
STEP 1 – Create a Table 1
- (5 pt) Generate two new variables called age_cat and primary2, that use information from the original variables age (for age_cat) and primary (for primary2). Add labels to the attributes, as noted in the table below. Provide a tabulation of age_cat and primary2 in this document, including any missing observations, with the labels for the attributes.
HINT: Make sure that your code maintains observations that are missing.
REMINDER: save your new dataset (as you have done before), so that you don’t lose the variables age_cat and primary2.
New variable Original variables used Attribute (definition) Labels added to attributes
age_cat age 0 (age <40) 1 (age 40-59) 2 (age >=60) Young adults
Middle-aged adults
Elder adults
primary2 primary 0 (primary <50) 1 (primary 50-99) 2 (primary >=100) Low
Moderate
High
- (10 pt) Recall that we have discussed ways to prepare a Table 1 (see the Power-point slides “Creating a table 1.ppt” in the assignment description). Prepare a Table 1 for this dataset, using exercised last week and did not exercise last week (variable: exercise) as headings for columns. The variables you should include in this table – with an appropriate test of statistical significance comparing exercise and no exercise samples– are:
• female (sex)
• foodstamp
• bmi or bmi_cat (a derived variable)
• age_cat (a derived variable)
• primary2 (a derived variable)
• soda
• fastfood - (5 pt) Describe in 1-2 sentences which statistical tests you selected for the variables in Table 1, and why you selected those. This should be written in the style of the statistical analysis section of a peer-reviewed manuscript. (see “methods example.docx” in the assignment description)
STEP 2 – Develop a base model to predict your selected outcome.
There are some options available to you here. You can model (1) bmi, (2) exercise, or (3) soda.
- (5 pt) Indicate the following
• the outcome you are modeling: __
• The appropriate regression type for this outcome:____ - (5 pt) Write your study question, including the outcome and main exposure in which you are interested.
- (5 pt) Write the formal set of hypotheses for your research question
- (10 pt) In addition to your main exposure variable, identify 2 more variables that you decide must be in your model. The choice of these variables is analyst driven, and should reflect your understanding of your research question, as well as variables that you feel are necessary to include in order to make fair comparisons. Create the base model by including these necessary variables and provide the Stata output below.
HINT: should any of these variables be treated as categorical rather than continuous? Make sure that your model reflects your thinking on this.
STEP 3 – Add to and refine your base model
- (5 pt) As you have worked with this dataset, there are likely other variables whose impact on your outcome is of interest to you. Select 2 of these variables from the dataset, and assess their association with the outcome variable. If you were following a forward selection approach, which of these variables would be entered first into your model? Provide Stata output that supports your answer of which to enter into the model.
• The next variable to add, according to forward selection, is ___
• The reason for selecting this variable is ___
• Stata output to support the above conclusion goes below - (5 pt) Add your selected variable (from the previous step) to the base model. Using the criteria we have discussed in class (improved predictive value, preference for a parsimonious model), determine whether the newly added variable should be kept in the model. Provide the output for your new model below and write 1-2 sentences that describe which model is preferred (base or new), citing evidence from the output of the two models.
HINT: should any of these variables be treated as categorical rather than continuous? Make sure that your model reflects your thinking on this. - (5 pt) Add an interaction term of your creation to your preferred model (this should be an interaction between two variables from your preferred model, identified in the previous question). Should the interaction term be kept in the model? State your justification for whether to keep the interaction term, providing support from numbers in the output that you provide below.
- (5 pt) After making all of the decisions in step 3, provide the output for your best model below.
STEP 4 – interpret your findings
- (5 pt) Assess the model fit using the appropriate post-regression diagnostics, using Stata code we have discussed in class. Provide output from this assessment and explain whether you have a satisfactory explanatory model or should consider a new model.
REMEMBER: I’m interested in your process here, not in having you find the very best explanatory model. So, if your model is not great, that’s OK…you don’t have to do the project over, just tell me why it’s not great. - (5 pt) Upload your annotated do file.
- (5 pt) Refer back to your study question and hypotheses (question 5-6). Write 1-2 sentences about the explanatory power of your best model. Then write an additional 1-2 sentences about how the information from this model could inform next steps. As examples of what you might write about next steps, consider (1) whether or not you were able to answer the study question (how strong is your evidence), (2) what educational or behavioral intervention might be undertaken based on your evidence, and/or (3) how you might improve on this model in a future study.