BUSINESS DATA MINING

Problem 1. Explain what each of the following R functions do? You can run them in R and check the results. (a) c(1, 17, −6, 3) (b) seq(1, 5, by=0.5) (c) seq(0, 10, length=5) (d) rep(0, 5) (e) rep(1:3, 4) (f) rep(4:6, 1:3) (g) sample(1:3) (h) sample(1:5, size=3, replace=FALSE) (i) sample(c(2,5,3), size=4, replace=TRUE) (j) sample(1:2, size=10, prob=c(1,3), replace=TRUE) (k) c(1, 2, 3) + c(4, 5, 6) (l) max(1:10) (m) min(1:10) (n) range(1:10) (o) matrix(1:12, nr=3, nc=4) (q) Let a ← c(1,2,3), b ← c(10, 20, 30), c ←c(100, 200, 300), d ← c(1000, 2000, 3000). What does the function rbind(a, b, c, d) do? What does cbind(a, b, c, d) do? 1 2 HOMEWORK 2 DUE DATE: FRIDAY, SEPTEMBER 25 AT 11:59 PM (r) Let C be the following matrix a b c d 1 10 100 1000 2 20 200 2000 3 30 300 3000 What is sum(C)? What is apply(C, 1, sum)? What is apply(C, 2, sum)? (s) Let movies ← c(“SPYDERMAN”,“BATMAN”,“VERTIGO”,“CHINATOWN”). What does lapply(movies, tolower) do? Notice that “tolower” changes the string value of a matrix to lower case. (t) Let x ← factor(c(“alpha”, “beta”, “gamma”, “alpha”, “beta”)). What does the function levels(x) return? (u) c ← 35:50 (v) c(1, 2, 3) + c(4, 5, 6) (w) c(1, 2, 3, 4) + c(10, 20) (x) sqrt(c(100, 225, 400)) Problem 2. Create the following vectors in R. a = (5, 10, 15, 20, ..., 160) b = (87, 86, 85, ..., 56) Use vector arithmetic to multiply these vectors and call the result d. Select subsets of d to identify the following. (a) What are the 19th, 20th, and 21st elements of d? (b) What are all of the elements of d which are less than 2000? (c) How many elements of d are greater than 6000? Problem 3. This exercise relates to the College data set, which can be found in the file College.csv. It contains a number of variables for 777 different universities and colleges in the US. The variables are • Private : Public/private indicator • Apps : Number of applications received • Accept : Number of applicants accepted • Enroll : Number of new students enrolled • Top10perc : New students from top 10% of high school class • Top25perc : New students from top 25% of high school class • F.Undergrad : Number of full-time undergraduates BUSINESS DATA MINING (IDS 472) 3 • P.Undergrad : Number of part-time undergraduates • Outstate : Out-of-state tuition • Room.Board : Room and board costs • Books : Estimated book costs • Personal : Estimated personal spending • PhD : Percent of faculty with Ph.D.’s • Terminal : Percent of faculty with terminal degree • S.F.Ratio : Student/faculty ratio • perc.alumni : Percent of alumni who donate • Expend : Instructional expenditure per student • Grad.Rate : Graduation rate (a) Read the data into R. Call the loaded data “college”. Explain how you do this. (b) How many variables are in this data set. What are their measurements? How do you get these information? (c) Use the function colnames() to change the “Top10perc” and “Top 25per” variables names to “Top10” and “Top25”. (d) Look at the data. You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands: > rownames (college) → college [,1] You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Write a code to eliminate the first column. (e) Add a column to indicate the acceptance rate for each university (acceptance rate = number of accepted applications / number of applications received). (f) Provide a summary statistics for numerical variables in the data set. (g) Use the pairs() function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix A using A[,1:10]. Can you observe any useful information in the plots? (h) Use the boxplot() function to produce side-by-side boxplots of Outstate versus Private. Do you observe any useful information in this plot? (i) Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%. Follow the code below. 4 HOMEWORK 2 DUE DATE: FRIDAY, SEPTEMBER 25 AT 11:59 PM > Elite → rep (“No”,nrow(college)) > Elite[college$Top10perc > 50] = “Yes” > Elite = as.factor(Elite) > college = data.frame(college,Elite) i. Explain each line of the above code. ii. Use the summary() function to see how many elite universities there are. Now use the plot() function to produce side-by-side boxplots of Outstate versus Elite. (j) Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways. (k) What is room and board costs of private schools on average ? (l) Create a new binary variable that is 1 if the student/faculty ratio is greater than 0.5 and 0 otherwise. (m) Compare the distribution of out of state tuition for private and public colleges. Problem 4. This exercise involves the “Auto” data set. (a) Remove the missing values from this data set. (b) What is the range of each quantitative predictor? You can answer this using the range() function. (c) What is the mean and standard deviation of each quantitative predictor? (d) Remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains? (e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings. (f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer. Problem 5. FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics, and culture, recently published a series of articles on gun deaths in America. Gun violence in the United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as BUSINESS DATA MINING (IDS 472) 5 other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.You can find this dataset, called ”gun deaths.csv”, on blackboard. (a) Generate a data frame that summarizes the number of gun deaths per month. (b) Generate a bar chart with labels on the x-axis. That is, each month should be labeled “Jan”, “Feb”, “Mar” and etc. (c) Generate a bar chart that identifies the number of gun deaths associated with each type of intent cause of death. The bars should be sorted from highest to lowest values. (d) Generate a boxplot visualizing the age of gun death victims, by sex. Print the average age of female gun death victims. Answer the following questions. Generate appropriate figures/tables to support your conclusions. (e) How many white males with at least a high school education were killed by guns in 2012? (f) Which season of the year has the most gun deaths? Assume that – Winter = January - March – Spring = April - June – Summer = July - September – Fall = October - December – Hint: You need to convert a continuous variable into a categorical variable. (g) Are whites who are killed by guns more likely to die because of suicide or homicide? How does this compare to blacks and Hispanics? (h) Are police-involved gun deaths significantly different from other gun deaths? Assess the relationship between police involvement and other variables.

BUSINESS DATA MINING

Our Advantages

Secure Payments guaarantee