Practice Exam 1
Solutions
1. When applying for financial aid, City U students and
their families must report household income (as computed for tax
purposes). Family incomes, in thousands
of dollars, for a group of 34 incoming students are shown below.
22.1 24.5
25.0 29.3 31.2 39.8 40.0 41.0 44.2 45.6
46.7 48.8
49.1 50.2 50.4 51.3 54.1 57.5 59.5 62.1
64.0 64.0
68.3 68.9 70.1 74.4 75.4 80.0 81.5 86.9
98.8 110.3 129.5 191.2
(a) Make a histogram or stemplot
of these data. If you choose a histogram,
be sure to specify your classes. If you
choose a stemplot, be sure to explain what your stems
and leaves represent.
|
2 |
2459 |
||
|
3 |
Stems are tens of thousands of dollars, and leaves are
thousands of dollars. I truncated
rather than rounding. Answers will vary. |
||
|
4 |
0145689 |
||
|
5 |
001479 |
||
|
6 |
24488 |
||
|
7 |
045 |
||
|
8 |
016 |
||
|
9 |
8 |
||
|
10 |
|
||
|
11 |
0 |
||
|
12 |
9 |
||
|
13 |
|
||
|
14 |
|
||
|
15 |
|
||
|
16 |
|
||
|
17 |
|
||
|
18 |
|
||
|
19 |
1 |
(b) Describe the overall
shape of the distribution. Is it roughly
symmetric, skewed to the right, or skewed to the left? Are there any outliers?
My distribution is skewed
to the right, with a center between 50 and 60 thousand dollars, and an outlier
of 191.2 thousand dollars.
(c) Would the 5 number summary or the mean and standard
deviation give a better brief summary for this distribution? Explain your choice. Calculate the one of the summaries you
choose.
The 5-number summary would
be better for this distribution because it is heavily skewed and has a definite
outlier. Since the mean and standard
deviation are sensitive to outliers, and less well-suited for skewed
distributions, we choose the 5-number summary.
Min = 22.1, Q1 = 44.2, M = 55.8, Q3 = 74.4, Max = 191.2
2. The table below
summarizes the accept/reject decisions which City U has made for a sample of
n=3000 applicants, broken down by the type of high school attended.
|
|
Public |
Private |
Parochial |
|
Accept |
1254 |
336 |
180 |
|
Reject |
1026 |
144 |
60 |
(a) What is the acceptance rate (as a %) among all City U
applicants? __________59%_____ (1770/3000)__
(b) What proportion of City U applicants are not from a public high school? _____0.24______(720/3000)___
(c) Find the conditional distribution of acceptance and rejection within each of the high school types. (That is, find the acceptance and rejection rates for students who attended public high schools. Then do the same for private high schools and again for parochial schools.) Summarize the results in a table and with a bar chart.
|
|
Public |
Private |
Parochial |
|
Accept |
55% |
70% |
75% |
|
Reject |
45% |
30% |
25% |
(d) If there was no relationship between the type of school
and the admissions decision, what would you expect for the count in the cell
describing number accepted from public high schools?
row total * column total
/ table total = 1770*2280/3000 = approximately 1345
(e) With a sentence or two, summarize any relationship that
you see in these data between the admission decision and the type of high
school.
Students from private and
parochial schools are accepted at higher rates than students from public
schools. Parochial acceptance rates are
slightly higher than private acceptance rates, but both are substantially
higher than public acceptance rates.
3. City U has a special relationship with an inner city high
school that encourages students to apply for admission. Below are the Verbal SAT scores from a SRS of
10 applicants from that school.
510 430 600 540 420 380 620 520 490 540
(a) Find the sample mean and standard deviation for these
SAT scores.
mean
= 505, standard deviation = 77.2
(b) Find the interquartile range for these data. [Recall that the interquartile range is the
difference between the third and first quartiles.]
Ordering the data, we
have: 380 420 430 490 510 520 540 540 600, and the first and third
quartiles are 430 and 540, respectively.
The interquartile range is 540-430 = 110.
(c) Use the 1.5*IQR criterion to decide if the minimum score
of 380 unusually low, given the other values in this distribution. Carefully justify your decision. [Recall that the 1.5*IQR criterion says that
an observation is an outlier if it falls more than 1.5*IQR above the third
quartile or below the first quartile.]
1.5*IQR = 1.5*110 =
165. Data below Q1-165 = 430-165 = 265
are outliers (as are data above Q3+165 = 705).
Since 380 is clearly above 265, the 380 SAT score is not an outlier by
the 1.5*IQR criterion.
4. Suppose that all
City U applicants are required to submit a high school grade average (on a 100
point scale). Past experience shows that
these averages follow a normal distribution with a mean of 83.0 and a standard
deviation of 6.0 points.
(a) What proportion of City U applicants should have a high
school average below 80? Find the appropriate
z-score and use a standard normal table.
and
P(z<-0.5) = 0.3085, using the table. So about 30.85% of applicants have a high school average below 80.
(b) The admissions office would like to designate students
in the top 10% of the high school grade distribution for a "fast
track" admissions decision. How
high would a student's high school average need to be in order to make it into
this special decision group? Your work
should include the relevant z-score and the relationship between the z-score
and your answer.
Looking up .90 in the body
of the table, we find a corresponding z-score of 1.28. Since
, we have
, and solving for x gives x = 90.69. So students with high school averages at or
above 90.69 would be in the “fast track” admissions decision group.
5. City U is noted for having a top-ranked water polo
team. In order to attract the best
quality players, the school is quite generous in awarding scholarships to
students on the team to help defray the $18,000 tuition bill. Suppose that the boxplot below reflects the
size of the scholarships awarded to the 15 current water poloists. All scholarships are in multiples of $1,000.

Determine whether each of the statements below is VALID
(definitely true), INVALID (definitely false), or UNDETERMINED (could be true
or false). Explain your reasoning in
each case.
(a) __________________ At least 4 of the water polo players
are on full scholarships.
VALID. Since the maximum and the third quartile are
the same, and there are four students at or above the third quartile, all of
those 4 must have full scholarships.
(b) __________________ There is at
least one player with a $12,000 scholarship.
VALID. The quartiles fall on actual observations
(with n=15, the median is observation 8, and the quartiles are observations 4
and 12).
(c) __________________ None of the
15 swimmers has a scholarship worth exactly $10,000.
UNDETERMINED. There is at
least one swimmer with no scholarship, and at least one swimmer with a $12,000
scholarship. This still leaves two
swimmers who could have scholarships anywhere between 0 and $12,000.
(d) Circle the value below which is the most reasonable
estimate for the sample mean of the water polo scholarships. Briefly explain your reasoning.
$ 9,000 $
13,500 $ 16,000 $ 18,000
$13,500 is the best
choice. Because this distribution is
skewed to the left, we know the mean should be less than the median of
$16,000. $13,500 is reasonable, but to
be sure, we do a quick check using the lowest possible numbers for the unknown
scholarships (in thousands:
0,0,0,12,12,12,12,16,16,16,16,18,18,18,18), which leads to an average
above $9000.
6. Trying to determine the best number of students to accept
is a tricky admission's decision. City U
officials must assume that some students will reject an offer from City U in
order to attend another school. If too
few students are accepted, they may end up with too small an incoming class,
but accepting too many students may jeopardize City U's rating in college
guidebooks. Here are several years' data
on the number of students accepted and the number who later enrolled. We are interested in predicting the number
enrolling from the number accepted.
|
Year |
Accepted |
Enrolled |
|
1996 |
2440 |
611 |
|
1997 |
2800 |
708 |
|
1998 |
2720 |
637 |
|
1999 |
2360 |
584 |
|
2000 |
2660 |
614 |
|
2001 |
2620 |
625 |
(a) Find the correlation between the number of students
accepted and the number that enrolled.
r = 0.8303, using my
graphing calculator with Accepted in L1, Enrolled in L2, and the linear
regression option, which also returns the correlation.
(b) Find the least squares regression line which best fits
these 6 data points.
, again from the linear regression option
(c) Write a sentence that interprets what the value of the
estimated slope of this regression line tells us about accepted and enrolled
students.
An increase of 100
accepted students would correspond to a predicted increase of about 21 enrolled
students.
(d) If City U accepts 2500 students in 2002, how many would
you expect to enroll?
, so if 2500 students are accepted, we would expect
approximately 609 to enroll.
(e) What is the
residual for 1998? Write a sentence
interpreting the value of the residual.
The residual for 1998
is: actual enrollment – predicted
enrollment = 637 – 655 = -18. Our
(f) Find the value of r2 for this model and
interpret it as a percentage. Your
statement should relate to City U admissions.
r2 is about
0.6894. About 69% of the variation in
the number of students enrolled from year to year can be accounted for by the
LSR model of enrolled students on accepted students.
(g)
Sketch a time plot of the accepted data and another of the enrolled data. these data. Do your
time plots reveal any strong trend in the number of students accepted or
enrolled from year to year?
There is variation in
the number of accepted students, but no clear
indication of a generally increasing or generally
decreasing trend.
The number enrolled varies, too, with a moderately strong correlation
with the number accepted, but again there is no
clear indication
that enrollment is generally increasing or
generally decreasing over
time.
7. The age distribution of students at City U
is modeled by the
distribution shown to the
right.
(a) Approximate the
median student age, based on the distribution.
the right, and half is to
the left. The median is about age 28..
(b) Do you expect the
mean student age to be higher or lower than the median? Explain briefly. Approximate the mean student age, based on
the distribution.
The mean will be higher
than the median. This distribution is
skewed to the right, so the mean will be pulled to the right. Graphically, the mean is the “balancing
point” for the distribution, which is approximately age 33.
(c) If we took random
samples of size 5 from the student population, computed the average age within
the sample, and looked at the distribution of these averages, would you expect
the mean for the new distribution to be larger than, smaller than, or the same
as, the mean you found in Part (b)? Explain
briefly.
Individual random samples may give averages either above
or below the distribution mean, but if we look at a bunch of them together, the
mean of the distribution of averages should be the same as the mean of
the distribution above.
(d) If we took random
samples of size 5 from the student population, computed the average age within
the sample, and looked at the distribution of these averages, would you expect
the standard deviation for the new distribution to be larger than, smaller than,
or the same as, the standard deviation of the distribution in Part (a)? Explain briefly.
The standard deviation of the distribution of averages
should be smaller than the standard deviation of the distribution
above. The average of 5 numbers will be
closer to the average age of all the students than will the individual ages,
and since standard deviation is sort of an average distance from the mean, it
will be smaller for the averaged values than for the individual values.

Correlation is a
measure of the strength and direction of a linear relationship between
two quantitative variables. It
takes values between -1 and 1, with values near 1 or -1 signifying a strong
linear relationship (values of exactly 1 or -1 imply all the data points lie on
a line), and values near 0 signifying
little or no linear relationship.
Positive correlations indicate that small values of one variable are
associated with small values of the other and large values are similarly
associated. Negative correlations
indicate that small values of one variable are associated with large values of
the other variable. The value of the
correlation does not depend on the units of measurement of either variable
(because correlation involves the standardized values, or z-scores, for the
individual measurements), and correlation does not distinguish between
explanatory and response variables (the formula is symmetric in x and y). In least squares regression of y on x, the
square of the correlation coefficient gives the proportion of variability in
the y-values that is explained by the least squares regression line.

9. Explain or define
the following terms as they relate to linear regression:
(a) Influential
observations
An influential
observation is one that has a substantial effect on the regression line (so
that removing that one observation changes the regression line a lot. Observations with extremely large or small x-values
have the potential to be extremely influential.
Other outliers may or may not be influential.
(b) Residual
The residual of an
observation is the difference between the actual and predicted y-value for a
given x-value. A positive residual means
the observation lies above the LSR line, while a negative residual means the
observation lies below the LSR line. The
LSR line tries to minimize the sum of the squared residuals.
10. Overweight
parents tend to have overweight children.
The results of a study of Mexican American girls aged 9 to 12 years are
typical. The investigators measured body
mass index (BMI), a measure of weight relative to height, for both the girls
and their mothers. People with high BMI
are overweight. The correlation between
the BMI of daughters and the BMI of their mothers was r = 0.506. The results of this study are
confounded. Explain what the confounding
is and what you may or may not conclude from the study.
Body type is
determined in part by heredity, so genetics explains part of the correlation,
but environmental influence also contribute to the correlation: mothers who are overweight may also set an
example of little exercise and poor diet, and these behaviors are likely to
influence the behaviors of their daughters.
11. The table below
shows numbers of flights on time and delayed for two airlines at five airports
in one month.
|
|
|
|
||
|
|
On Time |
Delayed |
On Time |
Delayed |
|
|
497 |
62 (11%) |
694 |
117 (14%) |
|
|
221 |
12 (5%) |
4840 |
415 (8%) |
|
|
212 |
20 (9%) |
383 |
65 (15%) |
|
|
503 |
102 (17%) |
320 |
129 (29%) |
|
|
1841 |
305 (14%) |
201 |
61 (23%) |
(a) What proportion of all Alaska Airlines flights were delayed? What proportion of all America West flights were delayed?
(b) Find the percentage of delayed flights for Alaska
Airlines at each of the five airports.
You may record your percentages in the table, next to the number of delayed
flights. Do the same for America West.
(c) What happens?
What is the name of the phenomenon you observe? Explain why it occurs in this situation. (What’s the lurking variable?)
Although
12. In Professor Friedman’s economics course, the correlation between the students total scores prior to the final exam and their final exam scores is r = 0.6. The pre-exam totals for all students in the course have mean 280 and standard deviation 30. The final exam scores have mean 75 and standard deviation 8.
(a) Professor Friedman grades on a curve so that he expects to assign A’s to approximately 15% of his students, B’s to approximately 35%, C’s to approximately 40% of his students, and D’s or F’s to the remaining 10%. Assuming the distribution of pre final totals is approximately normal, before the final exam, how many points (find the minimum) would a student need to be earning an A? a B? a C?
A: 311 B: 280 C:
242
normal curve calculations using z-scores of 1.04
(85% of area to left), 0 (50% of area to left), and -1.28 (10% of area to left)
to find corresponding x’s.
(b) Find the least squares regression line of final exam scores on pre-final total scores for this course.
predicted final exam score = 30.2 + 0.16*(pre final
total score)
(c) Explain the meaning of the vertical intercept of your LSR line in the context of Professor Friedman’s class. Is your interpretation reasonable? Why or why not?
The intercept is the
predicted final exam score for a student who had 0 points as a prefinal total. Our
interpretation is reasonable, but we should be careful.. The problem is that 0 as a prefinal total is probably far outside the range of the
actual students’ prefinal totals, and we shouldn’t
try to extrapolate far from the range of the data.
(d) Julie’s total before the exam was 300. What does LSR predict for her score on the final exam?
Julie’s predicted
score = 30.2 + 0.16*300 = 78.2
(e)
Should we should have great confidence in our ability to predict
Julie’s final exam score accurately?
Explain your answer and justify it statistically.
No, Julie’s score
could be considerably higher or lower than our prediction. The value of r2 in the regression
is only 0.36, which means the prefinal total only
accounts for 36% of the variability in the final exam scores. The remaining variability (64%) is large and
unaccounted for. We would have greater
predictive ability if r2 was larger.