Is Old Faithful Faithful?

Solutions


Solution 1) Summary statistics for both height of eruptions and interval between eruptions for the July 1995 data are:

 

Height (feet)

Interval (minutes)

Minimum

88

47

Average

133.3

77.82

Maximum

180

113

In light of the information from the 1988 data set, one can see that all of the summary statistics seem reasonable, except that the minimum height is somewhat less than the minimum height provided by the brochure and the minimum interval is somewhat longer than that provided by the brochure. Within the July 1995 data set, one can see that the minimum height of 88 feet occurred on July 28, 1995. From the data sheet, one learns that this was a windy day. The brochure tells us that strong winds may blow off the top of the ascending water column, masking the true height of the eruption.

Instructor's Comment: Since the minimum height from the July 1995 data is so much smaller than the minimum height from the 6,900 observations in 1988, one might suspect that the data from 1988, which were used for the summaries found in the brochure, may have excluded windy days. Alternatively, one may suspect an error in the estimated height for July 28, 1995. The latter suspicion might be assuaged by noting that there are other observations in the July 1995 data which are substantially less than 106 feet. In particular, the two next smallest heights from that data set are 95 and 96 feet, with the height of 95 feet also occurring on July 28, 1995.

To see how many eruptions are less than 115 feet or greater than 164 feet, one can examine a histogram for height:

This data set appears to have more than just a few eruptions less than 115 feet, but few greater than 164. Notice also that the histogram appears to be bimodal with modes at about 110 feet and 140 feet.

 

Solution 2) A plot of the predicted intervals vs. the duration of the current eruption yields points that are very closely approximated by a straight line. So the equation used to make predictions is very close to the equation for the line that best fits the plot below.

Therefore the linear equation being used must be something close to

I = 33.2 + 12.4D, where I is the interval between the current eruption and the next eruption and D is the duration (in minutes) of continuous visible water on the current eruption.

One should note that the points in the plot above do not fall exactly on a straight line because of rounding error.

 

Solution 3a) Looking at a scatterplot of the current duration and the interval between eruptions, one finds:

which suggests to have a strong linear relationship between the two variables. Notice, however, that there are two clusters of points. This suggests that there may be some other variable which explains the clustering, perhaps whether one or both reservoirs in Old Faithful fired during the eruption.

 

Solution 3b) The regression of interval between eruptions on the current duration (in minutes) is:

(i) The equation for the least squares line in this setting is

predicted interval = 33.3474 + 13.2854*(current duration).

The likely size of prediction error associated with this equation is 6.493 minutes, which is the square root of the mean squared error.

(ii) The value of R2 for this regression is 85.4%. That is, about 85% of the variation in interval between eruptions is explained by the duration of the current eruption.

 

Solution 3c) If the current eruption were to have a duration of 3.0 minutes, we would predict the time until the next eruption as

interval = 33.3474 + 13.2854*(3.0) = 73.20 minutes.

The brochure predicts the time until the next eruption for this scenario as 71 minutes. This does not agree exactly with the prediction above, but it is pretty close. One needs to recognize that the predictions made with the regression line in Part b are based solely on the small data set from July 1995, whereas the predictions in the park brochure are based on a much richer data set. [Note: The prediction of 71 minutes is well within the regression prediction (based on the July 1995 data) plus or minus one standard error.]

 

Solution 3d) The sample standard deviation of the intervals between eruptions is 16.9 minutes, while the standard deviation of the residuals is 6.5 minutes -- less than one-half as large. This decrease is due to the ability of the duration of the previous eruption to explain differences in intervals between eruptions.

 

Solution 3e) A plot of residuals versus predicted values for the regression of interval between eruptions on the duration of the current eruption is as follows:

There is a distinct pattern to this plot, namely there are two clusters of points in this plot. A similar observation is made in part (a). This pattern in the plot may suggest problems with the regression, i.e., we have failed to take into account some additional factor associated with eruptions. For example, it is possible that these clusters are related to whether one or both of Old Faithful’s reservoirs fired on a particular eruption.

 

Solution 4a) One will recall that our original predictive model was given by:

 

The predictive model obtained by adding height as an additional independent (predictor) variable is as follows:

 

One can see that height is not useful in the above model. Although adding height as an independent predictor increased the predictive ability of the model slightly (adjusted R2 here is 85.9% whereas the adjusted R2 for the original model was 85.3%), the increase is so small as to be negligible.

 

Solution 4b) The model with the product of height and duration as a predictor for the time between eruptions is shown below. The model has an R2 value of only about 73% and a root mean squared error which is much larger than the root mean squared errors of the other models considered. Thus, one would prefer our original model over either of the alternative models considered.

Instructor's Comment:

Note that including height in the regression models resulted in models based on fewer cases than our original predictive model (since there are observations for which no height is recorded). For this reason, one may prefer not to include height in any model. Moreover, the method by which the height of an eruption is measured is far less precise than the method by which the duration of an eruption is measured. Thus, one may prefer models not including height, based on the reliability of that data. If good data on wind condition were available then eliminating windy days might have made the height * duration model a better predictor of the time interval between eruptions. However since it was not mandatory to record the wind condition there is no guarantee of reliable data.

[TOP]