I chose to conduct my statistical analysis on wave data from the Fisheries and Oceans Canada website at http://www.meds-sdmm.dfo-mpo.gc.ca/isdm-gdsi/waves-vagues/index-eng.htm
Before I talk about the 3 tests that I will use to evaluate the data, I want to explain a few of the terms used in the data and some important points regarding the data itself.
To start off I’ll quickly explain some of the terms in my chosen data. Significant Wave Height is the mean of the top 1/3 largest waves, from trough to crest. The Wave Peak Period is the average distance between each wave in seconds. The Atmospheric Pressure is the downward force exerted by the weight of the overlying atmosphere and is measured in units called millibars. The rest of the data is pretty self explanatory.
In terms of the data itself, there are a couple of points I should mention here:
• The buoy periodically misfunctions reports wave heights upwards of 10,000m, therefore I chose to omit any readings above 12m.
• In presenting the data in a seasonal format I chose to use the meteorology dates for seasons in the Northern Hemisphere, having spring begin on March 1st, summer on June 1st, fall on September 1st, and winter on December 1st.
Test 1 – Regression and Correlation
Objective: To determine if there is any colinearity between significant wave height, peak period, atmospheric pressure and sea surface air temperature. Furthermore attempt to build a reliable regression model to predict future wave heights.
Expectations: I expect that the wave height will have a linear relationship with atmospheric pressure but not the other 2 variables.
• There is a linear relationship.
• The variation in the residuals is the same for both large and small values.
• The residuals follow the normal probability distribution.
• The independent variables should not be correlated.
• The residuals are independent.
The formula to predict wave heights is [Y^=309.5417842+0.343922104X1-0.305893382X2] with X1 being peak period and X2 being atmospheric pressure
Conclusion: Although I did manage to produce a formula for predicting future wave heights it does not have a very strong r2 of 0.59. During the process of creating the regression model I was forced to drop sea surface air temperature as a variable because the lower and upper 95% trapped zero, and it had the lowest correlation in the correlation matrix.
I did however confirm that there is no correlation in the residuals of the predicted values and the actual values.
I think that if I had data in the form of daily averages I would have produced more reliable results. The original data was too much for the data analysis tool to work with, so I chose to use seasonal data instead.
Test 2 – ANOVA Test
Objective: To determine if there is a difference in the mean wave heights between seasons. If so, to determine which seasons have different means. The null hypothesis in this test is that the mean is not significantly different between the seasons.
Expectations: I strongly feel that there will be a difference in the mean wave heights between seasons. Summer months are significantly smaller than the rest of the year.
• The populations follow the normal distribution.
• The populations have equal standard deviations.
• The populations are independent
Source of Variation SS df MS F P-value F crit
Between Groups 7.9994 3 2.666467 5.291576 0.010005 3.238872
Within Groups 8.062525 16 0.503908
Total 16.06193 19
The F value is above the critical value so it is apparent that the means of one or more of the groups of seasons is different. Upon further analysis using a t-test and creating a confidence interval we can determine that there are significant differences between fall vs. summer and winter vs. summer.
Conclusion: The results were exactly as I predicted. There is a significant difference, at a 95% confidence level, between fall vs. summer wave heights and winter vs. summer wave heights.
Test 3 – Time Series Forecasting
Objective: The objective in this test is to deseasonalize the air temperature data, to be able to predict the air temperature in the future based on a time series.
Expectations: I expect that the air temperature will not change significantly and that there will be a weak correlation between the temperature and time. The data set it not large enough to produce significant results.
• For each value of X, there are corresponding Y values. These Y values follow a normal distribution.
• The means of these normal distributions lie on the regression line.
• The standard deviations of these normal distributions are all the same.
Results: There is a weak r2 of 0.49 in the regression for this data.
Conclusion: The graph above I found interesting. Although there was a weak correlation in this time series, when I produced the deseasonalized data on a scatter plot, you can see a negative trend in the air temperatures over the 5 year sample. If I was to redo this test I would be interested to see the results by deseasonalizing the data by month instead. Also, I would increase my sample significantly.