3039_Sample_Ex2_1

3039_Sample_Ex2_1 - Jillian Roberts ISYE 3039 Part A...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Jillian Roberts ISYE 3039 Part A Problem 1 When assembling a customer‐designed complex system, challenges in predicting the costs will always arise. One particular complex system that has various challenges associated with it is a turbine engine for generating electricity. Costs that are included in this turbine engine complex system are capital costs, operation and maintenance costs, and fuel costs. Some of the challenges present in calculating these costs are quality, varying efficiencies, definition of the system, external impacts of the system, and technology changes. The first challenge of the errors in estimating the cost of assembling customer‐designed complex system is quality. The most significant reason for this challenge is that it is difficult to quantify the quality of a product. Quality is almost always directly related to costs. A higher quality product will assume a higher cost, while a product that is not as expensive may not have as sophisticated technology and will have a lower quality. This rule holds often, but the difficulty lies in the quantifiable relationship between quality and cost. How does one determine that a certain part of the turbine is 50% “better in quality” than another part? Quality, a subjective measure, can cause many errors when estimating the costs of assembling the turbine engine. Operation costs and the cost of assembly of a customer designed complex system may be difficult to estimate because of varying efficiencies. Turbine engines powered by different methods such as steam, nuclear fission, fossil fuels, geothermal, wind, and water have different efficiencies. In addition, a turbine engine of one particular type will have varying efficiency based on quality, environment, system constraints, and other outside factors. The precise efficiency of a given turbine is difficult to predict, thus it is difficult to know how much power to allow for the engine before and during assembly. Since the power is unknown, this presents a challenge in estimating the cost of the system. Also, once the system is in place and running, the varying efficiencies on a day‐to‐day, week‐to‐week, or month‐to‐ month basis changes the operation costs as well. A fourth challenge of predicting cost of complex systems lies in defining the boundary of the system and the costs involved with the system. For example, the turbine engine system can be defined as primarily the turbine, the generating source, and the transmission. Expanding this definition, the system could also include all of the transmission lines and the distribution systems that source the materials and energy. It is critical to define the boundaries of the system in order to have a complete set of information and to define the precise components to calculate the total cost of the system. Furthermore, in order to perform accurate cost prediction, it is necessary to define what external impacts the system will include, if any. Any customer‐designed complex system will have an impact on its surroundings and other costs associated that are not included in assembly or operation costs. The system dealing with the turbine engine for generating electricity will have many external costs: research and development, taxes, and government subsidies. Does the system include these additional costs? Should impacts on public and environmental damage be included? These questions need to be answered before cost calculation takes place to tackle the challenge of estimating the cost. 1 Another challenge that arises when predicting costs of complex systems such as the turbine engine is the unpredictability of technology changes. A new technology change will increase the reliability and capacity of a system. This increase in performance comes at a price however, as it will also increase the cost of the system. The difficulty in predicting cost lies in the fact that it is impossible to predict the future technology changes and when each new technology will become available. When completing a cost forecast, it will be difficult to estimate the precise increase in cost and the duration of this increase. For example, if the system is using a steam turbine and a new steam turbine is known to come into the market in the next two years, the calculation of cost of the system in the future is uncertain. Part A Problem 2 Start Simulation Initialize failure variables x, s, w, and rates u Do k = 1, K Do i = 1, I Set s(i) = 0, replace entire box Does sensor i fail? No End loop Go to the unchanged nd 2 block Yes Set w(i) = 0, replace entire box Yes Does switch i fail? No Do j = 1, J Does repeater ( i, j) fail? Yes No Does j = J? No Set x(i, j) = 0, replace entire box Yes 2 a) To update this system operation flow‐chart for the new policy, we need to modify the first block of the chart. We begin the same way as before by initializing the failure variables and looping through time (k = 1, K). Then, we loop through each cable box (i = 1, I). We first ask, does sensor i fail? If yes, set s(i) = 0, fix the entire cable box, and go to the next box. If no, continue and check the switch. Next we ask, does switch i fail? If yes, set w(i) = 0, fix the entire box, and go to the next box. If no, we will continue and check repeaters. The next step is to loop through each repeater (j = 1, J). We ask, does repeater (i, j) fail? If yes and j ≠ J, continue iterating through the rest of the repeaters to see if all of them are broken. If yes and j = J, set x(i, j) = 0, replace entire cable box, and go to next box. If any repeater does not fail, go to the next cable box. b) To illustrate how to evaluate the impact to the system profit made by the changed policy we must study the impact the system has on cost and revenue, for profit = revenue – cost. The modification of the system should not change revenue, so we focus on the change in cost. Let’s use an example and assume that a given system has three cable boxes, with the first cable box having the sensor broken, the second one with nothing broken, and the third box with the second repeater broken. With the new modification of the flow chart, we will iterate through each cable box and determine if it needs to be replaced. For the first cable box, the first check is the sensor, which is broken, so automatically we can replace the entire cable box and all of its components without checking if the other components are broken. This will save time in testing, but we won’t know whether or not the other components are broken, so we might have replaced parts that were in good condition or broken. Since there is a higher probability that only one component is broken than for two or more parts brokenat one time, we can say that the cost of this policy change for this cable box will increase. The original policy would replace only the broken sensor, but the new policy replaces the sensor, switch, and all repeaters when the sensor is broken. This increase in cost would decrease profit of the system. For the second cable box, we check the sensor, the switch, and all repeaters, all of which are working. This process does not change the cost of the system, for the original policy was set up the same. For both the original and new policies, if none of the components are broken, we do not replace any component. For the third cable box, we first check the sensor which is not broken, and then we move onto the switch which is also running fine, and then we move onto the first repeater which is working as well. The new policy states that all of the repeaters must be broken in order for there to be no signal. Since this first repeater is working, we know that there is a signal and that this cable box is working, so we do not need to check any more repeaters. The flow chart tells us to return to the cable box loop. Since this was the last cable box, we must continue to the unchanged 2nd block of the flow chart. The old policy states that if one of the repeaters is broken, we should replace that repeater. The new policy only replaces the cable box when all repeaters are broken. So for this example, the new policy saves the cost of one repeater. However, if all the repeaters are broken, we have continued through the flow chart and reached the repeaters, so we know that the switch and sensor are working. Since the new policy replaces not only the broken repeaters but also the switch and sensor, the new policy increases the cost of the system, and decreases the profit. c) The new repair policy for the undersea cable system will overall increase the cost of the system, which will therefore decrease the profit, since profit = revenue – cost. The example in answer (b) 3 illustrates this impact on cost and profit. For example, when a switch or sensor is broken, the new policy states that the entire cable box should be replaced, including the switch, sensor, and all repeaters. This increases the cost of the system because we would need to spend more on all of the new parts, instead of just the broken switch or sensor as in the original policy. Similarly, if all of the repeaters are broken, we have to replace the repeaters as well as the working switch and sensor. If only one of the repeaters is broken, we do not have to replace the repeater until all are broken. This momentarily decreases the cost of the system, but the payment will just be delayed until all are broken, at which time the cost of the system increases. Therefore, as cost increases, the profit of the system decreases. Part A Problem 3 a) A system that needs improvement is the barbecue at GTL’s Independence Day celebration. This system is a simple food service line set up as a self service buffet. Students lined up at the queue, which was first‐in‐first‐out, and went down the “assembly‐line” picking up plates and cutlery, a drink, and servings of food at each station. There were two main problems in this system: the large quantity of people and the small capacity. There were a large number of students in the queue at one time that was much greater thanthe capacity of the system. The food service was very slow, which decreased the capacity of the system as a whole. The bottleneck was the meat station. Brochettes were prepared and grilled at a slower pace than it takes for the students to walk through the assembly line and pick up food. Therefore, there was a large back‐up at the meat station, and the students behind were not able to move through the line. This was clearly noticeable at the beginning of the barbecue. When the buses got to GTL and the students got into line, no brochettes were ready or being cooked at the time. Therefore students were able to get through the line until the meat station and stopped the line for approximately 20 minutes waiting for the brochettes. The problem of the large quantity of people is not going to change. You cannot alter the demand or the number of people lining up for the barbecue. Thus, any improvements must be accomplished within the system, increasing the capacity and flow. Firstly, to improve capacity, a simple improvement could be to begin preparing the brochettes before students lined up. It is not a question of cook‐to‐order. All of the brochettes were the same and the cooks knew the length of time it took to cook the meat. This is a simple application of mass production and planning ahead of when to begin cooking them and would greatly increase the capacity of the system. A second improvement could be made to improve the flow of the system. Instead of having the meat station near the end of the “assembly line” and producing a bottleneck that halts the flow of the line before that point, the brochettes should be moved towards the front of the line. The position of the bottleneck, specifically moving the bottleneck to the beginning of a system, is a core principle in system management and significantly increases the flow and efficiency of a system. This is because the line’s speed will be determined by the bottleneck and will not cause any domino‐effect delays for the rest of the line. b) The key issues involved in building the flow chart for the barbecue system lie in defining the design configuration. We must fix the design configuration and parameters and determine the 4 system attributes in order to complete the flow chart successfully. For the barbecue system, the failure variables will be individual stations (plates, appetizers, drinks, vegetables, and meats). We will test these variables and determine if each station is flowing. We will have a “sensor” at each station to signal if the station is flowing, which determines if people are walking by. If the sensor signals, that means that the station is flowing, but if the sensor does not give a signal, we will know that the station needs fixing. The input failure variables that we will initialize are p = plates, a = appetizers, d = drinks, v = vegetables, and m = meats. The time variable is also defined as t. The flow chart will iterate through each time interval and will check all failure variables. Notice that there are no multiple “cable boxes”, so the system and all of its components are checked at each time period and there is no need for a second iteration. We also need to define how we are quantifying the results of the system. Unlike the undersea cable system where we are looking at profit = revenue – cost, we are looking at the overall system’s performance in terms of capacity (the number of people who can enter the system at one time) and flow (the speed of the people in the line). c) Flow chart of the GTL Barbecue System Start Simulation Initialize failure variables p, a, d, v, and m and rate u Continue to second block Do t = 1, T (time variable) Does plate p fail? No Do appetizers a fail? No No Do drinks d fail? Set m(t) = 0 No Yes Set p(t) = 0 Yes Set a(t) = 0 Yes Set d(t) = 0 Do vegetables v fail? No Does meat m fail? Yes Set v(t) = 0 5 Yes d) We can use this flowchart to perform what‐if analysis to explore potential improvement options. For the GTL Barbecue system, we will discuss two improvements. The first improvement is to increase the capacity of the system. We notice that the bottleneck of the system is the meat station. We can increase the capacity by begin preparing the brochettes before students lined up. This change in procedure would allow more meat to be available at a given time so students would move quickly though the station, which would directly relate to an increase in the signal at the meat station. This increase in capacity would not change the structure of the flowchart, but would significantly improve the results. The signal of the meat station would not “fail” as frequently, which would therefore increase the number of people who could be in the system at one given time. The increase in capacity would thereby increase the overall system performance. A second potential improvement is to increase the flow of the system. Again, we notice that the bottleneck of the system is the meat station. To remove this bottleneck, we can move the meat station to the beginning of the line, after the plate station. The position of the bottleneck, specifically moving the bottleneck to the beginning of a system, is a core principle in system management and significantly increases the flow and efficiency of a system. This is because the line’s speed will be determined by the bottleneck and will not cause any domino‐effect delays for the rest of the line. As in the previous example, the meat station would not “fail” as frequently, which would this time increase the speed of the people in the system. This increase in flow would therefore increase the overall system performance. The new flow chart for this example is on the next page: 6 Flow chart for the second potential improvement (increase flow): Start Simulation Initialize failure variables p, a, d, v, and m and rate u Do t = 1, T (time variable) Does plate p fail? No Does meat m fail? No No Do drinks d fail? Set a(t) = 0 No Do vegetables v fail? No Do appetizers a fail? Continue to second block Yes Set p(t) = 0 Yes Set m(t) =0 Yes Set d(t) = 0 Yes Set v(t) = 0 Yes 7 Part B Problem 1 Final Fitted Model – Coded Variables Y = 82.17 – 1.0 x1 – 6.09 x2 + 1.02 (x1)2 – 4.48(x2)2 – 3.6(x1 x2) Final Fitted Model – Original Variables Y = ‐631.85 + 7.08(Time) + 5.69(Temp) +0.03(Time2) – 0.01(Temp2) – 0.03(Time*Temp) ANOVA Table Regression Residual Total Regression Statistics R2 = SSR/SST = 503.73/628.71 = 0.8012 MSE = SSE/df = 124.98/5 = 24.996 df 5 5 10 SS MS F Significance F 503.7283895 100.7456779 4.030508006 0.076137956 124.9788833 24.99577665 628.7072727 Variable Statistics for Regression – Using Coded Variables Intercept x1 x2 (x1)2 (x2)2 x1 x2 Coefficients Standard Error 82.16666667 2.8865075 ‐1.012132034 1.767617629 ‐6.087615434 1.767617629 1.016666667 2.103885797 ‐4.483333333 2.103885797 ‐3.6 2.499788824 t Stat 28.46577279 ‐0.572596708 ‐3.443966237 0.48323282 ‐2.130977517 ‐1.440121648 P‐value 1.00222E‐06 0.59170273 0.018356659 0.649347721 0.086297255 0.209380624 8 Part B Problem 2 a) Data: x 1 2 3 4 5 6 7 8 9 10 Mean 0.7 ‐1.1 ‐2.9 ‐4.7 ‐6.5 ‐8.3 ‐10.1 ‐11.9 ‐13.7 ‐15.5 e ‐1.178816547 1.857106326 ‐0.668619577 ‐0.961108526 0.99567842 1.248301942 ‐1.997996151 0.730570946 0.347476998 0.635033075 Y ‐0.478816547 0.757106326 ‐3.568619577 ‐5.661108526 ‐5.50432158 ‐7.051698058 ‐12.09799615 ‐11.16942905 ‐13.352523 ‐14.86496692 Simple Linear Regression Model to resulted data (Y, x): Y = ‐1.74x + 2.2709 with R² = 0.949 Regression Statistics: Regression Statistics Multiple R 0.974190937 R Square 0.949047982 Adjusted R Square 0.942678979 Standard Error 1.294717952 Observations 10 ANOVA Table: Regression Residual Total df 1 8 9 SS MS F Significance F 249.7854311 249.7854311 149.0104632 1.88171E‐06 13.4103566 1.676294575 263.1957877 Variable Statistics for Regression: Intercept X Variable 1 9 Coefficients 2.270924205 ‐1.740029366 Standard Error 0.884460741 0.142543755 t Stat 2.567580561 ‐12.2069842 P‐value 0.033252478 1.88171E‐06 Regression Plot: Line Fit Plot 2 0 ‐2 0 ‐4 ‐6 Y ‐8 ‐10 ‐12 ‐14 ‐16 y = ‐1.74x + 2.2709 R² = 0.949 X Y Linear (Y) 2 4 6 8 10 Comments: The analysis of the model, Y = 2.5‐ 1.8 x + e, where e is the normal i.i.d. errors with mean zero and standard deviation σ = 1.0, shows that a linear regression model is Y = 2.2709 ‐ 1.74x with an R² value of 0.949. The large R2 value demonstrates that the linear regression model strongly correlates with the original model. The estimate of the regression slope, ‐1.74, has a P‐value of 1.88 × 10‐6. Since the p‐value is less than α = 0.01, we can say that the estimation of the regression slope is significant at α = 0.01. 10 Part B Problem 3 – Data Analysis Report Outline of Report I. Main Report a. Introduction i. Problem/Data Background ii. Outline of Analysis Steps iii. Conclusion from Study b. Data Analysis Details by Step i. Exploratory Analysis – Box Cox Transformation ii. Data Transformation Plots iii. VIF Analysis iv. Variable Selection v. Model Checking/Diagnostics vi. Final Model c. Observations from Analysis d. Reference II. Appendix Main Report A. Introduction i. Problem/Data Background Problem #3 Statement: Analyze the dataset: “fram200.xls”, which is a real‐life data set. Apply the box‐cox transformation to the output variable, sbp, to make them more normally distributed. Model the transformed sbp against significant input variables selected by a variable selection method. Note that some of the input variables might require transformation and there are possible significant combination effects from input variables. Provide regression analyses and model checking plots as illustrated in the lectures. Any missing components in these analyses and plots will cause point deductions. Data: The dataset “fram200.xls” includes eight input variables and one output variable and deals with the effect on Systolic Blood Pressure, sbp (the output variable). The input variables are sex (where 1 is “men”), dpb (diastolic blood pressure), scl (serum cholesterol), chdfate (coronary heart disease where 1 is “CHD”), followup (in days), age (in years), bmi (body mass index), and month (study month of baseline exam). 11 ii. iii. Outline of Analysis Steps We will begin the data set analysis by applying the box‐cox transformation on the y‐ data. This process will make the y‐data more normally distributed, the relationship between the x‐variables and SBP to be more linear‐like, and the variance more equal. The next step is to explore data transformation of the x‐variables to make each variable more linear‐like. We apply the box‐cox transformation process to the x‐variables to create “possible” transformations. The appropriateness and validity of these transformations will be further evaluated in later steps. The third step of the data analysis is to conduct a VIF (Variance Inflation Factor) Analysis on each of the x‐variables. This is conducted to avoid inflation of variance in estimation and to determine the variables that would affect the variance. The fourth step is to conduct a variable selection, in this case backward regression, to remove the variables that are not necessary for the overall model. The result of such a regression will be a model that includes only those variables that have a significant effect on the response. The fifth step is to examine the model fitting quality and to do model checking plots. These plots will allow us to see whether the model assumptions are violated. The final step is to provide interpretations and conclusions derived from the analysis steps. The conclusion will also provide the purpose of the model and how to use the model in future studies. Conclusion from Study The dataset “fram200.xls” shows the relationship between many input variables and systolic blood pressure. The backwards regression created a final model including sqrt(Month), age, sex, DBP, and CHDFate. These variables have the most significant effect on systolic blood pressure. Since the regression removed the other variables, we can assume that serum cholesterol, followup, and body mass index do not have significant effects on systolic blood pressure. This model can be used for further work in predicting the systolic blood pressure of patients. The model can also be further improved in the future if one desires to test the effect of new factors on systolic blood pressure. 12 B. Data Analysis Details by Step i. Exploratory Analysis The first step of the data analysis is to apply the Box‐Cox Transformation procedure on the output variable, sbp, to make the variable more normally distributed (Reference 1). Details of the transformation are located in Appendix Section 1. Box­Cox Pl ot of sbp 80 70 60 St Dev Lo w er C L U p p er C L Lamb d a (u sin g 95.0% c o n fid en c e) E stimate Lo w er C L U p p er C L Ro u n d ed Valu e ­ 1.41 ­ 1.99 ­ 0.76 ­ 1.00 50 40 30 20 ­5.0 ­2.5 0.0 Lambda 2.5 5.0 Limit Since the optimal value for λ is ‐1.41 and the rounded value is ‐1.00, the corresponding transformation is 1/y (Reference 2). The upper and lower bounds of the 95% confidence interval are ‐0.76 and ‐1.99, respectively. Data Transformation and Plots The procedure for transforming x‐variables begins with creating plots of the x‐y relationships. We create a plot for every x variable against 1/y, the transformed output variable. We will then examine each plot and notice if the relationship is linear. If the relationship is linear, no transformation is needed. If we notice a pattern, we will apply the Box Cox transformation on each x‐variable to acquire a potential transformed x variable. The process of creating data plots is necessary to explore what transformation might be needed to make the x‐y relationships more linear, variance more equal, and y‐data more normally distributed. For this data set, we notice that dbp, sex, and chdfate are either linear or probit distributions so there is no need to transform these variables. We confirm this by completing box cox transformations on these variables and acquire λ values of 1. The other variables have potential transformations, according to the Box Cox transformation method. SCL and age both have λ of ‐0.5, which mean they exhibit 1/√x relationships. Followup has λ = 1.43 and is x1.43. Month has λ = 0.5 and is √x. Finally BMI has λ = ‐1 and is 1/x. Details are found in Appendix section 2. 13 ii. iii. iv. v. VIF Analysis Variance Inflation Factor, VIF, analysis is a method that regresses an x‐variable against all other x‐variables. The purpose of this method is to remove variables that would skew inflation of variance in estimation. The VIF of a certain x is equal to 1/(1‐R2). If the VIF is greater than 5, we would remove the x‐variable from the model. A VIF greater than 5 corresponds to a variable that is highly correlated and if such a variable were included in the model, the estimate of population standard deviation will be over estimated. For this data set, we regress each variable and each transformed variable against all other variables and calculate the VIF for each x‐ variable. When we complete this task, we notice that all of the VIF values range from 1.0007 to 1.305. All of these values are below 5, which tell us that none of these variables will overestimate the population standard deviation, so we do not remove any of these variables. Details of VIF Analysis are found in Appendix Section 3. Variable Selection Since the VIF analysis did not remove any variables, we will use another variable selection method to remove variables that are not necessary in the model. In this data set, we will use backwards regression because we have a comprehensive model with many variables and need to reduce that number. The process begins by determining our initial comprehensive model, which will include all of our variables and transformed variables. This data set includes 1/sqrt(SCL), Followup ^ 1.43, 1/sqrt(Age), sqrt(Month), 1/BMI, SCL, Followup, Age, Month, BMI, Sex, DBP, and CHDfate. The second step is to define an alpha‐to remove, which will be the default of 15% for our purposes. The last step is to define a stopping criterion, which in this case will be when we reached our peak R2 value so that all the variables left are significant. The final model based on the backwards regression from Minitab is 1/SBP = ‐1.686 – 2.2 sqrt(Month) + 0.59 Age +4.1 Sex + 1.260 DBP +2.9 CHDfate. This model has the peak R2 (adj) value of 0.6203. Details of this variable selection are in Appendix Section 4. Model Checking/Diagnostics A. Linear Model – We notice that with the exception of the one high outlier, there is the same spread throughout the Ver sus Fi ts range of the prediction y. (response is sbp) 100 Because of the uniform spread, we can assume the 75 model is linear. 50 25 0 ­25 ­50 100 120 140 Fit t ed Value 160 180 200 Residual 14 B. Y‐data are normally distributed 99.9 99 95 90 80 70 60 50 40 30 20 10 5 1 0.1 Nor mal Pr obabi l i ty Pl ot (response is sbp) Per cent ­50 ­25 0 25 Residual 50 75 100 From this normal quantile‐quantile plot, we notice that when you plot the residuals of the response, sbp, vs. the percentage, a straight line is exhibited. This straight line confirms that the y‐data is normally distributed. C. Equal Variance – Since we know that the y‐data is normally distributed, we know that it also exhibits equal variance, based on the definition of a normal distribution. D. Y‐data are independent y(t) vs y(t+1) 0.012 0.01 0.008 y(t+1) 0.006 0.004 0.002 0 0 0.002 0.004 0.006 y(t) 0.008 0.01 0.012 Since this graph shows no pattern and the spread is uniform, we can assume that the y‐data is independent, in that y(t+1) is not dependent on the previous y(t). E. Conclusion Based on the regression analysis reports and model checking plots, we can be certain that the selected model based on the backwards regression fits the data well. The model checking plots also allow us to confirm that the model assumptions are upheld. 15 vi. Final Model Regression Analysis Tables Variable Constant Coefficient ‐1.686 T‐Value P‐Value S R‐Sq R‐Sq(adj) 14.1 62.99 62.03 √Month ‐2.2 ‐1.74 0.083 Age 0.59 4.72 0 Sex 4.1 2.03 0.044 DBP 1.26 15.17 0 CHDFate 2.9 1.31 0.192 The final regressed model is step nine of the backwards regression. The model suggests that the variables that have a significant impact on the Systolic Blood Pressure (SBP) response are the square root of the study month of baseline exam, age, sex, Diastolic Blood Pressure (DBP), and if the patient has coronary heart disease (CHDFate). The backwards regression began with all of the variables and determined that these five variables have the most significant impact on SBP. We use this regressed model to predict the SBP of a patient with their given values of month, age, sex, DBP, and CHDFate. 1/SBP = ‐1.686 – 2.2 sqrt(Month) + 0.59 Age +4.1 Sex + 1.26 DBP +2.9 CHDfate C. Observations from Analysis The dataset “fram200.xls” shows the relationship between many input variables and systolic blood pressure. The y‐data was first transformed to 1/SBP using the Box‐Cox transformation. This made the y‐data more normally distributed and the relationship between the x‐variables and SBP to be more linear‐like. Some of the x‐variables were transformed to make the variable more linear as well. The VIF analysis was then conducted to avoid inflation of variance in estimation and to determine those variables that would affect the variance. The VIF analysis showed that none of the variables had a significant impact on the variance, so we could not remove any variables at this step. We then conducted variable selection using backwards regression with all of the original and transformed variables. This step removed most of the transformed variables as well as some of the original variables. The regression created a final model including sqrt(Month), age, sex, DBP, and CHDFate. These variables have the most significant effect on systolic blood pressure. Since the regression removed the other variables, we can assume that serum cholesterol, followup, and body mass index do not have significant effects on systolic blood pressure. This model can be used for further work in predicting the systolic blood pressure of patients. The model can also be further improved in the future if one desires to test the effect of new factors on systolic blood pressure. D. Reference 1. Wikipedia contributors. "Power transform." Wikipedia, the Free Encyclopedia. Wikipedia, the Free Encyclopedia, 15 Jul. 2010. Web. 28 Jul. 2010 2. Minitab contributors. “Box‐Cox Transformation – Graphical Output.” Minitab Statguide. Minitab Help. 28 Jul. 2010. 16 II. Appendix 1. Box‐Cox Transformation Box Cox transformation covers the popular “power” transformations such as ln(x), x2, √x, and 1/x. The process will decide the best power transformation rigorously. For example, the relationship (x,y) can be transformed into (ln(x), ln(y)). The formula for Box‐Cox is as follows: λ λ‐ 1 Yi (λ) = (Yi ‐1)/(λ(GM(y) )) if λ ≠ 0 GM(y) ln(yi) if λ = 0 For y = 1, 2, 3, … Where GM(y) is the geometric mean of data {y1, y2, … , yn} = (y1 y2 … yn)1/n The benefits of the Box‐Cox Transformation are 1) to make the x‐y relationship more linear like, 2) to stabilize the variance and to meet equal variance assumption, and 3) to make the data more normal like. 2. Data plots and possible transformations of x‐variables Sex vs 1/SBP SCL vs 1/SBP 0.015 0.015 0.01 0.005 0 0 0.5 1 1.5 2 2.5 0.01 0.005 0 0 200 400 600 Followup vs 1/SBP 0.015 0.01 0.005 0 0 5000 10000 15000 0.015 0.01 0.005 0 0 DBP vs 1/SBP 50 100 150 Chdfate vs 1/SBP 0.015 0.01 0.005 0 0 0.5 1 1.5 0.015 0.01 0.005 0 0 Age vs 1/SBP 20 40 60 80 17 BMI vs 1/sbp 0.015 0.01 0.005 0 0 10 20 30 40 0.015 0.01 0.005 0 0 Month vs 1/SBP 5 10 15 Box­Cox Pl ot of BMI Lo w er C L U p p er C L Lamb d a (u sin g 95.0% c o n fid en c e) E stimate ­ 1.04 ­ 1.86 ­ 0.24 ­ 1.00 1/BMI vs 1/SBP 0.012 0.01 0.008 0.006 0.004 0.002 Limit 6.0 5.5 5.0 4.5 4.0 3.5 ­5.0 ­2.5 0.0 Lambda 2.5 5.0 Lo w er C L U p p er C L Ro u n d ed Valu e St Dev 0 0 0.02 0.04 0.06 Box­Cox Pl ot of scl Lo w er C L U p p er C L Lamb d a (u sin g 95.0% c o n fid en c e) E stimate Lo w er C L U p p er C L Ro u n d ed Valu e ­ 0.63 ­ 1.17 ­ 0.17 ­ 0.50 1/sqrt(SCL) vs 1/SBP 0.012 0.01 0.008 0.006 0.004 0.002 Limit 250 200 St Dev 150 100 50 ­5.0 ­2.5 0.0 Lambda 2.5 5.0 0 0 0.02 0.04 0.06 0.08 0.1 30000 25000 20000 Box­Cox Pl ot of fol l owup Lo w er C L U p p er C L Lamb d a (u sin g 95.0% c o n fid en c e) E stimate Lo w er C L U p p er C L Ro u n d ed Valu e 1.43 1.16 1.74 1.43 followup^1.43 vs 1/SBP 0.012 0.01 0.008 0.006 0.004 0.002 Limit St Dev 15000 10000 5000 0 ­2 ­1 0 1 2 Lambda 3 4 5 0 0 200000 400000 600000 800000 18 Box­Cox Pl ot of age Lo w er C L U p p er C L Lamb d a (u sin g 95.0% c o n fid en c e) E stimate Lo w er C L U p p er C L Ro u n d ed Valu e ­ 0.44 ­ 1.22 0.33 ­ 0.50 1/sqrt(Age) vs 1/SBP 0.012 0.01 0.008 0.006 0.004 12 11 St Dev 10 9 Limit ­5.0 ­2.5 0.0 Lambda 2.5 5.0 0.002 0 0 0.05 0.1 0.15 0.2 8 Box­Cox Pl ot of month 35 30 25 Lo w er C L U p p er C L Lamb d a (u sin g 95.0% c o n fid en c e) E stimate Lo w er C L U p p er C L Ro u n d ed Valu e 0.48 0.28 0.67 0.50 sqrt(Month) vs 1/SBP 0.012 0.01 0.008 0.006 0.004 0.002 Limit St Dev 20 15 10 5 0 ­2 ­1 0 1 2 Lambda 3 4 5 0 0 1 2 3 4 19 3. VIF Analysis Details Sex: Predictor Coef SE Coef T P Constant 1.8397 0.3941 4.67 0 dbp ‐0.00275 0.003117 ‐0.88 0.379 scl 0.000622 0.000815 0.76 0.446 chdfate ‐0.12208 0.08537 ‐1.43 0.154 followup ‐1.4E‐06 1.21E‐05 ‐0.11 0.909 age ‐6.7E‐05 0.004754 ‐0.01 0.989 month ‐0.00143 0.009953 ‐0.14 0.886 BMI ‐0.00578 0.01076 ‐0.54 0.592 S = 0.501997 R‐Sq = 2.4% R‐Sq(adj) = 0.0% The regression equation is sex = 1.84 ‐ 0.00275 dbp + 0.000622 scl ‐ 0.122 chdfate ‐ 0.000001 followup ‐ 0.00007 age ‐ 0.00143 month ‐ 0.0058 BMI. VIF = 1.00007. DBP: Predictor Coef SE Coef T P Constant 53.654 8.819 6.08 0 scl 0.02561 0.01881 1.36 0.175 chdfate 2.4 1.98 1.21 0.227 followup ‐0.00038 0.00028 ‐1.36 0.177 age 0.1258 0.1097 1.15 0.253 month ‐0.2929 0.2296 ‐1.28 0.204 BMI 0.924 0.2403 3.85 0 sex ‐1.474 1.673 ‐0.88 0.379 S = 11.6279 R‐Sq = 19.3% R‐Sq(adj) = 16.3% The regression equation is dbp = 53.7 + 0.0256 scl + 2.40 chdfate ‐ 0.000379 followup + 0.126 ‐ 0.293 month + 0.924 BMI ‐ 1.47 sex VIF = 1.194743 Chdfate: Predictor Coef SE Coef T P Constant ‐0.2382 0.3503 ‐0.68 0.497 scl 0.001853 0.000675 2.75 0.007 followup ‐3.9E‐05 9.82E‐06 ‐4.02 0 age ‐9E‐06 0.004008 0 0.998 month 0.006498 0.008379 0.78 0.439 BMI 0.012459 0.009033 1.38 0.169 sex ‐0.08677 0.06068 ‐1.43 0.154 dbp 0.003179 0.002624 1.21 0.227 S = 0.423230 R‐Sq = 21.8% R‐Sq(adj) = 19.0% The regression equation is chdfate = ‐ 0.238 + 0.00185 scl ‐ 0.000039 followup ‐ 0.00001 age + 0.00650 month + 0.0125 BMI ‐ 0.0868 sex + 0.00318 dbp VIF = 1.234568 20 Followup ^ 1.43: Predictor Coef SE Coef T P Constant ‐251230 282743 ‐0.89 0.375 sex 173 28447 0.01 0.995 dbp ‐1696 1222 ‐1.39 0.167 chdfate ‐144797 32152 ‐4.5 0 1/sqrt(age) 4249680 1099526 3.87 0 sqrt(month) 35852 17334 2.07 0.04 1/BMI 318976 2725837 0.12 0.907 1/sqrt(scl) 2187474 2351461 0.93 0.353 S = 196858 R‐Sq = 26.1% R‐Sq(adj) = 23.4% The regression equation is followup^1.43 = ‐ 251230 + 173 sex ‐ 1696 dbp ‐ 144797 chdfate + 4249680 1/sqrt(age) + 35852 sqrt(month) + 318976 1/BMI + 2187474 1/sqrt(scl) VIF = 1.305483029 1/sqrt(age): Predictor Coef SE Coef T P Constant 0.12253 0.01561 7.85 0 sex ‐0.00046 0.001803 ‐0.25 0.799 dbp ‐8.7E‐05 7.76E‐05 ‐1.13 0.262 chdfate 0.000007 0.002143 0 0.997 followup^1.43 2E‐08 0 3.87 0 sqrt(month) ‐0.0019 0.001102 ‐1.73 0.086 1/BMI 0.2539 0.1718 1.48 0.141 1/sqrt(scl) 0.3321 0.1474 2.25 0.025 S = 0.0124761 R‐Sq = 19.1% R‐Sq(adj) = 16.1% The regression equation is 1/sqrt(age) = 0.123 ‐ 0.00046 sex ‐ 0.000087 dbp + 0.00001 chdfate + 0.000000 followup^1.43 ‐ 0.00190 sqrt(month) + 0.254 1/BMI + 0.332 1/sqrt(scl) VIF = 1.191895 Sqrt(month) Predictor Coef SE Coef T P Constant 4.097 1.132 3.62 0 sex ‐0.0183 0.1174 ‐0.16 0.876 dbp ‐0.00453 0.00506 ‐0.9 0.372 chdfate 0.1151 0.1394 0.83 0.41 followup^1.43 6.1E‐07 3E‐07 2.07 0.04 1/sqrt(age) ‐8.069 4.677 ‐1.73 0.086 1/BMI ‐3.36 11.25 ‐0.3 0.765 1/sqrt(scl) ‐5.286 9.722 ‐0.54 0.587 S = 0.812706 R‐Sq = 3.7% R‐Sq(adj) = 0.2% The regression equation is sqrt(month) = 4.10 ‐ 0.018 sex ‐ 0.00453 dbp + 0.115 chdfate + 0.000001 followup^1.43 ‐ 8.07 1/sqrt(age) ‐ 3.4 1/BMI ‐ 5.29 1/sqrt(scl) 21 VIF = 1.002004 1/BMI Predictor Coef SE Coef T P Constant 0.038868 0.006975 5.57 0 sex 0.000833 0.000753 1.11 0.27 dbp ‐0.00013 3.13E‐05 ‐4.07 0 chdfate ‐0.00105 0.000894 ‐1.18 0.241 followup^1.43 0 0 0.12 0.907 1/sqrt(age) 0.04454 0.03013 1.48 0.141 sqrt(month) ‐0.00014 0.000465 ‐0.3 0.765 1/sqrt(scl) 0.07164 0.06234 1.15 0.252 S = 0.00522541 R‐Sq = 17.2% R‐Sq(adj) = 14.2% The regression equation is 1/BMI = 0.0389 + 0.000833 sex ‐ 0.000127 dbp ‐ 0.00105 chdfate + 0.000000 followup^1.43 + 0.0445 1/sqrt(age) ‐ 0.000139 sqrt(month) + 0.0716 1/sqrt(scl) VIF = 1.165501 1/sqrt(SCL) Predictor Coef SE Coef T P Constant 0.058214 0.007611 7.65 0 sex ‐0.00068 0.000872 ‐0.78 0.434 dbp ‐5.7E‐05 3.75E‐05 ‐1.53 0.127 chdfate ‐0.00264 0.001021 ‐2.58 0.01 followup^1.43 0 0 0.93 0.353 1/sqrt(age) 0.07794 0.0346 2.25 0.025 sqrt(month) ‐0.00029 0.000538 ‐0.54 0.587 1/BMI 0.09584 0.0834 1.15 0.252 S = 0.00604388 R‐Sq = 16.3% R‐Sq(adj) = 13.2% The regression equation is 1/sqrt(scl) = 0.0582 ‐ 0.000684 sex ‐ 0.000057 dbp ‐ 0.00264 chdfate + 0.000000 followup^1.43 + 0.0779 1/sqrt(age) ‐ 0.000292 sqrt(month) + 0.0958 1/BMI VIF = 1.152074 22 4. Variable Selection Backward Regression with Alpha‐to‐Remove = 15% Step Constant 1/sqrt(SCL) T‐Value P‐Value Followup^1.43 T‐Value P‐Value 1/sqrt(Age) T‐Value P‐Value sqrt(Month) T‐Value P‐Value 1/BMI T‐Value P‐Value SCL T‐Value P‐Value Followup T‐Value P‐Value Age T‐Value P‐Value Month T‐Value P‐Value BMI T‐Value P‐Value Sex T‐Value P‐Value DBP T‐Value P‐Value CHDfate T‐Value P‐Value S R‐Sq R‐Sq(adj) 1 2 3 4 ‐45.92 ‐46.14 ‐41.67 ‐43.05 ‐448 ‐449 ‐452 ‐466 ‐0.72 ‐0.72 ‐0.73 ‐0.75 0.473 0.471 0.467 0.451 0 ‐0.1 0.919 632 625 643 646 1.07 1.08 1.08 1.07 0.284 0.28 0.284 0.287 ‐5.6 ‐5.6 ‐5.6 ‐2.1 ‐0.63 ‐0.64 ‐0.65 ‐1.64 0.527 0.52 0.518 0.103 ‐640 ‐639 ‐641 ‐626 ‐0.65 ‐0.65 ‐0.65 ‐0.64 0.517 0.516 0.513 0.522 ‐0.06 ‐0.06 ‐0.061 ‐0.062 ‐0.7 ‐0.71 ‐0.72 ‐0.74 0.482 0.478 0.471 0.457 0.00036 0.00005 0.12 0.15 0.905 0.878 1.64 1.64 1.62 1.61 1.65 1.67 1.68 1.67 0.1 0.096 0.095 0.096 0.8 0.8 0.8 0.4 0.41 0.42 0.688 0.682 0.678 ‐1.25 ‐1.25 ‐1.25 ‐1.22 ‐0.81 ‐0.82 ‐0.82 ‐0.81 0.416 0.415 0.414 0.422 4.6 4.6 4.6 4.6 2.18 2.18 2.18 2.18 0.031 0.031 0.03 0.03 1.29 1.29 1.288 1.281 14.13 14.16 14.27 14.5 0 0 0 0 3.7 3.8 3.7 3.7 1.43 1.53 1.56 1.57 0.153 0.128 0.121 0.117 14.3 14.3 14.2 14.2 63.57 63.57 63.57 63.53 61.01 61.22 61.42 61.59 5 6 7 ‐75.88 ‐49.95 53.144 ‐559 ‐629 ‐550 ‐0.93 ‐1.06 ‐0.94 0.352 0.29 0.349 595 494 1.02 0.86 0.309 0.389 ‐2 ‐2 ‐2.1 ‐1.62 ‐1.63 ‐1.66 0.106 0.104 0.098 0.075 ‐ ‐0.92 0.358 0.086 ‐0.075 ‐ ‐1.08 ‐0.95 0.283 0.345 8 9 ‐1.229 ‐1.686 ‐2.2 ‐2.2 ‐1.73 ‐1.74 0.086 0.083 0.004 ‐ ‐0.16 0.87 0.6 0.59 4.68 4.72 0 0 4.1 4.1 2.03 2.03 0.044 0.044 1.262 1.26 14.99 15.17 0 0 3 2.9 1.31 1.31 0.193 0.192 14.1 14.1 63 62.99 61.84 62.03 10 ‐3.246 ‐2.1 ‐1.71 0.09 0.62 4.95 0 3.9 1.9 0.058 1.281 15.71 0 14.1 62.66 61.89 1.55 1.37 0.57 1.62 1.47 4.36 0.107 0.144 0 ‐0.27 ‐0.86 0.389 4.3 2.1 0.037 1.284 14.57 0 3.6 1.52 0.131 14.2 63.45 61.71 4.3 4.1 2.11 2.01 0.036 0.045 1.262 1.26 14.97 14.96 0 0 3.3 3.1 1.42 1.34 0.157 0.182 14.2 14.2 63.31 63.17 61.77 61.82 23 Part B Problem 5 – Data Analysis Report Main Report E. Introduction i. Problem/Data Background Problem #5 Statement: Analyze the dataset: “Data_Forecast.xls”, which is a real‐life data set. Note that you need to code the dates in the first column to time points, 1, 2, … Apply three exponential smoothing methods to forecast the future three data points. Compare the results and pick the best forecasting method for this data set. Can you provide a 90% prediction interval for the forecasting results based on your best forecasting method? Data: The data set includes Dates and a response variable Y. The dates, composed of December, March, June, and September for 10 sequential years, are given coded time points 1, 2, 3, … 40. ii. Outline of Analysis Steps To analyze the real‐life data set “DataForecast.xls”, we will first graph the x‐y relationship. We will notice the shape of the graph and its characteristics. We will then conduct three different exponential smoothing methods and examine the results of each. We conduct simple exponential smoothing, Holt’s exponential smoothing, and Winter’s exponential smoothing. All three methods will be compared using measure of accuracies and SSE values. These measures of accuracy for the time series analysis are based upon the residuals of the smoothing. They are Mean Absolute Percentage Error (MAPE), Mean Absolute Deviation (MAD), and Mean Squared Deviation (MSD). SSE is the sum of squares of prediction errors. Once we determine which exponential smoothing to use, we will forecast three data points into the future and develop 90% confidence intervals for each. This model can be used to forecast more data points in the future and to predict the response in further studies. iii. Conclusion from Study The graph of Time vs. Response exhibits a curved pattern, non‐linear, wave‐like pattern. It also shows that the response y variable has a seasonal trend and cycles every four data points, which corresponds to a year. The measures of accuracy and SSE values will determine that Winter’s method is the most effective way to smooth and forecast the dataset. 24 F. Data Analysis Details by Step i. Exploratory Analysis Time vs. Reponse 600 500 400 300 200 100 0 0 10 20 Time (t) 30 40 50 Y From the initial examination of the data and the time vs. response graph, we notice that the data exhibits a curved, non‐linear, wave‐like pattern. Simple Exponential Smoothing (EWMA) We will use Exponentially Weighted Moving Average (EWMA) for the simple exponential smoothing process. Single exponential smoothing smoothes your data by computing exponentially weighted averages and provides short‐term forecasts. This procedure works best for random data points without a trend or seasonal component (Reference 1). ii. S moothi ng Pl ot for Y S ingle Exponential Method 500 Var iab le A c tu al F its F o r ec asts 95.0% P I S mo o th in g C o n stan t A lp h a 0.344237 400 300 A c c u r ac y Measu r es MA P E 24.56 MA D 62.33 MS D 5476.61 Y 200 100 4 8 12 16 20 t 24 28 32 36 40 25 iii. We notice that in the “Smoothing Plot for Y” using simple exponential regression, the fitted values (red data points) do not correspond well to the actual data provided from the data set. Also, we notice that the three forecasting data points are horizontal. Since the actual data is curved and is not horizontal, we can determine that the forecasting method conducted by simple exponential regression is not the best method to smooth and forecast the data. The α value derived from the model is 0.344, which is considered small. Since the α value is small, the smoothing will have a slower drop and requires a larger window size, which means that the past data is not trusted. Since the past data is not trusted, we can further determine that the simple exponential smoothing is not the best method to smooth and forecast data. Holt’s Exponential Smoothing We will use the double exponential method (otherwise known as Holt’s exponential smoothing) to provide an alternative for smoothing and forecasting the data. Double exponential smoothing smoothes your data by Holt (and Brown as a special case) double exponential smoothing and provides short‐term forecasts. This procedure can work well when a trend is present but it can also serve as a general smoothing method. (Reference 2) S moothi ng Pl ot for Y 500 Double Exponential Method Var iab le A c tu al F its F o r ec asts 95.0% P I S mo o th in g C o n stan ts A lp h a (lev el) 0.534123 G amma (tr en d ) 0.067010 400 300 Y A c c u r ac y Measu r es MA P E 24.91 MA D 59.21 MS D 5969.92 200 100 4 8 12 16 20 t 24 28 32 36 40 We notice that in the “Smoothing Plot of Y” for the double exponential method the fitted data (red points) match up to the actual data (black points) reasonably well. The up‐and‐down motion of the fitted data is shifted to the right of the actual data, but the shape of the curve is quite similar to the actual data. The forecasting (green points) looks like it fits the actual data better than the simple exponential method since it is not composed of three horizontal points. This is due to the fact that the data provided from the data set exhibits a trend, and Holt’s method works more effectively with data with trends than the simple method. We notice that 26 iv. the α value, 0.534, determined by the method is considered large. This means we trust the past data more because it has a smaller window. We also notice that the Gamma value, 0.067, is very small, which also means that the past data trend relates well to the forecasting. Winter’s Exponential Smoothing Winter’s exponential smoothing is used for a linear trend with seasonality cycles. We will use the multiplicative version of the method (the variables are multiplied together instead of added). Winters' Method smoothes your data by Holt‐Winters exponential smoothing and provides short to medium‐range forecasting. You can use this procedure when both trend and seasonality are present. Winters' Method calculates dynamic estimates for three components: level, trend, and seasonal (Reference 3). Wi nter s' Method Pl ot for Y Multiplicative Method 500 Var iab le A c tu al F its F o r ec asts 95.0% P I S mo o th in g C o n stan ts A lp h a (lev el) 0.2 G amma (tr en d ) 0.2 Delta (seaso n al) 0.2 A c c u r ac y Measu r es MA P E 19.49 MA D 51.97 MS D 3659.43 400 300 Y 200 100 0 4 8 12 16 20 t 24 28 32 36 40 We notice in “Winter’s Method Plot for Y” that the fitted plot (red data points) matches very well with the actual data from the data set (black data points). This graph shows that the Winter’s method matches the data best among the three exponential smoothing methods. This is due to the fact that Winter’s method accounts for the seasonality and trends of the dataset. We also notice that the three forecasted points look like the pattern of the rest of the data points, and are not linear. Based on the visual aspects of the graph alone, we can determine that the Winter’s method is the best method for smoothing and forecasting a dataset. 27 v. Comparing Results We will compare the results of the exponential smoothing methods using the output measures of accuracy. These measures of accuracy for the time series analysis are based upon the residuals of the smoothing. They are Mean Absolute Percentage Error (MAPE), Mean Absolute Deviation (MAD), and Mean Squared Deviation (MSD). MAPE represents the accuracy as percentage of error. MAD represents accuracy in the same units as the dataset so as to help conceptualize the amount of error. MSD is a commonly used measure of accuracy of fitted time series values. To determine the best exponential smoothing method, we need to find the method that has the smallest error values (Reference 4). Simple Holt’s Winter’s MAPE 24.56 24.91 19.49 MAD 62.33 59.21 51.97 MSD 5476.61 5969.92 3659.43 We notice from this table that the Winter’s method has the smallest values for all three measures of accuracy. This means that, based on measures of accuracy comparison, Winter’s method is the best smoothing and forecasting method for this particular dataset since the accuracy of its predictions are the highest (and deviations are the lowest). We can also compare the results of the exponential smoothing methods using the SSE values. SSE stands for Sum of Squares of Prediction Errors. This is calculated from the Minitab results by squaring the prediction errors (output) and taking the sum of the squares. Smoothing Method SSE Simple 219064.5 Holt’s 238796.5 Winter’s 146377.3 We notice from this SSE table that the Winter’s method has the smallest value for the Sum of Squares of Prediction Errors. This means that, based on SSE comparison, Winter’s method is the best smoothing and forecasting method for this particular dataset since the error is the lowest of all three methods. Confidence Interval Minitab returns the values of three forecasted data points. We will use these data points and create a 90% prediction interval for each. The formula for the prediction interval is: ŷ ± tα/2 s √ [ (1/m) + (tF – tbar)2 / stt ] The prediction intervals calculated for each of the time points are: Period Forecast Lower Bound (90%) Upper Bound (90%) 41 172.84 52.75567 292.9243 42 250.219 130.1347 370.3033 43 278.4 158.3157 398.4843 The details of calculating the prediction interval is found in Appendix. vi. 28 G. Observations from Analysis The real‐life data set “DataForecast.xls” exhibits a curved pattern, non‐linear, wave‐like pattern. The graph of Time vs. Response shows that the response y variable has a seasonal trend and cycles every four data points, which corresponds to a year. This was our first clue that Winter’s exponential smoothing method would work best for this data set because Winter’s accounts for the seasonal trend. We then conducted three different exponential smoothing methods and examined the results of each. The first method is simple exponential smoothing, which had a small alpha value, the fitted data did not match up well with the actual data, and the forecasting was horizontal. All of these reasons support the notion that the simple exponential smoothing does not work with the data set well. The Holt’s and Winter’s methods fit the data set much cleaner and forecasted data points in the future more accurately that the simple method. All three methods were compared using measure of accuracies and SSE values, all of which determined that Winter’s method was the most effective way to smooth and forecast the dataset. Once we determine to use Winter’s method, we could forecast three data points into the future and develop 90% confidence intervals for each. This model can be used to forecast more data points in the future and to predict the response in further studies. H. References 1. Minitab contributors. “Single Exponential Smoothing.” Minitab Statguide. Minitab Help. 28 Jul. 2010. 2. Minitab contributors. “Double Exponential Smoothing.” Minitab Statguide. Minitab Help. 28 Jul. 2010. 3. Minitab contributors. “Winter’s Method.” Minitab Statguide. Minitab Help. 28 Jul. 2010. 4. Minitab contributors. “Measures of Accuracy.” Minitab Statguide. Minitab Help. 28 Jul. 2010. 5. MedCalc contributors. “Values of the t‐distribution.” MedCalc Software. MedCalc. 28 Jul 2010. II. Appendix Prediction Interval Calculations: Formula: ŷ ± tα/2 s √ [ (1/m) + (tF – tbar)2 / stt ] Where m is the number of data points (m= 40) tα/2 has d.f. m‐3 in Winter’s method (d.f. = 40 – 3 = 37) and α/2 = 10%/2 = 5% tα/2 = 1.687 (Reference 5) s = √(SSE/df) where df = 3 for Winter’s method and SSE is sum of errors squared s = √(146377.3/3) = 220.89 tF is the time point in the future (ex. 41, 42, 43) tbar is the avg(1, 2, … , m) = avg(1, 2, … , 40) = 20.5 stt = (1‐20.5)2 + (2 – 20.5)2 + … + (40 – 20.5)2 = 5330 Prediction Interval for tF = 41: 172.84 ± (1.687)(220.89) √ [ (1/40) + (41 – 20.5)2/5330 ] = (52.76, 292.92) 29 ...
View Full Document

Ask a homework question - tutors are online