ECONOMICS 306 
STATISTICAL ANALYSIS

 COURSE POLICIES / COURSE OUTLINE & CALENDER

COURSE FAQ / ASSIGNMENTS: 1, 2, 3, 4, 5, 6, (Answer Example)

SUPPLEMENTAL READING: Foundation/ Linear Probability Models/
Time Series-Forecasting/ Experimental Design/ Stochastic Simulation

PAST TEST QUESTIONS /

STATISTICS WEBSITES:Java Applets / Coin Flipping Page / American Statistical Association/ Journal of Statistics Education/ Fed Stats/ Time Series Data Library / Lexis-Nexis Stats Universe

Dr. Brian Goff/414 Grise Hall
Phone (502)745-3855 / brian.goff@wku.edu
Last Modified: March 18, 2002
Western Kentucky University


COURSE POLICIES (Knowledge of these policies will be tested on the first exam)

Contact Information
Office: Grise 414
Phone: 745-3855
Email: brian.goff@wku.edu
Office Hours for Spring 2002: 1:30-4 M, W; 9:30-11:30 T, TR
(appointments & drop-ins welcome at other times)

Resources
Andrew Siegel, Practical Business Statistics;
Microsoft Excel (with Data Analyst -- available on WKUnet);
SPSS 10.0 for Windows (available on WKUnet or limited quantities in bookstore)

Objective
To develop skills in data analysis and evaluation with business applications

Grading Policies
Announced Tests (2) = 30% (bonus pts available on each)
Unannounced Quizzes ( 6 to 10) = 30% (lowest dropped; lowest 2 dropped if 8 or more)
Computer Assignments (6) = 40% (lowest dropped)
Total = 100% (Also Adjustments to Grades under Classroom Policies below)

A = 90-100%; B = 80-89%; C = 70-79%; D = 60-69%; F = below 60%

Planned Test Dates: Test 1 -- March 4; Test 2 ( Final Exam) -- 9:05 MWF @ 8 AM, Monday May 6 11:15 MWF @ 10:30 AM, Thursday May 9

Make-up Tests/Quizzes: NO MAKEUP TESTS/QUIZZES GIVEN FOR ANY REASON. If the midterm the final will be weighted more heavily to compensate. If serious illness or injury, official WKU activities, or some other special circumstances will cause you to miss more than 2 quizzes, please see me.

Announced Tests: These will be 20 to 40 question multiple choice tests. If weather or my absence postpones a test (or an assignment due date), the test will be given at the next scheduled class period.

Unannounced Quizzes: These will be 5 questions in length. They will be drawn from the most recent lecture and/or the current assigned reading. Quizzes will have a 5 minute time limit. No credit given for quizzes turned in after time limit expires.No credit given for quizzes turned in after time limit. We will have 6-10 -- be ready for a quiz on any given day.

Computer Assignments:

Classroom Policies: Adjustments up to a +2% may be made to final grade percentages for those students who have provided consistently positive contributions. Orderly behavior and respect for others who are speaking (including me) is expected. No food or drink permitted. If late, enter with a minimum of disturbance and be seated in the nearest seat. Profanity or rude, disruptive behavior will not be tolerated. Individuals involved in incidents that I regard as significant violations of these policies will receive a warning and then a letter grade reduction per subsequent incident.

Miscellaneous
Last Day to drop with a "W" or change to audit is March 7. Students with disabilities covered by the ADA, please register with the WKU Office for Disabilities and then see me for accomodation.


Course Outline & Calender (Subject to Announced Changes)

Weeks 1-3 (Jan 14-Feb 1): Review and Extension of Basic Statistical Methods
Using data to understand reality (Ch. 1-2)
Computer software in analyzing and presenting data (Ch. 3; SPSS; Excel)
Foundational measures and methods (See Foundation Supplement)
Some applications of basic concepts: quality control (Ch. 18); EDA--Statistical detective work.
Assignment # 1 - Deadline Feb 1 (Friday - all assignments due 5 minutes after beginning of class)
Other Special Dates: Jan 21 (Monday) -- MLK Holiday;

Weeks 4-5 (Feb 4-15): Looking for Relationships between Variables Part 1
Scatterplots & correlation coefficients (11:396-419)
Regression lines & coefficients (11:419-427)
Extending applications to multiple explanatory variables (12:468-475)
Using regression with qualitative independent variables (12:520-526)
Assignment #2 - Deadline -- Feb 15 (Friday)

Weeks 6-7 (Feb 18-Mar 1): Looking for Relationships between Variables Part 2
Using regression with qualitative dependent variables -- LPM case (Supplement 1)
Evaluating the statistical reliability of regression results (11:428-438)
Regression predictions & errors (12:475-478; 503-507)
Nonlinear relationships in "linear" regression (12:516-518)

Assignment #3 - Deadline Mar 1 (Friday)

Special Date: Test #1 ("Midterm")-- Wednesday Mar 6

****Special Change: 2nd Chance on Midterm -- Wed Mar 13

******NOTE REVISIONS BELOW AS OF MARCH 14
(Week 10: Mar 18-22 Spring Break)

Week 11 (Mar 25 - Mar 29): Condensing & combining data to measure attributes
Factor Analysis & Cluster Analysis (See SPSS Help on these Topics + SPSS Results Coach)
Designing simulations using statistical software (Simulation Reading Supplement 4; SPSS Transform)
******Assignment # 4 ***** Skipping Assignment 4/Will do In-class*******

Weeks 12-13 (Apr 1-Apr 12): Varied Applications of Statistics to Business
Designing and conducting experiments in business settings (Experimental Design Supplement)
Using contingency tables to examine qualitative relationships (Ch. 17)
Using ANOVA to examine qualitative relationships (15:610-618; 629-630)
Assignment # 5 - Deadline April 12 (Friday)

Weeks 14-15 (Apr 15 - Apr 26): Using the Past as a Guide to the Future
Overview of time series/forecasting strategies & components (Time Series Supplement; 14:464-475)
A Primer on estimating trends, cycles, & seasons (Time Series Supplement)
Confusions about trends & randomness (Time Series Supplement; 14:566-567; 586-587; 592-594)
Basic forecasting concepts (Time Series Supplement)
Assignment # 6 - Deadline April 26 (Friday)

Week 16 (Apr 29 - May 3): Review & Prep For Final

Special Date: Test 2 (Final Exam--Apx. 70% new material and 30% from Test 1)


ECON 306 -- FAQ
Q:I missed a quiz because I was sick, when can I make it up?
A: No make-up quizzes are given. For extended illnesses, see me.

Q: I had to work late last night, the computer system was down this morning, ..., will you take my assignment late?
A: I will accept it late, but you will still be assessed the 10% per day penalty. Do the assignments in advance of deadlines to make sure that you turn them in on time.

Q: I will miss midterm because of forensics, swimming, golf, .... Can I make-up these tests?
A: No make-ups are given but see me, and we will discuss your options if you will miss more than 1 test.

Q: I can't understand question x on the latest computer assignment. Can you help me?
A: I will explain ambiguities in words and phrases, but you must use your book, notes, and brain to determine answers to the questions. You may also seek assistance from other students (and, of course, your teammate if you have one), but you may not copy the answers and output of others.

Q: I forgot to put my answers on the back of the last page of output, forgot to staple or paper clip it, my handwriting is not very clear .... Will you count off for that?
A: Yes.

Q: I'm doing poorly on quizzes/tests/assignment (plus I need this class for my degree program). Can I do work for extra credit?
A: No. Grades will be determined by the policies stated above.

Q: How can I prepare for the tests?
A: Keep up with the reading. Look back over your notes before class. Look over questions from past ECON 206 and 306 exams, sorted by subject area. Aside from the computer assignments, 20-30 minutes in between each class meeting spent in this way will go a long way and will do a lot more than cramming before an exam.


ASSIGNMENTS

Computer Assignment 1
(You need a floppy disk)
Objective: To gain familiarity with using spreadsheet software (Excel) and dedicated statistical software (SPSS) to perform statistical functions, and to review basic statistical concepts

Accessing SPSS: After you turn on the computer and the "Novell Client" window appears, you must type in SPSS rather than "Student" for your username (no password needed). Then click on the "SPSS" folder in the left window, then click the SPSS 10 icon in the right window. The SPSS menu and data spreadsheet should appear. After exititing SPSS, click on the "Applications" folder to have the Excel icon reappear. Excel can be accessed by just double-clicking the Excel icon on the standard student network desktop.

1. i) In SPSS, retrieve the file H:share\econ\goff\data\bballclass.sav into the SPSS data spreadsheet This file contains data related to Major League Baseball teams from 1990-1996. Labels for each variable are provided by SPSS; ii) Create two new new variables (stadium revenues divided by attendance and gate revenues divided by attendance) using Transform>Compute and give the new variables names of your choosing; iii) Generate histograms for these gate revenues per person in attendance and stadium revenues per person in attendance using Graphs>Histogram and print your output; iv) save the file to your floppy disk as an Excel (.xls) file (make sure the box to include variable names is checked);
2. i) Exit SPSS and get into Excel. Retrieve the Excel version of the file that you saved to floppy; ii) Use Tools>Data Analysis>Descriptive Statistics and fill in the input range to include the two new variables that you created in Step 1 (get both summary statistics and the confidence intervals); iii) Use Format>Columns>Autofit to resize your output columns to fit the output tables and then print these tables; iv) Go back to the data spreadsheet and click the first value under your stadium revenue per person variable, then click Data>Sort>Descending to sort the entire data set by this variable in descending order; v) copy down the top 10 cities and years.

3. On a separate sheet attached to the last sheet of output, provide answers to the following using complete sentences and standard English:
i) Briefly describe the data set and how the variables are measured.
ii) Using the histograms and the descriptive statistics you printed for stadium and gate revenues per person, describe the sizes of "typical" outcomes and describe the sizes of "outliers."
iii) Provide a brief summary of the cities-years with the largest stadium revenues per person;
iv) Based on descriptive statistics from Excel, what would you conclude if you tested the null hypothesis that stadium revenues per person were $3? What is your basis for this decision? If you had used Excel or SPSS to provide a p-value for this test, what is your guess as to the size of the p-value and what information would this p-value be providing?

Staple or clip your output together with the microcomputing cover sheet at the front and your answer sheet at the back with the other pages. Due Date = Feb 1 at beginning of class.

Assignment 1 Answers

3i) The data set contained five variables on Major League Baseball (MLB) teams from 1990-96. These were gate revenues (ticket sales) in millions of dollars, stadium revenues (concessions, ...) in millions of dollars, media revenue in millions of dollars, attendance in millions of people, and a team identifier.

3ii) Gate revenues per person had a mean of $9.8 million with a standard deviation of $1.9 million. These reveues appeared to be close to normally distributed with the exception of a few outliers in the $16-$18 million range. The mean of stadium revenues per person was $4.4 million with a standard deviation of $2.2 million. The distribution of these revenues did not appear very normal but were spread nearly uniform from about $2 million to $7 million. Outliers appeared on both sides -- down to $0 on the low side and up to $8 to $12 on the high side.

3iii) The following teams exhibited the largest stadium revenues per person: Chicago White Sox (96), Chicago White Sox (95), .... These figures ranged from $8.4 to $12.2 3iv) A claim that stadium revenues per person averaged $3 could be rejected using the 95% confidence interval around the sample mean. This interval is $4.4 +- $0.32. Since $3million is outside of the range allowing for sampling error, the evidence rejects the claim at this level of confidence. A p-value associated with this test measaures the probability that the claim is true. Since we rejected using the 95% interval, the p-value for this test would be at or below 0.05.

3iv)


Computer Assignment 2
Objective: To produce regression output with Excel and to evaluate/interpret the output.

1. i) Pick a specific date at least 1 month from now and approximate time of day (for example AM on April 1) for roundtrip flights with from Nashville (BNA) to at least 20 cities of your choosing (just make sure at least 3 are from the Southwest Airlines list below and at least 3 are not). Using on online travel search service such as Yahoo! Travel, Expedia.com, Travelocity.com, or Orbitz.com, determine the cheapest fare to each destination other than on Southwest and also the distance to each city. For this you can use an atlas or look them up at www.symsys.com/~ingram/mileage.html. (Note: approximations are ok; you may need to an atlas if the information is not listed on the travel website). Be sure to keep track of the city that goes with each price. Also, write down date and time of travel. .
ii) Enter the price and distance data as variables into an Excel file along with a text variable that abbreviates the city name. Also, create a 1/0 variable that identifies whether the Southwest Airlines flies directly from Nashville to that city or not. Make it equal 1 if Southwest flies the route directly. (These cities are Los Angeles, San Diego, San Jose, Oakland, Phoenix, Houston, Austin, San Antonio, Kansas City, New Orleans, Chicago, Birmingham, Orlando, Birmingham, Jacksonville, Detroit, Hartford, Manchester. If you know some other city is served directly, enter it as a 1 also.
iii) Use Excel's Data Analysis Tool to estimate a regression where the dependent variable is the fare and explanatory variables are distance to the city and the variable for whether Southwest flies direct or not. Direct Excel to generate residuals and standardized residuals (but not residual graphs). Before printing the output sheet, use Format, Column, AutoFit, so that the words in the output column are not cropped. Also, give the output Table a new title (Table 1: Air Fare Regression)
2. Print the both the regression output and your data sheet (Put these on separate pages) .
3. Answer the following questions on a separate sheet neatly printed written or typed and complete sentences. I'm looking for answers that are direct and precise. (66%)
i) Briefly describe the variables in your data set, their measurement, and how their were collected.
ii) Draw a graph (neatly done by hand is fine) of the regression line implied the coefficient on distance. Also, plot the fare-distance data combination for at least 10 of your cities on this graph and identify them by an abbreviation for each city.
iii) Explain the specific meaning of the coefficient on the distance variable in your regression.
iv) Explain the meaning of the Southwest Air variable in your regression and draw a simple graph showing how it would change the fare-distance graph that you drew earlier.
v) Explain meaning of the R-square statistic (not just in general, but the specific value in your regression).

Deadline = February 15 (beginning of class)
Remember to attach the microcomputing cover page; staple or clip pages together, and follow the other assignment guidelines.

Assignment 2 Answers
i) Your description of the variables and sources here.
ii) The graph should have Fare ($) on the y-axis and Distance (miles) on the x-axis. It should begin at the y-intercept on the y-axis and increase at the rate shown by your slope coefficient on Distance. It should also include 10 points showing combinations of Fare-Distance for 10 of your cities.
iii) Suppose your regression equation is Fare = 50 + 0.12*Distance - 75*Southwest.
The coefficient on Distance is 0.12. This estimates the slope between distance and fare and indicates that for each 1 mile increase in Distance of a flight, fare increase by about $0.12.
iv) Given the equation above, the coefficient on the Southwest variable of 75 means that if Southwest flies non-stop on a route, the Fare averages $75 less. A graph showing this would just shift the regression line in part ii) down so that the y-intercept would be $75 lower. The slope would be the same.
v) Suppose the R-squared is 0.65. This means that distance to a destination and whether Southwest flies non-stop or not accounts for 65 percent of the fare differences.


Computer Assignment 3
Objective: Using regression with limited dependent variables

1. On this assignment, choose from ONE of two data sets.
H:share\econ\goff\data\tvownership.xls is an Excel file containing a sample of 40 responses to a survey as to whether the household owned a 48-inch TV (or larger) along with information on income (in thousands of $), age (oldest in household in yrs) and whether children live in the household (1 if yes; 0 if no).
H:share\econ\goff\data\madenlf.xls is an Excel file containing a sample of 40 college linebackers and whether they made an nfl team (Madenfl = 1 if yes; 0 if no) along with information on their weight (in lbs.), 40-yard dash time (in seconds), and whether they played Division I-A football or not (1 if yes; 0 if no).
2. i) Use Excel to estimate a regression model with either 48-inch TV ownership or MadeNFL as the dependent variable, depending on the data set that you choose. Use the explanatory variables available in the data set.
ii) Save the predicted and residual values.
iii) Edit the regression output so that it has a title that describes what the regression contains.
3. Provide responses for each of the following:
i) Briefly describe the data set you used and how each variable is measured.
ii) Explain the meaning of the coefficients on the explanatory variables.
iii) Assess the reliability of the estimated coefficients in terms measures of sampling error.
iv) Briefly summarize the predicted and residual values. Also, using an actual case from the data, demonstrate how the predicted values and residual values were created.
v) What is the interpretation of the predicted values in this regression? What, if any, problems with a linear regression model can you identify from the predicted values?

Assignment 3 Answers (These are answers for TV data; answers for NFL data are similar)
3.i) The data set includes measurements from 40 households concerning TV ownership, ...
ii) The coefficient on household income is 0.005. This means that for every thousand dollars income increases, the probability of owning a large television increases by 0.005 or 0.5 percent. If a family has children, the probability of owning a large TV increases by 0.37 or 0.37 percent. The coefficient for age is 0.0018. This means that each year of age for the oldest household member increases the probability of ownership by 0.0018 or 0.18 percent.
iii) The standard errors and p-values of the coefficients allow their reliability or accuracy to be assessed. Income's coefficient (0.005) is more than 4 times larger than its standard error (0.001) and its p-value is very low (0.0002). Children's coefficient is more than 3 times larger than its standard error (0.001) and its pe-value is also very low (0.001). Therefore, these two coefficients are relatively reliable or accurate estimates of the true coefficients, at least so far as sampling error is concerned. The coefficient on age (0.001) is smaller than its standard error (0.009) and its p-value is far above 0.05 at 0.85. Therefore, when taking sampling error into account, the coefficient on age is not very reliable. The true value is likely zero.
iv) The predicted values for the probability of TV ownership range from -0.17 to 1.11 with considerable variation within this range. The residual values range form -0.19 to 0.61. Using the values for household #1 in the data set of INCOME =18, CHILDREN =1, AGE =41, the predicted value would be
Own TV = -0.35 + 0.0058(18) +0.37*(1) + 0.0018(41) = 0.19.
The actual value for this household was 0, so that the residual = actual - predicted = 1.0 - 0.19 = -0.19.
v) In a regression such as this one with a dependent variable measured as either a 0 or 1, the predicted values for the dependent variable measure the probability of the variable being equal to 1.0 . Because these values represent probabilities, their values should be between 0 and 1. However, as noted above, some of the predicted values from the linear regression are above 1.0 and below 0.0. Under these circumstances, the R-square statistic is not an accurate measure of how well the equation fits the data. Also, under these circumstances, logistic regression is a more accurate alternative to use because it forces the predicted values to lie between 0 and 1.


Computer Assignment 4 -- SKIP

 


Computer Assignment 5
Objective: To apply ideas of experimental design and methods of analyzing qualitative data.

1. Choose one of the data options below. Use principles of experimental design (see Reading Supplement) to better isolate the relationship between your factor of main interest and your dependent variable. This will involve identifying other factors influencing the dependent variable and devising a strategy that holds one or more of these other factors constant. Note that your original dependent variable should be quantitative (price, price to sales, or score diff.). In addition, construct a qualitative version of the dependent variable to go along with the quantitative version. (For example: the quantitative price of a book could be converted into low =0, medium =1, and high =2 where you choose where each category breaks). Your factor of main interest should be a qualitative variable with at least 2 categories but no more than 4.

Option 1: Collect at least 30 book prices (dependent variable) to determine whether they are related to a qualitative factor (such as fiction/nonfiction)

Option 2: Collect sales to price ratios (dependent variable) for at least 30 publicly traded companies to determine whether they are related to a qualitative factor (such as sector in which the company operates).

Option 3: Collect NCAA basketball tournament seedings (dependent variable) for all teams for at least one year to determine whether they are related to a qualitative factor (such as major, mid major, or below mid major conference affiliation).

2. Analyze your data in the following ways using SPSS:i) Generate an Analysis of Variance (ANOVA) using Analyze/Compare Means/One-Way ANOVA. Place the quantitative version of the dependent variable in the "Dependent List" box and the factor of main interest in the "Factor" box.ii) Generate a contingency table (aka Crosstabulation in SPSS) using Analyze / Descriptives/ Crosstabs and your qualitative version of the dependent variable ("Column Variable") along with your factor of main interest ("Row variable"). Use the "Statistics" button and choose "Chi-Square." iii). Edit the ANOVA title and the Crosstabulation title to add brief descriptions of your specific analysis.

3. Answer the following questions on a separate sheet and attach to the back of your output.
i) Briefly describe your data source, the variables, and how they were measured.
ii) Explain your experimental design strategy including your dependent variable, the factor of main interest, other factors likely to influence outcomes of the dependent variable, and the active steps you took to better isolate the effects of the factor of main interest. (This is a key part of the assignment, so be detailed and clear.)
iii) If you had additional time/resources, how might you improve on your design strategy?
iv) Describe the ANOVA results and what they mean for the relationship between your dependent variable and the factor of main interest.
v) Describe the Crosstab results and what they mean for the relationship between the qualitative version of the dependent variable and the factor of main interest.
 


Computer Assignment 6
Objective: To practice identifying time series components and making simple forecasts

PART I (50%)
1. Retrieve the data file into SPSS from: H:share\econ\goff\data\timeseries306.xls (Note: This is an Excel (.xls) format file so be sure to select (*.xls) for "File Type" and to select "Read Variable Names." ) The data are recorded at quarterly intervals from 1980 first quarter to 2001 fourth quarter. Besides variables listing year, quarter, and date, the data set includes

ACCREV = % of account receivables in delinquency for that quarter;

STOCKABC = price of stock for company ABC (in $) for that quarter;

2. Choose either ACCREV or STOCKABC and use methods (graphs, equations, ..) we discussed in class and in the Reading Supplement to identify patterns over time. Your analysis should provide you with a means for forecasting each of these variables.  Save any graphs or equations that you created to include in your report (max 2 pages of such material).

(Note: I have intentionally left the exact methods of your analysis up to you on this assignment).

3. On a separate sheet of paper attached to your output, provide answers to the following:
i) Briefly describe the variable analyzed and its measurement.
ii) Explain the specific time series patterns you found and the basis for your conclusion.
iii) Based on your findings generate a forecast for for the four quarters of 2002, explaining the details of how you arrived at these forecasts.
 

Part II (50% -- Write answers on a separate sheet from Part I)
 You have the job of investigating one of the major vendors of data-statistical software (SAS -- www.sas.com) to determine its capabilities for your company.
i) Under Software, choose "Success Stories" and then under "More Success Stories" choose either "Industry" or "Solution" and summarize 2 of the stories pertitent to your major.
ii) Go back to the main page (www.sas.com) and select "By Product" under Software.  Under the Product Index, choose "Statview."  Using the information provided by the "Introduction," "Technical Overview," and "How to Order" sections, summarize the features of this product along with information about pricing.
 



READING SUPPLEMENT -- FOUNDATION MATERIAL
(This "Foundation Supplement" is intended to give you a quick outline of the key statistical measures and methods that you should have already studied in a prior statistics class. For further explanation and formulas, see the pages in Siegel listed in parentheses.)

1. Determining What is Typical and Atypical in a Data Set
a. Measures of "Average Outcomes"
-- Mean (78-81): the everyday "average"; simple and useful with relatively symmetric data
-- Weighted Mean (83-85): an average where individuals values are multiplied by a weight before they are added together;
-- Median (87-90): important for determining the middle of asymmetric data
-- Mode (93-94): useful in categorical data

b. Measures of Variability of Outcomes: information about averages of a data set alone does not provide very much information about what is typical and atypical. Variability measures are critical for more detailed and useful understanding of data
-- Standard deviation, s, (121-125): a measure of the average distance from the mean for the values in the data set
-- Standardize Units (z-units, z-values, z-scores): the name given to variable (X) which has been converted so that the mean is zero (0) and the standard deviation is one (1). This conversion is done by the formula Zi = (Xi - )/s, where i refers to each individual item in the data set. This conversion eliminates whatever units were used to measure X, and it allows each data point to be easily evaluated in terms of how much it differs from the mean.
-- Empirical Rule (128-129): a summary of variation in outcomes for data that have roughly a bell-shaped (normal) shape and a means to indentify outliers. The Empirical Rule states: i) about 68 percent of the data will be between +/ 1 standard deviation from the mean, ii) about 95% of the data will be between +/- 2 standard deviations from the mean, and iii) about 99.9% of the data will be between +/- 3 standard deviations from the mean.
-- Chebeshev's Rule (128 n. 9): a summary of variation in outcomes for data regardless of their shape and a means to identify outliers. The rule states: i) about 75% of of the data will be between +/- 2 standard deviations from the mean, ii) about 89% of the data will be between +/- 3 standard deviations from the mean, and iii) about 94% of the data will be between +/-4 standard deviations from the mean.

c. Skew (skewness coefficient): a measure of the symmetry of outcomes

2. Bringing in Probability & Probability Distributions
Basic Definitions:
a. Probability (162-163, 169-170): a branch of mathematics dealing with computing likelihood of events; widely used in statistics; the probability of an event: must fall between 1 and 0;
b. Odds (of an event) = probability of an event divided by one minus the probability;
c. Law of Large Numbers (171-172): relative frequency of an event will become closer to its probability as the number of trials is increased;
d. Probability Distribution: a formula (or graph/table based on the formula) that relates the possible outcomes of a random variable to the likelihood (probability) of those outcomes (this just expands the definition given in the book)
i. KEY PROBABILITY DISTRIBUTIONS
-- Binomial Distribution (214-223): events with two independent outcomes;
-- Normal Distribution (223-227): for events with bell-shaped outcomes;
-- Standard Normal Distribution (224-228): a special case of the normal where the mean of a normal variable is converted to 0 and its standard deviation is converted to 1; a standardized normal variable;
-- t-Distribution (306-308): another bell-shaped probability distribution, but one whose shape changes slightly as the number of sample items (degrees of freedom) changes; with small numbers in the sample, it is a little wider than the normal curve; with large numbers in the sample, it is the same as the normal curve
-- Chi-Squared Distribution (672-673): another probability distribution that changes shape as the size of the sample (degrees of freedom) change; with a small sample, it is highly skewed to the right; with a large sample, it becomes more similar to the normal curve
-- F-Distribution (620-623): a probability distribution very similar in its properties and behavior to the Chi-Squared Distribution; it is skewed in small samples and

3. Accounting for Sampling Errors
a. Population parameters & statistical estimators (260-261);
b. Standard Error - an estimate of the "average" sample error in a sample-based statistical estimator of a population parameter if ideal random samples of the given size were repeatedly used; in other words, for instance, the standard error sample mean (272-275) estimates how far the sample mean is likely to differ from the population mean in repeated samples of a given size.
c. Confidence intervals (300-314): statistical estimates that use standard errors along with probability distributions to compute a range (interval) around a statistical estimate that incorporates most (usually 95% or 99%) of the likely sampling error.

4. Testing ideas with data (Hypothesis Testing 339-357))
Statistical Hypothesis Testing (in general): Using sample data to evaluate claims which have been made while also taking into account the problem of sampling error. Because of sampling error, statistical evidence and a claim may or may not provide convincing evidence against the claim. In scientific research and also in assessing the reliability of statistical estimators of population parameters, several common concepts and methods are used to assess hypotheses; these are useful where hypotheses are very "sharp" (clear-cut or precise); they are not as useful when conducting preliminary or exploratory data analysis

Null Hypothesis: The hypothesis or claim which is accepted unless convincing evidence against it is presented; used extensively in research and statistical tests

Alternative Hypothesis: The opposite claim to the null hypothesis; the hypothesis which is accepted only when substantial evidence against the null hypothesis is presented; sometimes called the research hypothesis; it is typically the claim which an investigator is trying to prove

Type I Error: Deciding to reject the null hypothesis when it is, in fact, true

("Alpha"): The probability of committing Type I Error. When conducting statistical tests of hypothesis, this probability is typically set to be a small number such as 10%, 5%, or 1%. This is also referred to as the "Significance Level for a test."

Level of Confidence (in a test): 1 - , that is, if the probability of Type I error is 5%, then the Level of Confidence in the test is 95%.

Type II Error: Deciding to accept the null hypothesis when it is, in fact, false.

("Beta"): The probability of committing Type II Error. For a given statistical test, a given value of , and a given sample size, the value of is predetermined. For a given sample size, if is reduced, then will increase. and can both be decreased for a specific test by using a larger sample size.

Power of a Test: 1 - alpha, that is, if the probability of Type II error is 10% then the power of the test is 90%. This number indicates how "powerful" the statistical test is in rejecting a null hypothesis when it is, in fact, false

Confidence Interval Approach to Hypothesis Tests: A confidence interval (usually 95% or 99%) around a sample estimate such as the sample mean and compared with the claim concerning the relevant population parameter (such as the population mean) by the null hypothesis. If the null hypothesis claim is outside of the confidence interval, then the null hypothesis is rejected and the alternative is accepted.

p-value: The probability that the sample evidence would be found by random chance if the null hypothesis were true. A low p-value (such as smaller than 5%) is an indication that it is unlikely that the sample evidence would be found by chance if the null hypothesis were true. The lower the p-value, the more convincing the evidence is against the null hypothesis.


SUPPLEMENTAL READING -- Linear Probability Models

LINEAR PROBABILITY MODELS IN REGRESSION-- Regression with 1/0 Dependent Variable

Background
In many useful applications of regression analysis, the dependent variable is qualitative (describing an attribute or quality) rather than quantitative. In some cases, the qualitative dependent variable has only two values such as home owner/not a home owner, purchased/did not purchase, defaulted/did not default, loan accepted/loan rejected, and similar variables where the data is usually recorded as 1s and 0s (often called binary or dummy variables). Linear regression can be used to show the relationship between one or more explanatory variables and binary dependent variable, although linear regression is not always the appropriate type of regression analysis in such situations. The same procedures in software such as Excel or SPSS can be used to compute the regression output, but the interpretation of the output differs when a 1/0 dependent variable is used. Regressions of this kind are often called Linear Probability Models. The basis for this name should become apparent below.

Example
The data for this example are drawn from 93 metropolitan areas in the U.S. The two variables used in the regression analysis are

NFL&MLB = 1 if the city has both an NFL and Major League Baseball team and 0 if not;

Population = metropolitan population of the city in millions.

NFL&MLB is used as the dependent variable and Population is the independent variable. The idea of the regression analysis is to determine how differences in population influence the likelihood of a city having both an NFL team and an MLB. Twenty-three cities contained both sports and the remaining sixty-five did not. The populations ranged from 500,000 on the low end to 8,000,000 on the high end.

The results of linear regression are as follows:

Dependent Variable = NFL&MLB
Variable Coefficient Standard Error t-statistic p-value
Constant 0.01 0.05 0.07 0.99
Population 0.18 0.03 7.60 0.001

In equation form the regression output is

NFL&MLB = 0.01 + 0.18xPopulation.

If a particular city contains a population of 3 million people, then the predicted (estimated) value for NFL&MLB can be computed the same as in any regression equation.

Predicted NFL&MLB = 0.01 + 0.18*(3) = 0.55 .

However, the interpretation of the predicted value is different if the dependent variable is a 1/0 variable as it is here. The predicted values in these situations estimate the probability that the dependent variable equals 1. In the example here, that means that the probability of a city with 3 million people having both NFL and MLB teams is 0.55 (55 percent).
Because the predicted values for the dependent variable represent probabilities, the slope coefficient for population equals the change in probability for a one unit (1 million) increase in population. The coefficient in the example here is 0.18 . That means that for every 1 million person increase in population the probability of a city containing both NFL&MLB franchises increases by 0.18 (18 percent).

The constant (y-intercept) in this case means that if the population were zero, the estimated probability would be 0.01 (1 percent). Obviously, this represents a purely hypothetical value since no city in the data set has a population anywhere near zero.

Special Problems:
One needs to be aware of a couple of special issues that arise when linear regression with 1/0 dependent variables.

1. Probability Restrictions: True probabilities cannot be lower than zero -- an impossible event -- and one -- a certain event. However, the prediceted probabilities using linear regression may sometimes be below zero or above one. In the example above, for instance, a city with a population of 8 million will have predicited probability of containing both football and baseball teams of 1.45. The reason for this is that linear regression continues to make prediction along a straight line no matter how high or low these prediction are.

When using linear regression models, one should always check the predicted values to see how many, if any, of those values are outside of the 0 to 1 range and how far outside they are. If many are outside that range or a few are considerably outside that range, another type of regression analysis, such as "logistic regression," is the appropriate method rather than linear regression. Logistic regression fits an "S-shaped" curve to the data rather than a straight line. The top and bottom of the "S" can approach but never go beyond 0 and 1.

2. R-Square Values: In regression with quantitative dependent variables, such as weekly sales, the actual data on sales as well as the predicted values for sales from the regression are both in dollars. With a 1/0 dependent variable, the actual data are all 1s and 0s while the predicted values are quantitative estimates of probability such as 0.55 in the example above. This fact makes the R-square somewhat inaccurate as a measure of how good the model fits the data by biasing it toward lower values.

Typically, to assess how good the model does it fitting the data, other measures are caculated in addition to R-square. The simplest method is to use 0.5 as a cutoff value -- any predicted value above 0.5 is treated as a prediction of a 1, and any predicted value below 0.5 is treated as a prediction of a 0. Then these predicted 1s and 0s are compared with the actual data, and the "percent of cases (observations) correctly predicted" can be computed.



READING SUPPLEMENT -- TIME SERIES & FORECASTING

Determining Patterns (components) Time Series Data
The analysis of data collected at regular time intervals represents an important areas of business statistics because of two inter-related reasons: 1) it permits patterns from past changes in a variable to examined; and 2) these past patterns can be used as a tool to forecast the future values. Unlike regression analysis where data must be gathered on both dependent and independent variables, time series methods can be used even if data have been collected only on the variable of interest. This provides a major advantage because it reduces the amount of data required. Chapter 14 in Siegel introduces some of the concepts of time series analysis. The discussion in this supplement covers a few of the things that Chapter 14 omits or leaves a bit fuzzy. Some of the key terms are:

Y(t) = values for variable Y (such as sales) measured at regular time intervals (t) such as monthly. Y(t) is also referred to as "the level of Y" or a "time series on Y."

Y(t-1) = value for the variable Y in the prior period. For instance, with monthly sales data, Y(t-1) refers to the prior month's sales. Y(t-1) is also called the "lagged" value of Y of the "lag of Y."

Y(t) - Y(t-1) = difference in Y; it is also known as the "first difference" of Y as well as the "change in Y" from period to period. It measures how much the variable changed from its value period. For monthly sales, it shows how much sales changed from the prior month.

Structural Models = regression models using time-based data where several other variables (X-variables) are used as explanatory variables in predicting values of Y;

Time Series Models = equations which use past values of Y, past differences in Y, and information drawn from these past values to predict current values;

Time Series Components = patterns in the past movements of Y such as a trend components, a cyclical component, a seasonal component, or a random component.

We will use weekly Sales for Company X from 1990 through 2001 to illustrate our points. The simplest (but sometimes misleading way) to determine patterns in the past history of sales is by looking at a graph of monthly sales plotted over time (Sales on the Y-axis and weeks & years on the X-axis) to see if there are any obvious trends, repetitive cycles, big jumps or dips during certain periods, and the like.


 

At first glance, the graph above appears to show that there may be an upward trend (a trend component)over the time frame of about $10 thousand per year. Also, there are some repetitive "ups and downs" (a cyclical component). If you look closely, you can also make out big dips of $20 to $30 thousand around the end of each year and an increase of a few thousand during the summer weeks (seasonal components).

While looking at graphs like the one above is a good first step, it leaves the size of the patterns to a lot of guesswork. To more precisely quantify the patterns in weekly sales, we can estimate an equation from the data that looks similar to a regression equation but uses on information on the variable under study or time. The data in our file would appear as below where "t" represent a particular week:
Week (t) Sales (in thousands of $)
1 200
2 208
3 215

In the equation estimated below we will use the following definitions:

Sales(t) = weekly sales in thousands of dollars;
Trend = time variable counting the weeks; it starts at 1 and increasing by 1 unit each week of the sample; this looks for a linear "trend" component (pattern) in the data;
Sales(t-1) = Sales in the prior week; this is the simplest means of looking for past "cycles" in the data;
XMAS = a seasonal indicator variable equal to 1 if a week included December 25 and 0 otherwise;
SUMMER = a seasonal indicator variable 1 for weeks from Memorial Day to Labor Day and 0 otherwise.

The results of estimating an equation to predict weekly Sales with these components appears below:
 
 
Variable Coefficient Std. Error t-value p-value
Constant 100 6.2 15.0 0.001
Trend 0.075 0.006 13.0 0.001
Sales(t-1) 0.50 0.03 15.00 0.001
Xmas -32.0 3.5 9.00 0.001
Summer 6.0 1.3 4.00 0.001

R-square = 0.82
Durbin-Watson = 2.00
Box-Pierce Q(12 lags) = 10.2 (p-value = 0.65)
Mean Weekly Sales = $250 (000); Standard deviation Sales= $30 (000)

In equation form, this would be written

Sales(t) = 1.00 + 0.075*Trend + 0.50*Sales(t-1) - 32*XMAS + 6*SUMMER + error(t)

The coefficients in this equation are interpreted the same way as regression coefficients. For instance, the Trend coefficient of 0.075 means that for each week, sales increases by about $0.075 thousand ($75). Around Christmas, Sales drops by about $32 thousand. The "Durban-Watson" statistic (2.01 in this case) is a measure of whether the errors in our model are dependent on each other (a bad thing). Values for the Durbin-Watson between about 1.6 and 2.4 are viewed as indicating independence of the errors (a good thing). The "Box-Pierce Q-Statistic" is another measure of this same thing.

Forecasts from Time Series Models
We can generate forecasts from this equation fairly easily. Suppose that for the last week of the 12 years of data (t=624), actual sales were 300. The equation would forecast sales for first week of 2002 (t=week 625 in the data) and the second week of 2002 (t=626) to be

Forecast Sales (t+1) = 100 + 0.075*(Trend = week 625) + 0.5*(Sales week t = 300) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 296.8

Forecast Sales (t+2) = 100 + 0.075(Trend = week 626) + 0.5 *(Sales week t+1) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 295.3

The goal in forecasting is to make accurate predictions. Errors = Actual Values - Forecasted values. As in regression analysis, R-square is a measure of how well the patterns in the model account for past movements in the series. In addition to R-square, other measures are commonly used evaluate how good the model is at forecasting (predicting) sales. Some of these are

Root Mean Square Error = a measure of the average size of the errors of the forecasts; it is the square root (sum of the squared errors divided by the number of errors);
Mean Absolute Error = a measure of the average size of the errors of the forecasts; it is the sum of the absolute value of the errors divided by the number of the errors;
Mean Absolute Percent Error = the average error size as a percent of the mean of the variable being forecasted; mean absolute error divided by the mean.

"Static Forecasts" estimate the forecasts and errors using the data for the sample used to compute the time series equation. "Dynamic Forecasts" use data from outside the original sample to compute forecasts and errors.

For the model above, the forecasting diagnostics were as follows:

(Static) Forecast Diagnostics
Root Mean Square Error = $7500; Mean Absolute Error = $7000; Mean Absolute Percent Error = 2.8%
 

In words, the average weekly error in our forecasts was $7000 to $7500 or about 2.8% of total sales.

More Complex Patterns:
The example above investigated some of the simpler patterns to be found in a time series. More complicated patterns can be investigated in a number of ways. For one, trends are not necessarily linear. Using a squared Trend term can sometimes account for this. Sometimes the trends are much more complex, requiring special methods. Second, cyclical patterns may be much more subtle and complex requiring a second or third "lagged" value or even lagged error terms. Sometimes, the errors from the forecasts can be used to improve future forecasts (another cyclical pattern). Third, seasonal patterns may not be as straightforward as we estimated above.

Also, other variables can be added to time series equation. In the example above, the existence, type, or amount of advertising done in the prior week might be included as an explanatory variable. Finally, one of the most subtle but most important issues in looking for time series patterns is to realize that some times patterns can be misleading. What appears to be a trend or a cycle may be nothing more than a series of random steps. For example, in class we will see a graph of a variable that appears to display a downward trend and possibly repetitive cycles around this downward trend, it is really nothing more than a series of random movements.


 
 
 

These kinds of time series that are a series of random steps are called "random walks." Chapter 14 discusses the idea a little. Random walks can masquerade as seeming trends or cycles in data. A random walk contains random movements from one step to the next -- if you selected a starting point, say 50, then drew a ball at random (say 2), and then put your new mark 2 places above your starting mark so you are now at 52, then you draw a new ball at random (say -5) and place the next mark -5 places above your prior mark so you are now at 47. This makes the change in your position random. That is exactly how the graph shown above was generated.

Such a random walk (series of random steps) looks very different from a series where balls are drawn from a hopper where 50 is the mean and balls with numbers ranging from 45 to 55 are in the hopper and drawn at random. A series generated by this procedure is a "random series." It is also called a "white noise" series. Its graph probably appears more like the one people have in mind when they think of randomness.


 

Distinguishing a random series from one with patterns is not too difficult. The graph above has no obvious pattern. More precisely, an equation for Stock Price(t) that included lagged stock price, trends, or seasons would all have coefficients near zero. Distinguishing random walks is a bit trickier. As the graph for the random walk shows, there seems to be a trend. If and equation for stock price (t) were estimated with a Trend, the Trend coefficient would also appear large. The key is the coefficient on lagged stock prices (t-1). A random walk will have a lagged coefficient near 1.0 (usually 0.9 to 1.0). The random walk above has this equation: Stock Price(t) = 0.17 + 0.98*Stock Price(t-1). The coefficient of 0.98 means that there is a random walk component in this series. Further analysis should be conducted using changes in the stock prices instead of the original levels.
 
 
 


SUPPLEMENTAL READING -- EXPERIMENTAL DESIGN

Introduction: Experimental design refers to actively controlling the process by which data is generated so that the effects of one or more variables can be better isolated and measured. These methods are common place in natural and life sciences, where many experiments are conducted within laboratories and most of the variables influencing outcomes can be controlled. However, the methods are also useful in other scientific and business settings where only some of the factors are controllable. In business, these methods have been most widely used to in production management settings to test different techniques or machinery. However, the same ideas are adaptable to almost any managerial or personnel setting and can range from very simplistic methods to very complicated designs.

Factor(s): a variable(s) that wholly or partly determines changes in another variable usually designated as the "response variable"; settings or "levels" of these factors refer to the different possible values the factor can take; in many situations, these values are qualitative

Experimental Data: Data that is generated where one or more of the factors influencing the outcomes is actively manipulated so as to better isolate or eliminate its effect or the effect of other factors;

Observational Data: Data that is collected without any active manipulation of the factors that influence the outcomes

Experimental Design: The plan for manipulating factors when generating and observing outcomes; the plan may range from a simply a change to a setting (levels) of one factor -- a simple "intervention"-- or a very extensive design that holds some factors constant while changing the settings (levels) of other factors

Overview of Steps in an Experimental Design: While the specifics of a designs vary based on the details of the industry, company, and particular issue, an organized approach to setting up the experiment should, more or less, follow the items presented below.
(Adapted from Coleman and Montgomery, Systematic Approach to Planning for a Designed Industrial Experiment, Technometrics, 1993).

1. Objectives of the experiment: should be specific, measurable, and relevant
2. Background: existing theoretical or statistical knowledge concerning the response variable or factors, if any, as well as how the current experiment fits in with this background
3. Response variable: identify how the variable is measured and the typical operating means and ranges (if known)
4. List factors & determine settings/controls:
a) Factors of main interest -- identify the variables influencing the response variable that the design is intended to help isolate their effects or their combined effects (interactions); determine the desired settings (levels) of these variables during the experiment including desired interactions;
b) Factors to be held constant -- identify these factors and the "allowable" ranges of
variation
c) Other controllable variables -- identify the other variables known or likely to influence the response variable that can be actively manipulated; determine the strategy for filtering out their effects (such as randomizing the variables of main interest among different settings of these factors or
d) Non-controllable factors -- identify the other variables that influence the response variable that cannot be actively manipulated; if these can be measured, determine the specific measurement strategy; if these cannot be measured, identify the expected impact, if any on the experimental outcomes
5. Restrictions: identify and list cost, legal, managerial, or other limitations placed on the ability to manipulate factors;
6. Oversight and setup: identify responsibilities of personnel in the experiment and whether or not trial runs should be conducted
7. Analysis techniques: if possible, identify the most likely statistical methods that will be used to analyze the data from the experiment (such as regression, plots, ANOVA, ...)


SUPPLEMENTAL READING 4 -- STOCHASTIC SIMULATIONS

Introduction:
Simulations are a growing tool used in both academic and business settings due to advances in computational power with computers. Even through most of the 1980s, most simulations of any sophistication were conducted by academics, a few governmental agencies such as the U.S. Department of Defense, and very large business such as Bell Labs employing people with significant mathematical/statistical training. With various point-and-click software applications, complex and powerful simulations can be conducted without relative ease.
Simulations, in general, are studies where a set of assumptions are combined with data to determine what outcomes would be found under those conditions. In many settings, simulations are go by the name "what if" analysis because the investigator is considering what will happen if a set of hypothetical conditions or data hold true. In statistics, the predicted values from a regression are a type of simulation where the equation and coefficients are used with the data for the X-variables to generated the predicted (forecasted) values for Y. Forecasted values from a time series model are another example of simulated values. Cost estimates for a project based on identifying costs of similar projects or appraisal values for a house based on average values of similar houses are examples of very simple, simulated outcomes.
Simulations (other than just guesses) all share two parts

Simulated Value = Model & Data.

Simulated values are generated by assuming different "what if" scenarios for either the model, the data or both.

Simple Example
A simulation might start with a model as simple as the basic accounting definition for net worth which subtracts one thing from another and then generates a what if scenario by merely multiplying assets by 2 times their actual values:

Model: Net Worth = Assets - Liabilities

Data (to be used in the model): Liabilities = actual values
Assets = 2*actual values

In this simple example, the only simulated or hypothetical part is the asset data values we plug in since the actual values for liabilities are used, and the model is a basic definition and not a hypothetical relationship. Spreadsheet software such as Excel make doing these and a little more complex simulations relatively easy by permitting various columns of numbers to be combined together in formulas determined by the user as well as permitting users.
The complexity of a simulation increases as the model becomes more complex and more of the data and model are hypothetical. Still, no matter how sophisticated the simulation, the simulated outcomes are driven by a model (one or more equations) and data (real or hypothetical input provided by the user). Even simulations in the form of computer games -- such as Microsoft Flight Simulator -- that present pictures to the users are really just combinations of equations (model) and data.

Stochastic Simulations:
The example described above is more technically called a "deterministic" simulation because the all of the numbers used are fixed at the outset, even the hypothetical values. A different class of simulations are where one the data or parameters can take on different values that, to some extent, are random. These kinds of simulations are called "stochastic" simulations or "Monte Carlo" simulation. They improve simulations by permitting the user to incorporate uncertainty more explicitly into the hypothetical scenarios.
In stochastic simulation, the idea is not to just say we don't know the future, therefore, lets just pick any number from 1 to 1 million at random. Instead, users assume that they can describe the likelihood of different outcomes but with some lingering uncertainty about the specifics. Therefore, the typical procedure is to pick some probability distribution, such as the normal distribution, that the user thinks describes the likelihood of outcomes, and then let the computer software generate hypothetical values at random that fit that probability distribution.

Example -- A Deterministic Simulation:
Suppose we are constructing a new house. To simplify matters, suppose we also know the final expense is driven by the size of the house (sq. ft heating & cooled) and the quality of the house (Premium = 1; Standard = 0). We also know that jumps in lumber prices (% change from current date on 2x4 prices). Based on past experience and analysis, suppose we have the following relationship (MODEL):

Housing Expense = $10,000 + 80*(sq. ft) + 20*(Premium*sq. ft) + 20,000*(% Change Lumber)

Data: we could plug in values for (sq. ft.), (premium), and (lumber) to simulate an outcome. If we use 3000 sq. ft., Premium = 1, and Lumber = 1%, the simulated price would be
$330,000 = 10,000 + 80*(3000) + 20*(1*3000) + 20,000*(1).

This is just like the predicted values we have generated with regression analysis and is another example of a "deterministic" simulation where the hypothetical parts of the simulation are fixed in advance.

Example-- Simulation with Stochastic Data Values
Now suppose everything else about the housing expense model and the data are the same, but we are not really sure how about the changes in lumber prices. Our best guess is that the average change will be zero, but they have often varied 1 or 2 percent up or down and occasionally a lot more. Rather than just plugging in a number such as 1% as we did above, we decide to conduct a stochastic simulation where we generate 100 different housing expenses for a 3000 sq. ft, premium house but assume that lumber price changes for these 100 cases are drawn from a normal distribution with a mean of 0% and a standard deviation of 2% designated as Normal (0,2). So Now the setup is

Housing Expense = $10,000 + 80*(sq. ft=3000) + 20*(Premium*sq. ft=1*3000) +
20,000*(% change lumber prices: Normal(0, 2))

After the software computes housing expenses for these 100 cases, we can examine them and find out what was the average housing expense, what was the highest and lowest expense, and what was the typical range of expense. This kind simulation provides much more information on which we could base our decisions.

Example -- Simulation with Stochastic Data Values & Model Parameters (coefficients)
To incorporate reality a little better, we could also assume that the coefficients (also called the parameters) in the model (80, 20, 20000) are, themselves, just estimates. The exact relationships are not known with certainty and can change to some extent. Suppose we think that the coefficient for sq. ft. is on average 80 with but may differ by a standard deviation of 5, we think the premium coefficient is 20 on average with a standard deviation of 2 and the coefficient on lumber prices is 20,000 on average with a standard deviation of 1000. Now our simulation setup with a 3000 sq. ft premium house is

Housing Expense = $10,000 + Normal(80, 5)*(Sq. ft. = 3000)
+ Normal(20, 2)*(premium*sq. ft. = 3000*1)
+ Normal(20000, 1000)*(%change lumber prices: Normal (0, 2))

In this case, the parameters (coefficients) in the model are generated by drawing numbers from normal distributions as are the data values for lumber price changes. We could again generate, say, 100, cases and find the average housing expense, the highest and lowest expense, and they typical range of expenses


Examples of Assignment Answers

Questions:
3a. Briefly describe the data set used in the assignment and measurement of specific variables.
b. Based on your output, what would "typical" kilowatts used per day be? Is the data set symmetric and what are outlying values?
c. Do winter months have higher gas usage than other months? Conduct test where the null hypothesis is that winter month gas usage is the same as other months.
 

EXAMPLE OF PRETTY GOOD ANSWERS
3. a. The data set consisted of 5 variables related to the monthly electric and gas usage of residential utility customers from July 1990 to June 1998. The variables included were a month identifier, average kilowatt hours used per day, gas thermal units used, average daily temperature, and number of days in the month.

b. Based on the histogram and descriptive statistics, typical kilowatt hours used per day ranged from about 30 to 40 per month. The mean was about 34 and the median was 37 with a standard deviation of 3. The data was skewed to the right with a few large outliers well beyond 50 hours per day.

c. The means for winter month gas usage (December, January, February) were about 30 percent higher than for other months. Based on a low p-value (0.005), a hypothesis that winter months and other months have the same mean could be rejected with strong confidence.

EXAMPLE OF PRETTY BAD ANSWERS
3. a. It had some variables about utility customers.
b. 34. A few big outliers.
c. winter had higher