COURSE FAQ / ASSIGNMENTS: 1, 2, 3, 4, 5, 6, (Answer Example)
SUPPLEMENTAL READING: Foundation/
Linear
Probability Models/
Time Series-Forecasting/
Experimental
Design/
Stochastic
Simulation
STATISTICS WEBSITES:Java Applets / Coin Flipping Page / American Statistical Association/ Journal of Statistics Education/ Fed Stats/ Time Series Data Library / Lexis-Nexis Stats Universe
Dr. Brian Goff/414 Grise Hall
Phone (502)745-3855 / brian.goff@wku.edu
Last Modified: March 18, 2002
Western Kentucky University
Contact Information
Office: Grise 414
Phone: 745-3855
Email: brian.goff@wku.edu
Office Hours for Spring 2002: 1:30-4 M, W; 9:30-11:30 T, TR
(appointments & drop-ins welcome at other times)
Resources
Andrew Siegel, Practical Business Statistics;
Microsoft Excel (with Data Analyst -- available on WKUnet);
SPSS 10.0 for Windows (available on WKUnet or limited quantities
in bookstore)
Objective
To develop skills in data analysis and evaluation with business
applications
Grading Policies
Announced Tests (2) = 30% (bonus pts available on each)
Unannounced Quizzes ( 6 to 10) = 30% (lowest dropped; lowest 2 dropped
if 8 or more)
Computer Assignments (6) = 40% (lowest dropped)
Total = 100% (Also Adjustments to Grades under Classroom Policies
below)
A = 90-100%; B = 80-89%; C = 70-79%; D = 60-69%; F = below 60%
Planned Test Dates: Test 1 -- March 4; Test 2 ( Final Exam) -- 9:05 MWF @ 8 AM, Monday May 6 11:15 MWF @ 10:30 AM, Thursday May 9
Make-up Tests/Quizzes: NO MAKEUP TESTS/QUIZZES GIVEN FOR ANY REASON. If the midterm the final will be weighted more heavily to compensate. If serious illness or injury, official WKU activities, or some other special circumstances will cause you to miss more than 2 quizzes, please see me.
Announced Tests: These will be 20 to 40 question multiple choice tests. If weather or my absence postpones a test (or an assignment due date), the test will be given at the next scheduled class period.
Unannounced Quizzes: These will be 5 questions in length. They will be drawn from the most recent lecture and/or the current assigned reading. Quizzes will have a 5 minute time limit. No credit given for quizzes turned in after time limit expires.No credit given for quizzes turned in after time limit. We will have 6-10 -- be ready for a quiz on any given day.
Computer Assignments:
Miscellaneous
Last Day to drop with a "W" or change to audit is March 7. Students
with disabilities covered by the ADA, please register with the WKU Office
for Disabilities and then see me for accomodation.
Weeks 1-3 (Jan 14-Feb 1): Review and Extension of Basic Statistical
Methods
Using data to understand reality (Ch. 1-2)
Computer software in analyzing and presenting data (Ch. 3; SPSS;
Excel)
Foundational measures and methods (See Foundation Supplement)
Some applications of basic concepts: quality control (Ch. 18); EDA--Statistical
detective work.
Assignment # 1 - Deadline Feb 1 (Friday - all assignments due 5
minutes after beginning of class)
Other Special Dates: Jan 21 (Monday) -- MLK Holiday;
Weeks 4-5 (Feb 4-15): Looking for Relationships between Variables
Part 1
Scatterplots & correlation coefficients (11:396-419)
Regression lines & coefficients (11:419-427)
Extending applications to multiple explanatory variables (12:468-475)
Using regression with qualitative independent variables (12:520-526)
Assignment #2 - Deadline -- Feb 15 (Friday)
Weeks 6-7 (Feb 18-Mar 1): Looking for Relationships between Variables
Part 2
Using regression with qualitative dependent variables -- LPM case
(Supplement 1)
Evaluating the statistical reliability of regression results (11:428-438)
Regression predictions & errors (12:475-478; 503-507)
Nonlinear relationships in "linear" regression (12:516-518)
Assignment #3 - Deadline Mar 1 (Friday)
Special Date: Test #1 ("Midterm")-- Wednesday Mar 6
****Special Change: 2nd Chance on Midterm -- Wed Mar 13
******NOTE REVISIONS BELOW AS OF MARCH 14
(Week 10: Mar 18-22 Spring Break)
Week 11 (Mar 25 - Mar 29): Condensing & combining data to measure
attributes
Factor Analysis & Cluster Analysis (See SPSS Help on these Topics
+ SPSS Results Coach)
Designing simulations using statistical software (Simulation Reading
Supplement 4; SPSS Transform)
******Assignment # 4 ***** Skipping Assignment 4/Will do In-class*******
Weeks 12-13 (Apr 1-Apr 12): Varied Applications of Statistics to
Business
Designing and conducting experiments in business settings (Experimental
Design Supplement)
Using contingency tables to examine qualitative relationships (Ch.
17)
Using ANOVA to examine qualitative relationships (15:610-618; 629-630)
Assignment # 5 - Deadline April 12 (Friday)
Weeks 14-15 (Apr 15 - Apr 26): Using the Past as a Guide to the Future
Overview of time series/forecasting strategies & components
(Time Series Supplement; 14:464-475)
A Primer on estimating trends, cycles, & seasons (Time Series
Supplement)
Confusions about trends & randomness (Time Series Supplement;
14:566-567; 586-587; 592-594)
Basic forecasting concepts (Time Series Supplement)
Assignment # 6 - Deadline April 26 (Friday)
Week 16 (Apr 29 - May 3): Review & Prep For Final
Special Date: Test 2 (Final Exam--Apx. 70% new material and 30% from Test 1)
Q: I had to work late last night, the computer
system was down this morning, ..., will you take my assignment late?
A: I will accept it late, but you will still be assessed the 10%
per day penalty. Do the assignments in advance of deadlines to make sure
that you turn them in on time.
Q: I will miss midterm because of forensics,
swimming, golf, .... Can I make-up these tests?
A: No make-ups are given but see me, and we will discuss your options
if you will miss more than 1 test.
Q: I can't understand question x on the latest
computer assignment. Can you help me?
A: I will explain ambiguities in words and phrases, but you must
use your book, notes, and brain to determine answers to the questions.
You may also seek assistance from other students (and, of course, your
teammate if you have one), but you may not copy the answers and output
of others.
Q: I forgot to put my answers on the back of
the last page of output, forgot to staple or paper clip it, my handwriting
is not very clear .... Will you count off for that?
A: Yes.
Q: I'm doing poorly on quizzes/tests/assignment
(plus I need this class for my degree program). Can I do work for extra
credit?
A: No. Grades will be determined by the policies stated above.
Q: How can I prepare for the tests?
A: Keep up with the reading. Look back over your notes before class.
Look over questions from past ECON 206 and 306 exams, sorted by subject
area. Aside from the computer assignments, 20-30 minutes in between each
class meeting spent in this way will go a long way and will do a lot more
than cramming before an exam.
Computer Assignment 1
(You need a floppy disk)
Objective: To gain familiarity with using spreadsheet software
(Excel) and dedicated statistical software (SPSS) to perform statistical
functions, and to review basic statistical concepts
Accessing SPSS: After you turn on the computer and the "Novell Client" window appears, you must type in SPSS rather than "Student" for your username (no password needed). Then click on the "SPSS" folder in the left window, then click the SPSS 10 icon in the right window. The SPSS menu and data spreadsheet should appear. After exititing SPSS, click on the "Applications" folder to have the Excel icon reappear. Excel can be accessed by just double-clicking the Excel icon on the standard student network desktop.
1. i) In SPSS, retrieve the file H:share\econ\goff\data\bballclass.sav
into the SPSS data spreadsheet This file contains data related to Major
League Baseball teams from 1990-1996. Labels for each variable are provided
by SPSS; ii) Create two new new variables (stadium revenues divided by
attendance and gate revenues divided by attendance) using Transform>Compute
and give the new variables names of your choosing; iii) Generate histograms
for these gate revenues per person in attendance and stadium revenues per
person in attendance using Graphs>Histogram and print your output; iv)
save the file to your floppy disk as an Excel (.xls) file (make sure the
box to include variable names is checked);
2. i) Exit SPSS and get into Excel. Retrieve the Excel version of
the file that you saved to floppy; ii) Use Tools>Data Analysis>Descriptive
Statistics and fill in the input range to include the two new variables
that you created in Step 1 (get both summary statistics and the confidence
intervals); iii) Use Format>Columns>Autofit to resize your output columns
to fit the output tables and then print these tables; iv) Go back to the
data spreadsheet and click the first value under your stadium revenue per
person variable, then click Data>Sort>Descending to sort the entire data
set by this variable in descending order; v) copy down the top 10 cities
and years.
3. On a separate sheet attached to the last sheet of output, provide
answers to the following using complete sentences and standard English:
i) Briefly describe the data set and how the variables are measured.
ii) Using the histograms and the descriptive statistics you printed
for stadium and gate revenues per person, describe the sizes of "typical"
outcomes and describe the sizes of "outliers."
iii) Provide a brief summary of the cities-years with the largest
stadium revenues per person;
iv) Based on descriptive statistics from Excel, what would you conclude
if you tested the null hypothesis that stadium revenues per person were
$3? What is your basis for this decision? If you had used Excel or SPSS
to provide a p-value for this test, what is your guess as to the size of
the p-value and what information would this p-value be providing?
Staple or clip your output together with the microcomputing cover sheet at the front and your answer sheet at the back with the other pages. Due Date = Feb 1 at beginning of class.
Assignment 1 Answers
3i) The data set contained five variables on Major League Baseball (MLB) teams from 1990-96. These were gate revenues (ticket sales) in millions of dollars, stadium revenues (concessions, ...) in millions of dollars, media revenue in millions of dollars, attendance in millions of people, and a team identifier.
3ii) Gate revenues per person had a mean of $9.8 million with a standard deviation of $1.9 million. These reveues appeared to be close to normally distributed with the exception of a few outliers in the $16-$18 million range. The mean of stadium revenues per person was $4.4 million with a standard deviation of $2.2 million. The distribution of these revenues did not appear very normal but were spread nearly uniform from about $2 million to $7 million. Outliers appeared on both sides -- down to $0 on the low side and up to $8 to $12 on the high side.
3iii) The following teams exhibited the largest stadium revenues per person: Chicago White Sox (96), Chicago White Sox (95), .... These figures ranged from $8.4 to $12.2 3iv) A claim that stadium revenues per person averaged $3 could be rejected using the 95% confidence interval around the sample mean. This interval is $4.4 +- $0.32. Since $3million is outside of the range allowing for sampling error, the evidence rejects the claim at this level of confidence. A p-value associated with this test measaures the probability that the claim is true. Since we rejected using the 95% interval, the p-value for this test would be at or below 0.05.
3iv)
1. i) Pick a specific date at least 1 month from now and approximate
time of day (for example AM on April 1) for roundtrip flights with from
Nashville (BNA) to at least 20 cities of your choosing (just make sure
at least 3 are from the Southwest Airlines list below and at least 3 are
not). Using on online travel search service such as Yahoo! Travel, Expedia.com,
Travelocity.com, or Orbitz.com, determine the cheapest fare to each destination
other than on Southwest and also the distance to each city. For this you
can use an atlas or look them up at www.symsys.com/~ingram/mileage.html.
(Note: approximations are ok; you may need to an atlas if the information
is not listed on the travel website). Be sure to keep track of the city
that goes with each price. Also, write down date and time of travel. .
ii) Enter the price and distance data as variables into an Excel
file along with a text variable that abbreviates the city name. Also, create
a 1/0 variable that identifies whether the Southwest Airlines flies directly
from Nashville to that city or not. Make it equal 1 if Southwest flies
the route directly. (These cities are Los Angeles, San Diego, San Jose,
Oakland, Phoenix, Houston, Austin, San Antonio, Kansas City, New Orleans,
Chicago, Birmingham, Orlando, Birmingham, Jacksonville, Detroit, Hartford,
Manchester. If you know some other city is served directly, enter it as
a 1 also.
iii) Use Excel's Data Analysis Tool to estimate a regression where
the dependent variable is the fare and explanatory variables are distance
to the city and the variable for whether Southwest flies direct or not.
Direct Excel to generate residuals and standardized residuals (but not
residual graphs). Before printing the output sheet, use Format, Column,
AutoFit, so that the words in the output column are not cropped. Also,
give the output Table a new title (Table 1: Air Fare Regression)
2. Print the both the regression output and your data sheet (Put
these on separate pages) .
3. Answer the following questions on a separate sheet neatly printed
written or typed and complete sentences. I'm looking for answers that are
direct and precise. (66%)
i) Briefly describe the variables in your data set, their measurement,
and how their were collected.
ii) Draw a graph (neatly done by hand is fine) of the regression
line implied the coefficient on distance. Also, plot the fare-distance
data combination for at least 10 of your cities on this graph and identify
them by an abbreviation for each city.
iii) Explain the specific meaning of the coefficient on the distance
variable in your regression.
iv) Explain the meaning of the Southwest Air variable in your regression
and draw a simple graph showing how it would change the fare-distance graph
that you drew earlier.
v) Explain meaning of the R-square statistic (not just in general,
but the specific value in your regression).
Deadline = February 15 (beginning of class)
Remember to attach the microcomputing cover page; staple or clip
pages together, and follow the other assignment guidelines.
Assignment 2 Answers
i) Your description of the variables and sources here.
ii) The graph should have Fare ($) on the y-axis and Distance (miles)
on the x-axis. It should begin at the y-intercept on the y-axis and increase
at the rate shown by your slope coefficient on Distance. It should also
include 10 points showing combinations of Fare-Distance for 10 of your
cities.
iii) Suppose your regression equation is Fare = 50 + 0.12*Distance
- 75*Southwest.
The coefficient on Distance is 0.12. This estimates the slope between
distance and fare and indicates that for each 1 mile increase in Distance
of a flight, fare increase by about $0.12.
iv) Given the equation above, the coefficient on the Southwest variable
of 75 means that if Southwest flies non-stop on a route, the Fare averages
$75 less. A graph showing this would just shift the regression line in
part ii) down so that the y-intercept would be $75 lower. The slope would
be the same.
v) Suppose the R-squared is 0.65. This means that distance to a
destination and whether Southwest flies non-stop or not accounts for 65
percent of the fare differences.
1. On this assignment, choose from ONE of
two data sets.
H:share\econ\goff\data\tvownership.xls
is an Excel file containing a sample of 40 responses to a survey as to
whether the household owned a 48-inch TV (or larger) along with information
on income (in thousands of $), age (oldest in household in yrs) and whether
children live in the household (1 if yes; 0 if no).
H:share\econ\goff\data\madenlf.xls
is an Excel file containing a sample of 40 college linebackers and whether
they made an nfl team (Madenfl = 1 if yes; 0 if no) along with information
on their weight (in lbs.), 40-yard dash time (in seconds), and whether
they played Division I-A football or not (1 if yes; 0 if no).
2. i) Use Excel to estimate a regression model with either 48-inch
TV ownership or MadeNFL as the dependent variable, depending on the data
set that you choose. Use the explanatory variables available in the data
set.
ii) Save the predicted and residual values.
iii) Edit the regression output so that it has a title that describes
what the regression contains.
3. Provide responses for each of the following:
i) Briefly describe the data set you used and how each variable
is measured.
ii) Explain the meaning of the coefficients on the explanatory variables.
iii) Assess the reliability of the estimated coefficients in terms
measures of sampling error.
iv) Briefly summarize the predicted and residual values. Also, using
an actual case from the data, demonstrate how the predicted values and
residual values were created.
v) What is the interpretation of the predicted values in this regression?
What, if any, problems with a linear regression model can you identify
from the predicted values?
Assignment 3 Answers (These are answers for TV data; answers for
NFL data are similar)
3.i) The data set includes measurements from 40 households concerning
TV ownership, ...
ii) The coefficient on household income is 0.005. This means that
for every thousand dollars income increases, the probability of owning
a large television increases by 0.005 or 0.5 percent. If a family has children,
the probability of owning a large TV increases by 0.37 or 0.37 percent.
The coefficient for age is 0.0018. This means that each year of age for
the oldest household member increases the probability of ownership by 0.0018
or 0.18 percent.
iii) The standard errors and p-values of the coefficients allow
their reliability or accuracy to be assessed. Income's coefficient (0.005)
is more than 4 times larger than its standard error (0.001) and its p-value
is very low (0.0002). Children's coefficient is more than 3 times larger
than its standard error (0.001) and its pe-value is also very low (0.001).
Therefore, these two coefficients are relatively reliable or accurate estimates
of the true coefficients, at least so far as sampling error is concerned.
The coefficient on age (0.001) is smaller than its standard error (0.009)
and its p-value is far above 0.05 at 0.85. Therefore, when taking sampling
error into account, the coefficient on age is not very reliable. The true
value is likely zero.
iv) The predicted values for the probability of TV ownership range
from -0.17 to 1.11 with considerable variation within this range. The residual
values range form -0.19 to 0.61. Using the values for household #1 in the
data set of INCOME =18, CHILDREN =1, AGE =41, the predicted value would
be
Own TV = -0.35 + 0.0058(18) +0.37*(1) + 0.0018(41) = 0.19.
The actual value for this household was 0, so that the residual
= actual - predicted = 1.0 - 0.19 = -0.19.
v) In a regression such as this one with a dependent variable measured
as either a 0 or 1, the predicted values for the dependent variable measure
the probability of the variable being equal to 1.0 . Because these values
represent probabilities, their values should be between 0 and 1. However,
as noted above, some of the predicted values from the linear regression
are above 1.0 and below 0.0. Under these circumstances, the R-square statistic
is not an accurate measure of how well the equation fits the data. Also,
under these circumstances, logistic regression is a more accurate alternative
to use because it forces the predicted values to lie between 0 and 1.
Option 1: Collect at least 30 book prices (dependent variable) to determine whether they are related to a qualitative factor (such as fiction/nonfiction)
Option 2: Collect sales to price ratios (dependent variable) for at least 30 publicly traded companies to determine whether they are related to a qualitative factor (such as sector in which the company operates).
Option 3: Collect NCAA basketball tournament seedings (dependent variable) for all teams for at least one year to determine whether they are related to a qualitative factor (such as major, mid major, or below mid major conference affiliation).
2. Analyze your data in the following ways using SPSS:i) Generate an Analysis of Variance (ANOVA) using Analyze/Compare Means/One-Way ANOVA. Place the quantitative version of the dependent variable in the "Dependent List" box and the factor of main interest in the "Factor" box.ii) Generate a contingency table (aka Crosstabulation in SPSS) using Analyze / Descriptives/ Crosstabs and your qualitative version of the dependent variable ("Column Variable") along with your factor of main interest ("Row variable"). Use the "Statistics" button and choose "Chi-Square." iii). Edit the ANOVA title and the Crosstabulation title to add brief descriptions of your specific analysis.
3. Answer the following questions on
a separate sheet and attach to the back of your output.
i) Briefly describe your data source,
the variables, and how they were measured.
ii) Explain your experimental design
strategy including your dependent variable, the factor of main interest,
other factors likely to influence outcomes of the dependent variable, and
the active steps you took to better isolate the effects of the factor of
main interest. (This is a key part of the assignment, so be detailed and
clear.)
iii) If you had additional time/resources,
how might you improve on your design strategy?
iv) Describe the ANOVA results and
what they mean for the relationship between your dependent variable and
the factor of main interest.
v) Describe the Crosstab results and
what they mean for the relationship between the qualitative version of
the dependent variable and the factor of main interest.
ACCREV = % of account receivables in delinquency for that quarter;
STOCKABC = price of stock for company ABC (in $) for that quarter;
2. Choose either ACCREV or STOCKABC and use methods (graphs, equations, ..) we discussed in class and in the Reading Supplement to identify patterns over time. Your analysis should provide you with a means for forecasting each of these variables. Save any graphs or equations that you created to include in your report (max 2 pages of such material).
(Note: I have intentionally left the exact methods of your analysis up to you on this assignment).
3. On a separate sheet of paper attached to your output, provide
answers to the following:
i) Briefly describe the variable analyzed and its measurement.
ii) Explain the specific time series patterns you found and the
basis for your conclusion.
iii) Based on your findings generate a forecast for for the four
quarters of 2002, explaining the details of how you arrived at these forecasts.
Part II (50% -- Write answers on a separate
sheet from Part I)
You have the job of investigating one of the major vendors
of data-statistical software (SAS -- www.sas.com) to determine its capabilities
for your company.
i) Under Software, choose "Success Stories" and then under "More
Success Stories" choose either "Industry" or "Solution" and summarize 2
of the stories pertitent to your major.
ii) Go back to the main page (www.sas.com) and select "By Product"
under Software. Under the Product Index, choose "Statview."
Using the information provided by the "Introduction," "Technical Overview,"
and "How to Order" sections, summarize the features of this product along
with information about pricing.
1. Determining What is Typical and Atypical in a Data Set
a. Measures of "Average Outcomes"
-- Mean (78-81): the everyday "average"; simple and useful
with relatively symmetric data
-- Weighted Mean (83-85): an average where individuals values
are multiplied by a weight before they are added together;
-- Median (87-90): important for determining the middle of
asymmetric data
-- Mode (93-94): useful in categorical data
b. Measures of Variability of Outcomes: information about averages
of a data set alone does not provide very much information about what is
typical and atypical. Variability measures are critical for more detailed
and useful understanding of data
-- Standard deviation, s, (121-125): a measure of the average
distance from the mean for the values in the data set
-- Standardize Units (z-units, z-values, z-scores): the name
given to variable (X) which has been converted so that the mean is zero
(0) and the standard deviation is one (1). This conversion is done by the
formula Zi = (Xi - )/s, where i refers to each individual item in the data
set. This conversion eliminates whatever units were used to measure X,
and it allows each data point to be easily evaluated in terms of how much
it differs from the mean.
-- Empirical Rule (128-129): a summary of variation in outcomes
for data that have roughly a bell-shaped (normal) shape and a means to
indentify outliers. The Empirical Rule states: i) about 68 percent of the
data will be between +/ 1 standard deviation from the mean, ii) about 95%
of the data will be between +/- 2 standard deviations from the mean, and
iii) about 99.9% of the data will be between +/- 3 standard deviations
from the mean.
-- Chebeshev's Rule (128 n. 9): a summary of variation in
outcomes for data regardless of their shape and a means to identify outliers.
The rule states: i) about 75% of of the data will be between +/- 2 standard
deviations from the mean, ii) about 89% of the data will be between +/-
3 standard deviations from the mean, and iii) about 94% of the data will
be between +/-4 standard deviations from the mean.
c. Skew (skewness coefficient): a measure of the symmetry of outcomes
2. Bringing in Probability & Probability Distributions
Basic Definitions:
a. Probability (162-163, 169-170): a branch of mathematics
dealing with computing likelihood of events; widely used in statistics;
the probability of an event: must fall between 1 and 0;
b. Odds (of an event) = probability of an event divided by
one minus the probability;
c. Law of Large Numbers (171-172): relative frequency of
an event will become closer to its probability as the number of trials
is increased;
d. Probability Distribution: a formula (or graph/table based
on the formula) that relates the possible outcomes of a random variable
to the likelihood (probability) of those outcomes (this just expands the
definition given in the book)
i. KEY PROBABILITY DISTRIBUTIONS
-- Binomial Distribution (214-223): events with two independent
outcomes;
-- Normal Distribution (223-227): for events with bell-shaped
outcomes;
-- Standard Normal Distribution (224-228): a special case
of the normal where the mean of a normal variable is converted to 0 and
its standard deviation is converted to 1; a standardized normal variable;
-- t-Distribution (306-308): another bell-shaped probability
distribution, but one whose shape changes slightly as the number of sample
items (degrees of freedom) changes; with small numbers in the sample, it
is a little wider than the normal curve; with large numbers in the sample,
it is the same as the normal curve
-- Chi-Squared Distribution (672-673): another probability
distribution that changes shape as the size of the sample (degrees of freedom)
change; with a small sample, it is highly skewed to the right; with a large
sample, it becomes more similar to the normal curve
-- F-Distribution (620-623): a probability distribution very
similar in its properties and behavior to the Chi-Squared Distribution;
it is skewed in small samples and
3. Accounting for Sampling Errors
a. Population parameters & statistical estimators
(260-261);
b. Standard Error - an estimate of the "average" sample error
in a sample-based statistical estimator of a population parameter if ideal
random samples of the given size were repeatedly used; in other words,
for instance, the standard error sample mean (272-275) estimates how far
the sample mean is likely to differ from the population mean in repeated
samples of a given size.
c. Confidence intervals (300-314): statistical estimates
that use standard errors along with probability distributions to compute
a range (interval) around a statistical estimate that incorporates most
(usually 95% or 99%) of the likely sampling error.
4. Testing ideas with data (Hypothesis Testing 339-357))
Statistical Hypothesis Testing (in general): Using sample
data to evaluate claims which have been made while also taking into account
the problem of sampling error. Because of sampling error, statistical evidence
and a claim may or may not provide convincing evidence against the claim.
In scientific research and also in assessing the reliability of statistical
estimators of population parameters, several common concepts and methods
are used to assess hypotheses; these are useful where hypotheses are very
"sharp" (clear-cut or precise); they are not as useful when conducting
preliminary or exploratory data analysis
Null Hypothesis: The hypothesis or claim which is accepted unless convincing evidence against it is presented; used extensively in research and statistical tests
Alternative Hypothesis: The opposite claim to the null hypothesis; the hypothesis which is accepted only when substantial evidence against the null hypothesis is presented; sometimes called the research hypothesis; it is typically the claim which an investigator is trying to prove
Type I Error: Deciding to reject the null hypothesis when it is, in fact, true
("Alpha"): The probability of committing Type I Error. When conducting statistical tests of hypothesis, this probability is typically set to be a small number such as 10%, 5%, or 1%. This is also referred to as the "Significance Level for a test."
Level of Confidence (in a test): 1 - , that is, if the probability of Type I error is 5%, then the Level of Confidence in the test is 95%.
Type II Error: Deciding to accept the null hypothesis when it is, in fact, false.
("Beta"): The probability of committing Type II Error. For a given statistical test, a given value of , and a given sample size, the value of is predetermined. For a given sample size, if is reduced, then will increase. and can both be decreased for a specific test by using a larger sample size.
Power of a Test: 1 - alpha, that is, if the probability of Type II error is 10% then the power of the test is 90%. This number indicates how "powerful" the statistical test is in rejecting a null hypothesis when it is, in fact, false
Confidence Interval Approach to Hypothesis Tests: A confidence interval (usually 95% or 99%) around a sample estimate such as the sample mean and compared with the claim concerning the relevant population parameter (such as the population mean) by the null hypothesis. If the null hypothesis claim is outside of the confidence interval, then the null hypothesis is rejected and the alternative is accepted.
p-value: The probability that the sample evidence would be found by random chance if the null hypothesis were true. A low p-value (such as smaller than 5%) is an indication that it is unlikely that the sample evidence would be found by chance if the null hypothesis were true. The lower the p-value, the more convincing the evidence is against the null hypothesis.
LINEAR PROBABILITY MODELS IN REGRESSION-- Regression with 1/0 Dependent Variable
Background
In many useful applications of regression analysis, the dependent variable
is qualitative (describing an attribute or quality) rather than quantitative.
In some cases, the qualitative dependent variable has only two values such
as home owner/not a home owner, purchased/did not purchase, defaulted/did
not default, loan accepted/loan rejected, and similar variables where the
data is usually recorded as 1s and 0s (often called binary or dummy variables).
Linear regression can be used to show the relationship between one or more
explanatory variables and binary dependent variable, although linear regression
is not always the appropriate type of regression analysis in such situations.
The same procedures in software such as Excel or SPSS can be used to compute
the regression output, but the interpretation of the output differs when
a 1/0 dependent variable is used. Regressions of this kind are often called
Linear Probability Models. The basis for this name should become apparent
below.
Example
The data for this example are drawn from 93 metropolitan areas in the
U.S. The two variables used in the regression analysis are
NFL&MLB = 1 if the city has both an NFL and Major League Baseball team and 0 if not;
Population = metropolitan population of the city in millions.
NFL&MLB is used as the dependent variable and Population is the independent variable. The idea of the regression analysis is to determine how differences in population influence the likelihood of a city having both an NFL team and an MLB. Twenty-three cities contained both sports and the remaining sixty-five did not. The populations ranged from 500,000 on the low end to 8,000,000 on the high end.
The results of linear regression are as follows:
Dependent Variable = NFL&MLB
| Variable | Coefficient | Standard Error | t-statistic | p-value |
| Constant | 0.01 | 0.05 | 0.07 | 0.99 |
| Population | 0.18 | 0.03 | 7.60 | 0.001 |
In equation form the regression output is
NFL&MLB = 0.01 + 0.18xPopulation.
If a particular city contains a population of 3 million people, then the predicted (estimated) value for NFL&MLB can be computed the same as in any regression equation.
Predicted NFL&MLB = 0.01 + 0.18*(3) = 0.55 .
However, the interpretation of the predicted value is different if the
dependent variable is a 1/0 variable as it is here. The predicted values
in these situations estimate the probability that the dependent variable
equals 1. In the example here, that means that the probability of a city
with 3 million people having both NFL and MLB teams is 0.55 (55 percent).
Because the predicted values for the dependent variable represent probabilities,
the slope coefficient for population equals the change in probability for
a one unit (1 million) increase in population. The coefficient in the example
here is 0.18 . That means that for every 1 million person increase in population
the probability of a city containing both NFL&MLB franchises increases
by 0.18 (18 percent).
The constant (y-intercept) in this case means that if the population were zero, the estimated probability would be 0.01 (1 percent). Obviously, this represents a purely hypothetical value since no city in the data set has a population anywhere near zero.
Special Problems:
One needs to be aware of a couple of special issues that arise when
linear regression with 1/0 dependent variables.
1. Probability Restrictions: True probabilities cannot be lower than zero -- an impossible event -- and one -- a certain event. However, the prediceted probabilities using linear regression may sometimes be below zero or above one. In the example above, for instance, a city with a population of 8 million will have predicited probability of containing both football and baseball teams of 1.45. The reason for this is that linear regression continues to make prediction along a straight line no matter how high or low these prediction are.
When using linear regression models, one should always check the predicted values to see how many, if any, of those values are outside of the 0 to 1 range and how far outside they are. If many are outside that range or a few are considerably outside that range, another type of regression analysis, such as "logistic regression," is the appropriate method rather than linear regression. Logistic regression fits an "S-shaped" curve to the data rather than a straight line. The top and bottom of the "S" can approach but never go beyond 0 and 1.
2. R-Square Values: In regression with quantitative dependent variables, such as weekly sales, the actual data on sales as well as the predicted values for sales from the regression are both in dollars. With a 1/0 dependent variable, the actual data are all 1s and 0s while the predicted values are quantitative estimates of probability such as 0.55 in the example above. This fact makes the R-square somewhat inaccurate as a measure of how good the model fits the data by biasing it toward lower values.
Typically, to assess how good the model does it fitting the data, other measures are caculated in addition to R-square. The simplest method is to use 0.5 as a cutoff value -- any predicted value above 0.5 is treated as a prediction of a 1, and any predicted value below 0.5 is treated as a prediction of a 0. Then these predicted 1s and 0s are compared with the actual data, and the "percent of cases (observations) correctly predicted" can be computed.
Determining Patterns (components) Time Series
Data
The analysis of data collected at regular time intervals represents
an important areas of business statistics because of two inter-related
reasons: 1) it permits patterns from past changes in a variable to examined;
and 2) these past patterns can be used as a tool to forecast the future
values. Unlike regression analysis where data must be gathered on both
dependent and independent variables, time series methods can be used even
if data have been collected only on the variable of interest. This provides
a major advantage because it reduces the amount of data required. Chapter
14 in Siegel introduces some of the concepts of time series analysis. The
discussion in this supplement covers a few of the things that Chapter 14
omits or leaves a bit fuzzy. Some of the key terms are:
Y(t) = values for variable Y (such as sales) measured at regular time intervals (t) such as monthly. Y(t) is also referred to as "the level of Y" or a "time series on Y."
Y(t-1) = value for the variable Y in the prior period. For instance, with monthly sales data, Y(t-1) refers to the prior month's sales. Y(t-1) is also called the "lagged" value of Y of the "lag of Y."
Y(t) - Y(t-1) = difference in Y; it is also known as the "first difference" of Y as well as the "change in Y" from period to period. It measures how much the variable changed from its value period. For monthly sales, it shows how much sales changed from the prior month.
Structural Models = regression models using time-based data where several other variables (X-variables) are used as explanatory variables in predicting values of Y;
Time Series Models = equations which use past values of Y, past differences in Y, and information drawn from these past values to predict current values;
Time Series Components = patterns in the past movements of Y such as a trend components, a cyclical component, a seasonal component, or a random component.
We will use weekly Sales for Company X from 1990 through 2001 to illustrate our points. The simplest (but sometimes misleading way) to determine patterns in the past history of sales is by looking at a graph of monthly sales plotted over time (Sales on the Y-axis and weeks & years on the X-axis) to see if there are any obvious trends, repetitive cycles, big jumps or dips during certain periods, and the like.
At first glance, the graph above appears to show that there may be an upward trend (a trend component)over the time frame of about $10 thousand per year. Also, there are some repetitive "ups and downs" (a cyclical component). If you look closely, you can also make out big dips of $20 to $30 thousand around the end of each year and an increase of a few thousand during the summer weeks (seasonal components).
While looking at graphs like the one above is a good first step, it leaves the size of the patterns to a lot of guesswork. To more precisely quantify the patterns in weekly sales, we can estimate an equation from the data that looks similar to a regression equation but uses on information on the variable under study or time. The data in our file would appear as below where "t" represent a particular week:
| Week (t) | Sales (in thousands of $) |
| 1 | 200 |
| 2 | 208 |
| 3 | 215 |
In the equation estimated below we will use the following definitions:
Sales(t) = weekly sales in thousands of dollars;
Trend = time variable counting the weeks; it starts at 1 and increasing
by 1 unit each week of the sample; this looks for a linear "trend" component
(pattern) in the data;
Sales(t-1) = Sales in the prior week; this is the simplest means of
looking for past "cycles" in the data;
XMAS = a seasonal indicator variable equal to 1 if a week included
December 25 and 0 otherwise;
SUMMER = a seasonal indicator variable 1 for weeks from Memorial Day
to Labor Day and 0 otherwise.
The results of estimating an equation to predict weekly Sales with these
components appears below:
| Variable | Coefficient | Std. Error | t-value | p-value |
| Constant | 100 | 6.2 | 15.0 | 0.001 |
| Trend | 0.075 | 0.006 | 13.0 | 0.001 |
| Sales(t-1) | 0.50 | 0.03 | 15.00 | 0.001 |
| Xmas | -32.0 | 3.5 | 9.00 | 0.001 |
| Summer | 6.0 | 1.3 | 4.00 | 0.001 |
R-square = 0.82
Durbin-Watson = 2.00
Box-Pierce Q(12 lags) = 10.2 (p-value = 0.65)
Mean Weekly Sales = $250 (000); Standard deviation
Sales= $30 (000)
In equation form, this would be written
Sales(t) = 1.00 + 0.075*Trend + 0.50*Sales(t-1) - 32*XMAS + 6*SUMMER + error(t)
The coefficients in this equation are interpreted the same way as regression coefficients. For instance, the Trend coefficient of 0.075 means that for each week, sales increases by about $0.075 thousand ($75). Around Christmas, Sales drops by about $32 thousand. The "Durban-Watson" statistic (2.01 in this case) is a measure of whether the errors in our model are dependent on each other (a bad thing). Values for the Durbin-Watson between about 1.6 and 2.4 are viewed as indicating independence of the errors (a good thing). The "Box-Pierce Q-Statistic" is another measure of this same thing.
Forecasts from Time Series Models
We can generate forecasts from this equation fairly easily. Suppose
that for the last week of the 12 years of data (t=624), actual sales were
300. The equation would forecast sales for first week of 2002 (t=week 625
in the data) and the second week of 2002 (t=626) to be
Forecast Sales (t+1) = 100 + 0.075*(Trend = week
625) + 0.5*(Sales week t = 300) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 296.8
Forecast Sales (t+2) = 100 + 0.075(Trend = week
626) + 0.5 *(Sales week t+1) - 0.32*(XMAS = 0) + 6*(Summer = 0)
= 295.3
The goal in forecasting is to make accurate predictions. Errors = Actual Values - Forecasted values. As in regression analysis, R-square is a measure of how well the patterns in the model account for past movements in the series. In addition to R-square, other measures are commonly used evaluate how good the model is at forecasting (predicting) sales. Some of these are
Root Mean Square Error = a measure
of the average size of the errors of the forecasts; it is the square root
(sum of the squared errors divided by the number of errors);
Mean Absolute Error = a measure
of the average size of the errors of the forecasts; it is the sum of the
absolute value of the errors divided by the number of the errors;
Mean Absolute Percent Error = the
average error size as a percent of the mean of the variable being forecasted;
mean absolute error divided by the mean.
"Static Forecasts" estimate the forecasts and errors using the data for the sample used to compute the time series equation. "Dynamic Forecasts" use data from outside the original sample to compute forecasts and errors.
For the model above, the forecasting diagnostics were as follows:
(Static) Forecast Diagnostics
Root Mean Square Error = $7500; Mean Absolute Error = $7000; Mean Absolute
Percent Error = 2.8%
In words, the average weekly error in our forecasts was $7000 to $7500 or about 2.8% of total sales.
More Complex Patterns:
The example above investigated some of the simpler patterns to be found
in a time series. More complicated patterns can be investigated in a number
of ways. For one, trends are not necessarily linear. Using a squared Trend
term can sometimes account for this. Sometimes the trends are much more
complex, requiring special methods. Second, cyclical patterns may be much
more subtle and complex requiring a second or third "lagged" value or even
lagged error terms. Sometimes, the errors from the forecasts can be used
to improve future forecasts (another cyclical pattern). Third, seasonal
patterns may not be as straightforward as we estimated above.
Also, other variables can be added to time series equation. In the example above, the existence, type, or amount of advertising done in the prior week might be included as an explanatory variable. Finally, one of the most subtle but most important issues in looking for time series patterns is to realize that some times patterns can be misleading. What appears to be a trend or a cycle may be nothing more than a series of random steps. For example, in class we will see a graph of a variable that appears to display a downward trend and possibly repetitive cycles around this downward trend, it is really nothing more than a series of random movements.
These kinds of time series that are a series of random steps are called "random walks." Chapter 14 discusses the idea a little. Random walks can masquerade as seeming trends or cycles in data. A random walk contains random movements from one step to the next -- if you selected a starting point, say 50, then drew a ball at random (say 2), and then put your new mark 2 places above your starting mark so you are now at 52, then you draw a new ball at random (say -5) and place the next mark -5 places above your prior mark so you are now at 47. This makes the change in your position random. That is exactly how the graph shown above was generated.
Such a random walk (series of random steps) looks very different from a series where balls are drawn from a hopper where 50 is the mean and balls with numbers ranging from 45 to 55 are in the hopper and drawn at random. A series generated by this procedure is a "random series." It is also called a "white noise" series. Its graph probably appears more like the one people have in mind when they think of randomness.
Distinguishing a random series from one with patterns is not too difficult.
The graph above has no obvious pattern. More precisely, an equation for
Stock Price(t) that included lagged stock price, trends, or seasons would
all have coefficients near zero. Distinguishing random walks is a bit trickier.
As the graph for the random walk shows, there seems to be a trend. If and
equation for stock price (t) were estimated with a Trend, the Trend coefficient
would also appear large. The key is the coefficient on lagged stock prices
(t-1). A random walk will have a lagged coefficient near 1.0 (usually 0.9
to 1.0). The random walk above has this equation: Stock Price(t) = 0.17
+ 0.98*Stock Price(t-1). The coefficient of 0.98 means that there is a
random walk component in this series. Further analysis should be conducted
using changes in the stock prices instead of the original levels.
Introduction: Experimental design refers to actively controlling the process by which data is generated so that the effects of one or more variables can be better isolated and measured. These methods are common place in natural and life sciences, where many experiments are conducted within laboratories and most of the variables influencing outcomes can be controlled. However, the methods are also useful in other scientific and business settings where only some of the factors are controllable. In business, these methods have been most widely used to in production management settings to test different techniques or machinery. However, the same ideas are adaptable to almost any managerial or personnel setting and can range from very simplistic methods to very complicated designs.
Factor(s): a variable(s) that wholly or partly determines changes in another variable usually designated as the "response variable"; settings or "levels" of these factors refer to the different possible values the factor can take; in many situations, these values are qualitative
Experimental Data: Data that is generated where one or more of the factors influencing the outcomes is actively manipulated so as to better isolate or eliminate its effect or the effect of other factors;
Observational Data: Data that is collected without any active manipulation of the factors that influence the outcomes
Experimental Design: The plan for manipulating factors when generating and observing outcomes; the plan may range from a simply a change to a setting (levels) of one factor -- a simple "intervention"-- or a very extensive design that holds some factors constant while changing the settings (levels) of other factors
Overview of Steps in an Experimental Design:
While the specifics of a designs vary based on the details of the industry,
company, and particular issue, an organized approach to setting up the
experiment should, more or less, follow the items presented below.
(Adapted from Coleman and Montgomery, Systematic Approach to Planning
for a Designed Industrial Experiment, Technometrics, 1993).
1. Objectives of the experiment: should be specific,
measurable, and relevant
2. Background: existing theoretical or statistical
knowledge concerning the response variable or factors, if any, as well
as how the current experiment fits in with this background
3. Response variable: identify how the variable
is measured and the typical operating means and ranges (if known)
4. List factors & determine settings/controls:
a) Factors of main interest -- identify the
variables influencing the response variable that the design is intended
to help isolate their effects or their combined effects (interactions);
determine the desired settings (levels) of these variables during the experiment
including desired interactions;
b) Factors to be held constant -- identify
these factors and the "allowable" ranges of
variation
c) Other controllable variables -- identify
the other variables known or likely to influence the response variable
that can be actively manipulated; determine the strategy for filtering
out their effects (such as randomizing the variables of main interest among
different settings of these factors or
d) Non-controllable factors -- identify the
other variables that influence the response variable that cannot be actively
manipulated; if these can be measured, determine the specific measurement
strategy; if these cannot be measured, identify the expected impact, if
any on the experimental outcomes
5. Restrictions: identify and list cost, legal,
managerial, or other limitations placed on the ability to manipulate factors;
6. Oversight and setup: identify responsibilities
of personnel in the experiment and whether or not trial runs should be
conducted
7. Analysis techniques: if possible, identify
the most likely statistical methods that will be used to analyze the data
from the experiment (such as regression, plots, ANOVA, ...)
SUPPLEMENTAL READING 4 -- STOCHASTIC
SIMULATIONS
Introduction:
Simulations are a growing tool used in both academic and business
settings due to advances in computational power with computers. Even through
most of the 1980s, most simulations of any sophistication were conducted
by academics, a few governmental agencies such as the U.S. Department of
Defense, and very large business such as Bell Labs employing people with
significant mathematical/statistical training. With various point-and-click
software applications, complex and powerful simulations can be conducted
without relative ease.
Simulations, in general, are studies where a set of assumptions
are combined with data to determine what outcomes would be found under
those conditions. In many settings, simulations are go by the name "what
if" analysis because the investigator is considering what will happen if
a set of hypothetical conditions or data hold true. In statistics, the
predicted values from a regression are a type of simulation where the equation
and coefficients are used with the data for the X-variables to generated
the predicted (forecasted) values for Y. Forecasted values from a time
series model are another example of simulated values. Cost estimates for
a project based on identifying costs of similar projects or appraisal values
for a house based on average values of similar houses are examples of very
simple, simulated outcomes.
Simulations (other than just guesses) all share two parts
Simulated Value = Model & Data.
Simulated values are generated by assuming different "what if" scenarios for either the model, the data or both.
Simple Example
A simulation might start with a model as simple as the basic accounting
definition for net worth which subtracts one thing from another and then
generates a what if scenario by merely multiplying assets by 2 times their
actual values:
Model: Net Worth = Assets - Liabilities
Data (to be used in the model): Liabilities = actual values
Assets = 2*actual values
In this simple example, the only simulated or hypothetical part is
the asset data values we plug in since the actual values for liabilities
are used, and the model is a basic definition and not a hypothetical relationship.
Spreadsheet software such as Excel make doing these and a little more complex
simulations relatively easy by permitting various columns of numbers to
be combined together in formulas determined by the user as well as permitting
users.
The complexity of a simulation increases as the model becomes more
complex and more of the data and model are hypothetical. Still, no matter
how sophisticated the simulation, the simulated outcomes are driven by
a model (one or more equations) and data (real or hypothetical input provided
by the user). Even simulations in the form of computer games -- such as
Microsoft Flight Simulator -- that present pictures to the users are really
just combinations of equations (model) and data.
Stochastic Simulations:
The example described above is more technically called a "deterministic"
simulation because the all of the numbers used are fixed at the outset,
even the hypothetical values. A different class of simulations are where
one the data or parameters can take on different values that, to some extent,
are random. These kinds of simulations are called "stochastic" simulations
or "Monte Carlo" simulation. They improve simulations by permitting the
user to incorporate uncertainty more explicitly into the hypothetical scenarios.
In stochastic simulation, the idea is not to just say we don't know
the future, therefore, lets just pick any number from 1 to 1 million at
random. Instead, users assume that they can describe the likelihood of
different outcomes but with some lingering uncertainty about the specifics.
Therefore, the typical procedure is to pick some probability distribution,
such as the normal distribution, that the user thinks describes the likelihood
of outcomes, and then let the computer software generate hypothetical values
at random that fit that probability distribution.
Example -- A Deterministic Simulation:
Suppose we are constructing a new house. To simplify matters, suppose
we also know the final expense is driven by the size of the house (sq.
ft heating & cooled) and the quality of the house (Premium = 1; Standard
= 0). We also know that jumps in lumber prices (% change from current date
on 2x4 prices). Based on past experience and analysis, suppose we have
the following relationship (MODEL):
Housing Expense = $10,000 + 80*(sq. ft) + 20*(Premium*sq. ft) + 20,000*(% Change Lumber)
Data: we could plug in values for (sq. ft.), (premium), and (lumber)
to simulate an outcome. If we use 3000 sq. ft., Premium = 1, and Lumber
= 1%, the simulated price would be
$330,000 = 10,000 + 80*(3000) + 20*(1*3000) + 20,000*(1).
This is just like the predicted values we have generated with regression analysis and is another example of a "deterministic" simulation where the hypothetical parts of the simulation are fixed in advance.
Example-- Simulation with Stochastic Data Values
Now suppose everything else about the housing expense model and
the data are the same, but we are not really sure how about the changes
in lumber prices. Our best guess is that the average change will be zero,
but they have often varied 1 or 2 percent up or down and occasionally a
lot more. Rather than just plugging in a number such as 1% as we did above,
we decide to conduct a stochastic simulation where we generate 100 different
housing expenses for a 3000 sq. ft, premium house but assume that lumber
price changes for these 100 cases are drawn from a normal distribution
with a mean of 0% and a standard deviation of 2% designated as Normal (0,2).
So Now the setup is
Housing Expense = $10,000 + 80*(sq. ft=3000) + 20*(Premium*sq. ft=1*3000)
+
20,000*(% change lumber prices: Normal(0, 2))
After the software computes housing expenses for these 100 cases, we can examine them and find out what was the average housing expense, what was the highest and lowest expense, and what was the typical range of expense. This kind simulation provides much more information on which we could base our decisions.
Example -- Simulation with Stochastic Data Values &
Model Parameters (coefficients)
To incorporate reality a little better, we could also assume that
the coefficients (also called the parameters) in the model (80, 20, 20000)
are, themselves, just estimates. The exact relationships are not known
with certainty and can change to some extent. Suppose we think that the
coefficient for sq. ft. is on average 80 with but may differ by a standard
deviation of 5, we think the premium coefficient is 20 on average with
a standard deviation of 2 and the coefficient on lumber prices is 20,000
on average with a standard deviation of 1000. Now our simulation setup
with a 3000 sq. ft premium house is
Housing Expense = $10,000 + Normal(80, 5)*(Sq. ft. = 3000)
+ Normal(20, 2)*(premium*sq. ft. = 3000*1)
+ Normal(20000, 1000)*(%change lumber prices: Normal (0, 2))
In this case, the parameters (coefficients) in the model are generated by drawing numbers from normal distributions as are the data values for lumber price changes. We could again generate, say, 100, cases and find the average housing expense, the highest and lowest expense, and they typical range of expenses
Questions:
3a. Briefly describe the data set used in the assignment and measurement
of specific variables.
b. Based on your output, what would "typical" kilowatts used per
day be? Is the data set symmetric and what are outlying values?
c. Do winter months have higher gas usage than other months? Conduct
test where the null hypothesis is that winter month gas usage is the same
as other months.
EXAMPLE OF PRETTY GOOD ANSWERS
3. a. The data set consisted of 5 variables related to the monthly
electric and gas usage of residential utility customers from July 1990
to June 1998. The variables included were a month identifier, average kilowatt
hours used per day, gas thermal units used, average daily temperature,
and number of days in the month.
b. Based on the histogram and descriptive statistics, typical kilowatt hours used per day ranged from about 30 to 40 per month. The mean was about 34 and the median was 37 with a standard deviation of 3. The data was skewed to the right with a few large outliers well beyond 50 hours per day.
c. The means for winter month gas usage (December, January, February) were about 30 percent higher than for other months. Based on a low p-value (0.005), a hypothesis that winter months and other months have the same mean could be rejected with strong confidence.
EXAMPLE OF PRETTY BAD ANSWERS
3. a. It had some variables about utility customers.
b. 34. A few big outliers.
c. winter had higher