Linear Regression and Finding the Line of Best Fit

Lesson Objectives

• Understand the concept of linear regression
• Use linear regression to find the line of best fit
• Interpret slope and y-intercept of the regression line
• Apply the linear model to make predictions
• Understand the limitations of linear regression models

Common Core Standards

• S.ID.6: Represent data on two quantitative variables on a scatter plot, and describe how the variables are related.
• S.ID.7: Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.
• S.ID.8: Compute (using technology) and interpret the correlation coefficient of a linear fit.

Prerequisite Skills

• Understanding linear functions and equations
• Plotting points on the Cartesian plane
• Calculating slope and y-intercept

Key Vocabulary

• Linear regression
• Line of best fit
• Scatter plot
• Correlation coefficient

Warm-up Activity (5 minutes)

Display a scatter plot of real-world data showing the relationship between hours spent studying per week and SAT scores. Use the following dataset (source: College Board, 2019 SAT Suite of Assessments Annual Report):

Hours Studied Per Week

Average SAT Score

0

1050

1-5

1090

6-10

1150

11-15

1190

16-20

1220

More than 20

1240

Note: For the "More than 20" category, we'll use 25 hours as an approximation for graphing purposes.

Ask students to describe the relationship they observe between the variables and sketch a line that best fits the data. Discuss how this line could be used to make predictions about SAT scores based on study time.

Prompt students with questions such as:

• What trend do you notice in the data?
• How does the average SAT score change as study time increases?
• Is the relationship between study time and SAT scores perfectly linear? Why or why not?
• If a student studies for 8 hours per week, what SAT score might you predict based on this data?

Teach (25 minutes)

Introduce the concept of linear regression as a statistical method for finding the line of best fit for a set of data points. Explain that this line minimizes the sum of squared vertical distances from the data points to the line.

Definitions

• Linear regression: A statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
• Line of best fit: The straight line that best represents the relationship between two variables in a scatter plot.
• Scatter plot: A graph that shows the relationship between two variables as a collection of points.
• Correlation coefficient: A measure of the strength and direction of the linear relationship between two variables, ranging from -1 to 1.

Use this slide show to review these definitions:

https://www.media4math.com/library/slideshow/linear-regression-vocabulary

Developing a Linear Regression Model

Use the following dataset for average height vs. average weight for American males (source: Centers for Disease Control and Prevention, National Health and Nutrition Examination Survey, 2015-2018):

Height (inches)

Weight (pounds)

67

160

68

165

69

170

70

175

71

180

72

185

73

190

74

195

Walk through the process of graphing the data set and performing linear regression using the Desmos graphing calculator:

1. Go to www.desmos.com/calculator
2. Click on the + button in the upper left corner and select "Table"
3. Enter the height data in column 1 and the weight data in column 2
4. Click on the + button again and select "f(x) expression"
5. input y1 ~ mx1 + b
6. Desmos will display the scatter plot and the line of best fit, along with other statistics.

Here is a Desmos activity you can use with this data set:

After performing the linear regression, present the following table with the results:

Aspect

Value

Linear function equation

y = 5x - 175

y-intercept

-175

Slope

5

Correlation coefficient

1

Explain how to interpret each of these values:

• The linear function equation represents the line of best fit.
• The y-intercept (-175) represents the theoretical weight when height is 0 (not meaningful in this context).
• The slope (5) indicates that for each inch increase in height, weight increases by 5 pounds on average.
• The correlation coefficient (1) shows a perfect positive linear relationship between height and weight in this dataset.

Using the Linear Model for Predictions

Demonstrate how to use the linear function equation y = 5x - 175 to calculate the weight for someone whose height is between values in the table.

Example: Calculate the predicted weight for someone who is 70.5 inches tall.

y = 5(70.5) - 175
y = 352.5 - 175
y = 177.5

Explain that the model predicts a weight of 177.5 pounds for someone who is 70.5 inches tall.

Limitations of the Linear Regression Model

Discuss the following limitations:

• Extrapolation: The model may not be accurate for heights outside the range of the data (67-74 inches). Using it to predict weights for very short or very tall individuals could lead to unrealistic results.
• Assumption of linearity: The model assumes a perfectly linear relationship, which may not always be true in real-world scenarios.
• Oversimplification: The model only considers height as a factor in determining weight, ignoring other important variables like age, gender, body composition, and lifestyle factors.
• Data quality: The accuracy of the model depends on the quality and representativeness of the data used to create it.
• Correlation vs. causation: While there's a strong correlation between height and weight, the model doesn't imply that height causes weight or vice versa.

Emphasize that understanding these limitations is crucial for appropriate use and interpretation of linear regression models.

Review (15 minutes)

Provide students with the following real-world dataset showing the relationship between age and average income for US adults (source: U.S. Bureau of Labor Statistics, Current Population Survey, 2020):

Age Group

Average Annual Income ($) 20-24 33,280 25-34 52,052 35-44 67,340 45-54 70,356 55-64 70,616 65+ 56,632 Guide students through the process of creating a scatter plot, finding the regression line, and interpreting the results using Desmos or a graphing calculator. Have students work in pairs to perform linear regression and answer questions about the meaning of the slope and y-intercept. After completing the linear regression, present the following summary: Aspect Value Linear function equation y = 1057.8x + 33,957 Slope 1057.8 y-intercept 33957 Correlation coefficient 0.82 Discuss the interpretation of these values: • The linear function equation represents the line of best fit for the age-income relationship. • The slope (1057.8) indicates that, on average, annual income increases by$1,057.80 for each year increase in age.
• The y-intercept (33957) represents the theoretical average annual income at age 0 (not meaningful in this context).
• The correlation coefficient (0.82) shows a strong positive linear relationship between age and income, although it's not perfect.

Have students discuss the limitations of this model, such as:

• The use of age groups instead of individual ages may affect the precision of the model.
• The model doesn't account for factors other than age that might influence income.
• The relationship may not be truly linear, especially at the upper end of the age range.

Assess (10 minutes)

Administer a 10-question quiz to assess student understanding of linear regression concepts and skills.

Quiz

1. What is the purpose of linear regression?

2. How is the line of best fit determined in linear regression?

3. What does the slope of a regression line represent?

4. What does the y-intercept of a regression line represent?

5. What does a correlation coefficient of 0.95 indicate about the relationship between variables?

6. Given the regression equation y = 2.5x + 10, what is the slope?

7. In the equation y = 2.5x + 10, what does the 10 represent?

8. True or False: A negative slope in a regression line always indicates a weak correlation.

9. What is the range of possible values for the correlation coefficient?

10. When might linear regression not be an appropriate method for analyzing a dataset?

1. To find the best-fitting straight line through a set of data points
2. By minimizing the sum of squared vertical distances from data points to the line
3. The rate of change in the dependent variable for each unit change in the independent variable
4. The predicted value of the dependent variable when the independent variable is zero
5. A very strong positive linear relationship between the variables
6. 2.5
7. The y-intercept
8. False
9. -1 to 1
10. When the relationship between variables is not linear