# Loading packages
library(ggplot2)
# Setting default theme
theme_set(
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = rel(1.4)),
axis.title = element_text(size = rel(1))
)
)
For the past few years I’ve been a statistics tutor in the psychology department at Maynooth University. We cover everything from the basics of descriptive statistics and graphs to the nitty-gritty of t-tests, correlations, and ANOVAs. However, there’s one topic where I always slow down and dive deep: regression analysis. Why? Because linear regression is basically the Swiss Army knife of statistics—it’s everywhere, it’s versatile, and it’s way cooler than it sounds.
Seriously, linear regression is like that one friend who shows up at every party. Lectures? Check. Assignments? Obviously. Research papers? You bet. Even casual chats about data somehow always circle back to it. It’s the ultimate tool for understanding how variables play together. Whether you’re predicting exam scores based on study hours (or lack thereof) or calculating how much coffee it takes to survive finals week (spoiler: a lot), regression has your back.
One thing I always like doing is having students calculate regression coefficients by hand. It’s a great way to understand the mechanics of the model and appreciate the magic of least squares. But, full disclosure: I always forget the formulas mid-session and end up frantically Googling them. So, I decided to write them down here—partly for fun, partly as a future reference for myself (hey future me!!).
The linear regression model
At its core, linear regression is all about finding relationships between variables. The model is a simple equation that describes how a dependent variable (Y) relates to one or more independent variables (X). In its simplest form, it looks like this:
Y = \beta_0 + \beta_1X + \epsilon
where:
- Y the dependent variable (the thing you’re trying to predict).
- X the independent variable (the thing you’re using to make predictions).
- \beta_0 is the intercept (where the line crosses the Y-axis).
- \beta_1 is the slope (how much Y changes for every unit change in X).
- \epsilon the error term (because let’s face it, life is messy).
Generating some data
First let’s load some packages and set our theme
To make this more concrete, let’s create a small dataset in R. We’ll simulate some data for study hours and exam scores, and then use linear regression to model the relationship between them.
set.seed(123)
# Hours of study per week (our x variable)
study_hours <- rnorm(10, mean = 7, sd = 2)
# Exam scores (our y variable)
exam_scores <- 25 + 5 * study_hours + rnorm(10, mean = 0, sd = 2)
test_data <- data.frame(
study_hours = ceiling(study_hours),
exam_scores = ceiling(exam_scores))
head(test_data)
study_hours exam_scores
1 6 57
2 7 59
3 11 77
4 8 61
5 8 61
6 11 81
If we plot this data we can see the relationship between study hours and exam scores:
exam_plot <- test_data |>
ggplot(aes(x = study_hours, y = exam_scores)) +
geom_point(size = 4) +
labs(
title = "Exam scores vs. study hours",
x = "Study hours",
y = "Exam scores"
)
exam_plot
From the plot, you can see a clear trend: the more hours a student spends studying, the higher their exam score tends to be. This is the relationship we want to model using linear regression.
Fitting a model in R
In R, fitting a linear regression model is as simple as using the lm()
function. We provide the formula for the model and the dataset, and R does the rest. Let’s fit a linear model to our data:
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 21.3 3.93 5.42 0.000627
2 study_hours 5.23 0.496 10.5 0.00000569
The lm()
function fits a linear model where exam_scores
is the dependent variable (Y) and study_hours
is the independent variable (X). The output gives us the estimated coefficients for the intercept (\beta_0) and slope (\beta_1).
- The intercept (\beta_0) is 21.33, meaning a student who studies 0 hours is predicted to score approximately 21% on the exam.
- The slope (\beta_1) is 5.23, indicating that for every additional hour of study, a student’s predicted exam score increases by 5.23%.
Visually this looks like
exam_plot <- exam_plot +
geom_smooth(
method = "lm",
colour = ggokabeito::palette_okabe_ito(order = 5),
se = FALSE) +
# Adding equation of the line
annotate(
"text", x = 6, y = 80,
label = paste0(
"Y = ", round(coef(fit)[1], 2), " + ", round(coef(fit)[2], 2), "X"),
color = ggokabeito::palette_okabe_ito(order = 5),
size = rel(5), fontface = "bold"
)
exam_plot
But how are these values calculated
Calculating Regression Coefficients by Hand
While R’s lm()
function makes fitting a linear regression model incredibly easy, it’s helpful to understand how the coefficients (\beta_0 \: \text{and} \: \beta_1) are derived.
The goal of linear regression is to find the best-fitting line through the data, which is done by minimizing the sum of squared errors. The sum of squared errors is calculated as:
SSE = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
where:
- Y_i is the actual value of the dependent variable.
- \hat{Y_i} is the predicted value of the dependent variable.
- n is the number of observations.
# Add predicted values to the dataset
test_data <- test_data |>
dplyr::mutate(predicted = coef(fit)[1] + coef(fit)[2] * study_hours)
# Visualizing SSE
exam_plot +
geom_point(
data = test_data,
aes(x = study_hours, y = predicted),
size = 2.5,
colour = ggokabeito::palette_okabe_ito(order = 1)) +
geom_segment(
data = test_data,
aes(x = study_hours, xend = study_hours,
y = exam_scores, yend = predicted),
colour = ggokabeito::palette_okabe_ito(order = 2),
linewidth = 1.25, linetype = "dashed", alpha = 0.7
)
To minimize the SSE, we need to find the values of \beta_0 (intercept) and \beta_1 (slope) that make the line fit the data as closely as possible. The formulas for \beta_0 and \beta_1 are:
\begin{align*} \beta_1 &= \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - X)^2} \\[18pt] \beta_0 &= \bar{Y} - \beta_1\bar{X} \end{align*}
where:
- X_i and Y_i are the individual data points.
- \bar{X} and \bar{Y} are the means of the independent and dependent variables, respectively.
Step 1: Calculate the means
The first step is to calculate the means of X and Y. These are the averages of the independent and dependent variables, respectively.
\begin{align*} \bar{X} = \frac{\sum X_i}{n} = \frac{77}{10} = 7.7 \\[12pt] \bar{Y} = \frac{\sum Y_i}{n} = \frac{616}{10} = 61.6 \end{align*}
Step 2: Calculate the deviations from the mean
Next, we calculate the deviations from the mean for each data point and their products. These are the building blocks for the slope (\beta_1)
Student | X | Y | (x_i - \bar{x})(y_i - \bar{y}) | (x_i - \bar{x})^2 |
---|---|---|---|---|
1 | 6 | 57 | 7.82 | 2.89 |
2 | 7 | 59 | 1.82 | 0.49 |
3 | 11 | 77 | 50.82 | 10.89 |
4 | 8 | 61 | -0.18 | 0.09 |
5 | 8 | 61 | -0.18 | 0.09 |
6 | 11 | 81 | 64.02 | 10.89 |
7 | 8 | 66 | 1.32 | 0.09 |
8 | 5 | 44 | 47.52 | 7.29 |
9 | 6 | 55 | 11.22 | 2.89 |
10 | 7 | 55 | 4.62 | 0.49 |
\sum | 77 | 616 | 188.8 | 36.1 |
Step 3: Calculate the slope (\beta_1)
Using the formula for \beta_1 we now get:
\begin{align*} \beta_1 &= \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - X)^2} \\[12pt] &= \frac{188.8}{36.1} \\[12pt] &\approx 5.23 \end{align*}
Step 4: Calculating the intercept (\beta_0)
Using the formula for \beta_0 we now get:
\begin{align*} \beta_0 &= \bar{Y} - \beta_1\bar{X} \\[12pt] &= 61.6 - 5.23(7.7) \\[12pt] &\approx 21.32 \end{align*}
Step 5: Writing our regression equation
Now that we have values for \beta_0 \: \text{and} \: \beta_1 we can write the regression equation:
\hat{Y} = 21.3 + 5.23X
which if you remember from before is the same as our lm()
model in R.
broom::tidy(fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 21.3 3.93 5.42 0.000627
2 study_hours 5.23 0.496 10.5 0.00000569
Citation
@online{monaghan2025,
author = {Monaghan, Cormac},
title = {Doing Linear Regression by Hand},
date = {2025-03-12},
url = {https://c-monaghan.github.io/posts/2025/2025-03-12-Linear-Regression/},
langid = {en}
}