Course log and updates

Welcome to the course SF2930 Regression analysis, Spring, 2022. This page presents the latest information about what is addressed in lectures and schedule changes.

Course organization: Due to covid regulations, both the lectures end the exercise sessions will be given by distance via Zoom. Familiarize yourself with the Canvas page, read through the material and information found below.

EXAM

A list with approximately 36 questions will be provided about two weeks prior to the exam. The written exam will consist of exactly six (6) of these questions, and each question will correspond to six points.

Update (9/5): The re-exam will have the same structure as the exam, and will use the same set of questions (see link below).

Update (11/3): The exam questions are now available here: questions2022-bd755af6-a75a-41b3-9a99-6eb9d3b77ade.pdf Download questions2022-bd755af6-a75a-41b3-9a99-6eb9d3b77ade.pdf

Update (3/3): The grading for the exam will be as follows. F: 0-15, Fx: 16-17, E: 18-21, D: 22-25, C: 26-29, B: 30-33, A: 34-36

Update (18/3): We will revert to the "standard" scale for grading the exams. In other words, the grading will be as follows.

Grade	Percent	Points	Freq. on the exam
A	>= 85%	30.6 -- 36	9.6%
B	>= 75%	27 -- 30.5	17%
C	>= 65%	23.4 -- 26.9	20%
D	>= 55%	19.8 -- 23.3	21%
E	>= 45%	16.2 -- 19.7	15%
Fx	>= 40%	14.4 -- 16.1	5.9%
F	< 40%	0 -- 16.0	12%

Below, I have linked to a few older exams, as well as to older lists of exam questions.

exam200606.pdf Download exam200606.pdf
exam201603.pdf Download exam201603.pdf
exam201606.pdf Download exam201606.pdf
exam201703.pdf Download exam201703.pdf
examquestions2020.pdf Download examquestions2020.pdf
examquestions2021.pdf Download examquestions2021.pdf

EXERCISES

We organize online exercises in zoom starting on the 21st of January. These will take place during regularly scheduled practice times. Teaching assistant, Isaac Ren will be available in zoom during the regularly scheduled exercises hours. It is important that you download the zoom client to be able to participate.

Exercise sessions
Link: https://kth-se.zoom.us/j/63151958791 (Links to an external site.)
Meeting ID: 631 5195 8791

LECTURES

Course coordinator Malin Palö Forsström will be available in zoom during the regularly scheduled lecture hours. It is important that you need to download the zoom client in order to participate.

Link: https://kth-se.zoom.us/j/63045281655 (Links to an external site.)
Meeting ID: 716 472 0901

A detailed intro will be given at the beginning of each lecture and important elements will be highlighted according to the Course plan. Then you can ask questions related to theory. Under Discussions or directly by e-mail (malinpf@kth.se) you can also ask questions concerning theory.

Observe that not all topics will be covered during the lectures and additional reading is required. Reading instructions will be provided below before each lecture.

In terms of time, we will follow the schedule. Self-study requires a little more discipline and I strongly encourage you to work in step with the schedule so that you do not fall behind in your studies.

Lecture 1, 20/1

At the beginning of this lecture, I will explain how the course will be organized this semester. Then an introduction to regression analysis will be presented.

The simple linear regression model will be discussed in detail, including basic assumptions on equal variance of the error term, linearity, and independence. The least squares (LS) fitting strategy will be discussed along with the properties of the obtained estimators of the regression coefficients. Go through these properties once again, read Sections 2.2.2--2.2.3 of MPV, check normal equations given by Equation (2.5) on p. 14 and their solutions, show that both LS estimators of the slope and intercept are unbiased, and find their variances. Think about three sources of error in the slope estimation and their effects: the noise level, the sample size, and the spread in x.

Go through Ex 2.1 and Ex 2.2 to see the numerical calculations for the LS fit, read about residual properties, and check the general properties 1.--5. of the LS fit presented on p. 20 of MPV.

Go through Sections 2.3.1--2.3.2 and check which additional assumptions are needed to perform the tests of significance on the slope and intercept. I will discuss this in detail during the next lecture.

Slides: lecture1-2.pdf Download lecture1-2.pdf
The covid dataset: data.csv Download data.csv
R code for Lecture 1: R1.Rmd Download R1.Rmd (note that you have to change the row "/Users/malin/Dropbox/Jobb/Teaching/KTH - SF2930/data.csv" to the location of the dataset on your computer)

I highly encourage you to play around with the R file (requires the dataset), and to simulate datasets yourself in your favorite language to try things out with.

Lecture 2, 21/1

Tests of significance and confidence intervals for the slope, intercept, and variance of the error term will be discussed for the simple linear regression model. Go through numerical examples and check graphs in Sections 2.3.1-2.3.2 of MPV. Fundamental analysis-of-variance (ANOVA) identity will be presented along with the test of significance of regression. It is very important to understand how the partition of the total variability in the response variable is obtained and how the ANOVA-based F-test is derived, this strategy will be used throughout the whole course, specifically in the multiple linear regression models which will be presented during the next two lectures. Go through Section 2.3.3 and check why F-test is equivalent to the t-test when testing the significance of regression in the simple regression model.

The concepts of confidence interval for the mean response and prediction interval for the future observation will be presented. Go through Section 2.4.2, check numerical examples 2.6 and 2.7, it is important to understand what is the principal difference between these two types of intervals and how they suppose to be used in the regression analysis.

Read yourself Section 2.10(2.9) where abuse of regression modeling is discussed and Section 2.11(2.10) where no-intercept regression model is presented as a special type of modeling (the idea is to force the intercept to be zero). Check numerical examples of Section 2.11(2.10) and think about the differences with the previously presented model (that includes intercept term), focusing specifically on the properties of the coefficient of determination.

Go through Section 2.12(2.11) and convince yourself that the ML estimators of the slope and intercept are identical to those obtained by the LS approach, this does not hold for the variance estimator, check why.

A short discussion of the case of the random regressors is presented in Section 2.13(2.12), check it, I will shortly discuss it during the next lecture.
Observe that the exercises selected for the first exercise session on the 24st of January are on the home page, see link [Exercises.](https://kth.instructure.com/courses/31832/pages/exercises)

Slides: lecture2-4.pdf Download lecture2-4.pdf
R code for Lecture 2: R2.Rmd Download R2.Rmd

Lecture 3, 25/1

Some important properties of residuals in the simple linear regression model will be discussed along with the concept of leverage which measures the effect of each observed y on its own fit. Be sure that you understand the difference between the error terms in the model and residuals for the model fit, and can describe the most important properties of both.

Multiple linear regression models will be introduced, starting with matrix notations and turning then to the LS normal equations, their solutions, and geometrical interpretation of the LS estimators. It is important to remember that, in general, any regression model that is linear in the coefficients (beta’s) is a linear regression model, regardless of the shape of the surface it generates. Go through Section 3.2.1, be sure that understand the structure of the matrix X’X and the structure and the role of the hat matrix H. Go through Example 3.1 and graphical data presentation in Section 3.2.1.

Go through Sections 3.2.3- 3.2.6 and check the properties of the parameter estimators obtained by both LS and ML approaches. Check carefully Appendix C.4 where the optimality of the LS estimators are stated in the Gauss-Markov theorem and be sure that you understand the proof of this result. Check carefully Appendix C.2 presenting important results on matrix calculations and some distributional properties of the regression estimators under normality. We will need these latter results for test procedures.

Rear yourself about the global test of significance of multiple linear regression. Go through Section 3.3.1, check the assumptions for constructing the tests of significance, computation formulas for ANOVA representation, and read about checking the model adequacy using the adjusted coefficient of determination. Think why this adjustment is needed? I will present some details during Lecture 4.

Slides: lecture3-3.pdf Download lecture3-3.pdf
R code: R3.Rmd Download R3.Rmd

Lecture 4, 26/1

We continue the discussion of the distributional properties of estimators in the regression model under normality. After repetition of properties of the multivariate normal distribution, we show that the estimator of beta coefficients obtained by LS is equal to that one obtained by ML (think whether it is the same for the error variance estimations?)

We further turn to the test procedures in the multiple linear regression. We start by the global test on the model adequacy and testing a subset of coefficient and test in the general linear hypothesis, see sections 3.3.1-3.3.4. It is important to understand why the partial F-test, presented in Section 3.3.2, measures the contribution of the subset of regressors into the model given that the other regressors are included in the model. Check Appendix C.3.3-C.3.4 for details and go through example 3.5 where the partial F-test is illustrated. Go through Examples 3.6 and 3.7 of section 3.3.4 which demonstrate the unified approach for testing linear hypotheses about regression coefficients.

Next, we discuss the confidence intervals for the coefficients (but not for the mean response). Read yourself sections 3.4.1-3.5. It is important to understand the difference between one-at-a-time confidence interval (marginal inference) for a single regression coefficient, and a simultaneous (or joint) confidence set for the whole vector of coefficients. Go through example 3.11, think about the advantages and disadvantages of the two methods which have been presented: the joint confidence set is given by (3.50) (confidence ellipse, see Fig. 3.8) and the Bonferroni-type correction strategy. I will talk about this during the next lecture.

Standardization (centering and scaling) of the regression coefficients is presented in section 3.9. Check yourself the two approaches for standardization and the interpretation of the standardized regression coefficients. One application of the standardization step is presented further in section 3.10 where the problem of multicollinearity is presented. Check why and how the standardization is applied here, we will discuss the problem of multicollinearity in detail during Lectures 8 and 9.

The phenomena of hidden extrapolation in the prediction of a new observation using the fitted model will be discussed in detail in the beginning of Lecture 5. Go through Section 3.8, it is important to understand the structure of RVH and the role of hat matrix H in specifying the location of the new data point in the x-space. Go through Example 3.13 and inspect the related figures.

Slides: lecture4-3.pdf Download lecture4-3.pdf
R code: R4.Rmd Download R4.Rmd

Lecture 5, 27/1

After discussing the problem with hidden extrapolation in multiple regression, we turn to the model evaluation strategies. The main question is whether the assumption underlying the linear regression model seems reasonable when applied to the data set in question. Since these assumptions are stated about the populations (the true) regression errors, we perform the model adequacy checking through the analysis of the sample-based (estimated) errors, residuals.

The main ideas of residual analysis are presented in Sections 4.2.1-4.2.3, and section 4.3 where the PRESS residuals are used to compute R² like statistics for evaluating the capability of the model. Go through Sections 4.2.1-4.2.3, check the difference between internal and external scaling of residuals and go through numerical examples 4.1 and 4.2 (and relates graphs and tables). Specifically, be sure that you understand why we need to check the assumptions of the model and how we can detect various problems with the model by using residual analysis. Think which formulas and methods we used are at risk to be incorrect when specific model assumptions are violated.

During the next lecture, I will again discuss various plots of residuals that are standard for the model diagnostics. Go through Section 4.2.3 and be sure that you understand how to "read" these plots, e.g. how to detect specific problems in practice. For example, how does non-constant error variance show up on a residual vs. fits plot? How to use residuals vs predictor plots to identify omitted predictors that can improve the model?

Slides: lecture5-3.pdf Download lecture5-3.pdf
R code: R5.Rmd Download R5.Rmd

Lecture 6, 31/1

During Lecture 5 we considered methods for detecting problems with a linear regression model. Some examples illustrating the transition from classic to computer-age practice in regression modeling and model validation will be shown to motivate model diagnostic strategy in linear regression. See Chapter 1 of the book Computer Age Statistical Inference Links to an external site. by B. Efron and T. Hastie,

Once the problems with the model were identified we have a number of solutions that are discussed during the current lecture. Section 5.2 presents variance stabilizing transforms and Section 5.3 summarizes a number of transforms to linearize the model. Go through these sections, check Examples 5.1 and 5.2, and Figure 5.4. You are expected to understand when (and which) transform of the response variable might help, the same for the transforming predictor variables. Observe that sometimes it is needed to transform both to meet the three conditions of the linear regression model.

Check carefully that you understand how to fit the regression model to the transformed data and how to check the model adequacy.
Observe that the methods of variable transform above involve subjective decisions, this means that the model you select as the good one can be different from that selected by your colleague, both models can be appropriate! An analytic strategy for selecting the "best" power transform of the response variable is presented in Section 5.4.1, this is Box-Cox transform. Go through this section (notes from the lecture) and check example 5.3. It is important to understand when the Box-Cox transform is suitable and how to choose the optimal value of the power parameter. Check different strategies of maximizing the likelihood function and making inferences about power parameters. For the overview of the Box-Cox method with a number of examples see Box-Cox transformations: An Overview Links to an external site., and for the R implementation of Box-Cox transformations for different purposes, graphical assessment of the success of transforms and inference on the transformation parameter, see Box-Cox power transformations: Package "AID" Links to an external site..

The common problem of non-constant error variance can also be solved by using the generalized least squares (GLS), and its special version weighted least squares. We do not discuss the general linear model (GLM) strategy presented in Section 5.3. Read Sections 5.5.1-5.5.3, think why these methods are suitable for fitting a linear regression model with unequal error variance, go through Example 5.5. and think about practical issues with GLS and weighted LS. I will be back to the problem of weight estimation during the next lecture.

Slides: lecture6-4.pdf Download lecture6-4.pdf
R code: R6-4.Rmd Download R6-4.Rmd

Lecture 7, 2/2

After repetition of the Box-Cox transformation, we turn to Chapter 6 where the methods for detecting influential observations are presented. It is important to lean the distinction between an outlier, the data point whose response y does not follow the general trend of the data, and the data point which has high leverage. Both outliers and high leverage data points can be influential, i.e. can dramatically change the results of regression analysis such as predicted responses, coefficient of determination, estimated coefficients, and results of the tests of significance. During this lecture, we will discuss various measures used for determining whether a point is an outlier, high leverage, or both. Once such data points are identified we then investigate whether they are influential. We first will consider a measure of leverage (see section 6.2), and then discussed two measures of influence, Cook's distance and DFBETAS (difference in fits of beta). It is important to understand the general idea behind these measures; both are based on deletion diagnostics, i.e, they measure the influence of the i:th observation if it is removed from the data. It is also important to see that both these measures combine residual magnitude with the location of the point of interest in x-space.

We further turn to the problem of multicollinearity which is discussed in Chapter 9. Multicollinearity is present when two or more of the predictor variables in the model are moderately or highly correlated (linearly dependent). It is important to understand the impact of multicollinearity on various aspects of regression analyses. The main focus of the present lecture is on the effects of multicollinearity on the variance of the estimated regression coefficients, the length of the estimated vector of coefficients, and prediction accuracy. Go through Section 9.3 and Example 9.1. Specifically, this example demonstrates that the multicollinearity among regressors does not prevent a good accuracy of predictions of the response within the scope of the model (interpolation) but seriously harms the prediction accuracy when performing extrapolation.

During the next lecture, we will discuss various methods of dealing with multicollinearity.

Slides: lecture7-2.pdf Download lecture7-2.pdf
R code: R7-2.Rmd Download R7-2.Rmd

Lecture 8, 8/2

We turn to the problem of multicollinearity discussed in Chapter 9. Multicollinearity is present when two or more of the predictor variables in the model are moderately or highly correlated (linearly dependent). It is important to understand the impact of multicollinearity on various aspects of regression analyses. The main focus of the present lecture was on the effects of multicollinearity on the variance of the estimated regression coefficients, the length of the estimated vector of coefficients, and prediction accuracy. Go through Section 9.3 and Example 9.1. Specifically, this example demonstrates that the multicollinearity among regressors does not prevent a good accuracy of predictions of the response within the scope of the model (interpolation) but seriously harms the prediction accuracy when performing extrapolation.

Go through the whole section 9.4, focus especially on the example with simulated data on p. 294 which demonstrates the need for measures of multiple correlations (not only pairwise, such as examination of the matrix X'X in its correlation form) for detection multicollinearity. We have also discussed some more general methods of multicollinearity diagnostics such as VIF, and eigensystem analysis of X'X to explain the nature of linear dependence, check the example by Webster et al (simulated data presented on p. 294) to see how to use elements of eigenvectors to catch the linear relationship between the predictors.

After presenting common methods for detecting multicollinearity (with a special focus on the eigensystem analysis presented in section 9.4.3), we turn to the methods for overcoming multicollinearity. We started with the Principal Components Regression, PCR, where the principal components are first obtained by transforming the original predictors, and then these components are used as new derived predictors in the regression model (see section 9.5.4).

The key idea of overcoming multicollinearity using PCR is to exclude those principal components which correspond to the lowest eigenvalues (think why just these components must be dropped). Be sure that you understand how the principal components are constructed from the original data matrix X. Check example 9.3, observe that the final PCR estimators of the original beta-coefficients can be obtained by back transforming.

Important! The method assumes that the original data are scaled to unit length (see p. 114 for scaling step) so that Y and each of the columns of X have zero empirical means.

Slides: lecture8-2.pdf Download lecture8-2.pdf
Annotated slides: lecture8_annotated.pdf Download lecture8_annotated.pdf
R code: R8-1.Rmd Download R8-1.Rmd

Lecture 9, 9/2

After repetition of the effects of multicollinearity on the model fit (with special focus on the eigensystem analysis presented in section 9.4.3 in MPV), we turn to the methods for overcoming multicollinearity. We have discussed two strategies: 1) principal component regression, PCR, where the principal components are first obtained by transforming the original predictors, and then these components are used as new derived predictors in the regression model (see section 9.5.4, MPV) and 2) Ridge regression, which shrinks the LS regression coefficients by imposing a penalty on their size (see section 9.5.3, MPV and section 5.6.4 in Izenman).

We have considered two formulations of the Ridge estimator, namely as the solution of the penalized residual sum of squares or as the shrinkage estimator). It is important to understand what is the role of biasing parameter, also called for penalty or tuning or shrinkage parameter, and how this parameter can be selected. I will be back to the derivation of Ridge estimation during the next lecture and show the bias-variance trade-off phenomena theoretically.

Observe that instead of one solution as we had in LS, the ridge regression generates a path/trace of solutions which is a function of biasing parameter. Check carefully example 9.2 where the choice of the parameter by inspection of the ridge trace is discussed. A computationally efficient approach of optimizing the biasing/shrinkage parameter is presented in the book by Izenman (see section 5.7, algorithm on table 5.7, p. 138, see the course home page for the e-book). This approach considers cross-validatory (CV) choice of the ridge parameter and is more suitable if the model is to be used for prediction. I will explain the details of the algorithm at the beginning of Lecture 10.

Important! Both PCR and Ridge regression methods assume that the original data are already centered and re-scaled, usually to unit length (see p. 114 for scaling step), so that Y and each of p columns of X have zero empirical means. Observe that the centering step is of principal importance for both PCR and Ridge regression. Both types of re-scaling presented on p. 115 of MPV are suitable, check the relationship between Z'Z and W'W to understand why.

Slides: lecture9-2.pdf Download lecture9-2.pdf
Annotated slides: lecture9_annotated.pdf Download lecture9_annotated.pdf
R code: R9-1.Rmd Download R9-1.Rmd

Lecture 10, 11/2

This lecture is devoted to Bayesian Regression modeling.

Slides: VillaniGuestLectureKTH2022.pdf Download VillaniGuestLectureKTH2022.pdf

Lecture 11, 15/2

Assessing the accuracy of the regression model can be performed using bootstrap, for example, which can be used to evaluate the variability of the coefficients by constructing CI for the regression coefficients when the error distribution is non-normal. Some bootstrap strategies in regression will be discussed. Specifically, we focus on the parametric bootstrap, show how to create the bootstrap data set using re-sampling of residuals, and how to form the bootstrap percentile intervals for regression coefficients using the empirical quartiles.

We shortly focus on the non-parametric bootstrap where the idea is to re-sample the data pairs (x, y) directly, without specifying a model. Read yourself Sections 5.2 and 5.3.4 in JWHT and Section 5.4 in Iz, and if you are curious check Sections 7.11-7.12 in HTF where the relationship between AIC, cross-validation (CV), and bootstrap is discussed from the perspective of test and training error.

Slides: lecture11-2.pdf Download lecture11-2.pdf
Annotated slides: lecture11.pdf (2).pdf Download lecture11.pdf (2).pdf
Bonus content on confidence intervals: Bonus Notes for SF2930.pdf Download Bonus Notes for SF2930.pdf
R code: R11.Rmd Download R11.Rmd

Lecture 12, 16/2

We further recall the properties of Ridge regression which shrinks the regression coefficients by imposing a penalty on their size, and derive an equivalent (constraint) form of writing the ridge problem. It is important to understand that there is a one-to-one correspondence between the shrinkage and constraint parameters in both formulations (see Sections 6.2 and 6.8 in JWHT, and compare (6.5) with (6.9)). It is also important to understand that when there are many correlated variables in the linear model (i.e multicollinearity problem), their coefficients can become poorly determined and demonstrate high variance. This problem can be alleviated by imposing the size constraint on the coefficients, i.e performing ridge regression.

We then discuss the Lasso regression, which is also a shrinkage method like the ridge, with a subtle but important difference. Due to the structure of its penalty term, the Lasso does a kind of continuous variable selection, unlike the ridge which only shrinks. Computing the Lasso solution is a quadratic programming problem, and efficient algorithms are available for obtaining the entire path of solutions, with the same computational costs as for ridge regression and the optimal value of the penalty parameter can be selected by cross-validation. Go through Section 6.2.2 in JWHT, with the focus on the Lasso variable selection properties. Think about the constraint Lasso formulation given by (6.9) and its connection to the theoretical statement of the variable selection problem stated in (6.10). Read yourself about the elastic-net penalty, a compromise between ridge and Lasso, see Section 3.4.3 of HTF.

A more detailed presentation of the relationship between subset selection, ridge, and Lasso regression along with Bayesian problem formulation is also provided in Section 3.4.3 of HTF. Further, Sections 3.5 and 3.6 are recommended for those who will work with the Scenario II of Project 1.

For all of the standard regression analyses performed so far, it was assumed that all the regressor variables are relevant, i.e should be included in the model. This is usually not the case in practical applications; more often there is a large set of candidate regressors from which a set of the most appropriate ones must be identified to include in the final regression model. We start to consider theoretically the consequences of the model misspecification (e.g effect of deleting a set of variables on the bias and variance of the coefficients of the retained ones). Check, in detail, the whole summary 1.- 5. in section 10.1.2 of MPV and motivations for the variable selection. Two natural strategies for variable selection, stepwise (backward and forward) regression and the best subsets regression have been presented.

The best subsets regression approach was discussed in detail (also called for all possible subsets regression models, which unfortunately will quickly huge, be sure that you understand why). Objective criteria for selecting the "best" model have been discussed. It is important to understand that different criteria can lead to different "best" models. Go through section 10.1.3 where the R²-value's adjusted version, MSE, and Mallows' Cp-statistic are presented and the relationship between these is discussed. Be sure that you understand why these measures are suitable for the selection of the optimal modes when using all possible subsets strategy. Check example 10.1 and related tables and graphs and be sure that you understand how to choose an optimal model based on the above-mentioned criterion.

Read yourself section 10.2.2 of MPV about the general idea behind step-wise regression, be sure that you understand how to conduct step-wise regression using partial F-statistic, check examples 10.3 and 10.4 to see the strategy of adding or removing a regressor. Think also about limitations of the best subsets and the step-wise variable selection (see general comments of step-wise-type approach on p. 349) in regression models. Go through sections 10.3-10.4 which present the main steps of a good model-building strategy along with the case study.

Preliminary slides: lecture12-3.pdf Download lecture12-3.pdf
Annotated slides: lecture12.pdf (2).pdf Download lecture12.pdf (2).pdf
R code: R12.Rmd Download R12.Rmd

Lecture 13, 18/2

The focus of previous lectures was mainly placed on linear regression models with fit by least squares. Such linear models are suitable when the response variable is quantitative, and ideally when the error distribution is Gaussian. However, other types of responses arise in practice. For example, binary response variables can be used to indicate the presence or absence of some attribute (e.g., “cancerous” versus “normal” cells in biological data) where the binomial distribution is more appropriate. Sometimes the response occurs as counts (e.g., number of arrivals in a queue or number of photons detected); here the Poisson distribution might be called for. This lecture introduces a generalization of simple linear models. We begin with the logistic regression model. Read section 13.2 of MPV about the general idea behind logit transform, parametrization and link functions.

Preliminary slides: lecture13-1.pdf Download lecture13-1.pdf
R code: R13.Rmd Download R13.Rmd

Lecture 14, 22/2

This lecture is devoted to generalized linear models.

Slides: Lecture 1.pptx Download Lecture 1.pptx

Lecture 15, 3/3

This lecture is devoted to generalized linear models.

Slides: Lecture 2 2022.pptx Download Lecture 2 2022.pptx

Lecture 16, 4/3

This lecture is devoted to kernel regression.

Slides: kernel_methods.pdf Download kernel_methods.pdf

If you want to learn more about kernel regression, a good reference is Chapter 6 in the book Pattern Recognition and Machine Learning Links to an external site. by Bishop.