Lectures content and reading instructions

Lecture 1, 15/1: Introduction to the course and simple linear regression

The lecture covers MPV 2.1-2.2.

The simple linear regression model will be discussed in detail, including basic assumptions on equal variance of the error term, linearity, and independence. The least squares (LS) fitting strategy will be discussed along with the properties of the obtained estimators of the regression coefficients. Go through these properties once again, read Sections 2.2.2--2.2.3 of MPV, check normal equations given by Equation (2.5) on p. 14 and their solutions, show that both LS estimators of the slope and intercept are unbiased, and find their variances. Think about three sources of error in the slope estimation and their effects: the noise level, the sample size, and the spread in x.

Go through Examples 2.1 and Ex 2.2 to see the numerical calculations for the LS fit, read about residual properties, and check the general properties 1.--5. of the LS fit presented on p. 20 of MPV.

Go through Sections 2.3.1--2.3.2 and check which additional assumptions are needed to perform the tests of significance on the slope and intercept. This will be discussed in Lecture 2.

You are highly encouraged you to play around with the R files provided after the lectures, and to simulate datasets yourself in your favorite language to try things out with.

 

Lecture 2, 16/1: Simple linear regression part 2

The lecture covers MPV 2.3-2.6.

Tests of significance and confidence intervals for the slope, intercept, and variance of the error term will be discussed for the simple linear regression model. Go through numerical examples and check graphs in Sections 2.3.1-2.3.2 of MPV. Fundamental analysis-of-variance (ANOVA) identity will be presented along with the test of significance of regression. It is very important to understand how the partition of the total variability in the response variable is obtained and how the ANOVA-based F-test is derived, this strategy will be used throughout the whole course, specifically in the multiple linear regression models which will be presented during the next two lectures. Go through Section 2.3.3 and check why F-test is equivalent to the t-test when testing the significance of regression in the simple regression model.

The concepts of confidence interval for the mean response and prediction interval for the future observation will be presented. Go through Section 2.4.2, check numerical examples 2.6 and 2.7, it is important to understand what is the principal difference between these two types of intervals and how they are supposed to be used in the regression analysis.

Read yourself Section 2.9 (5th Ed.) or 2.10 (6th) where abuse of regression modeling is discussed and Section 2.10 (5th Ed.) or 2.11 (6th) where no-intercept regression model is presented as a special type of modeling (the idea is to force the intercept to be zero). Check numerical examples of this section and think about the differences with the previously presented model (that includes intercept term), focusing specifically on the properties of the coefficient of determination.

Go through Section 2.11 (5th Ed.) or 2.12 (6th) and convince yourself that the ML estimators of the slope and intercept are identical to those obtained by the LS approach, this does not hold for the variance estimator, check why.

A short discussion of the case of the random regressors is presented in Section 2.12 (5th Ed.) or 2.13 (6th), check it, it will be shortly discussed during the next lectures.

 

Lecture 3, 20/1: Multiple linear regression part 1

The lecture covers MPV 3.1-3.2 and Iz 5.2.

Multiple linear regression models will be introduced, starting with matrix notations and turning then to the LS normal equations, their solutions, and geometrical interpretation of the LS estimators (3.2.2). It is important to remember that, in general, any regression model that is linear in the coefficients (beta’s) is a linear regression model, regardless of the shape of the surface it generates. Go through Section 3.2.1, be sure that understand the structure of the matrix X’X and the structure and the role of the hat matrix H. Go through Example 3.1 and graphical data presentation in Section 3.2.1.

Go through Sections 3.2.3- 3.2.6 and check the properties of the parameter estimators obtained by both LS and ML approaches. Check carefully Appendix C.4 where the optimality of the LS estimators are stated in the Gauss-Markov theorem and be sure that you understand the proof of this result. Check carefully Appendix C.2 presenting important results on matrix calculations and some distributional properties of the regression estimators under normality. We will need these results for test procedures.

In the end, we will present a discussion to motivate the LS estimators when we assume random regressor variables, see Iz 5.2.1.

 

Lecture 4, 21/1: Multiple linear regression part 2

The lecture mostly covers MPV 3.2.6-3.3-3.4.

We continue the discussion of the distributional properties of estimators in the regression model under normality. After repetition of properties of the multivariate normal distribution, we show that the estimator of beta coefficients obtained by LS is equal to that one obtained by ML (but slightly different for the error variance estimations).

We further turn to the test procedures in the multiple linear regression. We start with the global test on the model adequacy and testing a subset of coefficients and test in the general linear hypothesis, see sections 3.3.1-3.3.4. It is important to understand why the partial F-test, presented in Section 3.3.2, measures the contribution of the subset of regressors into the model given that the other regressors are included in the model. Check Appendix C.3.3-C.3.4 for details and go through example 3.5 where the partial F-test is illustrated. Go through Examples 3.6 and 3.7 of section 3.3.4 which demonstrate the unified approach for testing linear hypotheses about regression coefficients.

Next, we discuss the confidence intervals for the coefficients. Read sections 3.4-3.5. It is important to understand the difference between a one-at-a-time confidence interval (marginal inference) for a single regression coefficient, and a simultaneous (or joint) confidence set for the whole vector of coefficients. Go through example 3.11, and think about the advantages and disadvantages of the two methods that have been presented: the joint confidence set is given by (3.50) (confidence ellipse, see Fig. 3.8) and the Bonferroni-type correction strategy.

Standardization (centering and scaling) of the regression coefficients is presented in section 3.9 (5th) and 3.10 (6th). Check yourself the two approaches for standardization and the interpretation of the standardized regression coefficients. One application of the standardization step is presented further in section 3.10 (5th) and 3.11 (6th) where the problem of multicollinearity is presented. Check why and how the standardization is applied here, we will discuss the problem of multicollinearity in detail during Lectures 7 and 8.

The phenomena of hidden extrapolation in the prediction of a new observation using the fitted model will be discussed in detail at the beginning of Lecture 5.

 

Lecture 5, 22/1: Model diagnostics

The lecture covers MPV 3.8 (5th) or 3.9 (6th), 4.1-4.4

After discussing the problem with hidden extrapolation in multiple regression, we turn to the model evaluation strategies. The main question is whether the assumption underlying the linear regression model seems reasonable when applied to the data set in question. Since these assumptions are stated about the populations (the true) regression errors, we perform the model adequacy checking through the analysis of the sample-based (estimated) errors, residuals.

The main ideas of residual analysis are presented in Sections 4.2.1-4.2.3, and section 4.3 where the PRESS residuals are used to compute R² like statistics for evaluating the capability of the model. Go through Sections 4.2.1-4.2.3, check the difference between internal and external scaling of residuals and go through numerical examples 4.1 and 4.2 (and relates graphs and tables). Specifically, be sure that you understand why we need to check the assumptions of the model and how we can detect various problems with the model by using residual analysis. Think which formulas and methods we used are at risk to be incorrect when specific model assumptions are violated.

Go through Section 4.2.3 and be sure that you understand how to "read" these plots, e.g. how to detect specific problems in practice. For example, how does non-constant error variance show up on a residual vs. fits plot? How to use residuals vs predictor plots to identify omitted predictors that can improve the model?

Read 4.4 for a discussion on outlier detection.

 

Lecture 6, 28/1: Transformations

The lecture covers MPV 5.1-5.5

During Lecture 5 we considered methods for detecting problems with a linear regression model. Once the problems with the model have been identified we have a number of solutions that are discussed during the current lecture. Section 5.2 presents variance stabilizing transforms and Section 5.3 summarizes a number of transforms to linearize the model. Go through these sections, check Examples 5.1 and 5.2, and Figure-and-Table 5.4. You are expected to understand when (and which) transformation of the response variable might help, the same for the predictor variables. Observe that sometimes it is needed to transform both to meet the three conditions of the linear regression model.

Check carefully that you understand how to fit the regression model to the transformed data and how to check the model adequacy. 
Observe that the methods of variable transform above involve subjective decisions, this means that the model you select as the good one can be different from that selected by your colleague, both models can be appropriate! An analytic strategy for selecting the "best" power transform of the response variable is presented in Section 5.4.1, this is the Box-Cox transformation. Go through this section and check example 5.3. It is important to understand when the Box-Cox transform is suitable and how to choose the optimal value of the power parameter. Check different strategies for maximizing the likelihood function and making inferences about power parameters. For the overview of the Box-Cox method with a number of examples see 
Box-Cox transformations:  An Overview (optional), and for the R implementation of Box-Cox transformations for different purposes, graphical assessment of the success of transforms and inference on the transformation parameter, see Box-Cox power transformations: Package "AID".

The common problem of non-constant error variance can also be solved by using the generalized least squares (GLS), and its special version weighted least squares. Read Sections 5.5.1-5.5.3, think about why these methods are suitable for fitting a linear regression model with unequal error variance, and go through Example 5.5. thinking about practical issues with GLS and weighted LS.

 

Lecture 7, 30/1: Leverage/Influence points and Multicollinearity part 1

The lecture covers MPV 6 and MPV 9.1-9.4

We turn to Chapter 6 where the methods for detecting influential observations are presented. It is important to lean the distinction between an outlier, the data point whose response y does not follow the general trend of the data, and the data point which has high leverage. Both outliers and high leverage data points can be influential, i.e. can dramatically change the results of regression analysis such as predicted responses, coefficient of determination, estimated coefficients, and results of the tests of significance. During this lecture, we will discuss various measures used for determining whether a point is an outlier, high leverage, or both. Once such data points are identified we then investigate whether they are influential. We first will consider a measure of leverage (see section 6.2), and then discussed two measures of influence, Cook's distance and DFBETAS (difference in fits of beta). It is important to understand the general idea behind these measures; both are based on deletion diagnostics, i.e, they measure the influence of the i:th observation if it is removed from the data. It is also important to see that both these measures combine residual magnitude with the location of the point of interest in x-space.

Then we turn to the multicollinearity problem discussed in Chapter 9.  Multicollinearity is present when two or more of the predictor variables in the model are moderately or highly correlated (linearly dependent). It is important to understand the impact of multicollinearity on various aspects of regression analyses. The main focus of the present lecture is on the effects of multicollinearity on the variance of the estimated regression coefficients, the length of the estimated vector of coefficients, and prediction accuracy. Go through Section 9.3 and Example 9.1. Specifically, this example demonstrates that the multicollinearity among regressors does not prevent a good accuracy of predictions of the response within the scope of the model (interpolation) but seriously harms the prediction accuracy when performing extrapolation. Then we will see some techniques to detect multicollinearity in Section 9.4.

 

Lecture 8, 4/2: Multicollinearity part 2

The lecture covers MPV 9.4.3 and 9.5.4, JWHT 6.3 and 10.2, Iz 5.6.3, HTF 3.5

Go through the whole section 9.4, focusing especially on the examples with simulated data which demonstrates the need for measures of multiple correlations (not only pairwise, such as examination of the matrix X'X in its correlation form) for detection of multicollinearity. We have also discussed some more general methods of multicollinearity diagnostics such as VIF, and eigensystem analysis of X'X to explain the nature of linear dependence, check the example by Webster et al. to see how to use elements of eigenvectors to catch the linear relationship between the predictors.

After revising the methods for detecting multicollinearity (with a special focus on the eigensystem analysis presented in section 9.4.3), we turn to the methods for overcoming multicollinearity.  We start presenting the Principal Components Analysis, PCA (JWHT 10.2), an unsupervised method for dimensionality reduction. In Principal Components Regression (PCR) the principal components are first obtained by transforming the original predictors, and then these components are used as newly derived predictors in the regression model (see MPV 9.5.4).

The key idea of overcoming multicollinearity using PCR is to exclude those principal components that correspond to the lowest eigenvalues (think why just these components must be dropped). Be sure that you understand how the principal components are constructed from the original data matrix X. Check example 9.3, and observe that the final PCR estimators of the original beta-coefficients can be obtained by back transforming.

PCR does not guarantee that the components that best explain the predictors will also be the best to use for predicting the response. Partial Least Squares (PLS, Iz 5.6.3, HTF 3.5.2, JWHT 6.3.2) instead balance data reduction and prediction by identifying components that are both informative and predictive.

Important! These methods assume that the original data are scaled to unit length (see p. MPV 3.9 (5th) or 3.10 (6th)) so that Y and each of the columns of X have zero empirical means.

Finally, we briefly discuss indicator variables for categorical data (see MPV 8.1-8.2 for a discussion).

 

Lecture 9, 5/2: Shrinkage methods

The lecture covers MPV 9.5.3, Iz 5.6.4, HTF 3.4.1, JWHT 6.2.1.

Another method for overcoming multicollinearity is Ridge regression, which shrinks the LS regression coefficients by imposing a penalty on their size (see section MPV 9.5.3 and Iz 5.6.4).

We have considered two formulations of the Ridge estimator,  namely as the shrinkage estimator or as the solution of the penalized residual sum of squares. It is important to understand the role of the biasing parameter, also called penalty, tuning, or shrinkage parameter, and how this parameter can be selected. 

Observe that instead of one solution as we had in LS, the ridge regression generates a path/trace of solutions which is a function of the biasing parameter. Check carefully example 9.2 where the choice of the parameter by inspection of the ridge trace is discussed. A computationally efficient approach to optimizing the biasing/shrinkage parameter is presented in the book by Izenman (Iz 5.7, algorithm on table 5.7). This approach considers cross-validatory (CV) choice of the ridge parameter and is more suitable if the model needs to be used for prediction.

Important again! Both PCR and Ridge regression methods assume that the original data are already centered and re-scaled, usually to unit length, but also using the unit normal scaling. Both types of re-scaling (see p. MPV 3.9 (5th) or 3.10 (6th)) are suitable, check the relationship between Z'Z and W'W to understand why.

 

Lecture 10, 6/2: Model assessment with bootstrap

The lecture covers Iz 5.3-5.4, MPV 15.4

Assessing the accuracy of the regression model can be performed using bootstrap, for example, which can be used to evaluate the variability of the coefficients by constructing CI for the regression coefficients when the error distribution is non-normal. Some bootstrap strategies in regression will be discussed. Specifically,  we focus on the parametric bootstrap, show how to create the bootstrap data set using re-sampling of residuals, and how to form the bootstrap percentile intervals for regression coefficients using the empirical quantiles.

 

Lecture 11, 11/2: LASSO and Variable Selection

The lecture covers JWHT 6.2.2, HTF 3.4.2, JWHT 6.1, MPV 10.1-10.2

We further recall the properties of Ridge regression which shrinks the regression coefficients by imposing a penalty on their size, and derive an equivalent (constraint) form of writing the ridge problem. It is important to understand that there is a one-to-one correspondence between the shrinkage and constraint parameters in both formulations. It is also important to understand that when there are many correlated variables in the linear model (i.e. multicollinearity problem), their coefficients can become poorly determined and demonstrate high variance. This problem can be alleviated by imposing the size constraint on the coefficients, i.e. performing ridge regression.

We then discuss the Lasso regression, which is also a shrinkage method like the ridge, with a subtle but important difference. Due to the structure of its penalty term, the Lasso does a kind of continuous variable selection, unlike the ridge which only shrinks. Computing the Lasso solution is a quadratic programming problem, and efficient algorithms are available for obtaining the entire path of solutions, with the same computational costs as for ridge regression, and the optimal value of the penalty parameter can be selected by cross-validation. Go through Section 6.2.2 in JWHT, with a focus on the Lasso variable selection properties.

For all of the standard regression analyses performed so far, it was assumed that all the regressor variables are relevant, i.e. should be included in the model. This is usually not the case in practical applications; more often there is a large set of candidate regressors from which a set of the most appropriate ones must be identified to include in the final regression model. We start to consider theoretically the consequences of the model misspecification (e.g. effect of deleting a set of variables on the bias and variance of the coefficients of the retained ones). Check, in detail, the whole summary 1.- 5. in section 10.1.2 of MPV and motivations for the variable selection.

We discuss the best subsets regression approach (also called for all possible subsets regression models, which unfortunately will quickly huge, be sure that you understand why). Objective criteria for selecting the "best" model are discussed. It is important to understand that different criteria can lead to different "best" models. Go through section 10.1.3 where different measures are presented and the relationship between these is discussed. Be sure that you understand why these measures are suitable for the selection of the optimal modes when using all possible subset strategies. Check example 10.1 and related tables and graphs and be sure that you understand how to choose an optimal model based on the above-mentioned criterion.

Read section 10.2.2 of MPV about the general idea behind step-wise regression, be sure that you understand how to conduct step-wise regression using partial F-statistic, check examples 10.3 and 10.4 to see the strategy of adding or removing a regressor. Think also about the limitations of the best subsets and the step-wise variable selection (see general comments on step-wise-type approach on p. 349 (5th) or 365 (6th)) in regression models. Go through sections 10.3-10.4 which present the main steps of a good model-building strategy along with a case study.

 

Lecture 12, 12/2: Generalized linear models (GLMs)

The lecture covers MPV 13.1-13.4

The focus of previous lectures was mainly placed on linear regression models with least squares fit. Such linear models are suitable when the response variable is quantitative, and ideally when the error distribution is Gaussian.  However, other types of responses arise in practice. For example, binary response variables can be used to indicate the presence or absence of some attribute (e.g., “cancerous” versus “normal” cells in biological data)  where the binomial distribution is more appropriate. Sometimes the response occurs as counts (e.g., number of arrivals in a queue or number of photons detected); here the Poisson distribution might be involved.  This lecture introduces a generalization of simple linear models. We begin with the logistic regression model, proceed Poisson, and finally show how these are special instances of GLMs.

 

Lecture 13, 13/2: Bayesian Regression

The lecture covers some chapters of the book Bayesian Learning Links to an external site. by Mattias Villani (Professor at Stockholm University).

In particular: 

  • Read chapter 1 for an introduction to Bayesian inference and 2.1-2.2 for basic models with i.i.d. data.
  • We discuss some of the distributions covered in chapter 4.
  • The focus is on regression models, covered in chapter 5, and regularization, covered in chapter 12.