================================================================
* COLLINEARITY -- A SPECIAL CASE OF NON-IDENTIFIABILITY
- Interpreation issues:
. NOT a model problem, but causes interpretation issues
. Commonality of COLLINEARITY and leverage/-influence:
+ both are properties of the predictor distribution
+ neither is a model problem because neither involves y
+ both cause interpretation problems
. Difference between collinearity and leverage/self-influence:
+ Leverage/self-influence has to do with "outliers"
in (p+1)-dimensional predictor space.
+ Collinearity has to do with near-degeneracy of
N-dimensional predictor vectors
=> The interpretation problems they cause are completely different.
. Compare the following design matrices and comment on their
properties and interpretation of coefficients:
|1 0| |1 0| |1 1 0|
|1 0| |1 0| |1 1 0|
X = |0 1|, |1 1|, |1 0 1|
|0 1| |1 1| |1 0 1|
. In the third matrix, what is x0 adjusted for x1 and x2,
or x1 adjusted for x0 and x2, or x2 adjusted for x0 and x1?
- EFFECTS OF COLLINEARITY 1: SIMPSON''S PARADOX
. Practical example:
"A marketing project identified a list of affluent customers for
its new PDA. Should it focus on the younger or older members of
this list?"
"To answer this question, the marketing firm obtained a sample
of 75 consumers and asked them to rate their 'likelihood of
purchase' on a scale of 1 to 10."
Data:
site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/"
pda <- read.csv(paste(site,"pda.dat",sep=""))
pda <- read.csv("Data/pda.dat")
names(pda)
dim(pda)
plot(pda, pch=16)
summary(lm(Rating ~ Age, data=pda))
Observations: ???
. For increasing Age we see on average increasing Rating
which agrees with the simple regression of Rating on Age:
summary(lm(Rating ~ Age, data=pda))$coef
. However, adjusted for Income, we have a reversal of the slope for Age:
summary(lm(Rating ~ Income + Age, data=pda))$coef
. How is this possible? Collinearity of Age and Income:
cor(pda[,c("Age","Income")])
==> SIMPSON''S PARADOX -- REVERSAL OF SIGNS OF SLOPES UNDER ADJUSTMENT
(This effect is possible but not frequent in the presence of collinearity.
- EFFECTS OF COLLINEARITY 2: REDUCED SIGNIFICANCE OF SLOPES / LOSS OF POWER IN t-TESTS
. Recall: "Variance Inflation Factor" VIF and
"Standard Error Inflation Factor" SEIF = sqrt(VIF)
. Critical formula: SE.est(bj) = s / |xj.adj|
Collinearity of xj with other predictors causes |xj.adj| to be small
hence stderr.est to be large
. Interpretation: SEIF shows by how much the CI width
has been inflated by collinearity.
----------------------------------------------------------------
* COLLINEARITY ANALYSIS WITH SMALLEST PRINCIPAL COMPONENTS:
- Vectorized effects of collinearity on inference for the coeffs:
. The (stderr-)variance matrix of b is
V[b] = (X^T X)^{-1} sigma^2
. If the eigendecomposition of (X^T X) is:
(X^T X) = lambda1 P1 + lambda2 P2 +... (p+1 terms, Pj = uj uj^T)
then the eigendecomposition of (X^T X)^{-1} is:
(X^T X)^{-1} = 1/lambda1 P1 + 1/lambda2 P2 +...
[Side remark: Recall HW 3, spectral theory
If S is a symmetric matrix with eigendecomp
S = sum lambdaj Pj
and if f(lambda) is a numeric function, then
f(S) = sum f(lambdaj) Pj ]
Consequence: Linear combinations uj^T b = with small lambdaj
are statistically ill-determined (dataset-to-dataset).
V[uj^T b] = uj^T V[b] uj = 1/lambdaj = huge if lambdaj = small
- Extreme case: perfect collinearity -- rank deficient design matrix
X b = 0 <==> (X^T X) b = 0 <==> b^T (X^T X) b = 0 <==> |X b|^2 = 0
<==> b is eigenvector of (X^T X) with eigenvalue 0
Determination of rank of X: eigen decomposition of X^T X
Rank deficiency: count the (near) zero eigenvalues
Each eigenvector for a zero eigenvalue is a 'b' with X b = 0
Example:
boston.lm <- lm(MEDV ~ ., data=boston)
X <- model.matrix(boston.lm)
eigen(crossprod(X))$values # full rank
XX <- cbind(X,x12=X[,1]+X[,2],apply(X,1,sum)) # Add a collinear column
eigen(crossprod(XX))$values # rank deficient
Look for eigenvalue several orders smaller than the one before (ex: .9, 8.4e-09)
- Problem: Scaling of predictors
XX <- cbind(X[,-13]*.000001,X[,13]*1000000) # example: rescale predictors
eigen(crossprod(XX))$values # negative e'vals? => eigen() crashed!!
eigen(cor(XX[,-1]))$values # Reasonable scaling of predictor avoids the problem.
Solution: Scale the predictors, use 'cor()' instead of 'crossprod()'
but remove intercept vector, which is not of interest anyway
(centering/adjusting-for-the-mean is implicit in correlations)
eigen(cor(X[,-1]))$values
XX <- cbind(X[,-1],x12=X[,2]+X[,3],apply(X,1,sum)) # Append collinear predictors
eigen(cor(XX))$values
round(eigen(cor(XX))$vector[,14:15], 4)
- Standardizing the predictors: Often practiced in social sciences
Slope of x = est. ave. difference in y for a 1 sdev difference in x
'at fixed levels of all other predictors'
- Collinearity analysis with "smallest principal components":
Use of smallest eigenvalues/vectors to detect near-collinearity
PCA: Usually used for DIMENSION REDUCTION => largest eigenvalues/vectors
but COLLINEARITY ANALYSIS requires: smallest eigenvalues/vectors
Idea: Assume the columns of X are centered (intercept removed)
LS: find b such that |y - X b|^2 = min
PCA: find b such that var(X b) ~ |X b|^2 = max/min
for collinearity analysis we are interested in var(X b) small
But: PCA is ill-posed; b->Inf or b=0 solve the problem trivially.
=> PCA needs a constraint.
Idea: Inspired by collinearity analysis, take the best possible case
for comparison, which is: the predictors are pairwise ...
IF this were the case:
var(X b) = |x1|^2 b1^2 + ... + |xp|^2 bp^2
w.l.o.g.: |x1|^2 = ... = |xp|^2 = 1
var(X b) = b1^2 + ... + bp^2 = |b|^2
=> var(X b) = max/min subject to |b|^2 = 1
=> Eigenproblem: var(X b) = b^T cor(X) b = max/min subj. to |b|^2=1
Interpretation: Eigenvalues = variances (max/min/stationary)
var(X b) ~=~ 0 => X b ~=~ 0 for small eigenvalues
- Canned smallest eigendecomps. of the correlation matrix of X (w/o intercept)
site <- "http://stat.wharton.upenn.edu/~buja/STAT-541/"
source(paste(site, "collinearity-pca.R", sep=""))
X.scaled <- scale(X[,-1]) # standardize the columns, remove intercept
collinearity.pca(X.scaled, nlin=3)
$variances
PC13 PC12 PC11
0.064 0.169 0.186
$coefficients
PC13 PC12 PC11
CRIM -0.046 0.087 -0.110
ZN 0.081 -0.071 0.263
* INDUS 0.251 -0.113 -0.303
CHAS -0.036 -0.004 0.014
NOX -0.044 0.804 0.111
RM -0.046 0.153 0.053
AGE 0.039 -0.212 -0.459
DIS 0.018 0.391 -0.696
* RAD 0.633 -0.107 0.037
* TAX -0.720 -0.215 -0.105
PTRATIO -0.023 0.210 0.175
B 0.004 0.042 0.019
LSTAT -0.024 0.055 0.271
Interpretation of the smallest eigenvalue:
The smallest PC has 6.4% of the variance of a single variable (var(xj)=1).
Its standard deviation is sqrt(0.064)=0.253, or about a quarter of each xj
Interpretation of its eigenvector: Round the coeffs first.
INDUS: +1/4 RAD: +2/3 TAX: -3/4
Defines a linear combination with small variance, hence examine:
1/4*INDUS + 2/3*RAD - 3/4*TAX ~=~ 0
Equivalently:
3/4*TAX ~=~ 1/4*INDUS + 2/3*RAD
Equivalently:
TAX ~=~ 1/3*INDUS + 8/9*RAD
Check with a plot:
plot(TAX ~ I(1/3*INDUS + 8/9*RAD),
data=as.data.frame(X.scaled), pch=16, cex=1)
(Recall the formula language can also be used for plotting.)
Suspicious plot: These are not 506 points...
windows(width=6, height=12)
par(mfrow=c(2,1), mgp=c(1.8,.5,0), mar=c(3,3,1,1), cex.lab=.8, cex.axis=.8)
plot(TAX ~ I(1/3*INDUS + 8/9*RAD),
data=as.data.frame(X.scaled), pch=16, cex=.5)
Jitter the same:
n <- nrow(X.scaled)
jit <- .1
jit.x <- runif(n,-jit,jit)
jit.y <- runif(n,-jit,jit)
plot(TAX + jit.y ~ I(1/3*INDUS + 8/9*RAD + jit.x),
data=as.data.frame(X.scaled), pch=16, cex=.5)
Alternative: fuzzy scatter
smoothScatter(x=1/3*boston[,"INDUS"]+8/9*boston[,"RAD"], boston[,"TAX"], xlim=c(0,30), ylim=c(100,800) )
================================================================