1 Data compression

Reduce mulitple variables/dimensions into a single variable/dimension.

Three different approaches: Mean, Median, and First principle component

1.1 build the dataset

library(scales)
## Warning: package 'scales' was built under R version 3.4.2
N <- 10 # number of variables
x1 <- sample(seq(1, 20, by=1), N, replace = T)
x2 <- sample(seq(1, 150, by=1), N, replace = T)
x3 <- rnorm(N, mean=50, sd=10) 
x4 <- rnorm(N, mean=50, sd=30)
x5 <- dlnorm(1:N, 2, 0.35)    # log distribution
x5 <- rescale(x5  , to=c(0,150))
df <- data.frame(x1,x2,x3,x4,x5)
df <- t(df)
df <- round(df, digits=2)

dataset contains 5 instances, each instance contains 10 variables

df
##      [,1]  [,2]  [,3]  [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
## x1   7.00  9.00  9.00 12.00  15.00   3.00   5.00  13.00   2.00  16.00
## x2 122.00 80.00 14.00 16.00  90.00 148.00 103.00  41.00 116.00  98.00
## x3  37.54 36.40 45.32 58.18  59.22  60.56  65.07  63.99  40.48  42.29
## x4  59.13  2.39 68.37 86.17  65.71  49.99  65.34  95.23   4.50 -17.33
## x5   0.00  0.50 12.85 57.11 114.03 148.37 150.00 129.45 100.74  73.13

1.2 plot data with comparsion

plot all data points

# plot  
plot(c(1,1), xlim=c(1, nrow(df)), ylim=c(min(df), max(df)), type = 'n')
for(i in 1:nrow(df)){
  points(array(i, ncol(df)), df[i,]) # plot each row
}

plot data with comparsion

# box and whisker plot
boxplot(t(df), main="blue=mean, black_bar=median, red=1st pc")

# plot average 
row_mean <- apply(df, 1, mean)
lines(x=c(1: nrow(df)), y=row_mean, col="blue")

# plot first principle component in original scale
pca <- prcomp(df) 
# reverse PCA in prcomp to get original data
# prcomp will center the variables so you need to add the subtracted means back
ori <- t(t(pca$x %*% t(pca$rotation)) + pca$center)
# first column of ori to get origional values based on first principle component
lines(x=c(1: nrow(df)), y=ori[, 1], col="red") 

For Reducing mutiple variables into a single variable, principle component (1st) is not good comparing to mean and median. While, PCA can reduce mutiple variables into more than one variables

Home Page