Reduce mulitple variables/dimensions into a single variable/dimension.
Three different approaches: Mean, Median, and First principle component
library(scales)
## Warning: package 'scales' was built under R version 3.4.2
N <- 10 # number of variables
x1 <- sample(seq(1, 20, by=1), N, replace = T)
x2 <- sample(seq(1, 150, by=1), N, replace = T)
x3 <- rnorm(N, mean=50, sd=10)
x4 <- rnorm(N, mean=50, sd=30)
x5 <- dlnorm(1:N, 2, 0.35) # log distribution
x5 <- rescale(x5 , to=c(0,150))
df <- data.frame(x1,x2,x3,x4,x5)
df <- t(df)
df <- round(df, digits=2)
dataset contains 5 instances, each instance contains 10 variables
df
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## x1 7.00 9.00 9.00 12.00 15.00 3.00 5.00 13.00 2.00 16.00
## x2 122.00 80.00 14.00 16.00 90.00 148.00 103.00 41.00 116.00 98.00
## x3 37.54 36.40 45.32 58.18 59.22 60.56 65.07 63.99 40.48 42.29
## x4 59.13 2.39 68.37 86.17 65.71 49.99 65.34 95.23 4.50 -17.33
## x5 0.00 0.50 12.85 57.11 114.03 148.37 150.00 129.45 100.74 73.13
plot all data points
# plot
plot(c(1,1), xlim=c(1, nrow(df)), ylim=c(min(df), max(df)), type = 'n')
for(i in 1:nrow(df)){
points(array(i, ncol(df)), df[i,]) # plot each row
}
plot data with comparsion
# box and whisker plot
boxplot(t(df), main="blue=mean, black_bar=median, red=1st pc")
# plot average
row_mean <- apply(df, 1, mean)
lines(x=c(1: nrow(df)), y=row_mean, col="blue")
# plot first principle component in original scale
pca <- prcomp(df)
# reverse PCA in prcomp to get original data
# prcomp will center the variables so you need to add the subtracted means back
ori <- t(t(pca$x %*% t(pca$rotation)) + pca$center)
# first column of ori to get origional values based on first principle component
lines(x=c(1: nrow(df)), y=ori[, 1], col="red")
For Reducing mutiple variables into a single variable, principle component (1st) is not good comparing to mean and median. While, PCA can reduce mutiple variables into more than one variables