Breast Cancer detection using PCA + LDA in R
Principal Components Analysis and Linear Discriminant Analysis applied to BreastCancer Wisconsin Diagnostic dataset in R
The dangers associated with coal mining are myriad; black lung, flammable gas pockets, rockbursts, and tunnel collapses are all very real dangers that mining companies must consider when attempting to provide safe working conditions for miners. One class of mining hazard, commonly called ‘seismic hazards’, are notoriously difficult to protect against and even more difficult to predict with certainty. Therefore, predicting these hazards has become a well-known problem for machine learning and predictive analytics. The UCI Machine Learning Repository (https://archive.ics.uci.edu) provides a ‘seismic bumps’ data set that contains many records of combined categorical and numeric variables that could be used to predict seismic hazards. This ‘seismic bumps’ data set can be found at https://archive.ics.uci.edu/ml/datasets/seismic-bumps.
Our analysis attempts to use logistic regression techniques to predict whether a seismic ‘bump’ is predictive of a notable seismic hazard. We attempt to characterize our prediction accuracy and compare the results against the state of the art results from other statistical and machine learning techniques, that are included within the data set.
source("libraries.R")
#url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff"
#download.file(url, "seismic-bumps.arff")
seismicData <- import("seismic-bumps.arff")
The data were taken from from instruments in the Zabrze-Bielszowice coal mine, in Poland. There are 2,584 records, with only 170 class = 1 variables, so the data are significantly skewed towards non-hazardous training data. Field descriptions are below, but essentially energy readings and bump counts during one work shift are used to predict a ‘hazardous’ bump during the next shift. From the data description, a ‘hazardous bump’ is a seismic event with > 10,000 Joules, and a ‘shift’ is a period of 8 hours. For the sake of reference, a practical example of 10,000 Joules would be the the approximate energy required to lift 10,000 tomatoes 1m above the ground. A class = 1 variable result signifies that a harzardous bump did, indeed, occur in the following shift to the measured data. Here is an example of the fields in the data set.
## Observations: 2,584
## Variables: 19
## $ seismic <fctr> a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a...
## $ seismoacoustic <fctr> a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, b...
## $ shift <fctr> N, N, N, N, N, W, W, N, N, W, N, N, W, N, N, W...
## $ genergy <dbl> 15180, 14720, 8050, 28820, 12640, 63760, 207930...
## $ gpuls <dbl> 48, 33, 30, 171, 57, 195, 614, 194, 303, 675, 1...
## $ gdenergy <dbl> -72, -70, -81, -23, -63, -73, -6, -27, 54, 4, -...
## $ gdpuls <dbl> -72, -79, -78, 40, -52, -65, 18, -3, 52, 25, -3...
## $ ghazard <fctr> a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a...
## $ nbumps <dbl> 0, 1, 0, 1, 0, 0, 2, 1, 0, 1, 1, 0, 1, 1, 1, 2,...
## $ nbumps2 <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0,...
## $ nbumps3 <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 2,...
## $ nbumps4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nbumps5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nbumps6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nbumps7 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nbumps89 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ energy <dbl> 0, 2000, 0, 3000, 0, 0, 1000, 4000, 0, 500, 500...
## $ maxenergy <dbl> 0, 2000, 0, 3000, 0, 0, 700, 4000, 0, 500, 5000...
## $ class <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
From the UCI Machine Learning Repository, these are the field descriptions:
There are 18 input variables and one binary output variable (“class”). The data are mostly numeric with 4 categorical input variables. However, some of the numeric values only contain a handful of discrete values which can be viewed as coded categorical variables. In particular, maxenergy and the ‘nbumps(n)’ variables can be treated as categorical. So, in short, we see the following breakdown in variable types:
The categorical variables are seismic, seismoacoustic, shift, ghazard, nbumps, nbumps2, nbumps3, nbumps4, nbumps5, nbumps6, nbumps7, nbumps89, class and the continuous variables are genergy, gdpuls, energy, maxenergy. The output variable is ‘class’.
A table outlining the variables and some of their attributes is below:
variable Cardinality Nulls Total Uniqueness Distinctness
class 2 0 2584 0.0007739938 0.0007739938
energy 242 0 2584 0.0936532508 0.0936532508
gdenergy 334 0 2584 0.1292569659 0.1292569659
gdpuls 292 0 2584 0.1130030960 0.1130030960
genergy 2212 0 2584 0.8560371517 0.8560371517
ghazard 3 0 2584 0.0011609907 0.0011609907
gpuls 1128 0 2584 0.4365325077 0.4365325077
maxenergy 33 0 2584 0.0127708978 0.0127708978
nbumps 10 0 2584 0.0038699690 0.0038699690
nbumps2 7 0 2584 0.0027089783 0.0027089783
nbumps3 7 0 2584 0.0027089783 0.0027089783
nbumps4 4 0 2584 0.0015479876 0.0015479876
nbumps5 2 0 2584 0.0007739938 0.0007739938
nbumps6 1 0 2584 0.0003869969 0.0003869969
nbumps7 1 0 2584 0.0003869969 0.0003869969
nbumps89 1 0 2584 0.0003869969 0.0003869969
seismic 2 0 2584 0.0007739938 0.0007739938
seismoacoustic 3 0 2584 0.0011609907 0.0011609907
shift 2 0 2584 0.0007739938 0.0007739938
It is important to understand how many observations are “hazardous state (class = 1)” and “non-hazardous state (class = 0)”
table(seismicData$class)
##
## 0 1
## 2414 170
As mentioned above, the data set output variable is highly skewed, and contains many more non-hazardous classes than it does hazardous classes.
Distribution of all nbumps: Use cowplot
to display all nbumps in a grid.
Collect all the numeric variables and check for multi-collinearity:
seismicDataNumeric <- seismicData %>%
select(genergy, gpuls, gdenergy, gdpuls, energy, maxenergy)
# Create the correlation matrix
M <- round(cor(seismicDataNumeric), 2)
# Create corrplot
corrplot(M, method="pie", type = "lower")
To illustrate the problem with having an unbalanced class variable we compare accuracy of the logistic regression full model with the null model.
We first built a logistic regression model taking all the observations and variables into account. The summary of this full model is shown in the figure.
# Reimport the dataset and start fresh
sb <- import("seismic-bumps.arff")
# Build a full model
fullModel <- glm(class ~ . , data = sb, family = "binomial")
# Summary full model
summary(fullModel)
##
## Call:
## glm(formula = class ~ ., family = "binomial", data = sb)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1035 -0.3637 -0.2827 -0.1781 3.0300
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.426e+00 2.695e-01 -16.425 < 2e-16 ***
## seismicb 3.079e-01 1.844e-01 1.670 0.09500 .
## seismoacousticb 3.625e-02 1.866e-01 0.194 0.84596
## seismoacousticc 5.049e-01 6.767e-01 0.746 0.45563
## shiftW 8.186e-01 2.926e-01 2.798 0.00515 **
## genergy -7.967e-07 4.561e-07 -1.747 0.08066 .
## gpuls 9.443e-04 2.148e-04 4.397 1.1e-05 ***
## gdenergy -1.106e-03 2.115e-03 -0.523 0.60094
## gdpuls -2.337e-03 2.855e-03 -0.819 0.41302
## ghazardb -9.745e-03 3.448e-01 -0.028 0.97746
## ghazardc -1.390e+01 4.250e+02 -0.033 0.97391
## nbumps 3.862e+00 1.460e+00 2.645 0.00817 **
## nbumps2 -3.478e+00 1.465e+00 -2.374 0.01760 *
## nbumps3 -3.452e+00 1.461e+00 -2.362 0.01819 *
## nbumps4 -3.778e+00 1.522e+00 -2.482 0.01307 *
## nbumps5 -2.076e+00 2.625e+00 -0.791 0.42915
## nbumps6 NA NA NA NA
## nbumps7 NA NA NA NA
## nbumps89 NA NA NA NA
## energy -9.632e-06 3.159e-05 -0.305 0.76042
## maxenergy 1.771e-06 3.121e-05 0.057 0.95475
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1253.8 on 2583 degrees of freedom
## Residual deviance: 1073.2 on 2566 degrees of freedom
## AIC: 1109.2
##
## Number of Fisher Scoring iterations: 15
Next we predict using this full model and calculate the probability of getting a hazardous bump. To convert these probabilities to classes, we define a threshold. A good value for threshold is the mean of the original response variable. Probabilities greater than this threshold are categorized into hazardous bump class and probabilities lesser than the threshold are categorized into non-hazardous bump class. Model accuracy is then calculated by comparing with the actual class variable.
sb$prob <- predict(fullModel, type = "response")
# Find the actual probability
actualProbability <- mean(as.numeric(as.character(sb$class)))
# Use the actual probability to as threshold to generate predictions
sb$pred <- ifelse(sb$prob > actualProbability, 1, 0)
# Now calculate model accuracy
fullModelAccuracy <- mean(sb$pred == sb$class)
# Print it
fullModelAccuracy
## [1] 0.7352941
Our calculation show that the logistic regression model with all the observations and variables made a correct prediction 0.74% of the time.
What if our model always predicted class == 0, then what is the accuracy ?
sb$predNull <- 0
nullModelAccuracy <- mean(sb$predNull == sb$class)
nullModelAccuracy
## [1] 0.9342105
The null model accuracy is 93.42%. With our fullModel accuracy of 0.74% the model is actually performing worse than if it were to predict class 0 for every record.
This illustrates that “rare events” create challenges for classification models. When one outcome is very rare predicting the opposite can result in very high accuracy. We have demonstrated that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive. In the next section we will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model.
When balancing the class variable, a key question to ask ourselves is by how much should we shrink the number of observations for the class with most values. To aid in answering this question, we can use cross validation to figure out the optimal value to choose. We developed a function in R called cVLogisticRegression(n, k, size)
- where n
is the number of observations of the majority class, k
is the number of folds and size
is the number of observations per fold to include in test and train data. The function returns a vector of “Area Under the Curve” AUC values. A histogram of these AUC values for n = 170, n = 500, n = 1000 is shown in the figure.
cVLogisticRegression <- function(n, k, size){
# Reimport the dataset and start fresh
sb <- import("seismic-bumps.arff")
# Extract non-hazardous observations
nonh <- sb %>%
filter(class == 0)
# Randomly choose n observations
nonh <- nonh[sample(1:nrow(nonh), n), ]
# Extract hazardous observations
h <- sb %>%
filter(class == 1)
# Combine the nonh and h dataframes
balancedSB <- rbind(nonh, h)
# Shuffle the dataframe
balancedSB <- balancedSB[sample(1:nrow(balancedSB)),]
# Extract the class variable into a vector call it "dat.train.y"
dat.train.y <- balancedSB$class
# Create a model matrix using the entire balancedSB.
# Earlier we split this into training/test data.
# Now we are going do that within the loop.
dat.train.x <- model.matrix(as.formula("class ~ ."), data = balancedSB)
# Define the number of cross-validation loops
nloops <- k
# Number of observations in the training (i.e full) dataset
ntrains <- nrow(dat.train.x)
# Create an empty vector to hold all the AUC result.
# Later we will display each run's AUC in a histogram
cv.aucs <- c()
for (i in 1:nloops){
# Get the indexes using sample
index<-sample(1:ntrains,size)
# randomly draw observations from front and call it training set
cvtrain.x<-as.matrix(dat.train.x[index,])
# randomly draw observations from back and call it test set
cvtest.x<-as.matrix(dat.train.x[-index,])
# Get the corresponding class variable
cvtrain.y<-dat.train.y[index]
cvtest.y<-dat.train.y[-index]
## Model fitting:
# Call cv.glmnet to fit the model using training set
cvfit <- cv.glmnet(cvtrain.x, cvtrain.y, family = "binomial", type.measure = "class")
## Prediction:
# Predict using test set using the above model. Use type = "response" to get prediction probabilities.
fit.pred <- predict(cvfit, newx = cvtest.x, type = "response")
# Prediction function takes prediction probabilities as first arg and class to compare as its second arg
pred <- prediction(fit.pred[,1], cvtest.y)
## Prediction Performance:
# Check prediction performance
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
# Store the auc value for this run into the vector
cv.aucs[i]<-auc.train[[1]]
}
return(cv.aucs)
}
For n = 170
# n - number of class 0 obs
# k - number of folds to use in cv
# size - size of each fold
aucVector <- cVLogisticRegression(170, 50, 60)
hist(aucVector, main = paste("Histogram of AUC for n = ", 170))
text(x = 0.75, y = 25, paste("Mean AUC = ", round(mean(aucVector), 2)), col = "blue")
For n = 500
# n - number of class 0 obs
# k - number of folds to use in cv
# size - size of each fold
aucVector <- cVLogisticRegression(500, 50, 60)
hist(aucVector, main = paste("Histogram of AUC for n = ", 500))
text(x = 0.75, y = 25, paste("Mean AUC = ", round(mean(aucVector), 2)), col = "blue")
For n = 1000
# n - number of class 0 obs
# k - number of folds to use in cv
# size - size of each fold
aucVector <- cVLogisticRegression(1000, 50, 60)
hist(aucVector, main = paste("Histogram of AUC for n = ", 1000))
text(x = 0.65, y = 22, paste("Mean AUC = ", round(mean(aucVector), 2)), col = "blue")
As we can see, as the n value increases (i.e imbalance increases) our model performance is decreasing. This suggests that we should carefully choose the n value while building our model.
The effect of balancing the class variable can be observed in the below table:
200 -> AUC was close to 80% 500 -> AUC was close to 75% 1000 -> AUC was close to 50%
This shows that logistic regression model’s performance is largely dependent on having a balanced class variable.
We will first balance the class variable so that there are approximately equal number of hazardous vs non-hazardous cases.
# Reimport the dataset and start fresh
sb <- import("seismic-bumps.arff")
# Extract non-hazardous observations
nonh <- sb %>%
filter(class == 0)
# Randomly choose 200 observations
nonh <- nonh[sample(1:nrow(nonh), 200), ]
# Extract hazardous observations
h <- sb %>%
filter(class == 1)
# Combine the nonh and h dataframes
balancedSB <- rbind(nonh, h)
# Shuffle the dataframe
balancedSB <- balancedSB[sample(1:nrow(balancedSB)),]
Next, we will split the dataset into training (75%) and test (25%) datasets.
# Calculate N
N <- nrow(balancedSB)
# Create a random number vector
rvec <- runif(N)
# Select rows from the dataframe
balancedSB.train <- balancedSB[rvec < 0.75,]
balancedSB.test <- balancedSB[rvec >= 0.75,]
# Select rows for the class variable
train_class <- balancedSB.train$class
test_class <- balancedSB.test$class
Next, we build the model.
Our goal is to select a model with the smallest number of coefficients that also gives a good accuracy
.
#Build the formula for model.matrix
formula <- as.formula("class ~ .")
#Build the model matrix object from the training data
balancedSBmatrix <- model.matrix(formula, data = balancedSB.train)
# Pass the matrix and class vector
cvfit <- cv.glmnet(balancedSBmatrix, train_class, family = "binomial", type.measure = "class")
# plot the model
plot(cvfit)
We are essentially looking for the lamba value which yeilds the lowest miss-classification error. That lamba value will be the number of coefficients that are contributing the most.
Check the coefficients
coef(cvfit, s = "lambda.min")
## 22 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.752074e+00
## (Intercept) .
## seismicb 1.428659e-01
## seismoacousticb .
## seismoacousticc .
## shiftW 5.635677e-01
## genergy .
## gpuls 7.108425e-04
## gdenergy -1.933688e-03
## gdpuls .
## ghazardb 2.289537e-02
## ghazardc -2.262767e+00
## nbumps 4.597309e-01
## nbumps2 .
## nbumps3 1.648312e-01
## nbumps4 .
## nbumps5 1.519284e-01
## nbumps6 .
## nbumps7 .
## nbumps89 .
## energy -5.253076e-06
## maxenergy .
Prediction:
We will use the above model to predict on the test dataset.
# Since predict function expects a matrix to be passed to newx
# Note the use of balancedSB.test here instead of train
balancedSBmatrix <- model.matrix(formula, data = balancedSB.test)
# predict on the test data, use type = "response" to get prediction probabilities
pred <- predict(cvfit, newx = balancedSBmatrix, type = "response")
# create a prediction object
predObj <- prediction(pred, test_class)
Prediction Performance:
# Measure the performance using true positive rate vs false positive rate
myroc.perf <- performance(predObj, measure = "tpr", x.measure = "fpr")
# Measure the performance using AUC
auc.test <- performance(predObj, measure = "auc")
# Get the AUC value to display on the ROC plot
auc.value <- auc.test@y.values
# Plot the ROC with AUC value
plot(myroc.perf)
abline(a=0, b= 1) #Ref line indicating poor performance
text(x = .40, y = .6,paste("AUC = ", round(auc.value[[1]],3), sep = ""))
Using cross validation we can assess how well our model building process works. The idea is that we can know how well our model will perform on new data not yet collected. We will use AUC as the performance metric.
# Reimport the dataset and start fresh
sb <- import("seismic-bumps.arff")
# Extract non-hazardous observations
nonh <- sb %>%
filter(class == 0)
# Randomly choose 200 observations
nonh <- nonh[sample(1:nrow(nonh), 200), ]
# Extract hazardous observations
h <- sb %>%
filter(class == 1)
# Combine the nonh and h dataframes
balancedSB <- rbind(nonh, h)
# Shuffle the dataframe
balancedSB <- balancedSB[sample(1:nrow(balancedSB)),]
# Extract the class variable into a vector call it "dat.train.y"
dat.train.y <- balancedSB$class
# Create a model matrix using the entire balancedSB.
# Earlier we split this into training/test data.
# Now we are going do that within the loop.
dat.train.x <- model.matrix(as.formula("class ~ ."), data = balancedSB)
Now we are ready for the cross-validation loop
# Define the number of cross-validation loops
nloops <- 50
# Number of observations in the training (i.e full) dataset
ntrains <- nrow(dat.train.x)
# Create an empty vector to hold all the AUC result.
# Later we will display each run's AUC in a histogram
cv.aucs <- c()
Run the cross validation
for (i in 1:nloops){
# Get the indexes using sample
index<-sample(1:ntrains,60)
# randomly draw 60 observations from front and call it training set
cvtrain.x<-as.matrix(dat.train.x[index,])
# randomly draw 60 observations from back and call it test set
cvtest.x<-as.matrix(dat.train.x[-index,])
# Get the corresponding class variable
cvtrain.y<-dat.train.y[index]
cvtest.y<-dat.train.y[-index]
## Model fitting:
# Call cv.glmnet to fit the model using training set
cvfit <- cv.glmnet(cvtrain.x, cvtrain.y, family = "binomial", type.measure = "class")
## Prediction:
# Predict using test set using the above model. Use type = "response" to get prediction probabilities.
fit.pred <- predict(cvfit, newx = cvtest.x, type = "response")
# Prediction function takes prediction probabilities as first arg and class to compare as its second arg
pred <- prediction(fit.pred[,1], cvtest.y)
## Prediction Performance:
# Check prediction performance
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
auc.train <- performance(pred, measure = "auc")
auc.train <- auc.train@y.values
# Store the auc value for this run into the vector
cv.aucs[i]<-auc.train[[1]]
}
Draw a historgram of cv.aucs
hist(cv.aucs)
text(x = 0.75, y = 22, paste("Mean AUC = ", round(mean(cv.aucs), 2)), col = "blue")
This indicates that a majority of time our model prediction performance lies between 70 to 75%
Within the ARFF file is a large table of various classification techniques and their results.
Classification results using stratified 10-fold cross-validation repeated 10 times
Algorithm | Acc. | BAcc. | Acc.0 spec | Acc.1 sense | Size |
---|---|---|---|---|---|
q-ModLEM(entropy-RSS) (1) | 80.2(5.1) | 69.1(6.2) | 81.90 | 56.35 | 27.5 |
q-ModLEM(entropy-Corr.) (1) | 82.9(4.5) | 67.9(7.2) | 85.15 | 50.65 | 45.5 |
MODLEM (2) | 92.5(0.8) | 52.6(2.8) | 98.58 | 6.65 | 145.5 |
MLRules(-M 30) (3) | 93.2(0.3) | 50.5(1.3) | 99.69 | 1.29 | 30 |
MLRules(-M 100) (3) | 92.9(0.6) | 52.0(2.2) | 99.10 | 4.88 | 100 |
MLRules(-M 500) (3) | 92.3(0.6) | 52.9(2.8) | 98.27 | 7.59 | 500 |
BRACID (4) | 87.5(0.4) | 62.0(2.6) | 91.38 | 32.71 | - |
Jrip (Weka) | 93.0(0.6) | 51.4(2.4) | 99.35 | 3.47 | 1.8 |
PART (Weka) | 92.1(0.8) | 52.7(3.5) | 98.09 | 7.35 | 34 |
J48 (Weka) | 93.1(0.8) | 50.2(0.9) | 99.64 | 0.82 | 5.6 |
SimpleCart (Weka) | 93.4(0.0) | 50.0(0.0) | 100 | 0.00 | 1.0 |
NaiveBayes (Weka) | 86.7(2.0) | 64.7(5.8) | 90.08 | 39.41 | - |
IB1 (Weka) | 89.4(1.6) | 55.3(4.8) | 94.54 | 16.06 | - |
RandomForest(-I 100) (Weka) | 93.1(0.6) | 52.1(2.5) | 99.31 | 4.88 | 100 |
Principal Components Analysis and Linear Discriminant Analysis applied to BreastCancer Wisconsin Diagnostic dataset in R
Predict Seismic bumps using Logistic Regression in R
Unsupervised Learning: Clustering using R and Python
Approach to solving a binary classification problem