Create Categorical Variable From Continuous in R
I have the following question: Are there any Standard Methods for Converting a Continuous Response Variable into a Categorical Variable?
To give my question some context, I give the following example (using the R programming language). Suppose you have the following data (I also posted the code on how to generate this data to my example more reproducible):
#generate data a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11) b = rnorm(1000, 10, 5) c = rnorm(1000, 5, 10) cat_var <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.5,0.5)) old_response_variable <- rnorm(1000, 250, 100) #put data into a frame d = data.frame(a,b,c, cat_var, old_response_variable) d$cat_var = as.factor(d$cat_var) #view data a b c cat_var old_response_variable 1 -2.153779 15.135098 7.903363 B 233.7632 2 10.529895 5.055633 4.959639 B 372.3922 3 20.600232 10.333690 12.749611 B 349.6630 4 41.885899 17.280700 26.760988 B 164.3122 5 17.174567 11.878346 -3.306771 A 272.9595 6 21.524126 12.449084 6.911237 A 179.7316 In this above dataset, the variables a, b, c, cat_var are the predictor variables (covariates) and "old_response_variable" is the response variable (continuous). I am interested converting the "old_response_variable" into a (binary) categorical predictor variable - and then train a statistical model (e.g. decision tree) on this data for the purpose of supervised classification.
Proposed Strategy:
The plot of the "old_response_variable" looks like this:
plot(density(d$old_response_variable), main = "Distribution of the Old Response Variable")
The "old response variable" can take values between 0 and 600. Since I am interested in binary classification, I thought I could:
1) Make random splits (i.e. threshold) in the "old response variable" (e.g. if old_response_variable < 250 then new_response_variable = "0" else "1")
2) Train a decision tree model on the data from 1)
3) Record performance metrics (e.g. accuracy, sensitivity, specificity) from the model in 2)
4) Repeat steps 1) - 3) many times : choose the final threshold that has "suitable" values of accuracy, sensitivity, specificity (e.g. a threshold where the decision tree model has high accuracy, high sensitivity but low specificity might be less advantageous than a threshold where the decision tree model has medium accuracy, medium sensitivity and medium specificity).
Here is the R code corresponding to this strategy (on a small scale):
library(ggplot2) library(caret) library(rpart) #generate data a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11) b = rnorm(1000, 10, 5) c = rnorm(1000, 5, 10) cat_var <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.5,0.5)) old_response_variable <- rnorm(1000, 250, 100) #put data into a frame d = data.frame(a,b,c, cat_var, old_response_variable) d$cat_var = as.factor(d$cat_var) e <- d vec1 <- sample(250:300, 50) z <- 0 df <- expand.grid(vec1) df$Accuracy <- NA df$sens <- NA df$spec <- NA for (i in seq_along(vec1)) { # d <- e d$new_response_variable = as.integer(ifelse(d$old_response_variable < vec1[i] , 0, 1)) d$new_response_variable = as.factor(d$new_response_variable) fitControl <- trainControl(## 10-fold CV method = "repeatedcv", number = 2, ## repeated ten times repeats = 1) TreeFit <- train(new_response_variable ~ ., data = d[,-5], method = "rpart", trControl = fitControl) pred <- predict( TreeFit, d[,-5]) con <- confusionMatrix( d$new_response_variable, pred) #update results into table #final_table[i,j] = con$overall[1] z <- z + 1 df$Accuracy[z] <- con$overall[1] df$spec[z] <- con$byClass[1] df$sens[z] <- con$byClass[2] } #view final results ("var1" is the threshold) head(df) Var1 Accuracy sens spec 1 299 0.682 0.8125000 0.6798780 2 289 0.657 0.7358491 0.6525871 3 271 0.573 0.8125000 0.5691057 4 278 0.622 0.6491228 0.6185102 5 253 0.540 0.5352564 0.6093750 6 258 0.549 0.5305623 0.6318681 We can visualize the results of this strategy:
ggplot(df, aes(Var1)) + geom_line(aes(y = Accuracy, colour = "Accuracy")) + geom_line(aes(y = sens , colour = "sens ")) + geom_line(aes(y = spec , colour = "spec ")) + ggtitle("Results of Threshold Splitting Strategy")
According to the results of the above picture, a splitting threshold of approximately "280" (if old_response_variable < 280 then new_response_variable = "0" else "1") appears to be a suitable choice (balanced accuracy, specificity, sensitivity).
Question: Based on this strategy that I have outlined for splitting thresholds, are there any major statistical flaws? The main flaw that I can think of, is that the "suitably" of the threshold is decided by how well the model (the decision tree) performs - it is very possible that this threshold may not be a naturally occurring threshold or a ideal threshold, but rather a threshold that well matches the (multiple) models we trained (someone could ask the question "why wasn't a KNN or an SVM model used to evaluate potential splitting thresholds?") . In essence, we might have projected our biases on to this data - but to some extent, this is inevitable in statistical modelling.
But in general, have I outlined a "reasonable" strategy for converting a continuous response variable into a categorical response variable?
In general, can someone please comment on the approach I have used?
Thanks!
Note : I know that converting a continuous variable into a categorical variables will inevitably result in a loss of information - but what if the client/your boss is specifically requesting this problem be solved as a classification problem? (e.g. similar problems in the industry are treated as classification, and the goal is to redefine new thresholds/classses from a continuous variable).
Source: https://stats.stackexchange.com/questions/549594/converting-a-continuous-response-variable-into-a-categorical-variable
0 Response to "Create Categorical Variable From Continuous in R"
Post a Comment