Create Categorical Variable From Continuous in R

Question

I have the following question: Are there any Standard Methods for Converting a Continuous Response Variable into a Categorical Variable?

To give my question some context, I give the following example (using the R programming language). Suppose you have the following data (I also posted the code on how to generate this data to my example more reproducible):

                      #generate data          a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11)     b = rnorm(1000, 10, 5)     c = rnorm(1000, 5, 10)     cat_var <- sample( LETTERS[1:2], 1000, replace=TRUE,                    prob=c(0.5,0.5))     old_response_variable <- rnorm(1000, 250, 100)          #put data into a frame     d = data.frame(a,b,c, cat_var, old_response_variable)     d$cat_var = as.factor(d$cat_var)      #view data                    a         b         c cat_var old_response_variable     1 -2.153779 15.135098  7.903363       B              233.7632     2 10.529895  5.055633  4.959639       B              372.3922     3 20.600232 10.333690 12.749611       B              349.6630     4 41.885899 17.280700 26.760988       B              164.3122     5 17.174567 11.878346 -3.306771       A              272.9595     6 21.524126 12.449084  6.911237       A              179.7316

In this above dataset, the variables a, b, c, cat_var are the predictor variables (covariates) and "old_response_variable" is the response variable (continuous). I am interested converting the "old_response_variable" into a (binary) categorical predictor variable - and then train a statistical model (e.g. decision tree) on this data for the purpose of supervised classification.

Proposed Strategy:

The plot of the "old_response_variable" looks like this:

                      plot(density(d$old_response_variable), main = "Distribution             of the Old Response Variable")

enter image description here

The "old response variable" can take values between 0 and 600. Since I am interested in binary classification, I thought I could:

1) Make random splits (i.e. threshold) in the "old response variable" (e.g. if old_response_variable < 250 then new_response_variable = "0" else "1")

2) Train a decision tree model on the data from 1)

3) Record performance metrics (e.g. accuracy, sensitivity, specificity) from the model in 2)

4) Repeat steps 1) - 3) many times : choose the final threshold that has "suitable" values of accuracy, sensitivity, specificity (e.g. a threshold where the decision tree model has high accuracy, high sensitivity but low specificity might be less advantageous than a threshold where the decision tree model has medium accuracy, medium sensitivity and medium specificity).

Here is the R code corresponding to this strategy (on a small scale):

                      library(ggplot2)     library(caret)     library(rpart)          #generate data          a = rnorm(5000, 10, 10) + rnorm(5000, 6, 11)     b = rnorm(1000, 10, 5)     c = rnorm(1000, 5, 10)     cat_var <- sample( LETTERS[1:2], 1000, replace=TRUE,                      prob=c(0.5,0.5))     old_response_variable <- rnorm(1000, 250, 100)          #put data into a frame     d = data.frame(a,b,c, cat_var, old_response_variable)     d$cat_var = as.factor(d$cat_var)          e <- d     vec1 <- sample(250:300, 50)     z <- 0     df <- expand.grid(vec1)     df$Accuracy <- NA df$sens <- NA     df$spec <- NA          for (i in seq_along(vec1)) {                 # d <- e         d$new_response_variable =        as.integer(ifelse(d$old_response_variable < vec1[i] ,                       0, 1))                 d$new_response_variable =              as.factor(d$new_response_variable)                 fitControl <- trainControl(## 10-fold CV             method = "repeatedcv",             number = 2,             ## repeated ten times             repeats = 1)                 TreeFit <- train(new_response_variable ~ .,                           data = d[,-5],                          method = "rpart",                          trControl = fitControl)                 pred <- predict(             TreeFit,             d[,-5])                 con <- confusionMatrix(             d$new_response_variable,             pred)                 #update results into table         #final_table[i,j] = con$overall[1]     z <- z + 1     df$Accuracy[z] <- con$overall[1]     df$spec[z] <- con$byClass[1]     df$sens[z] <- con$byClass[2]             }          #view final results ("var1" is the threshold)              head(df)           Var1 Accuracy      sens      spec     1  299    0.682 0.8125000 0.6798780     2  289    0.657 0.7358491 0.6525871     3  271    0.573 0.8125000 0.5691057     4  278    0.622 0.6491228 0.6185102     5  253    0.540 0.5352564 0.6093750     6  258    0.549 0.5305623 0.6318681

We can visualize the results of this strategy:

                      ggplot(df, aes(Var1)) +          geom_line(aes(y = Accuracy, colour = "Accuracy")) +             geom_line(aes(y =  sens  , colour = "sens  ")) +             geom_line(aes(y =  spec  , colour = "spec  "))  +              ggtitle("Results of Threshold Splitting Strategy")

enter image description here

According to the results of the above picture, a splitting threshold of approximately "280" (if old_response_variable < 280 then new_response_variable = "0" else "1") appears to be a suitable choice (balanced accuracy, specificity, sensitivity).

enter image description here

Question: Based on this strategy that I have outlined for splitting thresholds, are there any major statistical flaws? The main flaw that I can think of, is that the "suitably" of the threshold is decided by how well the model (the decision tree) performs - it is very possible that this threshold may not be a naturally occurring threshold or a ideal threshold, but rather a threshold that well matches the (multiple) models we trained (someone could ask the question "why wasn't a KNN or an SVM model used to evaluate potential splitting thresholds?") . In essence, we might have projected our biases on to this data - but to some extent, this is inevitable in statistical modelling.

But in general, have I outlined a "reasonable" strategy for converting a continuous response variable into a categorical response variable?

In general, can someone please comment on the approach I have used?

Thanks!

Note : I know that converting a continuous variable into a categorical variables will inevitably result in a loss of information - but what if the client/your boss is specifically requesting this problem be solved as a classification problem? (e.g. similar problems in the industry are treated as classification, and the goal is to redefine new thresholds/classses from a continuous variable).

Create Categorical Variable From Continuous in R

0 Response to "Create Categorical Variable From Continuous in R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel