Logistic regression is a type of generalized linear model used for modeling binary responses, as a function of some explanatory variables. In the data, we would have a binary response variable for \(n\) observations, \(Y_i, i=1,2,..,n\), where \(Y_i=1\) or \(0\). The data would also have explanatory variables \((X_{i1},X_{i2},...,X_{ik})\).

The logistic model is formulated as follows \[ P(Y_i=1) = \frac{e^{\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2}+...+\beta_k X_{ik} }}{1+ e^{\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2}+...+\beta_k X_{ik} }},\] or equivalently \[ \log\frac{P(Y_i=1)}{1-P(Y_i=1)}= \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2}+...+\beta_k X_{ik}, \] where \(\beta_0, \beta_1,..., \beta_k\) are parameters to be estimated. The parameters are estimated using maximum likelihood method of estimation. Here, we will implement the approach in R. the data example we use here is adapted from the book, "Analyzing multivariate data" by Lattin Carrol and Green.

Problem

A direct marketing company has information on purchasing history of 50,000 existing customers. The company is interested in offering a new title "Art History of Florence". It would prefer to do target mailing- that is to send the offer only to those existing customers whose likelihood of buying the book is relatively high. In an attempt to come up with a profile of the target customer, the company sent direct mail "pilot" offer for the "Art History of Florence" to 1000 existing customers (chosen at random from the customer base of 50000). Of these, 83 purchased the book. The company now wants to use this information to differentiate buyers from non-buyers. If that can be done then they can score the remaining customers and target among the remaining customers, those who are more likely to buy the book.

For each customer in the database, the company has information on "months since last purchase" (as this variable increases the recency of last purchase from the company decreases) and the number of art books purchased (a measure of interest in the art category). The data from this pilot is shown in the file "bookdev.csv".

library(MASS)

X<-read.csv("./Bookdev.csv")

fit<-glm(buyer~ Months + artbooks, family=binomial(link="logit"), data=X  )

summary(fit)
## 
## Call:
## glm(formula = buyer ~ Months + artbooks, family = binomial(link = "logit"), 
##     data = X)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8836  -0.4232  -0.3221  -0.2617   2.8813  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.22564    0.23892  -9.315  < 2e-16 ***
## Months      -0.07072    0.01923  -3.677 0.000236 ***
## artbooks     0.98905    0.13466   7.345 2.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 572.07  on 999  degrees of freedom
## Residual deviance: 502.93  on 997  degrees of freedom
## AIC: 508.93
## 
## Number of Fisher Scoring iterations: 6

We can see that both the variables Months and artbooks are significant, and have signs that are intuitive for the given context. Ofcourse, one question would be how good is this model. One way to assess th egoodness of fit test is by using the Hosmer-Lemeshow test. For this test, the predicted probabilities are computed on the training data, the data is sorted in descending order of predicted probabilities (\(P(Y_i=1)\)) and data is divided into 10 equal groups (deciles). The Hosmer-Lemeshow test then compares the observed frequency and expected frequency of 1's in these groups, using a statistical test. If the observed and expected frequencies are close (in the statistical sense of chi-square test) then the model is good.

# obtain predicted values

pred<-fitted.values(fit)
# predicted probabilities for first 10 observations
pred[1:10]
##          1          2          3          4          5          6          7 
## 0.01940083 0.03366309 0.03604094 0.02228262 0.03604094 0.33808623 0.04726690 
##          8          9         10 
## 0.11769176 0.11044751 0.09738499
# Hosmer Lemeshow Test
#install.packages("ResourceSelection")
library(ResourceSelection)
## ResourceSelection 0.3-5   2019-07-22
hl<-hoslem.test(X$buyer, pred, g = 10)

hl$observed
##                   
## cutyhat             y0  y1
##   [0.00901,0.0239] 101   2
##   (0.0239,0.036]   109   4
##   (0.036,0.0413]    83   4
##   (0.0413,0.0473]  102   3
##   (0.0473,0.0541]  100   7
##   (0.0541,0.066]    85   6
##   (0.066,0.0857]    87   7
##   (0.0857,0.104]    97  10
##   (0.104,0.169]     82  11
##   (0.169,0.83]      71  29
hl$expected
##                   
## cutyhat                 yhat0      yhat1
##   [0.00901,0.0239] 101.440169   1.559831
##   (0.0239,0.036]   109.264258   3.735742
##   (0.036,0.0413]    83.524507   3.475493
##   (0.0413,0.0473]  100.243771   4.756229
##   (0.0473,0.0541]  101.387670   5.612330
##   (0.0541,0.066]    85.409749   5.590251
##   (0.066,0.0857]    86.724032   7.275968
##   (0.0857,0.104]    97.114852   9.885148
##   (0.104,0.169]     80.662012  12.337988
##   (0.169,0.83]      71.228981  28.771019
hl$statistic
## X-squared 
##   1.48392
hl$p.value
## [1] 0.9929717

Whether to send target mailer or not?

Confusion matrix on development sample

In this case the P-value is very large indicating a good fit. Note that the model gives a probability of buying, but we need to convert that into an action plan for our implementation of target mailers. So, one needs to decide on a suitable cutoff.

cutoff<- mean(X$buyer)
cutoff
## [1] 0.083
pred_class<- (pred>cutoff)*1
ct <- table(X$buyer, pred_class)

#development: confusion matrix
ct
##    pred_class
##       0   1
##   0 656 261
##   1  32  51
Confusion matrix on validation sample
Xv<- read.csv("./Bookval.csv")

pred<-predict(fit, type="response", newdata=Xv)
pred_class<- (pred>cutoff)*1

ct <- table(Xv$buyer, pred_class)
#validation: confusion matrix
ct
##    pred_class
##       0   1
##   0 686 233
##   1  28  53