## The file “multicommodity.csv” contains information about 500 customers who have purchased one or more of the 20 products that a company sells. The company has information about the gender of these 500 customers through a survey that they conducted. The company wants to learn a model such that it is able to predict the gender of a customer based on the portfolio of products purchased.¶

### Importing libraries¶

In [55]:
from sklearn import tree
from sklearn import metrics
import numpy as np
import csv


### Reading data file as list object¶

In [56]:
import csv
csvfile = open(filename)
csvfile.close()

# removing first row from list
lst = lst[1:]
arr = np.array(lst)
data = arr.astype(float)

# extract first column which is classification
c = data[:,0]

# extract remaining data
d = data[:,1:]
return(c,d)

In [57]:
import pandas as pd
c = pd.read_csv(filename, usecols = [0])
d = pd.read_csv(filename, usecols = np.arange(1,21))
cnum = c.values
dnum = d.values

#You may also use the following
#num = c.to_numpy()
#dnum = d.to_numpy()

cnum = cnum[:,0]
return(cnum,dnum)

In [58]:
(c,d) = readFileThroughPandas("multicommodity.csv")
print(c.shape)
print(d.shape)

(500,)
(500, 20)


### Fitting a decision tree¶

In [59]:
# Note that only 80% of the dataset is being used for training
clf = tree.DecisionTreeClassifier()
clf.fit(X=d[0:300,:],y=c[0:300])

Out[59]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')

### Printing performance metrics¶

In [60]:
# returns accuracy
print("Training accuracy",clf.score(X=d[:300,:],y=c[:300]))
print("Testing accuracy",clf.score(X=d[300:,:],y=c[300:]))

indices = range(300,500)
c_predicted = clf.predict(d[indices,:])

Training accuracy 1.0
Testing accuracy 0.68


### Always look at the confusion matrix¶

In [54]:
m = metrics.confusion_matrix(c[indices],c_predicted)
print(m)

[[91 25]
[39 45]]