The file “multicommodity.csv” contains information about 500 customers who have purchased one or more of the 20 products that a company sells. The company has information about the gender of these 500 customers through a survey that they conducted. The company wants to learn a model such that it is able to predict the gender of a customer based on the portfolio of products purchased.

Loading libraries

In [55]:
from sklearn import svm
from sklearn import metrics
import numpy as np

Reading data file as a list object

In [56]:
import csv
def readFileThroughCSV(filename):
    csvfile = open(filename)
    readerobject = csv.reader(csvfile, delimiter=',')
    lst = list(readerobject)
    csvfile.close()

    # removing first row from list
    lst = lst[1:]
    arr = np.array(lst)
    data = arr.astype(float)

    # extract first column which is classification
    c = data[:,0]

    # extract remaining data
    d = data[:,1:]
    return(c,d)
In [59]:
import pandas as pd
def readFileThroughPandas(filename):
    c = pd.read_csv(filename, usecols = [0])
    d = pd.read_csv(filename, usecols = np.arange(1,21))
    cnum = c.values
    dnum = d.values
    
    #You may also use the following
    #num = c.to_numpy()
    #dnum = d.to_numpy()
    
    cnum = cnum[:,0]
    return(cnum,dnum)
In [60]:
(c,d) = readFileThroughPandas("multicommodity.csv")
#(c,d) = readFileThroughCSV("multicommodity.csv")
print(c.shape)
print(d.shape)
(500,)
(500, 20)

Fitting SVM model

In [61]:
# Create an SVM classification object
# The linear SVM will give a poor confusion matrix. Increasing C will help.
# clf = svm.LinearSVC(C=100)

# clf = svm.SVC(kernel='linear')
# Note that rbf is the default kernel in svm.SVC
# C is a regularization parameter to avoid overfitting

clf = svm.SVC(kernel='linear',C=1)

# Fitting SVM only on first 300 data points
clf.fit(X=d[0:300,:],y=c[0:300])

# In case of more than 2 classes, note that multiclass is done based on one-vs-one in svm.SVC
Out[61]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Printing performance metrics

In [62]:
# returns accuracy
print("Training accuracy",clf.score(X=d[:300,:],y=c[:300]))
print("Testing accuracy",clf.score(X=d[300:,:],y=c[300:]))

# Evaluating performance on last 200 data points
indices = range(300,500)
c_predicted = clf.predict(d[indices,:])
Training accuracy 0.8566666666666667
Testing accuracy 0.815

Always look at the confusion matrix

In [63]:
m = metrics.confusion_matrix(c[indices],c_predicted)
print(m)
[[106  10]
 [ 27  57]]