The file “multicommodity.csv” contains information about 500 customers who have purchased one or more of the 20 products that a company sells. The company has information about the gender of these 500 customers through a survey that they conducted. The company wants to learn a model such that it is able to predict the gender of a customer based on the portfolio of products purchased.¶

Loading libraries¶

In [1]:
from sklearn import svm
from sklearn import metrics
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Reading data file as through Pandas¶

In [2]:
def readFileThroughPandas(filename):
    att = pd.read_csv(filename, usecols = np.arange(1,21))
    lab = pd.read_csv(filename, usecols = [0])
    return(att,lab)
In [3]:
(att,lab) = readFileThroughPandas("multicommodity.csv")
print(att.shape)
print(lab.shape)

# Use the first 300 rows for training and the remaining rows for testing

x_train = att.iloc[0:300]
y_train = lab.iloc[0:300]

x_test = att.iloc[300:]
y_test = lab.iloc[300:]

# Alternatively use the following code to choose the rows randomly
# x_train, x_test, y_train, y_test = train_test_split(att, lab, test_size = 0.30)
(500, 20)
(500, 1)

Fitting SVM model¶

In [4]:
# Create an SVM classification object

# clf = svm.SVC(kernel='linear',C=1)
# Note that rbf is the default kernel in svm.SVC
# C is a regularization parameter to avoid overfitting
# The default value of C is 1

clf = svm.SVC(kernel='linear',C=1)
clf.fit(X=x_train,y=y_train)

# In case of more than 2 classes, note that multiclass is automatically done based on one-vs-one in svm.SVC
Out[4]:
SVC(C=1, kernel='linear')

Printing performance metrics¶

In [5]:
# returns accuracy
print("Training accuracy",clf.score(X=x_train,y=y_train))
print("Testing accuracy",clf.score(X=x_test,y=y_test))

y_predicted = clf.predict(x_test)
Training accuracy 0.8571428571428571
Testing accuracy 0.815

Always look at the confusion matrix¶

In [6]:
m = metrics.confusion_matrix(y_test,y_predicted,labels=clf.classes_)
print(m)
[[106  10]
 [ 27  57]]
In [7]:
# Better visualization of a confusion matrix
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=m,display_labels=clf.classes_)
disp.plot()
plt.show()