The file "loanacceptance.csv" contains various attributes of 500 customers based on which loan has either been granted or denied. You have to create a system that automatically decides whether to grant a loan or not to grant a loan.

Importing libraries

In [12]:
from sklearn import ensemble
from sklearn import metrics
import numpy as np
import csv

Reading data file as list object

In [13]:
def readFileThroughCSV(filename):
    csvfile = open(filename)

    # creating a csv reader object
    readerobject = csv.reader(csvfile, delimiter=',')
    lst = list(readerobject)
    csvfile.close()

    # removing first row from list
    lst = lst[1:]
    arr = np.array(lst)
    data = arr.astype(float)

    # extract last column which is classification label
    c = data[:,-1]
    # extract remaining data
    d = data[:,1:-1]
    
    return(c,d)
In [14]:
(c,d) = readFileThroughCSV("loanacceptance.csv")
# shape of the variables
print(c.shape)
print(d.shape)
(500,)
(500, 6)

Fitting random forests

In [7]:
# Note that only 80% of the dataset is being used for training
clf = ensemble.RandomForestClassifier(n_estimators=10)
clf.fit(X=d[0:400,:],y=c[0:400])
Out[7]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Printing performance metrics

In [8]:
# returns accuracy
print("Training accuracy",clf.score(X=d[:400,:],y=c[:400]))
# for decision trees clf.score returns the R-squared value (it can be negative as well in case of bad performance)
print("R-square accuracy",clf.score(X=d[400:,:],y=c[400:]))

indices = range(400,500)
c_predicted = clf.predict(d[indices,:])

print("Testing accuracy",metrics.accuracy_score(c[indices],c_predicted))
Training accuracy 1.0
R-square accuracy 0.91
Testing accuracy 0.91

Always look at the confusion matrix

In [9]:
m = metrics.confusion_matrix(c[indices],c_predicted)
print(m)
[[24  6]
 [ 3 67]]