Basic machine learning with python sklearn for classification

A demonstration of how to use python package sklearn for a basic machine learning task : classification.

code

#Lets start with installing pandas from within python interactive shell. This may be needed if you are missing a package.
import pip
#if you have pip installed, you can use this to install any package you need : pip.main(['install',package-name])
pip.main(['install','pandas']) 

output

    Requirement already satisfied: pandas in c:\users\u588401\appdata\local\continuum\anaconda3\lib\site-packages
    Requirement already satisfied: python-dateutil>=2 in c:\users\u588401\appdata\local\continuum\anaconda3\lib\site-packages (from pandas)
    Requirement already satisfied: pytz>=2011k in c:\users\u588401\appdata\local\continuum\anaconda3\lib\site-packages (from pandas)
    Requirement already satisfied: numpy>=1.9.0 in c:\users\u588401\appdata\local\continuum\anaconda3\lib\site-packages (from pandas)
    Requirement already satisfied: six>=1.5 in c:\users\u588401\appdata\local\continuum\anaconda3\lib\site-packages (from python-dateutil>=2->pandas)
    
    You are using pip version 9.0.1, however version 10.0.1 is available.
    You should consider upgrading via the 'python -m pip install --upgrade pip' command.
  
Now lets use pandas to load the dataset from a url.

code

import pandas
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
dataset = pandas.read_csv(url)
print('shape=',dataset.shape)
print(dataset.head())

output

    shape= (149, 5)
       5.1  3.5  1.4  0.2  Iris-setosa
    0  4.9  3.0  1.4  0.2  Iris-setosa
    1  4.7  3.2  1.3  0.2  Iris-setosa
    2  4.6  3.1  1.5  0.2  Iris-setosa
    3  5.0  3.6  1.4  0.2  Iris-setosa
    4  5.4  3.9  1.7  0.4  Iris-setosa

Oh no, the dataset has no column names, lets add them

code

dataset.columns=['A','B','C','D','E']
print(dataset.head())

output

         A    B    C    D            E
    0  4.9  3.0  1.4  0.2  Iris-setosa
    1  4.7  3.2  1.3  0.2  Iris-setosa
    2  4.6  3.1  1.5  0.2  Iris-setosa
    3  5.0  3.6  1.4  0.2  Iris-setosa
    4  5.4  3.9  1.7  0.4  Iris-setosa

Lets look at the class distribution of the column we will use as target for our machine learning tasks, column ā€˜Eā€™.

code

print(dataset.groupby('E').size())

output

    E
    Iris-setosa        49
    Iris-versicolor    50
    Iris-virginica     50
    dtype: int64

Now lets split our dataset randomly into training, testing, and validation datasets. First, shuffle the dataset randomly with replacement.

code

# The frac keyword argument specifies the fraction of rows to return in
# the random sample, so frac=1 means return all rows (in random order).
dataset_shuffled = dataset.sample(frac=1)
print(dataset_shuffled.head())

output

           A    B    C    D                E
    9    5.4  3.7  1.5  0.2      Iris-setosa
    113  5.8  2.8  5.1  2.4   Iris-virginica
    96   6.2  2.9  4.3  1.3  Iris-versicolor
    118  6.0  2.2  5.0  1.5   Iris-virginica
    147  6.2  3.4  5.4  2.3   Iris-virginica

Now split the data into training and validation sets in 80:20 ratio. We will use the validation set to test the performance of machine learning algorithms.

code

n = dataset_shuffled.shape[0]
validation_size = int(0.2*n)

from sklearn import model_selection

data_array = dataset_shuffled.values
X = data_array[:,0:4]  # separate the predictor variables
Y = data_array[:,4]    # from the target variable (last column)
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X,Y,test_size=validation_size,random_state=13)

Check the sizes of the training and validation data sets.

code

print('Train_X = ',X_train.shape, 'Train_Y = ',Y_train.shape)
print('Val_X = ',X_validation.shape, 'Val_Y = ',Y_validation.shape)

output

    Train_X =  (120, 4) Train_Y =  (120,)
    Val_X =  (29, 4) Val_Y =  (29,)

code

import sklearn
from sklearn import linear_model
from sklearn import tree
from sklearn import svm
from sklearn import neighbors

# we will use these classifiers
LR = linear_model.LogisticRegression()
KNN = neighbors.KNeighborsClassifier()
DT = tree.DecisionTreeClassifier()
SVM = svm.SVC()

models_list = [LR,KNN,DT,SVM]
num_folds = 8
kfolds = model_selection.KFold(n_splits=num_folds, shuffle=True, random_state=13)
for model in models_list:
    model_name = type(model).__name__
    cv_results = model_selection.cross_val_score(model,X_train,Y_train,scoring='accuracy',cv=kfolds)
    print(model_name,' : ',cv_results.tolist())
    print(model_name,' : mean=',cv_results.mean(), ' sd=',cv_results.std())
    print('------------------------------------------------------------------------')

output

    LogisticRegression  :  [1.0, 1.0, 1.0, 0.6666666666666666, 0.9333333333333333, 0.9333333333333333, 0.9333333333333333, 1.0]
    LogisticRegression  : mean= 0.9333333333333333  sd= 0.10540925533894599
    ------------------------------------------------------------------------
    KNeighborsClassifier  :  [1.0, 0.9333333333333333, 1.0, 0.8666666666666667, 1.0, 0.9333333333333333, 1.0, 1.0]
    KNeighborsClassifier  : mean= 0.9666666666666667  sd= 0.04714045207910316
    ------------------------------------------------------------------------
    DecisionTreeClassifier  :  [0.9333333333333333, 0.9333333333333333, 0.9333333333333333, 0.9333333333333333, 1.0, 0.9333333333333333, 1.0, 0.9333333333333333]
    DecisionTreeClassifier  : mean= 0.95  sd= 0.02886751345948128
    ------------------------------------------------------------------------
    SVC  :  [0.9333333333333333, 1.0, 0.9333333333333333, 0.9333333333333333, 1.0, 0.9333333333333333, 1.0, 1.0]
    SVC  : mean= 0.9666666666666667  sd= 0.033333333333333326
    ------------------------------------------------------------------------

But this is naive training with each of the classifiers as we have not done anything to optimize the parameters for each classifer. We will use sklearn.model_selection.GridSearchCV for that. Lets do this with each classifier separately because each classifier will have its own set of parametes to tune.

code

kfolds = model_selection.KFold(n_splits=num_folds,shuffle=True,random_state=13)

#define function to be called to do grid search for each of the classifiers.

def print_cv_results(estimator,results):
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    print(type(estimator).__name__)
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
    print('best_parameters=',results.best_params_)
    print('best_accuracy=',results.best_score_)
    print('----------------------------------------------')
    
def do_grid_search(estimator,grid_values,kfolds):
    clf = model_selection.GridSearchCV(estimator=estimator,param_grid=grid_values,scoring='accuracy',cv=kfolds)
    results = clf.fit(X_train,Y_train)
    print_cv_results(estimator,results)
    return clf

code

# First, logistic Regression parameters to tune : C, penalty
grid_values = {'C':[0.01,0.1,1,10,100],'penalty':['l1','l2']}
clf_LR = do_grid_search(LR,grid_values,kfolds) 

# Next we do with KNN
grid_values = {'n_neighbors':list(range(1,10))}
clf_KNN = do_grid_search(KNN,grid_values,kfolds)

# Next Decision Tree
grid_values = {'criterion':['gini','entropy'],'splitter':['best','random'],'min_samples_split':list(range(2,10)),'min_samples_leaf':[1,2,3,4,5]}
clf_DT = do_grid_search(DT,grid_values,kfolds)

output

    LogisticRegression
    0.233 (+/-0.149) for {'C': 0.01, 'penalty': 'l1'}
    0.675 (+/-0.356) for {'C': 0.01, 'penalty': 'l2'}
    0.725 (+/-0.380) for {'C': 0.1, 'penalty': 'l1'}
    0.817 (+/-0.420) for {'C': 0.1, 'penalty': 'l2'}
    0.950 (+/-0.088) for {'C': 1, 'penalty': 'l1'}
    0.933 (+/-0.211) for {'C': 1, 'penalty': 'l2'}
    0.967 (+/-0.094) for {'C': 10, 'penalty': 'l1'}
    0.967 (+/-0.094) for {'C': 10, 'penalty': 'l2'}
    0.958 (+/-0.093) for {'C': 100, 'penalty': 'l1'}
    0.967 (+/-0.094) for {'C': 100, 'penalty': 'l2'}
    best_parameters= {'C': 10, 'penalty': 'l1'}
    best_accuracy= 0.9666666666666667
    ----------------------------------------------
    KNeighborsClassifier
    0.958 (+/-0.065) for {'n_neighbors': 1}
    0.950 (+/-0.058) for {'n_neighbors': 2}
    0.958 (+/-0.065) for {'n_neighbors': 3}
    0.967 (+/-0.067) for {'n_neighbors': 4}
    0.967 (+/-0.094) for {'n_neighbors': 5}
    0.975 (+/-0.065) for {'n_neighbors': 6}
    0.975 (+/-0.065) for {'n_neighbors': 7}
    0.975 (+/-0.065) for {'n_neighbors': 8}
    0.975 (+/-0.065) for {'n_neighbors': 9}
    best_parameters= {'n_neighbors': 6}
    best_accuracy= 0.975
    ----------------------------------------------
    DecisionTreeClassifier
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}
    0.942 (+/-0.104) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'best'}
    0.950 (+/-0.088) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 4, 'splitter': 'best'}
    0.975 (+/-0.065) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 4, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'best'}
    0.900 (+/-0.115) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 6, 'splitter': 'best'}
    0.925 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 6, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 7, 'splitter': 'best'}
    0.967 (+/-0.067) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 7, 'splitter': 'random'}
    0.933 (+/-0.115) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 8, 'splitter': 'best'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 8, 'splitter': 'random'}
    0.933 (+/-0.115) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 9, 'splitter': 'best'}
    0.925 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 9, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'best'}
    0.958 (+/-0.093) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 3, 'splitter': 'best'}
    0.925 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 3, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 4, 'splitter': 'best'}
    0.917 (+/-0.111) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 4, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 5, 'splitter': 'best'}
    0.975 (+/-0.065) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 5, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 6, 'splitter': 'best'}
    0.925 (+/-0.155) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 6, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 7, 'splitter': 'best'}
    0.967 (+/-0.067) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 7, 'splitter': 'random'}
    0.933 (+/-0.115) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 8, 'splitter': 'best'}
    0.917 (+/-0.111) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 8, 'splitter': 'random'}
    0.933 (+/-0.115) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 9, 'splitter': 'best'}
    0.958 (+/-0.093) for {'criterion': 'gini', 'min_samples_leaf': 2, 'min_samples_split': 9, 'splitter': 'random'}
    0.967 (+/-0.067) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 2, 'splitter': 'best'}
    0.933 (+/-0.149) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 2, 'splitter': 'random'}
    0.967 (+/-0.067) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 3, 'splitter': 'best'}
    0.958 (+/-0.093) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 3, 'splitter': 'random'}
    0.958 (+/-0.065) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 4, 'splitter': 'best'}
    0.917 (+/-0.173) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 4, 'splitter': 'random'}
    0.967 (+/-0.067) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 5, 'splitter': 'best'}
    0.958 (+/-0.093) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 5, 'splitter': 'random'}
    0.950 (+/-0.088) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 6, 'splitter': 'best'}
    0.925 (+/-0.104) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 6, 'splitter': 'random'}
    0.958 (+/-0.065) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 7, 'splitter': 'best'}
    0.933 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 7, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 8, 'splitter': 'best'}
    0.908 (+/-0.132) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 8, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 9, 'splitter': 'best'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 3, 'min_samples_split': 9, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 2, 'splitter': 'best'}
    0.950 (+/-0.088) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 2, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 3, 'splitter': 'best'}
    0.908 (+/-0.114) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 3, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 4, 'splitter': 'best'}
    0.933 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 4, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 5, 'splitter': 'best'}
    0.950 (+/-0.058) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 5, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 6, 'splitter': 'best'}
    0.933 (+/-0.067) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 6, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 7, 'splitter': 'best'}
    0.875 (+/-0.294) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 7, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 8, 'splitter': 'best'}
    0.942 (+/-0.104) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 8, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 9, 'splitter': 'best'}
    0.933 (+/-0.163) for {'criterion': 'gini', 'min_samples_leaf': 4, 'min_samples_split': 9, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 2, 'splitter': 'best'}
    0.875 (+/-0.194) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 2, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 3, 'splitter': 'best'}
    0.900 (+/-0.231) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 3, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 4, 'splitter': 'best'}
    0.925 (+/-0.104) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 4, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 5, 'splitter': 'best'}
    0.900 (+/-0.149) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 5, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 6, 'splitter': 'best'}
    0.925 (+/-0.155) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 6, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 7, 'splitter': 'best'}
    0.900 (+/-0.094) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 7, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 8, 'splitter': 'best'}
    0.900 (+/-0.163) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 8, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 9, 'splitter': 'best'}
    0.917 (+/-0.160) for {'criterion': 'gini', 'min_samples_leaf': 5, 'min_samples_split': 9, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}
    0.942 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'best'}
    0.892 (+/-0.114) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 4, 'splitter': 'best'}
    0.933 (+/-0.133) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 4, 'splitter': 'random'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'best'}
    0.933 (+/-0.094) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 6, 'splitter': 'best'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 6, 'splitter': 'random'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 7, 'splitter': 'best'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 7, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 8, 'splitter': 'best'}
    0.942 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 8, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 9, 'splitter': 'best'}
    0.917 (+/-0.160) for {'criterion': 'entropy', 'min_samples_leaf': 1, 'min_samples_split': 9, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'best'}
    0.958 (+/-0.065) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 3, 'splitter': 'best'}
    0.958 (+/-0.065) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 3, 'splitter': 'random'}
    0.942 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 4, 'splitter': 'best'}
    0.925 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 4, 'splitter': 'random'}
    0.933 (+/-0.094) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 5, 'splitter': 'best'}
    0.958 (+/-0.093) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 5, 'splitter': 'random'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 6, 'splitter': 'best'}
    0.942 (+/-0.169) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 6, 'splitter': 'random'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 7, 'splitter': 'best'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 7, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 8, 'splitter': 'best'}
    0.950 (+/-0.058) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 8, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 9, 'splitter': 'best'}
    0.950 (+/-0.111) for {'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 9, 'splitter': 'random'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 2, 'splitter': 'best'}
    0.950 (+/-0.111) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 2, 'splitter': 'random'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 3, 'splitter': 'best'}
    0.900 (+/-0.189) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 3, 'splitter': 'random'}
    0.958 (+/-0.065) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 4, 'splitter': 'best'}
    0.933 (+/-0.094) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 4, 'splitter': 'random'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 5, 'splitter': 'best'}
    0.925 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 5, 'splitter': 'random'}
    0.958 (+/-0.065) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 6, 'splitter': 'best'}
    0.908 (+/-0.210) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 6, 'splitter': 'random'}
    0.958 (+/-0.065) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 7, 'splitter': 'best'}
    0.908 (+/-0.132) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 7, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 8, 'splitter': 'best'}
    0.925 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 8, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 9, 'splitter': 'best'}
    0.908 (+/-0.162) for {'criterion': 'entropy', 'min_samples_leaf': 3, 'min_samples_split': 9, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 2, 'splitter': 'best'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 2, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 3, 'splitter': 'best'}
    0.950 (+/-0.058) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 3, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 4, 'splitter': 'best'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 4, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 5, 'splitter': 'best'}
    0.942 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 5, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 6, 'splitter': 'best'}
    0.933 (+/-0.115) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 6, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 7, 'splitter': 'best'}
    0.933 (+/-0.133) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 7, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 8, 'splitter': 'best'}
    0.942 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 8, 'splitter': 'random'}
    0.942 (+/-0.124) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 9, 'splitter': 'best'}
    0.908 (+/-0.148) for {'criterion': 'entropy', 'min_samples_leaf': 4, 'min_samples_split': 9, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 2, 'splitter': 'best'}
    0.958 (+/-0.065) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 2, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 3, 'splitter': 'best'}
    0.917 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 3, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 4, 'splitter': 'best'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 4, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 5, 'splitter': 'best'}
    0.950 (+/-0.088) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 5, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 6, 'splitter': 'best'}
    0.900 (+/-0.115) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 6, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 7, 'splitter': 'best'}
    0.908 (+/-0.114) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 7, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 8, 'splitter': 'best'}
    0.925 (+/-0.080) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 8, 'splitter': 'random'}
    0.950 (+/-0.129) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 9, 'splitter': 'best'}
    0.925 (+/-0.104) for {'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 9, 'splitter': 'random'}
    best_parameters= {'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 4, 'splitter': 'random'}
    best_accuracy= 0.975
    ----------------------------------------------

code

# And finally SVM
grid_values = {'C':[1,10,100],'kernel':['linear', 'poly', 'rbf', 'sigmoid'] }
clf_SVM = do_grid_search(SVM,grid_values,kfolds)
    SVC
    0.975 (+/-0.065) for {'C': 1, 'kernel': 'linear'}
    0.958 (+/-0.093) for {'C': 1, 'kernel': 'poly'}
    0.967 (+/-0.067) for {'C': 1, 'kernel': 'rbf'}
    0.192 (+/-0.140) for {'C': 1, 'kernel': 'sigmoid'}
    0.967 (+/-0.094) for {'C': 10, 'kernel': 'linear'}
    0.950 (+/-0.088) for {'C': 10, 'kernel': 'poly'}
    0.967 (+/-0.094) for {'C': 10, 'kernel': 'rbf'}
    0.192 (+/-0.140) for {'C': 10, 'kernel': 'sigmoid'}
    0.958 (+/-0.093) for {'C': 100, 'kernel': 'linear'}
    0.942 (+/-0.080) for {'C': 100, 'kernel': 'poly'}
    0.942 (+/-0.104) for {'C': 100, 'kernel': 'rbf'}
    0.192 (+/-0.140) for {'C': 100, 'kernel': 'sigmoid'}
    best_parameters= {'C': 1, 'kernel': 'linear'}
    best_accuracy= 0.975
    ----------------------------------------------

code

from sklearn.metrics import classification_report

# predictions for LR
Y_true, Y_pred = Y_validation, clf_LR.predict(X_validation)
print(classification_report(Y_true, Y_pred))

output

                     precision    recall  f1-score   support
    
        Iris-setosa       1.00      1.00      1.00        10
    Iris-versicolor       1.00      1.00      1.00         9
     Iris-virginica       1.00      1.00      1.00        10
    
        avg / total       1.00      1.00      1.00        29
# predictions for KNN
Y_true, Y_pred = Y_validation, clf_KNN.predict(X_validation)
print(classification_report(Y_true, Y_pred))
                     precision    recall  f1-score   support
    
        Iris-setosa       1.00      1.00      1.00        10
    Iris-versicolor       0.90      1.00      0.95         9
     Iris-virginica       1.00      0.90      0.95        10
    
        avg / total       0.97      0.97      0.97        29

code

# predictions for DT
Y_true, Y_pred = Y_validation, clf_DT.predict(X_validation)
print(classification_report(Y_true, Y_pred))
                     precision    recall  f1-score   support
    
        Iris-setosa       1.00      1.00      1.00        10
    Iris-versicolor       0.82      1.00      0.90         9
     Iris-virginica       1.00      0.80      0.89        10
    
        avg / total       0.94      0.93      0.93        29

code

# predictions for SVM
Y_true, Y_pred = Y_validation, clf_SVM.predict(X_validation)
print(classification_report(Y_true, Y_pred))
                     precision    recall  f1-score   support
    
        Iris-setosa       1.00      1.00      1.00        10
    Iris-versicolor       0.90      1.00      0.95         9
     Iris-virginica       1.00      0.90      0.95        10
    
        avg / total       0.97      0.97      0.97        29

Download notebook file for code and data at Classification

Written on March 16, 2014