I want to use StackingClassifier & VotingClassifier with StratifiedKFold & cross_val_score. I am getting nan values in cross_val_score if I use StackingClassifier or VotingClassifier. If I use any other algorithm instead of StackingClassifier or VotingClassifier, cross_val_score works fine. I am using python 3.8.5 & sklearn 0.23.2.
Jupyter notebook attached StackingVotingClassifierIssue.ipynb.
Dataset attached Parkinsons.csv. I had moved the status column (which is the target feature) to the right most end in parkinsons.csv.
Dataset can also be found at this Kaggle Link
Below is the full code and the output.
StackingVotingClassifierIssue.zip
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import metrics
from sklearn import model_selection
from sklearn import feature_selection
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv('parkinsons.csv')
FS_X=dataset.iloc[:,:-1]
FS_y=dataset.iloc[:,-1:]
FS_X.drop(['name'],axis=1,inplace=True)
select_k_best = feature_selection.SelectKBest(score_func=feature_selection.f_classif,k=15)
X_k_best = select_k_best.fit_transform(FS_X,FS_y)
supportList = select_k_best.get_support().tolist()
p_valuesList = select_k_best.pvalues_.tolist()
toDrop=[]
for i in np.arange(len(FS_X.columns)):
bool = supportList[i]
if(bool == False):
toDrop.append(FS_X.columns[i])
FS_X.drop(toDrop,axis=1,inplace=True)
smote = SMOTE(random_state=7)
Balanced_X,Balanced_y = smote.fit_sample(FS_X,FS_y)
before = pd.merge(FS_X,FS_y,right_index=True, left_index=True)
after = pd.merge(Balanced_X,Balanced_y,right_index=True, left_index=True)
b=before['status'].value_counts()
a=after['status'].value_counts()
print('Before')
print(b)
print('After')
print(a)
SkFold = model_selection.StratifiedKFold(n_splits=10, random_state=7, shuffle=False)
estimators_list = list()
KNN = KNeighborsClassifier()
RF = RandomForestClassifier(criterion='entropy',random_state = 1)
DT = DecisionTreeClassifier(criterion='entropy',random_state = 1)
GNB = GaussianNB()
LR = LogisticRegression(random_state = 1)
estimators_list.append(LR)
estimators_list.append(RF)
estimators_list.append(DT)
estimators_list.append(GNB)
SCLF = StackingClassifier(estimators = estimators_list,final_estimator = KNN,stack_method = 'predict_proba',cv=SkFold,n_jobs = -1)
VCLF = VotingClassifier(estimators = estimators_list,voting = 'soft',n_jobs = -1)
scores1 = model_selection.cross_val_score(estimator = SCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('StackingClassifier Scores',scores1)
scores2 = model_selection.cross_val_score(estimator = VCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('VotingClassifier Scores',scores2)
scores3 = model_selection.cross_val_score(estimator = DT,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('DecisionTreeClassifier Scores',scores3)
Output
Before
1 147
0 48
Name: status, dtype: int64
After
1 147
0 147
Name: status, dtype: int64
StackingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
VotingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
DecisionTreeClassifier Scores [0.86666667 0.9 0.93333333 0.86666667 0.96551724 0.82758621
0.75862069 0.86206897 0.86206897 0.93103448]
This is because your
cross_val_score
is raising an internal error. By default we are permissive and replace the score by anan
.To get the traceback, you need to pass
error_score="raise"
and in your case I am getting: