Heart Disease Decision Tree Classifier Model

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "target" field refers to the presence of heart disease in the patient.

1. Get the Data Ready

import pandas as pd
      import numpy as np
      import matplotlib as plt
      import seaborn as sns
      from sklearn.tree import DecisionTreeClassifier
      %matplotlib inline

# the heart.csv file can be found in the github along with this notebook
      df = pd.read_csv('heart.csv')
      df.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

#double check to see if there's some balance to the amount of 1s and 0s for the targets
      df['target'].value_counts()

1    165
      0    138
      Name: target, dtype: int64

I converted the 1s and 0s into Yes and No as well as rename the column to ensure it would be easier for viewers to read later on.

df['target'] = df['target'].replace(0, 'no');
      df['target'] = df['target'].replace(1, 'yes');
      df = df.rename(columns={'target': 'heart_disease'})

# making sure that our values are in the correct form for our shift to a numpy array
      df.dtypes

age                int64
      sex                int64
      cp                 int64
      trestbps           int64
      chol               int64
      fbs                int64
      restecg            int64
      thalach            int64
      exang              int64
      oldpeak          float64
      slope              int64
      ca                 int64
      thal               int64
      heart_disease     object
      dtype: object
    
      age                int64
      sex                int64
      cp                 int64
      trestbps           int64
      chol               int64
      fbs                int64
      restecg            int64
      thalach            int64
      exang              int64
      oldpeak          float64
      slope              int64
      ca                 int64
      thal               int64
      heart_disease     object
      dtype: object

2. Pre-Processing

Get your arrays ready for the decision tree model and create one array with the independent variables and one without to use for predictions in the future

#note the one-hot encoding was already present in the data set so didn't need to be done
      X = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']].values
      X[0:5]

array([[ 63. ,   1. ,   3. , 145. , 233. ,   1. ,   0. , 150. ,   0. ,
                2.3,   0. ,   0. ,   1. ],
            [ 37. ,   1. ,   2. , 130. , 250. ,   0. ,   1. , 187. ,   0. ,
                3.5,   0. ,   0. ,   2. ],
            [ 41. ,   0. ,   1. , 130. , 204. ,   0. ,   0. , 172. ,   0. ,
                1.4,   2. ,   0. ,   2. ],
            [ 56. ,   1. ,   1. , 120. , 236. ,   0. ,   1. , 178. ,   0. ,
                0.8,   2. ,   0. ,   2. ],
            [ 57. ,   0. ,   0. , 120. , 354. ,   0. ,   1. , 163. ,   1. ,
                0.6,   2. ,   0. ,   2. ]])

y = df[['heart_disease']]
      y[0:5]

Index	heart_disease
0	yes
1	yes
2	yes
3	yes
4	yes

#### Now use the train_test split to get training and testing splits for the X and Y arrays

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)
      #check size of each set
      X_train.shape
      y_train.shape
      X_test.shape
      y_test.shape

(91, 1)

3. Prediction Time

We will be using a decision tree classifier model with an "entropy" criterion to have the tree pick columns by best information availability.

heartTree = DecisionTreeClassifier(criterion="entropy")
      heartTree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                            max_features=None, max_leaf_nodes=None,
                            min_impurity_decrease=0.0, min_impurity_split=None,
                            min_samples_leaf=1, min_samples_split=2,
                            min_weight_fraction_leaf=0.0, presort=False,
                            random_state=None, splitter='best')

heartTree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                            max_features=None, max_leaf_nodes=None,
                            min_impurity_decrease=0.0, min_impurity_split=None,
                            min_samples_leaf=1, min_samples_split=2,
                            min_weight_fraction_leaf=0.0, presort=False,
                            random_state=None, splitter='best')

Using the model we just made with heartTree, we will now apply the tree to the testing sets to check how accurate our model really is

predTree = heartTree.predict(X_test)

print (predTree[0:5])
      print (y_test[0:5])

['yes' 'yes' 'yes' 'yes' 'yes']
          heart_disease
      245            no
      162           yes
      10            yes
      161           yes
      73            yes

It seems like the model is not 100% accurate but lets check the accuracy score to get an idea of all the predictions

from sklearn import metrics
      from sklearn.metrics import mean_squared_error
      x = pd.DataFrame(predTree)
    
      import matplotlib.pyplot as plt
      print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

DecisionTrees's Accuracy:  0.8131868131868132

from sklearn import preprocessing
      from sklearn.preprocessing import OneHotEncoder
    
      enc = preprocessing.OneHotEncoder()
    
      enc.fit(y_test)
    
      onehotlabels = enc.transform(y_test).toarray()
      onehotlabels.shape
    
      enc.fit(x)
    
      onehotlabels2 = enc.transform(x).toarray()
      onehotlabels2.shape

(91, 2)

def rmse(y_actual, y_pred):
          return np.sqrt(mean_squared_error(y_actual, y_pred))
    
      print('RMSE score on train data:')
      print(rmse(onehotlabels, onehotlabels2))

RMSE score on train data:
      0.4322189107537832

Very high root mean squre error, I'll be coming back to improve this model!

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
0	63	1	3	145	233	1	0	150	0	2.3	0	1	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2	1
2	41	0	1	130	204	0	0	172	0	1.4	2	2	1
3	56	1	1	120	236	0	1	178	0	0.8	2	2	1
4	57	0	0	120	354	0	1	163	1	0.6	2	2	1