Heart Disease Decision Tree Classifier Model
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "target" field refers to the presence of heart disease in the patient.
1. Get the Data Ready
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
# the heart.csv file can be found in the github along with this notebook
df = pd.read_csv('heart.csv')
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
#double check to see if there's some balance to the amount of 1s and 0s for the targets
df['target'].value_counts()
1 165
0 138
Name: target, dtype: int64
I converted the 1s and 0s into Yes and No as well as rename the column to ensure it would be easier for viewers to read later on.
df['target'] = df['target'].replace(0, 'no');
df['target'] = df['target'].replace(1, 'yes');
df = df.rename(columns={'target': 'heart_disease'})
# making sure that our values are in the correct form for our shift to a numpy array
df.dtypes
age int64
sex int64
cp int64
trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca int64
thal int64
heart_disease object
dtype: object
age int64
sex int64
cp int64
trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca int64
thal int64
heart_disease object
dtype: object
2. Pre-Processing
Get your arrays ready for the decision tree model and create one array with the independent variables and one without to use for predictions in the future
#note the one-hot encoding was already present in the data set so didn't need to be done
X = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']].values
X[0:5]
array([[ 63. , 1. , 3. , 145. , 233. , 1. , 0. , 150. , 0. ,
2.3, 0. , 0. , 1. ],
[ 37. , 1. , 2. , 130. , 250. , 0. , 1. , 187. , 0. ,
3.5, 0. , 0. , 2. ],
[ 41. , 0. , 1. , 130. , 204. , 0. , 0. , 172. , 0. ,
1.4, 2. , 0. , 2. ],
[ 56. , 1. , 1. , 120. , 236. , 0. , 1. , 178. , 0. ,
0.8, 2. , 0. , 2. ],
[ 57. , 0. , 0. , 120. , 354. , 0. , 1. , 163. , 1. ,
0.6, 2. , 0. , 2. ]])
y = df[['heart_disease']]
y[0:5]
Index | heart_disease |
---|---|
0 | yes |
1 | yes |
2 | yes |
3 | yes |
4 | yes |
#### Now use the train_test split to get training and testing splits for the X and Y arrays
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)
#check size of each set
X_train.shape
y_train.shape
X_test.shape
y_test.shape
(91, 1)
3. Prediction Time
We will be using a decision tree classifier model with an "entropy" criterion to have the tree pick columns by best information availability.
heartTree = DecisionTreeClassifier(criterion="entropy")
heartTree
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
heartTree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
Using the model we just made with heartTree, we will now apply the tree to the testing sets to check how accurate our model really is
predTree = heartTree.predict(X_test)
print (predTree[0:5])
print (y_test[0:5])
['yes' 'yes' 'yes' 'yes' 'yes']
heart_disease
245 no
162 yes
10 yes
161 yes
73 yes
It seems like the model is not 100% accurate but lets check the accuracy score to get an idea of all the predictions
from sklearn import metrics
from sklearn.metrics import mean_squared_error
x = pd.DataFrame(predTree)
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))
DecisionTrees's Accuracy: 0.8131868131868132
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
enc = preprocessing.OneHotEncoder()
enc.fit(y_test)
onehotlabels = enc.transform(y_test).toarray()
onehotlabels.shape
enc.fit(x)
onehotlabels2 = enc.transform(x).toarray()
onehotlabels2.shape
(91, 2)
def rmse(y_actual, y_pred):
return np.sqrt(mean_squared_error(y_actual, y_pred))
print('RMSE score on train data:')
print(rmse(onehotlabels, onehotlabels2))
RMSE score on train data:
0.4322189107537832
Very high root mean squre error, I'll be coming back to improve this model!