Multiple Datum’s form Data, mindless Data can do wonders.
Mindful you and me together can do magic!
How often are you in a situation where you have 2 alternatives either yes or a no, black or a white, and so on. These are instances where you ‘classify’ your scenario into only two solutions, a number of solutions may vary but usually, they are two solutions. This is what we call as ‘Classification’ we classify the outcomes in a set number of instances usually two. This week at The Datum we have how can we use Neural Networks as the classification model. And once we have the model in hands we will go about prediction using the model and lastly, we will evaluate our model and predictions for its rightness.
Today we have a very interesting model to build which can classify breast cancer. Cancer is uncontrolled multiplication of cells. And breast cancer is caused due to the multiplication of cells in the mammary gland that are transformed into malignant cells. They have the ability to detach themselves from the tissues which they are formed and invade into the surroundings due to multiplication. Breast is formed from multiple types of cells that form the breast but, the most common breast cancers are from glandular cells or from those forming the walls of the ducts.
The objective of this model today is to classify the number of benign and malignant classes which form the two most common type of breast cancer. For this model, we will be using the data already present in R named BreastCancer which is available in mlbench package. Using this data we will classify benign and malignant types of breast cancer using neural networks as a classifier.
The data as mentioned is readily available in R library of mlbench and each variable except the first, is in the form of 11 numerical attributes with values ranging from 0 through 10, with some missing values as well. Following are the data fields of this data set.
$Id: Sample code number
$Cl.thickness: Clump thickness
$Cell.size: Uniformity of cell size
$Cell.shape: Uniformity of cell shape
$Marg.adhesion: Marginal adhesion
$Epith.c.size: Single epithelial cell size
$Bare.nuclei: Bare nuclei
$Bl.cromatin: Bland chromatin
$Normal.nucleoli: Normal nucleoli
With good amount of information on the model brief and data we are good to implement the Neural Network Classifier model in R. To have an in-depth understanding of Neural Network concepts and functioning visiting the blog Prediction Analysis with Neural Networks and Linear Regression is recommended, which has all the information needed to understand the working of Neural Networks. It can be found by hitting the following link:
Building Model in R
Loading, Cleaning and Preparing Data
The initial algorithm building goes with loading the relevant libraries of R. We first load the library mlbench, which has our required data alongside with library neuralnet which will be required for building our neural network. If you do not have the library you can first install those libraries using R function install.packages as example install.packages(“mlbench”) this will get your library installed just to make sure you have an active internet connection. Now, you can load the library as mentioned above. Next steps involve loading the data which can be easily loaded using data() function and view the summary of our data as shown below:
The above is the summary of the Breast Cancer data which has summarized attributes of all the data fields in the data set. Also, as we can see above the highlighted shows the Missing Instances or NA’s in the data set. We need to fix those so that we can obtain good results in modeling. This is what we call data preparation that is preparing data to get the best out of modeling. To fix that missing instances we use function na.omit() which omits the missing terms. Since the missing terms are small compared to the total data hence omitting the missing term will not affect our modeling. If the missing terms are very large in number then we need to find ways like Median Imputation or relevant to fix the missing data. Below is the listing for using na.omit and then viewing the summary of missing data.
As we can see above in the summary of clean data we do not have any more missing terms. Thus, we have fixed the only issue with the data and now have prepared the data.
We can best models out of data if we know our data well. It is always a good idea to plot some visuals of data so that we better understand how our data behaves, as these visuals visually enable us to understand data relationships which are not otherwise possible just by glancing data. Visual plots give us lot of information say box plot which gives us entire information of a data field by the visual representation of data with respect to their quartiles. Below is the simple to understand listing for plotting box plot for all the data fields except the first followed by the box plot obtained.
What we can infer is most of the data fields have median equal to 0 and the highest median is for Cl.thickness equal to 4. Mitosis does not have much to say as we can see visually. In this way, we can get the general idea of the data-set which is a good practice.
Now, we will plot the histogram of all the nine data fields visually computed above, this will help us understand how frequently data arises in individual data fields. We first divide the plot space into nine different areas for nine different plots so that we can visually see and compare all the data field histogram all at once, then we plot all the histogram. For the partition of the plot space, we use the function par() of R. We divide as 3 x 3 space thus giving us 9 plot partition. Below the listing for partition followed by plotting all the 9 data field histogram.
As we can see above most data fields have a frequency of zero much higher, which gave us such a box plot. Done with a good amount of visualization we now move forward to build our neural network model.
Another very important aspect before forming the model is scaling the data. We can get best out of the models if we have scaled data because for every unit change we get relevant unit change output for scaled data. We first separate out the columns required which are columns 2 through 10 and convert them into factor. Once factored we will convert them into numeric via character. In short, we have converted data in columns 2 through 10 into numeric.
Once we have all the data as numeric, we will now compute the min and max of the data followed by forming the final scaled data required as listed below
We will now add two more data columns in the data set as ‘benign’ and ‘malignant’ which will be binary 1 and 0 respectively. Below is the listing to add two new columns using model.matrix function of R. And we get the final prepared data for modeling.
The above processing is done for modeling since now we have our data in binary format classification modeling becomes simpler and our model performance increases.
Neural Network Modelling
As always before modeling we split the data into two i.e. the training and testing data sets. Below is the listing for splitting the data with 70:30 as training and testing respectively.
We will now form the formula required for the modeling and finally using neuralnet function to form our model as listed below. The formula formed is simply as follows:
Cancer_databenign + Cancer_datamalignant ~ Cl.thickness + Cell.size + Cell.Shape + Marg.adhesian + Epith.c.size + Bare.nuclei + Bl.cromatin + Normal.nucleoli + Mitosis
Since we now have our model ready we can now make predictions using the model. Below is the listing for the predictions using model.
A confusion matrix is the most reliable entity for evaluating the predictions. It is the matrix with the classifier’s predictions against the actual known data categories. It is the table with a count of how often each combination of known outcomes (the truth) occurred in combination with each prediction type.
For our model we can compute the confusion matrix as listed below
Number of items categorized correctly divided by the total number of items, simply what fraction of time the classifier is correct. Accuracy is given as (TP + TN) / (TP + FP + TN + FN) = (66 + 128) / (66 + 5 + 128 + 6) = 0.94 or 94% Accurate. We can see that the prediction is 94% accurate.
Precision and Recall
Precision is what factor of the items the classifier flags as being in the class actually are in class. Precision is TP / (TP + FP) = 66 / (66 + 5) = 0.92 or 92% precise
Recall is what fraction of the things that are in the class detected by the classifier TP / (TP + FN) = 66/ (66 + 6) = 0.91 or 91% recall
If either precision or recall is small, F1 is also small, idea is that classifier that improves precision or recall by sacrificing a lot of the complementary measure will have a lower F1. F1 is 2*precision*recall / (precision+recall) which is 2*0.92*0.91 / (0.92+0.91) = 0.91 or 91%
Sensitivity and Specificity
Sensitivity also called as true positive rate and is exactly equal to recall here 0.91. And specificity is called as true negative rate is equal to TN / (TN + FP) = 128 / (128 + 5) = 0.96 or 96% specific
For all the evaluation parameters computed we can see that we have a very good number all being above 90%, thus we can state that we have a very good model which is making excellent predictions. Below is the summarized evaluation table of our predictions.
|Evaluation Term||Value (%)|
|Sensitivity = Recall||91%|
Neural Network Classification Takeaways
- Neural network classifications give better classification results compared to other statistical classifiers
- Modeling is very simple once data is pre-processed and scaled
- Very accurate results in predictions
Neural network modeling is reliable to get accurate results. And its application in classifying breast cancers is promising to see the results. In this way Data Science, Machine Learning and Neural Networks are useful deployments to a better human living in many ways possible. There are numerous applications where they can be deployed in this data-driven world.
Follow, like and share THE DATUM blogs for such exciting practical data science and machine learning algorithms.