Prediction Analysis with Neural Networks and Linear Regression

Preface

A new week and a new algorithm to learn. This is how diversified Data Science is and also it is spread widely across multiple tangents. The new tangent in Data Science which we are discussing today is Neural Networks. As mentioned Data Science is widely spread across multiple tangents this can be confusing at times in terms of which algorithm to deploy under particular circumstances and on a given data-set. Thus, it becomes important that you have the right information about the particular algorithm being deployed and what you can get out of it. Do not worry ‘The Datum’ is here for you! In brief, we are going to do predictions using two distinct algorithms Neural Network and Linear Regression models. Then, from the predicted results we will analyze the performance how well did model perform on a given data-set and which one predicts better.

To begin, let us have some theoretical background about both the models we are deploying. A good amount of theory has been discussed for Linear Regression in the previous blog ‘Linear Regressions and Predictions’ https://thedatum.data.blog/2019/06/16/linear-regression-and-predictions/ Please refer the link so that you can get a better idea on Linear Regressions in terms of theory as well as its practical working. Here, we will discuss and learn some theory about Neural Networks.

Neural Networks

We, humans, have the power to see things, learn them, retain in our memory and deploy the learning/experiences in the coming situations at its best. We are able to this with the help of a brain cell called Neurons. Neurons are just another type of cells existing in our brains which has this power of learning things, storing them in memory and deploying them whenever necessary. And with our increasing studies, researches and experiments we are now able to embed this mechanism in computers in the form of what we call and hear ‘Artificial Neural Networks (ANN)’. ANN, as the name suggests, are also neural networks made up of neurons but in this case artificial. Here we are not going much deeper in figuring out similarities in actual Neurons and Artificial Neurons but, we will well understand how a basic ANN works which is very much similar to actual biological neuron.

An Artificial Neural Network

Neural Networks work in layers, mainly there are three types of layers viz. one Input Layer, varying number of Hidden Layers (depends on the application for which Neural Network is being deployed) and lastly one Output Layer. The middle or hidden layer is what does the jaw-dropping work, this is a layer that does all the magic in the neural networks thus making all its applications possible. Below is the notional representation of Artificial Neural Networks

Notional representation of an Artificial Neural Network

To better understand how all the things are being performed in hidden layers we need to understand weights, bias and activation functions.

Weights, Biases and Activation Functions

Weights are the ones that help convert any input to its desired output. This is very much similar to our regressions where in we have a fitted slope which acts as a weight and it multiplied to an input thus giving us predicted/expected output. Similar to regressions weights here are numerical parameters which determines ANN’s output as well as tells us how strongly each of neurons affect the others. For example, consider we have three inputs x1, x2 and x3 then the synaptic weights to be applied to them are w1, w2 and w3 to get the desired output which is as follows:

Output = y= f(x) = Summation (xi * wi)

One more thing, we have an intercept in the Linear Regression output which is to adjust the fitted slope, In ANN we have bias which is very similar to the intercept of Linear Regression. It is an additional parameter which is used to adjust the output along with the weighted sum of inputs to the neuron. The actual processing which is done by an ANN is as follows:

Output = sum(weights*inputs) + bias

A function is always applied to an ANN which is what we call Activation Function. The processing in the neural networks is mainly achieved with activation functions. An activation function is any mathematical function which converts the input to output, without activation function Neural Networks will simply function as basic linear models. Activation function is what separates Neural Networks with other models. There are many activation function which can be used with a neural network depending upon application. The common neural network activation functions are Unit Step Function, Sigmoid, Hyperbolic Tangent, Rectified Linear Unit (ReLU), etc. Neural Networks are expected to function as non-linear and complex thus, activation function deployed needs to be robust enough and posses the following:

  1. It should be differential
  2. It should be simple and processing should be faster
  3. It should not be zero centered

Explaining all the activation functions is out of scope of this work, but they are really pretty easy to understand as they are simple mathematical functions!

Without good background of functioning of an Artificial Neural Network and Linear Regression in our hands, let us began about building model and analysis in R.

Model

Data

We will be using data set provided in R library itself. The data set contains instances of houses in the suburbs of Boston. There are in all 13 attributes to this data which are as listed below:

  1. $crim – per capita crime rate by town
  2. $zn – proportion of residential land zoned for lots over 25,000 square feet
  3. $indus – proportion of non-retail business acres per town
  4. $chas – Charles River dummy variable (1 if tract bounds river, 0 otherwise)
  5. $nox – nitric oxide concentration (parts per million)
  6. $rm – average number of rooms per dwelling
  7. $age – proportion of owner occupied buildings built prior to 1940
  8. $dis – weighted distances to 5 Boston employment centers
  9. $rad – index of accessibility to radial highways
  10. $tax – full value property tax rate per $10,000
  11. $ptratio – pupil-teacher ratio by town
  12. $black – 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town
  13. $lstat – percent lower status of the population
  14. $medv – median value of owner occupied homes in $1000’s

From the above our modelling considers $medv as the dependent variables and other instances as independent variables for which we will be building linear regression model and ANN model for predictions.

Loading and Scaling Data

The Boston data is available in the ‘MASS’ library of R. Thus we include the library MASS so that we can load the data. If you do not have library MASS, you can simply install the library by using the listing install.packages(“MASS”). This will install the library then you can use the function below. You need to have internet connection for installing packages. Further we set.seed so that results are reproducible. Then we load data into an object called ‘data’ and then we can view the loaded data.

Loading Data into R from R library
View of Data

Now, we will scale our data. Scaling is important so that we get accurate results and for any unit change in an instance we get a unit change in the output as well which is not possible otherwise. We will first mean and standard deviation of our data then we will use data.frame function of R to get scaled data. Below are the respective listings.

Computes the ‘mean’ of the data and stores in object ‘mean’
Computes standard deviation of data and stores in object ‘sd’
Forming scaled data frame using mean and sd,
later viewing the scaled data
View of Head of scaled data (n=10)

Training and Testing Sets

In data science, this is really a common practice. We divide the data into a training set and testing set ratio usually being 70:30 respectively. This is because we build our models on the training set and validate the model. Once model is validated and made sure it is good model we then use testing data set to test the effectiveness of the model. We make predictions on the testing data set. We will now see how we can split the data.

Splitting can be done using sample() function of R. The first line of listing below shows how we can apply sample function using number of rows of our data with 0.70 that indicates the percent of split required. This can also be 0.80 for 80% or 0.90 for 90% depending on the amount of split required. The next two lines of listing shows how we can use as.data.frame() function with the object (here split) created using sample() function. We store training data using scaled_data with a normal or positive split, this is 70% of our data as we used 0.70. The testing data is stored from the scaled_data with a negative split object which indicates the rest of 30% of the data. Below are the entire listing of performing split on the scaled data.

Listing for splitting the data

We will now create a ‘formula’ that we will use directly in the Neural Network creating function of R. First we save all the names of the data-fields (19 columns) into a object called variables. The we create a formula using as.formula function. This step can be ignored if you can manually write this formula medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat This is all stored in the ‘formula’ object in the below listing.

Listing for creating formula

You will now have to load the library neuralnet. This can be done using library(‘neuralnet’). If the package is not installed you can use install.packages(“neuralnet”) followed by the library function mentioned above. This will load the neuralnet library which will be required to build the neural network model.

Building the Neural Network

We will now build the network using neuralnet() function from the library we just loaded. Below is the listing to built the neural network.

Listing for building Neural Network

The above listing will create our neural network, we use the formula in the argument we created before along with indicating to use training data. Here, hidden = 3 indicates that we are using 3 hidden layers in the neural network and lastly, since we will be comparing our results with the linear regression model we need to indicate that neural network should give us linear output thus, linear.output = TRUE, if false will give us classifications. Below is the summary of the Neural Network.

Summary of the Neural Network

We cannot much infer from this summary about the model, but in brief following can be interpreted:

  1. Length – Shows how many elements of this type are contained
  2. Class – Indicates specific class type
  3. Mode – Type of component (numeric, logical, list, etc.)

We can have a visual look of our Neural Network which is been formed by using the function plot() as plot(network) and this is what you get:

Plot of Neural Network

The start of black lines are considered as input nodes, and as in the argument we mentioned hidden=3, we can see 3 hidden layers as well and the single arrow at the end indicate the output layer for $medv. The numbers on the black arrows is what we call weights. The blue nodes and lines and numbers are called as bias. The bias is added in each step, as previously stated bias can be considered as ‘intercept’ similar of linear regression.

The neural net function which we used develops a matrix to store all its results of the network formed. We can retrieve the matrix using network$result.matrix and this is what we get:

Result Matrix of the Network

This contains all the information about the network in the numeric form. These are easily interpret-able for example: crim.to.1layhid1 is the weight for the connection between the input $crim and first node of the hidden layer.

With enough eyeballing on the network formed we can now do some predictions on the test-data set. The compute() function of library neuralnet can be deployed to do the predictions on the testing data. Once predicted, we will compute Mean Square Error of the predictions to test how well did the neural network predicted. Below are the listings:

Listing for predictions using Neural Network
Computing Mean Square Error

For above, in argument of compute we used data_test[,1:13] indicates to compute for 14th column that is $medv and not for 1 to 13 columns on test data set. Below is the Mean Square Error of the Neural Network Predictions:

Mean Square Error of predictions made by Neural Network

Building Linear Regression Model

Will keep this results in mind for now, and now we will build the linear regression model followed by its prediction and also computing its Mean Square Error for comparison. For detailed algorithm building of Linear Regression and its predictions with deep understanding visit my previous blog:

https://thedatum.data.blog/2019/06/16/linear-regression-and-predictions/

Below is the listing to build linear regression model using function lm(), predictions with function predict() and computing Mean Square Error of these predictions:

Listing for building, predicting and computing MSE of predictions for Linear Regression

Below is the summary of the linear regression model:

Summary: Linear Regression model

For detailed understanding of the summary of linear regression visit:

https://thedatum.data.blog/2019/06/16/linear-regression-and-predictions/

Following is the Mean Square Error MSE of the predictions done by linear regression.

Mean Square Error of predictions made by Linear Regression

From the knowledge and understanding of statistics of any data Mean Square Error is the parameter which should be as small as possible in terms of predictions especially. And we can clearly see that Mean Square Error of Neural Network is smaller then Mean Square Error of Linear Regressions. That is: 0.26104 (MSE Neural Network) < 0.322887 (MSE Linear Regressions). Thus, after building models, understanding and predictions followed by validations using Mean Square Error we can say that, ‘Neural Networks are better predictors then Linear Regression models‘. Thus, if you are looking for better accurate predictions it is wise to use Neural Networks.

Visual Analysis

Now we will plot visual interpretation by computing graph of actual values vs the predicted values, Neural Network followed by Linear Regressions. Below is the listing to compute the graph:

Listing for computing actual vs predicted graph for Neural Network and Linear Regressions

This is what we get from the above lines of graph computation:

Actual vs Predicted plots of Neural Networks and Linear Regression

As we can visually make out that the predictions of neural network are more concentrated around the line than those by the linear regression model. There not much big difference but if observed carefully it stands true. The difference is minimal because there not much difference in the Meas Square Errors as well, but since we will always opt for better options we should go for Neural Network predictions which predicts better than Linear Regressions.

Conclusion and Takeaways:

  1. Neural Network models are better predictors than Linear Regression models
  2. This is proven by validation of Mean Square Error metric when conducted on predicted outcomes by both the models where, MSE of Neural Network was lesser compared to Linear Regression
  3. This happens because Neural Networks have better environment, computing and functioning which fits the model better then Linear Regression
  4. The plot for actual vs predicted for both the models shows that predictions of Neural Network predictions are more concentrated around the line than those of linear regression

In data science no one model, algorithm or application is better than other, these comparisons will stand true most of the time but this does not outplay Linear Regression predictions. It is always wise to test against multiple models to be better confident of the models we build. The deploying of a particular model and its outcomes depends upon the type of data and the application for which the model is being built.

The Datum’ is growing and reaching out to many across the world, and it will continue to do so and give you the best of the practical Data Science and Machine Learning Algorithms. Now reaching out across 40+ countries we expect the same continued support from you. If you find this article helpful please share with the ones who can be helped from this because together we can do a lot more. And also, do not forget to like and subscribe to The Datum for more such stuff

Multiple Datum’s form Data, mindless Data can do wonders.

Mindful you and me together can do magic.

Cheers!..The Datum

Advertisements

2 Comments Add yours

  1. Thank you, Narrij. Here’s the interesting part. We (at IBM) were using these techniques as early as 1962, plus many other stochastic and statistical approaches. Glad they are extending this.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s