Linear Regression and Predictions

on

Multiple Datums can form Data, Mindless Data can do wonders,

Mindful you and me can do magic, because together we can get better!!

Introduction

Whenever we take a crucial decision it is largely dependent on multiple things/instances that affect the final decision to be taken. For example, suppose we need to sell a used car. To decide or get an idea about what price should be car sold it will be depended on Car’s Company, Miles Traveled, Years Used, Mileage. etc. This indicates Selling Price of a used car will be a scalar measure of the instances which we just addressed. Linear Regression is one such modelling which can help us find this with simple statistical computations. To build a Linear Regression model, a history of data of such similar used cars can be deployed. Later, this model can be used to predict the car’s selling price provided we fed the instances of the car in the model. This is all about how linear regressions model can be built with the past data and the model can be deployed to predict. And this is all we will go through today, initiating with building model on data, analyzing model and lately predicting with our regression model.

Understanding Linear Regression

As we interpreted earlier, to predict one single value i.e. selling price of a car (dependent variable) depends on multiple instances like Car’s Company, Miles Traveled, Years Used, Mileage, etc. (independent variables). Linear regression is modelling/mapping/fitting a linear line with instances/relations of independent variables to give us scalar dependent output. Since, the line which is to be fitted is linear with respect to the provided data points and data relations, this modelling is called a Linear Regression. Linear Regressions are also known to be giving us continuous scalar output like for example here giving us selling price of a car which is a continuous value. Below is the image which shows the data points in a space and a linear line being fitted w.r.t the data points. Further, this linear line can be used to prediction as shown in the figure below:

Notional Representation of Linear Regression modelling and predicting

Implementing Linear Regression in R

Data for Modelling

For this linear regression modelling we will be using data set provided in R library itself. The data set contains instances of houses in the suburbs of Boston. There are in all 13 attributes to this data which are as listed below:

  1. $crim – per capita crime rate by town
  2. $zn – proportion of residential land zoned for lots over 25,000 square feet
  3. $indus – proportion of non-retail business acres per town
  4. $chas – Charles River dummy variable (1 if tract bounds river, 0 otherwise)
  5. $nox – nitric oxide concentration (parts per million)
  6. $rm – average number of rooms per dwelling
  7. $age – proportion of owner occupied buildings built prior to 1940
  8. $dis – weighted distances to 5 Boston employment centers
  9. $rad – index of accessibility to radial highways
  10. $tax – full value property tax rate per $10,000
  11. $ptratio – pupil-teacher ratio by town
  12. $black – 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town
  13. $lstat – percent lower status of the population
  14. $medv – median value of owner occupied homes in $1000’s

From the above our modelling considers $medv as the dependent variables and other instances as independent variables for which we need to fit a linear regression so that we can do the predictions.

Building Linear Regression Model

Loading and Scaling Data

First we start our modeling by obviously loading the data into R first. The Boston data set is present in library ‘MASS’ of R. We first need to load the MASS library which is done using library() function of R. Once loaded the data is available to us as ‘BOSTON’ which can be loaded into an object as done here using object ‘data’. The following listings after that are just a view of structure and summary of the data.

Listings for loading data and viewing structure and summary
Structure and Summary of the Boston data
Summary indicates there are no missing values ‘NAs’ in any of the data instances

In making linear regression models it is important that we split the data into training data set and testing data set. This is because we build or model on the training data and not exposing out testing data. Model learns from the training data and from the built model predictions are done on the testing data set. We are splitting training set and testing data set in 70% and 30% respectively. We use sample() function of R to split the data and then assigning the split data into train_data and test_data objects respectively. Below are the listings for data split

Listing for splitting the data and assigning to objects

Building Linear Regression using lm()

We will now build our linear regression model, R has a very famous function lm() which is used to build linear regression model. The function is used along with separation of dependent and independent variables with ‘~’ and argument for ‘data=’ indicating the function which data to use (in our case train_data). Below is the listing for building linear regression model using lm() where medv dependent variable is separated with independent variables with ~. Here independent variables are indicated with dot ‘.’ which means selecting all the independent variables.

Regression model listing
Summary of reg

Understanding Model Summary

We will understand the summary above in detail, the very first is the Call which indicates what lm() model we called for. Second section is Residuals and in regression residuals are everything. We will better understand and elaborate Residuals once we do the prediction on data set. The next section is Coefficients, they are model computations for each of the independent variables showing Estimate, Std. Error, t value and Pr(>|t|) which is also known as p value.

What does coefficients tel you?

The estimate indicates the impact of that independent variable in the final value computation. The very first co-efficient i.e. (Intercept) is the constant added by the model for the right fitting. The other coefficients are our independent variables used to build the model. For example: consider Estimate of $crim which is -0.13967 and for $rad 0.15523. As we know we are estimating $medv i.e. median value of owner occupied homes in $1000’s, $crim which is the crime rate will impact $medv with a decrease in $medv price of 0.13967 fraction. On the other hand $rad will hike the $medv price with a fraction of 0.15523. And our model is turning out to be very logical and making good estimation because if the crime rate is high in a particular area the median value of housing is likely to be going down thus estimate of crime is negative. Whereas, if there is connection to the radial highway $rad, then the median price $medv is likely to go high as highway is accessible, therefore $rad has a positive sign with estimate value. Thus +/- indicate whether the instance will have positive or negative impact on the dependent estimation and the value of estimate indicates how much will be the impact in the respective positive or negative direction. Std. Error is the error in calculation that can be obtained using the provided estimates. And, t value is coefficient divided by Std Error. The other important and most considered things in linear regression is the p-value, it helps us determine how significant are the results. p-value should be very small so that results can be interpreted as highly significant. The * behind all the p-value indicates the confidence of those p-values. *** indicates 100% confidence, ** indicates 99.99% confidence, * indicates 95% confidence, single dot ‘.’ indicates 90% confidence. And if there is nothing such next to p-value indicates that model is not sure about the confidence of p-value.

What does Model stats tell you?

Moving on we have the model quality indicators. 1) Residual standard error is the standard deviation of the residual values or it can be interpreted as difference between set of observed and predicted values. 2) R squared gives the value of how close is the data to the fitted linear regression line in this case, value closer to 1 the better. 3) Adjusted r-squared measures goodness of fit of regression model that contains differing number of independent variables, this too should be as closer to 1 indicating good results. 4) F-static is the combined effect of all the variables, of the value is high it means something is significant in our model. 5) Lastly, p-value is probability that significance is just by chance, thus for good model p-values should be very small, which is rightly in this case. Thus, the regression model summary indicates that the model is rightly built statistically and can be deployed to make predictions on the test data.

How to predict with our formed model?

We are going to use predict() function of R to predict the values based on our already formed linear regression model. As well all know test data already has a $medv column so we are going to introduce new column in the test data known as ‘predicted’ and then we will compare how closely or not did our model predicted results compared to already existing medv values. Below is the listing for forming a new column ‘predicted’ in test data and using predict() function with our model and an argument of newdata=test_data. This will add a new predicted column which will have predicted values based on our model on the test data. Further we will have a view of data using head(test_data) this will give us first 5 rows of the test data. Then, we have used cor() function between the $medv and newly formed $predicted columns to see how closely co-related are our predicted values w.r.t real values.

Listing for Predicting and forming new column in Test Data
Output of head() on test data and correlation value between original and predicted values

We can clearly see in the above output that the last to columns $medv and $predicted contains values that are close to each other. The predicted values true on the basis of the model and hence are reliable too. The correlation too between the both comes out be 0.7876 i.e. 78.76% which is sign of good correlation between the two column values. Thus, we can say that predicted values can be trusted and model too can be trusted and deployed for further such similar predictions if any to be made.

Model Evaluation 1: Mean Squared Error (MSE) evaluation

To be very sure about are findings we will now go a step ahead in its verification process. Mean Squared Error which measures the average of squares of errors will be deployed on our test data that will calculate mean value of squared errors incurred in prediction. This value should come out to be as small as possible indicating positive results. To calculate Mean squared errors we first use predict() function to one again predict using our model and store the values in an object. Further this object is used to compute mean squared error. MSE is a mathematical formulation which is done below in the listing and stored in a object.

Listing for Mean Squared Error Computation
MSE Output

As we can see above the Mean Squared Error value comes out to be 0.013 which is a good value as it is small indicating good model performance.

Model Evaluation 2: Summarizing Residuals

In linear regressions coefficients are deployed to minimize sum of squares of the residuals. We expect residual values to be small for good models. We will summarize residuals and then discuss its significance and interpretation. Below is the console view showing listing for summarizing residuals and its equivalent output.

Summarizing Residual listing and output

What we expect in the residual summary that median should be near 0 which is true in our case and 1st Qu. value should be near -3rd Qu. (no much difference) which is true too in our case. Otherwise if the values had been too large which is not in this case model is not a usable model.

Conclusion

Linear regression is the bread and butter or go to modelling in Data Science, this is because it tells us a lot about the data behavior and most of the times gives us right answers. And if not, linear regression gives us enough information upon data and model so that we can decide and deploy other model based on the results given by linear regression. Thus, it is very very important modelling algorithm and following are the linear regression’s takeaways for you:

  1. A go to statistical modelling as I mentioned
  2. Always deploy first model, try all others later based on the output of linear regression model
  3. May be a trouble in computing if has large variables or large categorical variables
  4. Thinking linear regressions, think in terms of residuals and make sure that they are very small in number
  5. Linear regressions can do very good predictions once a good model is formed and verified

End Note

So that’s all I have for you guys this time, what are you using linear regression for and how is it helping you? is a good question to ask before deploying linear regression. Let me know your views about the models and how we can get better, because there is always a room for improvement and it can be done with your genuine inputs and feedback. Also, I feel great each day seeing the support getting bigger each day from all around the world, please do not forget to like, share and subscribe to ‘The Datum’ blog if you find it worthy! Have a good rest of the week until we comeback with something for you on next weekend. Cheers!

Advertisements

One Comment Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s