Explained: Linear Regression with real life scenarios in R
Machine learning is one of the most trending topics at present and is expected to grow exponentially over the coming years. Before we drill down to one of the most common techniques in machine learning that is the ‘linear regression’, let’s understand what exactly is regression.
Regression analysis is a form of a predictive modelling technique that establishes a relationship between two variables namely a dependent variable and independent variable.
In simpler words, a regression analysis involves graphing a line over a set of data points that most closely fits the overall shape of the data or it can be said that a regression shows the changes in a dependent variable on the y-axis to the change in the explanatory variable on the x-axis.
Linear Regression
Linear regression aims to establish a linear relationship between two variables where one is the independent variable and other is the dependent variable using a linear equation on the data that is put under observation. For example if we consider the weight and height of a person, as the height increases the weight also increases, hence a linear relationship can be established between the height and weight of a person.
A similar scenario would be trying to predict the price of an apartment based on its size, the cost would directly depend on the size of the apartment, so based on certain independent variables in a data, a linear regression model can be created that analyzes the data and predicts the final value of the apartment.
Linear Regression Equation
Linear regression as mentioned earlier is a way to formulate the relationship between two variables. You may also find it similar to the slope formula.
The equation is of the form
Y = a+bX
Where Y is the dependent variable plotted along the y-axis
X is the independent variable plotted along the x-axis.
b signifies the slope of the line and a is the y intercept (value of y when x=0)
The values of a and b are further calculated as
Multiple Linear Regression
In contrast to simple linear regression discussed above, multiple linear regression uses multiple variables to predict the outcome of the dependent variable.
A multiple linear regression model is of the form
Here Y is the variable that is being predicted, X’s are the independent variables on which Y depends, and b’s are the regression coefficients which show the changes in X and the corresponding changes in Y which eventually leads to the prediction of Y.
Real life scenarios using Linear Regression
A theory which has no significance in practical is of no use and ultimately fades away with time. On the other hand the formulae and theories which can be applied in real life scenarios are the ones that are studied and researched upon.
Let’s see few real life cases where Linear Regression is used:
Evaluating trends and sales estimation
Linear regression can be used in businesses to evaluate trends and make estimates or forecasts.
For example, If a company’s sale has increased steadily every month for past few years, then conducting a linear regression analysis on the sales data with monthly sales on the y-axis and time on the x-axis, this will give us a line that predicts the upward trend in the sale. The company can then use the slope of the line to forecast the sale in future months.
Assessment of risk in Financial services and Insurance domain
Linear regression can be used to analyze the risk in various industries, for example a health insurance company might conduct a linear regression algorithm by plotting the number of claims per customer against its age and they might discover that the old customers tend to make more health insurance claims. Accordingly the insurance company can fix the insurance claim amounts. The results of such analysis might guide important business decisions.
Predicting the salary of a person based on experience
HR department of a company can use linear regression to predict the salary of a person based on experience and other factors such as the type of university, GPA, school rank etc. This data can then be used to plot a linear regression line with salary being the dependent variable and experience and other factors being the independent variables.
Building a Linear Regression model in R
This section will provide you with a brief introduction of a linear regression model that I created in R to predict the salary of an employee based on few independent variables such as experience, GPA, and school ranking.
The dataset used to build the model has the following structure.
The data contains five variables as shown in the picture above out of which ‘Salary’ is the dependent variable and the value that is to be predicted. ‘School_Ranking’, ‘GPA’, ‘Experience’, are the independent variables on which the predicted value of salary depends.
A data in raw form is very difficult to analyze and understand, therefore visualizations based on the data prove to be really efficient. One of such visualizations is a correlation plot that depicts how strongly the variables are related to each other. The variables that have very less or no relation can be deleted from the data as such variables will not have much impact on the machine learning model being created.
Correlation plot is one of the many visualizations that can be created while analyzing and exploring the data. After studying and analyzing the data one can proceed to build the machine learning models using required libraries and functions present in R. R is one of the best tools to do statistical analysis and create various models with minimum errors.
After creating the model and analyzing various statistical measures such as p-value, R square value, RMSE i.e. the root mean squared error value to know how accurate the model is, plots based on the model can be created to analyze how efficient and accurate the model is.
Red line depicts the actual values of the data put under observation
Green line shows the predicted values.
Seeing the plot above between the actual and the predicted values, we can infer that both the actual and predicted values trace almost the same path and it can be analyzed how much is the error difference between both the values.
Multiple models can be created in similar way and the one that best fits the data can be used to predict the required value.
Shortcomings of Linear Regression models
As discussed, linear regression carves out only the linear relationship between the independent and the dependent variables which means that there is a straight line relationship between them. Linear regression fails in situations where there is not a linear relationship between the variables in the data being analyzed.
Linear regression looks only at the mean of the dependent and independent variable. For example, if we look at the birth weights of infants and the ages of the mother to which they are born then linear regression will look at the average weight of the babies born to mothers of different ages. But sometimes the extremes are required to know the desired output such as the babies that have very low weight are at risk and require extra care but since linear regression looks at the mean, this inference cannot be taken out.
Linear regression is very sensitive to outliers. Outliers are the values that are very different from rest of the data. For example if a data has salaries of people of different age groups and most of the people in age group of 25–30 have salaries ranging from 30,000–40,000 then few people in the same age group having salaries very high let’s say above 2,00,000 or very less below 10,000 will form outliers and these outliers have very big impact on the accuracy of the linear regression model. Therefore it is very necessary to minimize the outliers using statistical analysis.
Summary
Linear regression is definitely one of the most widely used techniques in machine learning but it has it’s own advantages and disadvantages. So it is very important to know the type of data that is under observation and the final output that is required from it. Accordingly the technique that best fits the situation should be used. Machine learning is vast and has n number of techniques, once you know what fits your data the best, the rest of the journey will seem a cake walk.