Linear Regression
- Use Regression Analysis
- Assumptions of Linear Regression
- Types of Linear Regression
- Evaluation Metrics for Linear Regression
- Advantages of Linear Regression
- Disadvantages of Linear Regression
This section focuses on
linear regression, a fundamental component of supervised machine learning
falling under the umbrella of ML regression models or analysis. Linear regression machine learning serves as
a basic yet crucial algorithm within this domain. Most algorithms within
supervised learning are geared towards handling continuous data, aiming to
forecast outcomes in scenarios such as predicting stock prices or estimating
car values. In linear regression analysis, we have one or more independent variables x
that try to predict an outcome y.
Supervised learning
encompasses two primary branches: classification and regression. Classification
involves predicting the category or class of a dataset based on independent
variables. It yields discrete outcomes, typically binary choices like 'yes' or
'no', '1' or '0', or specific categories such as dog breeds or car models.
In contrast, regression,
another form of supervised learning, focuses on predicting continuous output
variables based on independent input variables. This methodology is
instrumental in forecasting scenarios like housing prices or stock values.
Essentially, regression involves examining correlations between variables to
forecast continuous outcomes. It is heavily utilized in applications like
forecasting and time series modeling. Regression, put simply, is the act of
reducing the vertical distance on a graph that displays the relationship
between the target and predictor variables between data points and a line or
curve that is plotted through them.
Overall, linear
regression plays a pivotal role in predictive modeling, particularly in
scenarios involving continuous data prediction. Its simplicity and
effectiveness make it a staple in various predictive analytics tasks, offering
valuable insights into variable relationships and enabling accurate forecasts. In this blog, we learn linear regression in Python regression analysis.
Things that are related to regression analysis
Dependent variable – the
dependent variable is that variable which we want to predict or guess, it is
also called the target variable.
Independent variables –
these are called features which mainly affect the dependent variable. The model
is trained over these independent variables they are also called predictors.
Outliers –is a very low
or very high value that does not match with other values if we remove it from
the dataset then it does not very much affect the result but if we do not
remove it from the training or dataset then there is a chance that it might lead
to decrease performance of the model so therefore, we need to remove it before
training.
Multicollinearity – "Multicollinearity"
describes a situation in which the variables are highly associated with one
another. It is a bad thing for the dataset because it creates problems when we
try to rank the most affecting variables.
Overfitting and
underfitting also need to be a concern in machine learning regression or any other machine learning model.
Linear Regression
A supervised machine
learning method called linear regression can be used to determine the
relationship between one or more independent features and a dependent variable.
When there is just one independent variable, univariate linear regression analysis is
utilized; however, multivariate linear regression is employed when there are
several independent factors.
Why should we use regression analysis?
As we know regression
analysis mainly uses continuous variables and there are many situations in the
real world where we need continuous results or predictions to make a good
decision. For such problems, we need such type of technique that can predict
continuous values and regression is good in this. There are some other reasons
also available for regression analysis which are given below: -
It predicts the
relationship between the dependent and the independent variable.
It can find how data
points are related.
It can also help to find
the predicted real/continuous values
By using linear
regression we can detect the most important features, also the least important features, by using
this we can also check or know how one feature affects the other feature.
Real-world Example of Linear Regression
Let's suppose we have to
bake cookies, but we don’t know how much dough each batch needs. To overcome
this, we need data like tracked flour cups (independent variable) and cookies
made by them (dependent variable).
Let's suppose we have
data in the below table
Flour
(cups) |
Cookies |
1.5 |
12
|
2 |
16 |
2.5 |
20 |
From the above data, we need to use some sort
of method to understand how many cups of flour we need to bake the desired
number of cookies. In a problem like this linear regression works like magic or
we can say in this case magic recipe. Now let’s create an equation that fits
this data. The equation builds a straight line that shows the relationship.
Let’s suppose the equation is solved by a computer for our help and it gives an
equation:
Cookies = 8 * Flour + 4
The above formula reveals
that 8 more cookies for every extra cup of flour. The number 4 is the starting
number of cookies (with no flour!).
Now let’s suppose someone
asked us about how many cookies we can bake using 3.5 cups of flour. Now we can
confidently tell them the answer using the above equation:
Cookies = (8 * 3.5) + 4 =
28 + 4 = 32 cookies.
Linear regression gives
us knowledge that if someone asks us how many cookies, we can make with certain
cups of flour we can answer them using the above equation.
However, this will not
work with all the cookie recipes, if we made chocolate chip cookies then there
might be different formulas applied to it for better results.
The assumption for the linear regression model
linearity - it means that
the independent and dependent variables are in some sort of relationship. which
means that change in the independent variable(s) can affect the dependent
variable linearly. it also means that we can draw a straight line through data
points.
The idea behind the independence, which holds that features in a dataset are unrelated to one another, is crucial to linear regression. This suggests that one observation's dependent variable value is independent of another observation's dependent variable value. Dependencies between attributes or observations may jeopardize the linear model's correctness.
Homoscedasticity refers
to the consistent variance of errors across all levels of the independent
variable(s). This means that regardless of the values that independent
variable(s) have, the variability of the errors remains uniform. Maintaining
constant variance in the residuals is essential because any deviation from
homoscedasticity can lead to inaccuracies in the model's predictions.
Normality – This means
that the residuals should follow that bell-shaped curve; otherwise, the linear
model will not work.
Types of Linear Regression
There are many types of
linear regression but two of them are the most prominent they are: -
Simple Linear Regression
– It is the most fundamental and often used version of linear regression
available. For this regression, all we need is one dependent variable and one
independent variable. Its formula is written below:
y=β_0+β_1 X
Here:
y is the dependent
variable
X is the independent
variable
β_0 is the intercept
β_1 is the slope
Multiple linear
regression – in this regression there is more than one independent variable
available and one dependent variable available. The equation for this regression
method is:
y=β_0+β_1 X+β_2 X+⋯.+β_n X
Here:
y is the dependent
variable
X1, X2, …, Xp are the
independent variables
β_0 is the intercept β_1 ,β_2,….,β_n are the slopes
Some other types are regression also available and they are: -
Polynomial regression
improves upon linear regression by adding higher-order polynomial terms (i.e.,
independent variables) to the model. This allows for more flexible and complex
relationships to be captured between the variables.
Ridge regression is a
regularization technique for linear regression models that helps avoid
overfitting; it performs best when there are several independent variables to
take into account. Ridge regression drives the model toward solutions with
lower coefficients by adding many terms to the least squares objective
function, improving model stability and reducing the impact of
multicollinearity.
Lasso regression is an
additional regularization method that employs an L1 penalty term to zero out of the non-significant independent variable coefficients. This effectively
performs feature selection, enabling the model to focus on the most relevant
predictors and disregard irrelevant ones.
Elastic Net regression merges
the regularization penalties of both ridge and lasso regression techniques. By
striking a balance between their strengths, elastic net regression offers
enhanced flexibility and robustness in handling multicollinearity and feature
selection challenges commonly encountered in regression analysis.
Best fit line
Linear regression
algorithms aim to determine the optimal equation for a best-fit line, this line
can accurately predict values, and these values are based on independent variables. The
major objective is to minimize the error margin between the predicted values of
the model and the actual values that are acquired. The relationship between the
actual and predicted variables in the dataset is displayed by the best-fit
line, which is often a straight line.
Within this line, the
slope plays a crucial role, indicating the rate of change of the predicted
variable in response to a unit change in the actual variable(s). Understanding
the kind and intensity of the link between the two classes may be gained by quantifying
the impact of the independent factors on the outcome variable.
In the diagram
provided, the variable Y shows the dependent variable, while X means the
independent variable(s), also known as features or predictors of Y. Making predictions about the dependent variable Y
based on the values of the independent variable or variables X is one of the
primary goals of linear regression. This predictive
relationship is represented by a straight line, hence the term
"linear" regression.
Linear regression
models use optimization techniques like gradient descent to lower the mean
squared error (MSE) on a training dataset by iteratively changing the model's
parameters. The goal is to reduce the values of parameters, often denoted as
θ_1 (theta subscript 1) and θ_2(theta subscript 2), to maximize the model's
performance and to get the best-fit line. Gradient descent facilitates this
process by iteratively updating the parameters to gradually converge toward the
optimal values that minimize the cost function, ultimately leading to the
creation of an accurate linear regression model.
Evaluation Metrics for linear regression
The evaluation metrics
are used to check how well our linear regression model performs. They help us
to understand how well the model can detect or give the observed outputs.
The most common
measurements are: -
Coefficient of determination (R-squared): It is static, primarily indicating the degree of
variation that the generated model can account for or describe.
It always lies between 0 and 1. If the model is good then it is much closer to
1 and vice versa. Its mathematical expression is as follows: -
R^2=1-(RSS/TSS)
Residual sum of Squares (RSS) – The total of all the residuals for every data point in the graph or information set is known as the residual sum of squares, or RSS. This metric may be used to calculate the deviation between the expected and observed outputs RSS =
Total Sum of Squares (TSS) – The total sum of squares (TSS) is the sum of all the data value deviations from the response variable's standard deviation.
The
R-squared method is superior to the RSME. Since the Root Mean Squared Error's
value depends on the units of the variables, it may change when the variables'
units change.
Linear Regression Line
The linear regression line serves as a powerful tool for understanding the relationship among two variables. It typically shows the optimal line that best describes how the predicted variable (Y) adapts to fluctuations in an actual variable (X). The general pattern is captured in this line, which shows how changes in the independent variable affect the dependent variable. Important information about the direction and strength of the link between the variables may be obtained by studying the slope and intercept of the regression line. All things considered, the underlying dynamics between the two variables under examination are concisely and clearly shown by the linear regression line.
- When the predicted value (X) and the actual value (Y) correlate positively, then the linear regression line is positive. This means that as X's value increases, Y grows as well, and vice versa when X's drops. The positive linear regression line shows a good correlation between the variables visually by sloping upward from left to right.
- When an expected variable (Y) is negative and an actual variable (X) is positive, we say that the two variables are inversely related. This system is supposed to work as follows: as X increases, Y decreases, and vice versa. Negative linear regression lines slope negatively, slanting from left to right to show a negative correlation between the variables.
Advantages of Linear Regression
- Comparing linear regression to its more complex parents, it is a straightforward and widely used method in regression analysis. The coefficients in the linear regression model indicate how much a change in the dependent variable corresponds to a change of one unit in the independent variable and are highly interpretable and provide significant information about the correlations between variables.
- Scalability and computing efficiency of linear regression are two of its main advantages; these allow it to handle big datasets with ease. Real-time applications, where rapid model deployment is critical, are especially well-suited for its capacity to be quickly trained on large datasets.
- Furthermore, compared to other machine learning techniques, linear regression has resilience against outliers, i.e., these anomalies have a very little effect on the model's overall performance. This robustness helps linear regression to be stable and reliable under different conditions.
- Moreover, linear regression functions as a fundamental model and is frequently used as a benchmark to assess the effectiveness of increasingly sophisticated machine learning algorithms. Its accessibility and usefulness in a wide range of applications are further enhanced by its simplicity and well-established character, which make it a widely available choice across many machine learning libraries and software packages.
Disadvantages of Linear
Regression
- Despite its simplicity and efficiency, linear regression exhibits certain limitations that can affect its performance in certain scenarios. One significant drawback is its reliance on assuming a linear relationship between independent and dependent variables. When such a linear relationship doesn't exist within the dataset, linear regression tends to perform poorly, leading to inaccurate predictions and inadequate model fitting.
- Another challenge is its sensitivity to multicollinearity, a situation where independent variables display a high correlation with each other. This can lead to instability in the model estimates and affect the interpretation of individual coefficients.
- Furthermore, linear regression assumes that the features are already formatted correctly for the model. Thus, to convert the features into a format that the model can use efficiently, feature engineering is frequently required, which complicates the modeling process.
- To make matters worse, before the model is run, linear regression assumes that the features are adequately constructed. Consequently, the modeling process becomes more complex as feature engineering is frequently required to convert the features into a format that the model can use efficiently.
- Furthermore, linear regression has limitations in providing explanatory relationships between variables, particularly in cases where the relationships are complex or nonlinear. In such instances, more advanced machine-learning techniques may be required to uncover deeper insights and nuances within the data.
Conclusion
Linear regression is a very basic and fundamental machine learning algorithm that is widely used for simple datasets and benchmarking for the other models’ performance. It is widely used because of its simplicity, interpretability, and efficiency. It very useful tool especially when it comes to understanding the relationship between variables and making predictions in a variety of applications. However, we must also know its limitations like it cannot work very well when there is no linear correlation between independent and dependent variable(s). It is sensitive to multicollinearity also.