Showing posts with label Logistic Regression. Show all posts
Showing posts with label Logistic Regression. Show all posts

Monday, February 19, 2024

LOGISTIC REGRESSION IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Logistic Regression
  • Types and terminologies of Logistic Regression
  • Working of Logistic Regression
  • Steps in Logistic Regression Modelling
  • Logistic Regression Model Thresholding

Logistic regression machine learning, one of the most significant supervised machine learning algorithms, is typically applied to classification tasks. Finding the likelihood that an observation belongs to a certain class is the aim of classification problems, as opposed to continuous-outcome regression models. Logistic regression gets its name from the mathematical procedure that uses a sigmoid function to analyze the output of a linear regression function. The likelihood that an observation falls into a certain class is indicated by the probabilities that emerge from this transformation. In this blog, we learn logistic regression in Python.

Using a set of independent factors, logistic regression aims to forecast the categorical dependent variable. Logistic regression provides probabilistic numbers between 0 and 1, which indicate the likelihood that an observation belongs to a particular class, as opposed to exact categorical findings.

Although they are not the same, logistic regression and linear regression are comparable technically. They are both useful, nonetheless, despite this. Linear regression is used to handle issues with continuous value predictions, whereas logistic regression is better suited for classification tasks. Logistic regression fits the dataset with a logistic function, resulting in a curved 'S' form, to estimate the maximum output between 0 and 1. Identifying cancerous cells or a dog in an image are two instances of the kinds of data that this curve could be used to represent.

Logistic regression holds significant value in machine learning because it can effectively classify fresh data from both continuous and discrete datasets and generate probabilistic results. Observation categorization using several data sources is facilitated by this flexible and significant tool, which also helps identify the most influential variables for classification tasks. It is particularly useful in predictive modeling and decision-making processes. 

Image source original 

Logistic Function (Sigmoid Function)

In logistic regression, the determination of whether an observation belongs to a certain class relies on a threshold value, often set using the sigmoid function. Typically, a threshold of 0.5 is used for binary classification. Predicted value is placed in class 0 if it is less than 0.5; class 1 if it is more than or equal to 0.5. As a mathematical instrument, the sigmoid function maps expected values onto probabilities, hence converting real values into a range between 0 and 1.

This transformation is crucial in logistic regression, as the output must stay within the bounds of 0 and 1. The sigmoid function, often visualized as an 'S' curve, accomplishes this task by ensuring that predictions are constrained within these limits.

In logistic regression, certain prerequisites must be met. Firstly, the dependent variable must be categorical in nature, as logistic regression is specifically designed for classification tasks. Additionally, multicollinearity among independent variables should be avoided, as it can lead to instability in model estimates and complicate the interpretation of coefficients. By adhering to these requirements, logistic regression can effectively classify observations and provide valuable insights into categorical outcomes.

Types of Logistic Regression

Three different types of logistic regression can be distinguished:

There is a unique type of logistic regression that is known as binomial logistic regression. It has only two possible outcomes: true or false, pass or fail, etc.

Multinomial: In multinomial logistic regression it can have 3 or more than 3 outcomes it can also be in unordered manner.

Three or more outcomes can also be included in an ordinal category logistic regression; however, in this instance, the dependent variables are ranked from low to high.

Logistic Regression Example

We all get spammy emails even sometimes we get bombarded by them, but between these spammy emails, we get some important emails. To overcome such problems, we want a solution that can filter spammy emails from important emails. To achieve this, we want to train a model that automatically does this.

In such situations, logistic regression can help. The logistic regression algorithm can analyze the features of an email and predict the probability of whether the mail is spammy (dependent variable) or not. 

Some features can be in spam emails like

  • Presence of spammy words (e.g., “FREE”, “URGENT” “4U”)
  • Sender’s email address (unknown or known contact)
  • Inclusion of attachments or links

To train our logistic regression model we collect a bunch of emails these emails include spammy and non-spammy emails. Then we labeled them accordingly after that we fed these labeled data to the logistic regression model. After receiving the data, the model tries to learn the pattern from it and tries to learn to differentiate spammy emails from normal emails.

 After learning from the data, the logistic regression does not just only answer yes or no, it also provides the probability between 0 (definitely not spam) and 1 (definitely spam). This probability allows us to set a threshold value. For example, if we set a threshold value of 0.6 as a spam threshold, then any email that has a probability over 0.6 is flagged as spam.

The beauty of this algorithm is that we can set the threshold value to our needs. If we want then we can even set the threshold value higher if we don’t want to miss any important email which might be flagged as spam. Inversely if we are less worried about missing important emails then we can set the threshold value lower.

Here we learn about logistic regression machine learning example in real-world situations.

Terminologies involved in Logistic Regression

The following terminology must be understood to completely understand how logistic regression operates:

The independent variables are the attributes, features, or qualities of the input that are used to predict the outcomes of the dependent variable.

The dependent variable, also known as the target variable, has values that the logistic regression model predicts or estimates.

A formula that illustrates the relationship between two variables—one of which is independent and the other dependent—is called a logistic function. Through the conversion of input variables, the logistic function generates probability values between zero and one.

The likelihood that an event will occur is directly correlated with its likelihood of not occurring. It presents an alternative perspective on probability.

The log-odds, sometimes referred to as the logit function, is the natural logarithm of the chances. A linear combination of intercepts and independent variables that describe the log probability of the dependent variable can be used to create a logistic regression model.

The estimated parameters of a logistic regression model, like a coefficient, reveal the type of relationship that exists between the two sets of data. They quantify the impact of independent influences on the likelihood of the result.

The log chances of setting each independent variable to zero are the intercept which is a constant feature of logistic regression models.

The logistic regression model's coefficients can be predicted or estimated by using maximum likelihood estimation. Finding the coefficients that maximize the chance of witnessing the data given the assumptions and structure of the model is the aim.

How does Logistic Regression work

Logistic regression modeling converts the continuous output of the linear regression function into values that may be categorized by using the sigmoid function. This mathematical function, commonly known as the logistic function, is crucial in transforming the set of real-valued independent variables into a binary range of 0 and 1.

The continuous output of the linear regression model is transformed into probabilities using logistic regression and the sigmoid function. These probabilities specifically indicate the likelihood that an observation falls into a particular category. This process aids in the classification of data into discrete categories by efficiently employing the input variables to anticipate categorical outputs.

Let us have independent input features in the below format cap

In this, the dependent variable Y only has binary values that are 0 and 1.

After that, we can apply the multi-linear function to the input features or variables X 

The weight or coefficient in this case is denoted by w_i=[w_1,w_2,w_3,…,w_m] (w subscript 1, 2,3 and m), the X-observation is represented by x_i, the bias factor—also known as the intercept—is represented by b. The preceding equation can instead be expressed as the bias and weight dot product.

Sigmoid Function

After that let's discuss the use of the sigmoid function in this the input will be z and we need to find the probability between 0 and 1 which is predicted by y.


Image Source original

As can be seen in the following picture, the sigmoid function converts the data from the continuous variables into a probability value between zero and one. As z gets closer to infinity, the value of σ(z) approaches 1, while as z gets closer to negative infinity, it approaches 0.0, and 1 is intrinsically linked in σ(z).
In the probability of being a class can be measured as:

Logistic Regression Equation

In logistic regression, there is a chance that something occurs or something does not occur, which is different from probability as probability is the ratio of something that possibly occurs with everything. So, it can be written as 

In this, we apply the log on the odd which makes it 


 The final equation or formula of logistic regression will be: 

Assumptions for Logistic Regression

The assumptions for logistic regression are as follows:

Independent observations – it means that every independent feature is independent from other features which leads to no correlation between them.

In logistic regression, certain conditions must be met for the model to perform effectively:

Binary dependent variables: The dependent variable in logistic regression should only be able to take on binary or dichotomous values. If there are multiple categories, the softmax function is utilized instead. This binary nature allows logistic regression to classify observations into two distinct groups based on the given input variables.

Log odds about independent factors: A dependent variable's log odds ought to be squared about the independent variables. This linear association ensures that changes in the independent variables correspond to proportional changes in the log odds of the outcome, facilitating accurate predictions.

No outliers: This relationship can be expressed correctly using the square of the log odds of a dependent variable about the independent factors. Therefore, it is important to address outliers or ensure they are not present in the dataset to maintain the model's accuracy and reliability.

Large sample size – for training machine learning algorithms we need to have a large dataset because it takes a lot of data to learn machine learning properly.

Applying steps in logistics regression modeling

If we want to utilize logistic regression properly for prediction, we need to follow the following structured approach:

Define the problem: At the start, we need to define the dependent variable and independent variables. Determining whether the problem involves binary classification or not is crucial.

Preprocessing and cleansing of the data is necessary after the problem has been determined. This includes ensuring the dataset functions well with logistic regression analysis, encoding categorical variables, and managing missing values.

Exploratory data analysis (EDA): By doing EDA, one may learn more about the dataset. While spotting unusual events guarantees data integrity, visualizing connections between dependent and independent variables helps to better comprehend the data.

Feature selection: Selecting relevant actual variables is vital for effective modeling. Features with significant relationships to the dependent variable are retained, while redundant or irrelevant features are removed to improve model performance.

Building a logistic regression model using the chosen real variables and estimating its coefficients using the training data.

Performance of the model is evaluated using AUC-ROC, F1-score, recall, accuracy, precision, and other measures. This makes the prediction accuracy of the model on new data evaluable.

Model improvement: Fine-tuning the model may be necessary to enhance its performance. Adding new features, utilizing regularization techniques, or adjusting independent variables can all improve the quality of your model and lower the risk of overfitting.

Model deployment: after we’re happy with the models’ result, it is ready for deployment in real-world scenarios. The model we get after training is then used to make predictions on new, unseen data, offering valuable insights for decision-making purposes.

Logistic Regression Model Thresholding

The logistics regression only becomes a classification technique because the threshold is brought to it. In logistics regression setting the threshold value is very important. The classification problem somehow depends on it quite much.

The recall and precision values very much affect the threshold value, we want both precision and recall to be one, but it is very difficult or even impossible for a good machine learning model so we want the value to come as much as possible close to one.

For deciding the precision-recall tradeoff (because when one increases another decreases), we can use the following arguments which can help to decide the threshold:

To attain a scenario of low precision and high recall, the focus lies on minimizing the number of false negatives while allowing a potentially higher number of false positives. This means selecting a decision threshold that prioritizes recall over precision. By doing so, the aim is to capture as many true positives as possible, even if it results in a relatively higher number of false positives.

Conversely, achieving high precision and low recall involves reducing the number of false positives while potentially accepting a higher number of false negatives. This entails choosing a decision threshold that emphasizes precision over recall. The goal is to ensure that the majority of the predicted positives are indeed true positives, even if it means sacrificing the ability to capture all actual positives, leading to a lower recall value.

Summary

Like linear regression logistic regression is also a very simple and easy-to-implement machine learning algorithm, it is used for classification problems where we need to check or select whether a thing belongs to one class or not. It has also a similar type of limitation as linear regression it also needs to be that there should be a linear relationship to exist in the independent and dependent variables and also it should have dependent data that can be used for classification purposes. In this chapter, we also learn about advanced logistic regression.

Python Code

Below is the logistic regression Python example code: -

LINEAR REGRESSION IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Linear Regression

  • Use Regression Analysis
  • Assumptions of Linear Regression
  • Types of Linear Regression
  • Evaluation Metrics for Linear Regression
  • Advantages of Linear Regression
  • Disadvantages of Linear Regression

This section focuses on linear regression, a fundamental component of supervised machine learning falling under the umbrella of ML regression models or analysis. Linear regression machine learning serves as a basic yet crucial algorithm within this domain. Most algorithms within supervised learning are geared towards handling continuous data, aiming to forecast outcomes in scenarios such as predicting stock prices or estimating car values. In linear regression analysis, we have one or more independent variables x that try to predict an outcome y.

Supervised learning encompasses two primary branches: classification and regression. Classification involves predicting the category or class of a dataset based on independent variables. It yields discrete outcomes, typically binary choices like 'yes' or 'no', '1' or '0', or specific categories such as dog breeds or car models.

In contrast, regression, another form of supervised learning, focuses on predicting continuous output variables based on independent input variables. This methodology is instrumental in forecasting scenarios like housing prices or stock values. Essentially, regression involves examining correlations between variables to forecast continuous outcomes. It is heavily utilized in applications like forecasting and time series modeling. Regression, put simply, is the act of reducing the vertical distance on a graph that displays the relationship between the target and predictor variables between data points and a line or curve that is plotted through them.

Overall, linear regression plays a pivotal role in predictive modeling, particularly in scenarios involving continuous data prediction. Its simplicity and effectiveness make it a staple in various predictive analytics tasks, offering valuable insights into variable relationships and enabling accurate forecasts. In this blog, we learn linear regression in Python regression analysis.

Things that are related to regression analysis

Dependent variable – the dependent variable is that variable which we want to predict or guess, it is also called the target variable.

Independent variables – these are called features which mainly affect the dependent variable. The model is trained over these independent variables they are also called predictors.

Outliers –is a very low or very high value that does not match with other values if we remove it from the dataset then it does not very much affect the result but if we do not remove it from the training or dataset then there is a chance that it might lead to decrease performance of the model so therefore, we need to remove it before training.

Multicollinearity – "Multicollinearity" describes a situation in which the variables are highly associated with one another. It is a bad thing for the dataset because it creates problems when we try to rank the most affecting variables.

Overfitting and underfitting also need to be a concern in machine learning regression or any other machine learning model.

Linear Regression

A supervised machine learning method called linear regression can be used to determine the relationship between one or more independent features and a dependent variable. When there is just one independent variable, univariate linear regression analysis is utilized; however, multivariate linear regression is employed when there are several independent factors.

Why should we use regression analysis?

As we know regression analysis mainly uses continuous variables and there are many situations in the real world where we need continuous results or predictions to make a good decision. For such problems, we need such type of technique that can predict continuous values and regression is good in this. There are some other reasons also available for regression analysis which are given below: -

It predicts the relationship between the dependent and the independent variable.

It can find how data points are related.

It can also help to find the predicted real/continuous values

By using linear regression we can detect the most important features,  also the least important features, by using this we can also check or know how one feature affects the other feature.

Real-world Example of Linear Regression

Let's suppose we have to bake cookies, but we don’t know how much dough each batch needs. To overcome this, we need data like tracked flour cups (independent variable) and cookies made by them (dependent variable).

Let's suppose we have data in the below table

Flour (cups)

Cookies

1.5

12

2

16

2.5

20

 From the above data, we need to use some sort of method to understand how many cups of flour we need to bake the desired number of cookies. In a problem like this linear regression works like magic or we can say in this case magic recipe. Now let’s create an equation that fits this data. The equation builds a straight line that shows the relationship. Let’s suppose the equation is solved by a computer for our help and it gives an equation:

Cookies = 8 * Flour + 4

The above formula reveals that 8 more cookies for every extra cup of flour. The number 4 is the starting number of cookies (with no flour!).

Now let’s suppose someone asked us about how many cookies we can bake using 3.5 cups of flour. Now we can confidently tell them the answer using the above equation:

Cookies = (8 * 3.5) + 4 = 28 + 4 = 32 cookies.

Linear regression gives us knowledge that if someone asks us how many cookies, we can make with certain cups of flour we can answer them using the above equation.

However, this will not work with all the cookie recipes, if we made chocolate chip cookies then there might be different formulas applied to it for better results.

The assumption for the linear regression model

linearity - it means that the independent and dependent variables are in some sort of relationship. which means that change in the independent variable(s) can affect the dependent variable linearly. it also means that we can draw a straight line through data points.


The idea behind the independence, which holds that features in a dataset are unrelated to one another, is crucial to linear regression. This suggests that one observation's dependent variable value is independent of another observation's dependent variable value. Dependencies between attributes or observations may jeopardize the linear model's correctness.

Homoscedasticity refers to the consistent variance of errors across all levels of the independent variable(s). This means that regardless of the values that independent variable(s) have, the variability of the errors remains uniform. Maintaining constant variance in the residuals is essential because any deviation from homoscedasticity can lead to inaccuracies in the model's predictions.

Normality – This means that the residuals should follow that bell-shaped curve; otherwise, the linear model will not work. 

Types of Linear Regression

There are many types of linear regression but two of them are the most prominent they are: -

Simple Linear Regression – It is the most fundamental and often used version of linear regression available. For this regression, all we need is one dependent variable and one independent variable. Its formula is written below:

y=β_0+β_1 X

Here:

y is the dependent variable

X is the independent variable

β_0 is the intercept

β_1 is the slope

Multiple linear regression – in this regression there is more than one independent variable available and one dependent variable available. The equation for this regression method is:

y=β_0+β_1 X+β_2 X+⋯.+β_n X

Here:

y is the dependent variable

X1, X2, …, Xp are the independent variables

β_0 is the intercept β_1  ,β_2,….,β_n are the slopes

Some other types are regression also available and they are: -

Polynomial regression improves upon linear regression by adding higher-order polynomial terms (i.e., independent variables) to the model. This allows for more flexible and complex relationships to be captured between the variables.

Ridge regression is a regularization technique for linear regression models that helps avoid overfitting; it performs best when there are several independent variables to take into account. Ridge regression drives the model toward solutions with lower coefficients by adding many terms to the least squares objective function, improving model stability and reducing the impact of multicollinearity.

Lasso regression is an additional regularization method that employs an L1 penalty term to zero out of the non-significant independent variable coefficients. This effectively performs feature selection, enabling the model to focus on the most relevant predictors and disregard irrelevant ones.

Elastic Net regression merges the regularization penalties of both ridge and lasso regression techniques. By striking a balance between their strengths, elastic net regression offers enhanced flexibility and robustness in handling multicollinearity and feature selection challenges commonly encountered in regression analysis.

Best fit line

Linear regression algorithms aim to determine the optimal equation for a best-fit line, this line can accurately predict values, and these values are based on independent variables. The major objective is to minimize the error margin between the predicted values of the model and the actual values that are acquired. The relationship between the actual and predicted variables in the dataset is displayed by the best-fit line, which is often a straight line.

Within this line, the slope plays a crucial role, indicating the rate of change of the predicted variable in response to a unit change in the actual variable(s). Understanding the kind and intensity of the link between the two classes may be gained by quantifying the impact of the independent factors on the outcome variable.



In the diagram provided, the variable Y shows the dependent variable, while X means the independent variable(s), also known as features or predictors of Y. Making predictions about the dependent variable Y based on the values of the independent variable or variables X is one of the primary goals of linear regression. This predictive relationship is represented by a straight line, hence the term "linear" regression.

Linear regression models use optimization techniques like gradient descent to lower the mean squared error (MSE) on a training dataset by iteratively changing the model's parameters. The goal is to reduce the values of parameters, often denoted as θ_1 (theta subscript 1) and θ_2(theta subscript 2), to maximize the model's performance and to get the best-fit line. Gradient descent facilitates this process by iteratively updating the parameters to gradually converge toward the optimal values that minimize the cost function, ultimately leading to the creation of an accurate linear regression model.

Evaluation Metrics for linear regression

The evaluation metrics are used to check how well our linear regression model performs. They help us to understand how well the model can detect or give the observed outputs.

The most common measurements are: -

Coefficient of determination (R-squared): It is static, primarily indicating the degree of variation that the generated model can account for or describe. It always lies between 0 and 1. If the model is good then it is much closer to 1 and vice versa. Its mathematical expression is as follows: -

R^2=1-(RSS/TSS)

Residual sum of Squares (RSS) – The total of all the residuals for every data point in the graph or information set is known as the residual sum of squares, or RSS. This metric may be used to calculate the deviation between the expected and observed outputs RSS =

Total Sum of Squares (TSS) – The total sum of squares (TSS) is the sum of all the data value deviations from the response variable's standard deviation. 

Root Mean Squared Error (RMSE): It is computed as the variance of the residuals squared. The degree to which the actual data point agrees with the expected values characterizes the absolute fit of our model to the data. It might also be expressed as
To produce an unbiased estimate, we can divide the sum of the squared residuals from the above equation instead of dividing the whole number of data points from the module that have the number of degrees of freedom. This is then referred to as the Residual Standard Error (RSE). It can be represented as 

The R-squared method is superior to the RSME. Since the Root Mean Squared Error's value depends on the units of the variables, it may change when the variables' units change.

Linear Regression Line

The linear regression line serves as a powerful tool for understanding the relationship among two variables. It typically shows the optimal line that best describes how the predicted variable (Y) adapts to fluctuations in an actual variable (X). The general pattern is captured in this line, which shows how changes in the independent variable affect the dependent variable. Important information about the direction and strength of the link between the variables may be obtained by studying the slope and intercept of the regression line. All things considered, the underlying dynamics between the two variables under examination are concisely and clearly shown by the linear regression line.

  • When the predicted value (X) and the actual value (Y) correlate positively, then the linear regression line is positive. This means that as X's value increases, Y grows as well, and vice versa when X's drops. The positive linear regression line shows a good correlation between the variables visually by sloping upward from left to right.
  • When an expected variable (Y) is negative and an actual variable (X) is positive, we say that the two variables are inversely related. This system is supposed to work as follows: as X increases, Y decreases, and vice versa. Negative linear regression lines slope negatively, slanting from left to right to show a negative correlation between the variables.

Advantages of Linear Regression

  • Comparing linear regression to its more complex parents, it is a straightforward and widely used method in regression analysis. The coefficients in the linear regression model indicate how much a change in the dependent variable corresponds to a change of one unit in the independent variable and are highly interpretable and provide significant information about the correlations between variables.
  • Scalability and computing efficiency of linear regression are two of its main advantages; these allow it to handle big datasets with ease. Real-time applications, where rapid model deployment is critical, are especially well-suited for its capacity to be quickly trained on large datasets.
  • Furthermore, compared to other machine learning techniques, linear regression has resilience against outliers, i.e., these anomalies have a very little effect on the model's overall performance. This robustness helps linear regression to be stable and reliable under different conditions.
  • Moreover, linear regression functions as a fundamental model and is frequently used as a benchmark to assess the effectiveness of increasingly sophisticated machine learning algorithms. Its accessibility and usefulness in a wide range of applications are further enhanced by its simplicity and well-established character, which make it a widely available choice across many machine learning libraries and software packages.

Disadvantages of Linear Regression

  • Despite its simplicity and efficiency, linear regression exhibits certain limitations that can affect its performance in certain scenarios. One significant drawback is its reliance on assuming a linear relationship between independent and dependent variables. When such a linear relationship doesn't exist within the dataset, linear regression tends to perform poorly, leading to inaccurate predictions and inadequate model fitting.
  • Another challenge is its sensitivity to multicollinearity, a situation where independent variables display a high correlation with each other. This can lead to instability in the model estimates and affect the interpretation of individual coefficients.
  • Furthermore, linear regression assumes that the features are already formatted correctly for the model. Thus, to convert the features into a format that the model can use efficiently, feature engineering is frequently required, which complicates the modeling process.
  • To make matters worse, before the model is run, linear regression assumes that the features are adequately constructed. Consequently, the modeling process becomes more complex as feature engineering is frequently required to convert the features into a format that the model can use efficiently.
  • Furthermore, linear regression has limitations in providing explanatory relationships between variables, particularly in cases where the relationships are complex or nonlinear. In such instances, more advanced machine-learning techniques may be required to uncover deeper insights and nuances within the data.

Conclusion

Linear regression is a very basic and fundamental machine learning algorithm that is widely used for simple datasets and benchmarking for the other models’ performance. It is widely used because of its simplicity, interpretability, and efficiency. It very useful tool especially when it comes to understanding the relationship between variables and making predictions in a variety of applications. However, we must also know its limitations like it cannot work very well when there is no linear correlation between independent and dependent variable(s). It is sensitive to multicollinearity also.

Python code

Here is the linear regression model Python code: -


Featured Post

ASSOCIATION RULE IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Association rule   Rule Evaluation Metrics Applications of Association Rule Learning Advantages of Association Rule Mining Disadvantages of ...

Popular