Surusha_tutorials

Welcome to our blog page, your ultimate destination for in-depth content on technical courses. Explore comprehensive tutorials covering programming languages like C, Java, and Python, including advanced topics in libraries such as NumPy and Pandas. Dive into the fascinating realms of Artificial Intelligence (AI), Machine Learning (ML), and Data Science, where you'll find insightful articles, practical guides, and cutting-edge insights to enhance your skills and knowledge

Monday, February 19, 2024

LOGISTIC REGRESSION IN MACHINE LEARNING/PYTHON/ARTIFICIAL INTELLIGENCE

Logistic Regression

Types and terminologies of Logistic Regression
Working of Logistic Regression
Steps in Logistic Regression Modelling
Logistic Regression Model Thresholding

Logistic regression machine learning, one of the most significant supervised machine learning algorithms, is typically applied to classification tasks. Finding the likelihood that an observation belongs to a certain class is the aim of classification problems, as opposed to continuous-outcome regression models. Logistic regression gets its name from the mathematical procedure that uses a sigmoid function to analyze the output of a linear regression function. The likelihood that an observation falls into a certain class is indicated by the probabilities that emerge from this transformation. In this blog, we learn logistic regression in Python.

Using a set of independent factors, logistic regression aims to forecast the categorical dependent variable. Logistic regression provides probabilistic numbers between 0 and 1, which indicate the likelihood that an observation belongs to a particular class, as opposed to exact categorical findings.

Although they are not the same, logistic regression and linear regression are comparable technically. They are both useful, nonetheless, despite this. Linear regression is used to handle issues with continuous value predictions, whereas logistic regression is better suited for classification tasks. Logistic regression fits the dataset with a logistic function, resulting in a curved 'S' form, to estimate the maximum output between 0 and 1. Identifying cancerous cells or a dog in an image are two instances of the kinds of data that this curve could be used to represent.

Logistic regression holds significant value in machine learning because it can effectively classify fresh data from both continuous and discrete datasets and generate probabilistic results. Observation categorization using several data sources is facilitated by this flexible and significant tool, which also helps identify the most influential variables for classification tasks. It is particularly useful in predictive modeling and decision-making processes.

Image source original

Logistic Function (Sigmoid Function)

In logistic regression, the determination of whether an observation belongs to a certain class relies on a threshold value, often set using the sigmoid function. Typically, a threshold of 0.5 is used for binary classification. Predicted value is placed in class 0 if it is less than 0.5; class 1 if it is more than or equal to 0.5. As a mathematical instrument, the sigmoid function maps expected values onto probabilities, hence converting real values into a range between 0 and 1.

This transformation is crucial in logistic regression, as the output must stay within the bounds of 0 and 1. The sigmoid function, often visualized as an 'S' curve, accomplishes this task by ensuring that predictions are constrained within these limits.

In logistic regression, certain prerequisites must be met. Firstly, the dependent variable must be categorical in nature, as logistic regression is specifically designed for classification tasks. Additionally, multicollinearity among independent variables should be avoided, as it can lead to instability in model estimates and complicate the interpretation of coefficients. By adhering to these requirements, logistic regression can effectively classify observations and provide valuable insights into categorical outcomes.

Types of Logistic Regression

Three different types of logistic regression can be distinguished:

There is a unique type of logistic regression that is known as binomial logistic regression. It has only two possible outcomes: true or false, pass or fail, etc.

Multinomial: In multinomial logistic regression it can have 3 or more than 3 outcomes it can also be in unordered manner.

Three or more outcomes can also be included in an ordinal category logistic regression; however, in this instance, the dependent variables are ranked from low to high.

Logistic Regression Example

We all get spammy emails even sometimes we get bombarded by them, but between these spammy emails, we get some important emails. To overcome such problems, we want a solution that can filter spammy emails from important emails. To achieve this, we want to train a model that automatically does this.

In such situations, logistic regression can help. The logistic regression algorithm can analyze the features of an email and predict the probability of whether the mail is spammy (dependent variable) or not.

Some features can be in spam emails like

Presence of spammy words (e.g., “FREE”, “URGENT” “4U”)
Sender’s email address (unknown or known contact)
Inclusion of attachments or links

To train our logistic regression model we collect a bunch of emails these emails include spammy and non-spammy emails. Then we labeled them accordingly after that we fed these labeled data to the logistic regression model. After receiving the data, the model tries to learn the pattern from it and tries to learn to differentiate spammy emails from normal emails.

After learning from the data, the logistic regression does not just only answer yes or no, it also provides the probability between 0 (definitely not spam) and 1 (definitely spam). This probability allows us to set a threshold value. For example, if we set a threshold value of 0.6 as a spam threshold, then any email that has a probability over 0.6 is flagged as spam.

The beauty of this algorithm is that we can set the threshold value to our needs. If we want then we can even set the threshold value higher if we don’t want to miss any important email which might be flagged as spam. Inversely if we are less worried about missing important emails then we can set the threshold value lower.

Here we learn about logistic regression machine learning example in real-world situations.

Terminologies involved in Logistic Regression

The following terminology must be understood to completely understand how logistic regression operates:

The independent variables are the attributes, features, or qualities of the input that are used to predict the outcomes of the dependent variable.

The dependent variable, also known as the target variable, has values that the logistic regression model predicts or estimates.

A formula that illustrates the relationship between two variables—one of which is independent and the other dependent—is called a logistic function. Through the conversion of input variables, the logistic function generates probability values between zero and one.

The likelihood that an event will occur is directly correlated with its likelihood of not occurring. It presents an alternative perspective on probability.

The log-odds, sometimes referred to as the logit function, is the natural logarithm of the chances. A linear combination of intercepts and independent variables that describe the log probability of the dependent variable can be used to create a logistic regression model.

The estimated parameters of a logistic regression model, like a coefficient, reveal the type of relationship that exists between the two sets of data. They quantify the impact of independent influences on the likelihood of the result.

The log chances of setting each independent variable to zero are the intercept which is a constant feature of logistic regression models.

The logistic regression model's coefficients can be predicted or estimated by using maximum likelihood estimation. Finding the coefficients that maximize the chance of witnessing the data given the assumptions and structure of the model is the aim.

How does Logistic Regression work

Logistic regression modeling converts the continuous output of the linear regression function into values that may be categorized by using the sigmoid function. This mathematical function, commonly known as the logistic function, is crucial in transforming the set of real-valued independent variables into a binary range of 0 and 1.

The continuous output of the linear regression model is transformed into probabilities using logistic regression and the sigmoid function. These probabilities specifically indicate the likelihood that an observation falls into a particular category. This process aids in the classification of data into discrete categories by efficiently employing the input variables to anticipate categorical outputs.

Let us have independent input features in the below format cap

In this, the dependent variable Y only has binary values that are 0 and 1.

After that, we can apply the multi-linear function to the input features or variables X

The weight or coefficient in this case is denoted by w_i=[w_1,w_2,w_3,…,w_m] (w subscript 1, 2,3 and m), the X-observation is represented by x_i, the bias factor—also known as the intercept—is represented by b. The preceding equation can instead be expressed as the bias and weight dot product.

Sigmoid Function

After that let's discuss the use of the sigmoid function in this the input will be z and we need to find the probability between 0 and 1 which is predicted by y.

Image Source original

As can be seen in the following picture, the sigmoid function converts the data from the continuous variables into a probability value between zero and one. As z gets closer to infinity, the value of σ(z) approaches 1, while as z gets closer to negative infinity, it approaches 0.0, and 1 is intrinsically linked in σ(z).

In the probability of being a class can be measured as:

Logistic Regression Equation

In logistic regression, there is a chance that something occurs or something does not occur, which is different from probability as probability is the ratio of something that possibly occurs with everything. So, it can be written as

In this, we apply the log on the odd which makes it

The final equation or formula of logistic regression will be:

Assumptions for Logistic Regression

The assumptions for logistic regression are as follows:

Independent observations – it means that every independent feature is independent from other features which leads to no correlation between them.

In logistic regression, certain conditions must be met for the model to perform effectively:

Binary dependent variables: The dependent variable in logistic regression should only be able to take on binary or dichotomous values. If there are multiple categories, the softmax function is utilized instead. This binary nature allows logistic regression to classify observations into two distinct groups based on the given input variables.

Log odds about independent factors: A dependent variable's log odds ought to be squared about the independent variables. This linear association ensures that changes in the independent variables correspond to proportional changes in the log odds of the outcome, facilitating accurate predictions.

No outliers: This relationship can be expressed correctly using the square of the log odds of a dependent variable about the independent factors. Therefore, it is important to address outliers or ensure they are not present in the dataset to maintain the model's accuracy and reliability.

Large sample size – for training machine learning algorithms we need to have a large dataset because it takes a lot of data to learn machine learning properly.

Applying steps in logistics regression modeling

If we want to utilize logistic regression properly for prediction, we need to follow the following structured approach:

Define the problem: At the start, we need to define the dependent variable and independent variables. Determining whether the problem involves binary classification or not is crucial.

Preprocessing and cleansing of the data is necessary after the problem has been determined. This includes ensuring the dataset functions well with logistic regression analysis, encoding categorical variables, and managing missing values.

Exploratory data analysis (EDA): By doing EDA, one may learn more about the dataset. While spotting unusual events guarantees data integrity, visualizing connections between dependent and independent variables helps to better comprehend the data.

Feature selection: Selecting relevant actual variables is vital for effective modeling. Features with significant relationships to the dependent variable are retained, while redundant or irrelevant features are removed to improve model performance.

Building a logistic regression model using the chosen real variables and estimating its coefficients using the training data.

Performance of the model is evaluated using AUC-ROC, F1-score, recall, accuracy, precision, and other measures. This makes the prediction accuracy of the model on new data evaluable.

Model improvement: Fine-tuning the model may be necessary to enhance its performance. Adding new features, utilizing regularization techniques, or adjusting independent variables can all improve the quality of your model and lower the risk of overfitting.

Model deployment: after we’re happy with the models’ result, it is ready for deployment in real-world scenarios. The model we get after training is then used to make predictions on new, unseen data, offering valuable insights for decision-making purposes.

Logistic Regression Model Thresholding

The logistics regression only becomes a classification technique because the threshold is brought to it. In logistics regression setting the threshold value is very important. The classification problem somehow depends on it quite much.

The recall and precision values very much affect the threshold value, we want both precision and recall to be one, but it is very difficult or even impossible for a good machine learning model so we want the value to come as much as possible close to one.

For deciding the precision-recall tradeoff (because when one increases another decreases), we can use the following arguments which can help to decide the threshold:

To attain a scenario of low precision and high recall, the focus lies on minimizing the number of false negatives while allowing a potentially higher number of false positives. This means selecting a decision threshold that prioritizes recall over precision. By doing so, the aim is to capture as many true positives as possible, even if it results in a relatively higher number of false positives.

Conversely, achieving high precision and low recall involves reducing the number of false positives while potentially accepting a higher number of false negatives. This entails choosing a decision threshold that emphasizes precision over recall. The goal is to ensure that the majority of the predicted positives are indeed true positives, even if it means sacrificing the ability to capture all actual positives, leading to a lower recall value.

Summary

Like linear regression logistic regression is also a very simple and easy-to-implement machine learning algorithm, it is used for classification problems where we need to check or select whether a thing belongs to one class or not. It has also a similar type of limitation as linear regression it also needs to be that there should be a linear relationship to exist in the independent and dependent variables and also it should have dependent data that can be used for classification purposes. In this chapter, we also learn about advanced logistic regression.

Python Code

Below is the logistic regression Python example code: -