Logistic Regression
- Types and terminologies of Logistic Regression
- Working of Logistic Regression
- Steps in Logistic Regression Modelling
- Logistic Regression Model Thresholding
Logistic regression machine learning, one
of the most significant supervised machine learning algorithms, is typically
applied to classification tasks. Finding the likelihood that an observation
belongs to a certain class is the aim of classification problems, as opposed to
continuous-outcome regression models. Logistic regression gets its name from
the mathematical procedure that uses a sigmoid function to analyze the output
of a linear regression function. The likelihood that an observation falls into
a certain class is indicated by the probabilities that emerge from this transformation. In this blog, we learn logistic regression in Python.
Using a set of
independent factors, logistic regression aims to forecast the categorical
dependent variable. Logistic regression provides probabilistic numbers between
0 and 1, which indicate the likelihood that an observation belongs to a
particular class, as opposed to exact categorical findings.
Although they are not the
same, logistic regression and linear regression are comparable technically.
They are both useful, nonetheless, despite this. Linear regression is used to
handle issues with continuous value predictions, whereas logistic regression is
better suited for classification tasks. Logistic regression fits the dataset
with a logistic function, resulting in a curved 'S' form, to estimate the
maximum output between 0 and 1. Identifying cancerous cells or a dog in an
image are two instances of the kinds of data that this curve could be used to
represent.
Logistic regression holds significant value in machine learning because it can effectively classify fresh data from both continuous and discrete datasets and generate probabilistic results. Observation categorization using several data sources is facilitated by this flexible and significant tool, which also helps identify the most influential variables for classification tasks. It is particularly useful in predictive modeling and decision-making processes.
Image source original
Logistic Function (Sigmoid Function)
In logistic regression,
the determination of whether an observation belongs to a certain class relies
on a threshold value, often set using the sigmoid function. Typically, a
threshold of 0.5 is used for binary classification. Predicted value is placed in class 0 if it is less than 0.5; class 1 if it is more than or equal to 0.5. As a mathematical instrument, the sigmoid function maps expected values onto probabilities, hence converting real values into a range between 0 and 1.
This transformation is
crucial in logistic regression, as the output must stay within the bounds of 0
and 1. The sigmoid function, often visualized as an 'S' curve, accomplishes
this task by ensuring that predictions are constrained within these limits.
In logistic regression,
certain prerequisites must be met. Firstly, the dependent variable must be
categorical in nature, as logistic regression is specifically designed for
classification tasks. Additionally, multicollinearity among independent
variables should be avoided, as it can lead to instability in model estimates
and complicate the interpretation of coefficients. By adhering to these
requirements, logistic regression can effectively classify observations and
provide valuable insights into categorical outcomes.
Types of Logistic Regression
Three different types of
logistic regression can be distinguished:
There is a unique type of logistic
regression that is known as binomial logistic regression. It has only two possible
outcomes: true or false, pass or fail, etc.
Multinomial: In multinomial logistic regression it can have 3 or more than 3 outcomes it can also be in unordered manner.
Three or more outcomes
can also be included in an ordinal category logistic regression; however, in
this instance, the dependent variables are ranked from low to high.
Logistic Regression Example
We all get spammy emails even sometimes we get bombarded by them, but between these spammy emails, we get some important emails. To overcome such problems, we want a solution that can filter spammy emails from important emails. To achieve this, we want to train a model that automatically does this.
In such situations, logistic regression can help. The logistic regression algorithm can analyze the features of an email and predict the probability of whether the mail is spammy (dependent variable) or not.
Some features can be in spam emails like
- Presence of spammy words (e.g., “FREE”, “URGENT” “4U”)
- Sender’s email address (unknown or known contact)
- Inclusion of attachments or links
To train our logistic regression model we collect a bunch of emails these emails include spammy and non-spammy emails. Then we labeled them accordingly after that we fed these labeled data to the logistic regression model. After receiving the data, the model tries to learn the pattern from it and tries to learn to differentiate spammy emails from normal emails.
After learning from the data, the logistic regression does not just only answer yes or no, it also provides the probability between 0 (definitely not spam) and 1 (definitely spam). This probability allows us to set a threshold value. For example, if we set a threshold value of 0.6 as a spam threshold, then any email that has a probability over 0.6 is flagged as spam.
The beauty of this algorithm is that we can set the threshold value to our needs. If we want then we can even set the threshold value higher if we don’t want to miss any important email which might be flagged as spam. Inversely if we are less worried about missing important emails then we can set the threshold value lower.
Here we learn about logistic regression machine learning example in real-world situations.
Terminologies involved in Logistic Regression
The following terminology must be
understood to completely understand how logistic regression operates:
The independent variables are the
attributes, features, or qualities of the input that are used to predict the
outcomes of the dependent variable.
The dependent variable, also known as the
target variable, has values that the logistic regression model predicts or
estimates.
A formula that illustrates the
relationship between two variables—one of which is independent and the other
dependent—is called a logistic function. Through the conversion of input
variables, the logistic function generates probability values between zero and
one.
The likelihood that an event will occur
is directly correlated with its likelihood of not occurring. It presents an
alternative perspective on probability.
The log-odds, sometimes referred to as
the logit function, is the natural logarithm of the chances. A linear
combination of intercepts and independent variables that describe the log
probability of the dependent variable can be used to create a logistic regression
model.
The estimated parameters of a logistic
regression model, like a coefficient, reveal the type of relationship that
exists between the two sets of data. They quantify the impact of independent
influences on the likelihood of the result.
The log chances of setting each
independent variable to zero are the intercept which is a constant feature of
logistic regression models.
The logistic regression model's coefficients
can be predicted or estimated by using maximum likelihood estimation. Finding
the coefficients that maximize the chance of witnessing the data given the
assumptions and structure of the model is the aim.
How does Logistic Regression
work
Logistic regression
modeling converts the continuous output of the linear regression function into
values that may be categorized by using the sigmoid function. This mathematical
function, commonly known as the logistic function, is crucial in transforming
the set of real-valued independent variables into a binary range of 0 and 1.
The continuous output of
the linear regression model is transformed into probabilities using logistic
regression and the sigmoid function. These probabilities specifically indicate
the likelihood that an observation falls into a particular category. This
process aids in the classification of data into discrete categories by
efficiently employing the input variables to anticipate categorical outputs.
Let us have
independent input features in the below format cap
In this, the dependent variable Y only has binary values that are 0 and 1.
After that, we can apply the multi-linear function to the input features or variables X
The weight or coefficient in this case is denoted by w_i=[w_1,w_2,w_3,…,w_m] (w subscript 1, 2,3 and m), the X-observation is represented by x_i, the bias factor—also known as the intercept—is represented by b. The preceding equation can instead be expressed as the bias and weight dot product.
Sigmoid Function
After that let's discuss the use of the sigmoid function in this the input will be z and we need to find the probability between 0 and 1 which is predicted by y.
Logistic Regression Equation
In logistic regression, there is a chance that something occurs or something does not occur, which is
different from probability as probability is the ratio of something that possibly occurs with everything. So, it can be written as
In this, we apply the log on the odd which makes it
Assumptions for Logistic
Regression
The assumptions for
logistic regression are as follows:
Independent observations
– it means that every independent feature is independent from other features which leads to no correlation between them.
In logistic regression,
certain conditions must be met for the model to perform effectively:
Binary dependent
variables: The dependent variable in logistic regression should only be able to
take on binary or dichotomous values. If there are multiple categories, the
softmax function is utilized instead. This binary nature allows logistic
regression to classify observations into two distinct groups based on the given
input variables.
Log odds about independent factors: A
dependent variable's log odds ought to be squared about the
independent variables.
This linear association ensures that changes in the independent variables
correspond to proportional changes in the log odds of the outcome, facilitating
accurate predictions.
No outliers: This relationship can be expressed correctly using
the square of the log odds of a dependent variable about the
independent factors. Therefore, it is important to address
outliers or ensure they are not present in the dataset to maintain the model's
accuracy and reliability.
Large sample size – for training machine learning algorithms we need to have a large dataset because it takes a lot of data to learn machine learning properly.
Applying steps in logistics
regression modeling
If we want to utilize logistic regression properly for prediction, we need to follow the following structured approach:
Define the problem: At the start, we need to define the dependent variable and independent variables. Determining whether the problem involves binary classification or
not is crucial.
Preprocessing and cleansing of the data is necessary after the problem has been determined. This includes ensuring the dataset functions well with logistic regression analysis, encoding categorical variables, and managing missing values.
Exploratory data analysis (EDA): By doing EDA, one may learn more about the dataset. While spotting unusual events guarantees data integrity, visualizing connections between dependent and independent variables helps to better comprehend the data.
Feature selection:
Selecting relevant actual variables is vital for effective modeling. Features
with significant relationships to the dependent variable are retained, while
redundant or irrelevant features are removed to improve model performance.
Building a logistic regression model using the chosen real variables and estimating its coefficients using the training data.
Performance of the model is evaluated using AUC-ROC, F1-score, recall, accuracy, precision, and other measures. This makes the prediction accuracy of the model on new data evaluable.
Model improvement:
Fine-tuning the model may be necessary to enhance its performance. Adding new
features, utilizing regularization techniques, or adjusting independent
variables can all improve the quality of your model and lower the risk of
overfitting.
Model deployment: after we’re happy with the models’ result, it is ready for deployment in real-world scenarios. The model we get after training is then used to make predictions on new, unseen data, offering valuable insights for decision-making purposes.
Logistic Regression Model
Thresholding
The logistics regression
only becomes a classification technique because the threshold is brought to it. In
logistics regression setting the threshold value is very important. The
classification problem somehow depends on it quite much.
The recall and precision
values very much affect the threshold value, we want both precision and
recall to be one, but it is very difficult or even impossible for a good machine
learning model so we want the value to come as much as possible close to one.
For deciding the
precision-recall tradeoff (because when one increases another decreases), we can use
the following arguments which can help to decide the threshold:
To attain a scenario of
low precision and high recall, the focus lies on minimizing the number of false
negatives while allowing a potentially higher number of false positives. This
means selecting a decision threshold that prioritizes recall over precision. By
doing so, the aim is to capture as many true positives as possible, even if it
results in a relatively higher number of false positives.
Conversely, achieving high precision and low recall involves reducing the number of false positives while potentially accepting a higher number of false negatives. This entails choosing a decision threshold that emphasizes precision over recall. The goal is to ensure that the majority of the predicted positives are indeed true positives, even if it means sacrificing the ability to capture all actual positives, leading to a lower recall value.
Summary
Like linear regression logistic regression is also a very simple and easy-to-implement machine
learning algorithm, it is used for classification problems where we need to
check or select whether a thing belongs to one class or not. It has also a similar type of limitation as linear regression it also needs to be that there
should be a linear relationship to exist in the independent and dependent
variables and also it should have dependent data that can be used for
classification purposes. In this chapter, we also learn about advanced logistic regression.