Machine Learning Logistic Regression with Python and Scikit-learn

3 min readFeb 25, 2021

In this coding tutorial I am going to show you how to do machine learning logistic regression using python and scikit-learn.

Logistic regression is a supervised learning classification algorithm used to predict whether an outcome occurs or not(binary classifier). For example is someone is pregnant or not, if a tumor is cancerous or not. The logistic recession model computes the weighted sum of input features but instead of outputting the result directly like in linear regression, it outputs the logistic result. Logistic regression uses the sigmoid function to predict the probability of x(independent variable) predicting y (dependent variable). The sigmoid function produces an S shaped curve that can convert any number to and map it to numerical value between 0 and 1, without ever reaching 0 or 1. If a value is greater than 0.5, then it is classified as a 1, and less than .05 a 0.

Code

For this project I am using Juypter Notebook (from Anaconda) to run my files, and I am using the chrun dataset from Kaggle. You can either follow along or look at the code from my GitHub account.

Import Libraries

First step is to import the libraries I need for the project.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import Data

The steps below import the data and allow us to read the head of the dataset.

dataset = pd.read_csv('Churn_Modelling.csv')
dataset.head()

Exploratory Data Analysis¶

Some data analysis below to allow me to explore further relationships in the data.

dataset.describe()dataset.isnull().sum()sns.boxplot(x="Geography", y ='Age', hue="Exited", data=dataset, palette="Set3")sns.set_style('whitegrid') sns.countplot(x='Exited', hue="Gender", data=dataset,palette='RdBu_r')

Data Encoding

In order to perform logistic regression we need to concert the gender and the country columns into numbers. Label encoding should only be used when the categorical variable in question has an relationship such as gender or age group, so for gender we will use label encoder. In the country column there is no relationship between the values in the columns, so in this instance we would use one Hot Encoder.

X = dataset.iloc[:, 3:-1].values
y = dataset.iloc[:, -1].values
print(X)from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

Splitting the dataset into the Training set and Test set

The training dataset is used to fit the machine learning model. The testing dataset is used to evaluate the fit machine learning model. We want to make sure we don’t test our algorithm on the same data we trained it on, to really get a clear picture if our algorithm worked.

from sklearn.model_selection import train_test_split

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Standard Scaler

Many machine learning algorithms perform better when numerical data is scaled to a standard range. Data may have different units (such as year, hours, months, USD Dollar, etc.) which may mean the variables have different scales. Differences in the scales across our data may increase the difficulty of the problem being modeled. Standardizing a dataset involves rescaling the distribution of data so that the mean of observed values is 0 and the standard deviation is 1.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Training Data on the Logistic Regression model

What we are doing here is creating a new instance of the logistic regression object, and calling the fit function on our instance and passing through our X train and Y train data.

from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train,y_train)

Predictions and Evaluations

What we are doing here is looking up the precision, recall f1 score metrics for our data. We are also creating a confusion matrix.

predictions = logistic_regression.predict(X_test)

from sklearn.metrics import classification_report print(classification_report(y_test,predictions))from sklearn.metrics import confusion_matrix, accuracy_score y_pred = logistic_regression.predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) accuracy_score(y_test, y_pred)

Conclusion

Thank you for checking out my tutorial on logistic regression.

My favorite machine learning books:

★Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems: https://amzn.to/3kk1nbH

★Practical Statistics for Data Scientists: https://amzn.to/3uu1Igp ▷Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps: https://amzn.to/3aP61eu

★Machine Learning For Absolute Beginners: A Plain English Introduction (Machine Learning from Scratch) https://amzn.to/3bwxKjm ▷AI and Machine Learning for Coders: A Programmer’s Guide to Artificial Intelligence: https://amzn.to/3pPs88V