📊 Logistic Regression in Python¶

Viviana Márquez
http://vivianamarquez.com

✅ Today's Goals:¶

• Learn what Logistic Regression is.

• When should you use Logistic Regression.

• Build a machine learning model on a real-world application in Python.

Logistic Regression¶

• 📸 Most famous machine learning algorithm after linear regression.

Logistic Regression¶

• 📸 Most famous machine learning algorithm after linear regression.

🥊 Linear Regression vs Logistic Regression¶

• Linear regression is used to predict/forecast values (continuous values)

• Logistic regression is used classification tasks (discrete values: yes/no, dead/alive, pass/fail, ham/spam)

[Recap] Linear Regression¶

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$,

where $y$ is the dependent variable and $x_1,x_2,...,x_n$ are the explanatory variables.

• In Linear Regression, the predicted value can be anywhere between $-\infty$ to $\infty$.

• For Logistic Regression, we need the values to be between 0 and 1.

Our friend: The Sigmoid Function¶

$y = \dfrac{1}{1+e^{-x}}$

• Applying the Sigmoid function on linear regression, we obtain logistic regression:

$y = \dfrac{1}{1+e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}}$

🛂 Learning check-point:¶

What should I use for this problem?¶

• Determining the price of houses.

• Determining whether a song is rock or jazz.

• Predicting the weight of a person.

• Predicting if a customer is going to make a purchase or not.

Machine Learning Model time!¶

Goal¶

Predict: Water or a Fire Pokemon

In [67]:

import pandas as pd

# Load data
data = pd.read_csv("Pokemon.csv")

# Clean data
filter_pokemon = ["Water", "Fire"]
data = data[data['Type 1'].isin(filter_pokemon)]
data = data.reset_index()
data = data.drop(['Type 2', 'Total','Generation','Legendary', "#", "index"], axis=1)

# Preview data
data.head()

Out[67]:

	Name	Type 1	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
0	Charmander	Fire	39	52	43	60	50	65
1	Charmeleon	Fire	58	64	58	80	65	80
2	Charizard	Fire	78	84	78	109	85	100
3	CharizardMega Charizard X	Fire	78	130	111	130	85	100
4	CharizardMega Charizard Y	Fire	78	104	78	159	115	100

In [68]:

X = data[data.columns[2:]]
y = data['Type 1']

In [69]:

X.head()

Out[69]:

	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
0	39	52	43	60	50	65
1	58	64	58	80	65	80
2	78	84	78	109	85	100
3	78	130	111	130	85	100
4	78	104	78	159	115	100

In [70]:

y.head()

Out[70]:

0    Fire
1    Fire
2    Fire
3    Fire
4    Fire
Name: Type 1, dtype: object

In [71]:

# Split data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

🛂 What is the shape of X_train, X_test, y_train, y_test?

In [72]:

print(X.shape)
print(X_train.shape)
print(X_test.shape)

(164, 6)
(131, 6)
(33, 6)

In [73]:

print(y.shape)
print(y_train.shape)
print(y_test.shape)

(164,)
(131,)
(33,)

In [74]:

# Model

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train,y_train)

/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Out[74]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [75]:

y_pred = logreg.predict(X_test)

Is our model working?¶

In [76]:

from sklearn import metrics

metrics.accuracy_score(y_test, y_pred)

Out[76]:

0.7878787878787878

Charmeleon

In [77]:

data[data['Name']=="Charmeleon"][data.columns[2:]]

Out[77]:

	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
1	58	64	58	80	65	80

In [78]:

# Predict Charmeleon
logreg.predict(data[data['Name']=="Charmeleon"][data.columns[2:]])

Out[78]:

array(['Water'], dtype=object)

Wartortle

In [79]:

# Predict Wartortle
logreg.predict(data[data['Name']=="Wartortle"][data.columns[2:]])

Out[79]:

array(['Water'], dtype=object)

How to improve our model?¶

In [80]:

from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Out[80]:

array([[ 3,  5],
       [ 2, 23]])

In [81]:

data['Type 1'].value_counts()

Out[81]:

Water    112
Fire      52
Name: Type 1, dtype: int64

Next time: Dealing with unbalanced data sets.¶

	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
0	39	52	43	60	50	65
1	58	64	58	80	65	80
2	78	84	78	109	85	100
3	78	130	111	130	85	100
4	78	104	78	159	115	100

	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
0	39	52	43	60	50	65
1	58	64	58	80	65	80
2	78	84	78	109	85	100
3	78	130	111	130	85	100
4	78	104	78	159	115	100

	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
0	39	52	43	60	50	65
1	58	64	58	80	65	80
2	78	84	78	109	85	100
3	78	130	111	130	85	100
4	78	104	78	159	115	100