📊 Logistic Regression in Python




Viviana Márquez
http://vivianamarquez.com

✅ Today's Goals:

• Learn what Logistic Regression is.

• When should you use Logistic Regression.

• Build a machine learning model on a real-world application in Python.

Logistic Regression

• 📸 Most famous machine learning algorithm after linear regression.


Logistic Regression

• 📸 Most famous machine learning algorithm after linear regression.


🥊 Linear Regression vs Logistic Regression



Linear regression is used to predict/forecast values (continuous values)

Logistic regression is used classification tasks (discrete values: yes/no, dead/alive, pass/fail, ham/spam)

[Recap] Linear Regression



$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$,

where $y$ is the dependent variable and $x_1,x_2,...,x_n$ are the explanatory variables.



• In Linear Regression, the predicted value can be anywhere between $-\infty$ to $\infty$.

• For Logistic Regression, we need the values to be between 0 and 1.

Our friend: The Sigmoid Function


$y = \dfrac{1}{1+e^{-x}}$

• Applying the Sigmoid function on linear regression, we obtain logistic regression:


$y = \dfrac{1}{1+e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}}$

🛂 Learning check-point:

What should I use for this problem?



• Determining the price of houses.

• Determining whether a song is rock or jazz.

• Predicting the weight of a person.

• Predicting if a customer is going to make a purchase or not.

Machine Learning Model time!


Goal



Predict: Water or a Fire Pokemon
In [67]:
import pandas as pd

# Load data
data = pd.read_csv("Pokemon.csv")

# Clean data
filter_pokemon = ["Water", "Fire"]
data = data[data['Type 1'].isin(filter_pokemon)]
data = data.reset_index()
data = data.drop(['Type 2', 'Total','Generation','Legendary', "#", "index"], axis=1)

# Preview data
data.head()
Out[67]:
Name Type 1 HP Attack Defense Sp. Atk Sp. Def Speed
0 Charmander Fire 39 52 43 60 50 65
1 Charmeleon Fire 58 64 58 80 65 80
2 Charizard Fire 78 84 78 109 85 100
3 CharizardMega Charizard X Fire 78 130 111 130 85 100
4 CharizardMega Charizard Y Fire 78 104 78 159 115 100
In [68]:
X = data[data.columns[2:]]
y = data['Type 1']
In [69]:
X.head()
Out[69]:
HP Attack Defense Sp. Atk Sp. Def Speed
0 39 52 43 60 50 65
1 58 64 58 80 65 80
2 78 84 78 109 85 100
3 78 130 111 130 85 100
4 78 104 78 159 115 100
In [70]:
y.head()
Out[70]:
0    Fire
1    Fire
2    Fire
3    Fire
4    Fire
Name: Type 1, dtype: object
In [71]:
# Split data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

🛂 What is the shape of X_train, X_test, y_train, y_test?

In [72]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)
(164, 6)
(131, 6)
(33, 6)
In [73]:
print(y.shape)
print(y_train.shape)
print(y_test.shape)
(164,)
(131,)
(33,)
In [74]:
# Model

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train,y_train)
/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[74]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
In [75]:
y_pred = logreg.predict(X_test)

Is our model working?

In [76]:
from sklearn import metrics

metrics.accuracy_score(y_test, y_pred)
Out[76]:
0.7878787878787878


Charmeleon

In [77]:
data[data['Name']=="Charmeleon"][data.columns[2:]]
Out[77]:
HP Attack Defense Sp. Atk Sp. Def Speed
1 58 64 58 80 65 80
In [78]:
# Predict Charmeleon
logreg.predict(data[data['Name']=="Charmeleon"][data.columns[2:]])
Out[78]:
array(['Water'], dtype=object)


Wartortle

In [79]:
# Predict Wartortle
logreg.predict(data[data['Name']=="Wartortle"][data.columns[2:]])
Out[79]:
array(['Water'], dtype=object)

How to improve our model?

In [80]:
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[80]:
array([[ 3,  5],
       [ 2, 23]])
In [81]:
data['Type 1'].value_counts()
Out[81]:
Water    112
Fire      52
Name: Type 1, dtype: int64

Next time: Dealing with unbalanced data sets.