Introduction

In traditional programming, we take a problem, break it down into specific steps and figure out how to instruct the computer to perform those steps. It requires us as the programmer to figure out how to solve the problem, and we explicitly tell the computer how to do things. Machine learning is a whole new paradigm - rather that telling the computer what to do, we show it examples of what we'd like done, and it learns how to perform the task. It's magical - provided we can get enough examples, and assuming we have a good way for the computer to 'learn'!

There are many different learning algorithms, but for this course we're not going to go deep into the inner workings. Instead, we'll look at a few that illustrate the concepts and show you how to use these in a practical setting - you can always explore more if you're interested.

Open in colab and follow along:

Colab

Meet the Data

We're in a hurry to get to the machine lerning part, but before we do that it's time to put some of Lesson 2 into practice. Let's load the data, do a minimal inspection and see if there are any issues that need attention before we can get to modelling.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

The dataset contains information on different car models. Specifically, it shows the fuel efficiency, which is what we'll be trying to predict later in this lesson. The data description has more information if you're curious. Let's dive in:

mtcars = pd.read_csv('https://gist.githubusercontent.com/wmeints/80c1ba22ceeb7a29a0e5e979f0b0afba/raw/8629fe51f0e7642fc5e05567130807b02a93af5e/auto-mpg.csv')
print(mtcars.shape)
mtcars.head()
(398, 9)
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

Looking good so far. Let's check for missing data:

mtcars.isna().sum()
mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
car name        0
dtype: int64

Apparently no missing data. Next let's look at some summary statistics:

mtcars.describe()
mpg cylinders displacement weight acceleration model year origin
count 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.425879 2970.424623 15.568090 76.010050 1.572864
std 7.815984 1.701004 104.269838 846.841774 2.757689 3.697627 0.802055
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.250000 2223.750000 13.825000 73.000000 1.000000
50% 23.000000 4.000000 148.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3608.000000 17.175000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000

Where is horsepower? If a column doesn't show up in the output of describe, it's a good bet that it's been interpreted as a string column or some other non-numeric type. Use .info() to verify this:

# Can you see the problem?

Looking into it further, we discover that horsepower has some '?'s in it. Let's fix this by first removing the problematic values, then converting the column to a numeric type, and then finally replacing the missing values with the median value.

# Can you explain each line?
mtcars['horsepower'] = mtcars['horsepower'].replace('?', np.NaN) # Replacing with NaN
mtcars['horsepower'] = mtcars['horsepower'].astype(float) # Would int be OK here?
mtcars['horsepower'] = mtcars['horsepower'].fillna(mtcars['horsepower'].median())
mtcars.describe() # Now we're talking
mpg cylinders displacement horsepower weight acceleration model year origin
count 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.425879 104.304020 2970.424623 15.568090 76.010050 1.572864
std 7.815984 1.701004 104.269838 38.222625 846.841774 2.757689 3.697627 0.802055
min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.250000 76.000000 2223.750000 13.825000 73.000000 1.000000
50% 23.000000 4.000000 148.500000 93.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 125.000000 3608.000000 17.175000 79.000000 2.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 82.000000 3.000000

Now, we could do all sorts of exploring. Looking for correlation, checking distributions for outliers or suspicious value, grouping by year or origin to see how those affect thigns... but for now let's pick one plot that will show some relationships and give a nice overview: the scatter matrix. Note: this takes a while so be careful using it for large datasets...

pd.plotting.scatter_matrix(mtcars, figsize=(16, 16))
plt.show()

Starting Simple: Linear Regression

We're going to start with possibly THE simplest model - a straight line. If you've done high-school maths, you'll remember that we can specify a stright line on a graph of x vs y with the equation $y = mx + c$ (or $y = B0 + B1*x$, or ...). We have a gradient ($m$) and an intercept ($c$) which together tell us what the y value will be for a given input x. We can think of these two numbers as the model parameters. If we know the parameters, we can calculate the outputs given a set of inputs. Since we're doing machine learning, we won't know the parameters - instead, we'll start with a list of inputs and outputs and try to figure out how to make the computer find the best model parameters. Let's load some data and give it a go.

Let's begin by looking at a single input column:the weight. You would expect the fuel efficiency to go down for heavier cars, and indeed this is generally the case. Here, we plot fuel efficiency vs weight and over the top I plot a straight line with values for the intercept and gradient which I guessed using trial and error. It's not a very good fit - can you improve it byy changing the values?

ax = mtcars.plot(x='weight', y='mpg', kind='scatter') # Keep track of the plotting axis

# Plot a straight line over the top
x = mtcars['weight'].values
intercept = 53
gradient = -0.008
y_pred = intercept + gradient*mtcars['weight']
ax.plot(x, y_pred, c='red')
[<matplotlib.lines.Line2D at 0x7f4f2d8d7450>]

Rather than using trial and error, we can instead simply ask the computer to find the 'line of best fit' - the line which minimises the distance between the line and all the points (the 'error'). There are many libraries which can be used to fit a straight line to some data, but we'll use scikit-learn's linear regression model as it mimics the syntax we'll use for more complex models later on.

from sklearn.linear_model import LinearRegression
model = LinearRegression() # Create the model
x = mtcars['weight'].values # Our inputs
y = mtcars['mpg'] # Desired outputs
model.fit(x.reshape(-1, 1), y) # Try leaving out .reshape - this is because we normally have many input features

# Print the model parameters
print('Intercept: ', model.intercept_)
print('Gradient: ', model.coef_)
Intercept:  46.31736442026565
Gradient:  [-0.00767661]
 

You can see that the model has 'learnt' the intercept and gradient that best describe this relationship. We could use these to calculate the predicted fuel efficiency for a 3000 pund car like so:

46.31736442026565 + (-0.00767661)*3000
23.287534420265647

But that's the hard way! We can instead simply call model.predict():

model.predict([[3000]])
array([23.28753423])

Multiple Regression

Looking at weight is a good start, but what if we have multiple inputs? We know heavier cars are less efficient in general, but it also matters how old the car is, how fast it goes and so on. The nice thing with our linear model is that we can simply add these factors together - we have one intercept and then a different gradient for each input. For N inputs, we'll have N+1 parameters to learn (one for each input and one for the intercept). Here that is in practice:

X = mtcars[['weight', 'acceleration']].values # Our inputs - capital X since we have more than one
y = mtcars['mpg'] # Desired outputs

model = LinearRegression() # Create the model
model.fit(X, y) # No need to reshape any more

# Print the model parameters
print('Intercept: ', model.intercept_)
print('Gradients: ', model.coef_) # Note: one for each input
Intercept:  41.399828302000174
Gradients:  [-0.00733564  0.25081589]

Metrics for Regression

We're creating different models - we should figure out how to tell which is best. For a classifiaction task (where the output is one of two or more classes) we can use something like accuracy - how many answers did our model get right out of the total. For regression (predicting a continuous variable) we need a different measure, or metric, and there are several to choose from.

Let's split out data into a training set and a test set (so that we can measure performance on data the model hasn't seen before) and investigate the different options.

from sklearn.model_selection import train_test_split

# Define our inputs (X) and our output (y)
X = mtcars.drop(['mpg', 'car name'], axis=1)
y = mtcars['mpg']

# We split our data, so we see how well any models we make do on the 'test' set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()

1) R-Squared

$R^2$ or 'R-Squared' is a measure of how well a model explains the variance in the outputs - that is, how closely do the predictions follow the actual values. Higher is better, and a value of 1 is a perfect score. For sklearn regression models, $R^2$ is built in as the default score() function:

model.score(X_test, y_test) # .score() calculates the predictions and compares them to the true values
0.8442527203494318

You can get a better intuition for how well a model is doing by looking at a scatter plot of predicted vs actual - a perfect model would get all answers right, so this would look like a straight line. Our model does pretty well:

plt.scatter(model.predict(X), y, alpha=0.5) # Plot preds vs actual
plt.plot(y, y, c='red') # All predictions should lie on this line for a perfect model
[<matplotlib.lines.Line2D at 0x7f4f3c8f4a50>]
# Plot the above for it's predicitons

2) Mean Absolute Error

Another way to quantify performance is to look at a measure of how wrong the model is on average. One popular choice for this is the mean absolute error - on average, how far off is the model's prediction from the actual value? We could calculate the size of the error ('absolute error) for each prediction and take the average of those, or we could use sklearn's mean_absolute_error function:

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, model.predict(X_test))
2.3327229952681265

You can interpret this as follows: if the model predicts the fuel efficiency for a car, it is off by ~2.3 mpg on average. Not bad!

3) Mean Squared Error and Root Mean Squared Error

Sometimes, we especially care about large errors. Imagine a situation where one model is off by about 1mpg all the time (on average) and another gets within 0.5mpg most of the time but is occasionally wrong by 10mpg or more. The second might have a lower mean error, but we don't want something that could make such a huge mistake. By squaring the errors, we assign a much larger penalty to relatively large errors and pay less attention to smaller errors. This is the thought behind Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Again, sklearn does this all for us:

from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False) # squared=True -> MSE
rmse
2.994051533251487
# Use the above metrics to see which do best.

Decision Trees and Random Forests

The next model we'll look at is a little more complicated than linear regression, but also more intuitive. DTREE explanation

Decision Trees

from sklearn.tree import DecisionTreeRegressor, plot_tree

X = mtcars[['weight', 'acceleration']].values # Our inputs - capital X since we have more than one
y = mtcars['mpg']

dtree = DecisionTreeRegressor(max_depth=3)
dtree.fit(X_train, y_train)
print('Train score:', dtree.score(X_train, y_train))
print('Test score:', dtree.score(X_test, y_test))
plt.scatter(y_test, model.predict(X_test), alpha=0.5)
plt.plot(y, y, c='red') # All predictions should lie on this line for a perfect model

Try a deeper tree (max_depth=12) or leave out the max_depth parameter - what happens to the train score and the test score?

fig, ax = plt.subplots(figsize=(12, 8)) # Optional way to make the plot bigger
plot_tree(dtree)
plt.show()
 

Ensembles: Teamwork for the win!

DTREES can overfit

Ensembles: diverse models

Random forest

from sklearn.ensemble import RandomForestRegressor
 

Feature Engineering

TODO fill missing values

TODO encode origin

Cheating with PyCaret

 

Exercises

import sklearn.datasets
data = sklearn.datasets.load_boston() # Loading the data - it's built in to sklearn!
print('Data dictionary keys:', data.keys()) # This is a dictionary conatining our features, target, feature names, a description etc
boston = pd.DataFrame(data['data'], columns=data['feature_names']) # Convert to a dataframe
boston['target'] = data['target'] # Add the target column
print(boston.shape)
boston.head()
# Do a brief explore of the data
# Which columns have the highest correlation with the target?
# Fit a linear model
# Plot predicted vs actual
 

If you'd like a lovely introduction to the idea of machine learning and deep learning, the first lesson in the fastai course is excellent - you could also save that for a follow-on to lesson 5. And lesson 5 goes much deeper into Random Forests and is also highly recommended.