Introduction
In traditional programming, we take a problem, break it down into specific steps and figure out how to instruct the computer to perform those steps. It requires us as the programmer to figure out how to solve the problem, and we explicitly tell the computer how to do things. Machine learning is a whole new paradigm - rather that telling the computer what to do, we show it examples of what we'd like done, and it learns how to perform the task. It's magical - provided we can get enough examples, and assuming we have a good way for the computer to 'learn'!
There are many different learning algorithms, but for this course we're not going to go deep into the inner workings. Instead, we'll look at a few that illustrate the concepts and show you how to use these in a practical setting - you can always explore more if you're interested.
Open in colab and follow along:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
The dataset contains information on different car models. Specifically, it shows the fuel efficiency, which is what we'll be trying to predict later in this lesson. The data description has more information if you're curious. Let's dive in:
mtcars = pd.read_csv('https://gist.githubusercontent.com/wmeints/80c1ba22ceeb7a29a0e5e979f0b0afba/raw/8629fe51f0e7642fc5e05567130807b02a93af5e/auto-mpg.csv')
print(mtcars.shape)
mtcars.head()
Looking good so far. Let's check for missing data:
mtcars.isna().sum()
Apparently no missing data. Next let's look at some summary statistics:
mtcars.describe()
Where is horsepower? If a column doesn't show up in the output of describe, it's a good bet that it's been interpreted as a string column or some other non-numeric type. Use .info()
to verify this:
# Can you see the problem?
Looking into it further, we discover that horsepower has some '?'s in it. Let's fix this by first removing the problematic values, then converting the column to a numeric type, and then finally replacing the missing values with the median value.
# Can you explain each line?
mtcars['horsepower'] = mtcars['horsepower'].replace('?', np.NaN) # Replacing with NaN
mtcars['horsepower'] = mtcars['horsepower'].astype(float) # Would int be OK here?
mtcars['horsepower'] = mtcars['horsepower'].fillna(mtcars['horsepower'].median())
mtcars.describe() # Now we're talking
Now, we could do all sorts of exploring. Looking for correlation, checking distributions for outliers or suspicious value, grouping by year or origin to see how those affect thigns... but for now let's pick one plot that will show some relationships and give a nice overview: the scatter matrix. Note: this takes a while so be careful using it for large datasets...
pd.plotting.scatter_matrix(mtcars, figsize=(16, 16))
plt.show()
Starting Simple: Linear Regression
We're going to start with possibly THE simplest model - a straight line. If you've done high-school maths, you'll remember that we can specify a stright line on a graph of x vs y with the equation $y = mx + c$ (or $y = B0 + B1*x$, or ...). We have a gradient ($m$) and an intercept ($c$) which together tell us what the y value will be for a given input x. We can think of these two numbers as the model parameters. If we know the parameters, we can calculate the outputs given a set of inputs. Since we're doing machine learning, we won't know the parameters - instead, we'll start with a list of inputs and outputs and try to figure out how to make the computer find the best model parameters. Let's load some data and give it a go.
Let's begin by looking at a single input column:the weight. You would expect the fuel efficiency to go down for heavier cars, and indeed this is generally the case. Here, we plot fuel efficiency vs weight and over the top I plot a straight line with values for the intercept and gradient which I guessed using trial and error. It's not a very good fit - can you improve it byy changing the values?
ax = mtcars.plot(x='weight', y='mpg', kind='scatter') # Keep track of the plotting axis
# Plot a straight line over the top
x = mtcars['weight'].values
intercept = 53
gradient = -0.008
y_pred = intercept + gradient*mtcars['weight']
ax.plot(x, y_pred, c='red')
Rather than using trial and error, we can instead simply ask the computer to find the 'line of best fit' - the line which minimises the distance between the line and all the points (the 'error'). There are many libraries which can be used to fit a straight line to some data, but we'll use scikit-learn's linear regression model as it mimics the syntax we'll use for more complex models later on.
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Create the model
x = mtcars['weight'].values # Our inputs
y = mtcars['mpg'] # Desired outputs
model.fit(x.reshape(-1, 1), y) # Try leaving out .reshape - this is because we normally have many input features
# Print the model parameters
print('Intercept: ', model.intercept_)
print('Gradient: ', model.coef_)
You can see that the model has 'learnt' the intercept and gradient that best describe this relationship. We could use these to calculate the predicted fuel efficiency for a 3000 pund car like so:
46.31736442026565 + (-0.00767661)*3000
But that's the hard way! We can instead simply call model.predict()
:
model.predict([[3000]])
Multiple Regression
Looking at weight is a good start, but what if we have multiple inputs? We know heavier cars are less efficient in general, but it also matters how old the car is, how fast it goes and so on. The nice thing with our linear model is that we can simply add these factors together - we have one intercept and then a different gradient for each input. For N inputs, we'll have N+1 parameters to learn (one for each input and one for the intercept). Here that is in practice:
X = mtcars[['weight', 'acceleration']].values # Our inputs - capital X since we have more than one
y = mtcars['mpg'] # Desired outputs
model = LinearRegression() # Create the model
model.fit(X, y) # No need to reshape any more
# Print the model parameters
print('Intercept: ', model.intercept_)
print('Gradients: ', model.coef_) # Note: one for each input
Metrics for Regression
We're creating different models - we should figure out how to tell which is best. For a classifiaction task (where the output is one of two or more classes) we can use something like accuracy - how many answers did our model get right out of the total. For regression (predicting a continuous variable) we need a different measure, or metric, and there are several to choose from.
Let's split out data into a training set and a test set (so that we can measure performance on data the model hasn't seen before) and investigate the different options.
from sklearn.model_selection import train_test_split
# Define our inputs (X) and our output (y)
X = mtcars.drop(['mpg', 'car name'], axis=1)
y = mtcars['mpg']
# We split our data, so we see how well any models we make do on the 'test' set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
1) R-Squared
$R^2$ or 'R-Squared' is a measure of how well a model explains the variance in the outputs - that is, how closely do the predictions follow the actual values. Higher is better, and a value of 1 is a perfect score. For sklearn regression models, $R^2$ is built in as the default score()
function:
model.score(X_test, y_test) # .score() calculates the predictions and compares them to the true values
You can get a better intuition for how well a model is doing by looking at a scatter plot of predicted vs actual - a perfect model would get all answers right, so this would look like a straight line. Our model does pretty well:
plt.scatter(model.predict(X), y, alpha=0.5) # Plot preds vs actual
plt.plot(y, y, c='red') # All predictions should lie on this line for a perfect model
# Plot the above for it's predicitons
2) Mean Absolute Error
Another way to quantify performance is to look at a measure of how wrong the model is on average. One popular choice for this is the mean absolute error - on average, how far off is the model's prediction from the actual value? We could calculate the size of the error ('absolute error) for each prediction and take the average of those, or we could use sklearn's mean_absolute_error
function:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, model.predict(X_test))
You can interpret this as follows: if the model predicts the fuel efficiency for a car, it is off by ~2.3 mpg on average. Not bad!
3) Mean Squared Error and Root Mean Squared Error
Sometimes, we especially care about large errors. Imagine a situation where one model is off by about 1mpg all the time (on average) and another gets within 0.5mpg most of the time but is occasionally wrong by 10mpg or more. The second might have a lower mean error, but we don't want something that could make such a huge mistake. By squaring the errors, we assign a much larger penalty to relatively large errors and pay less attention to smaller errors. This is the thought behind Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Again, sklearn does this all for us:
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False) # squared=True -> MSE
rmse
# Use the above metrics to see which do best.
from sklearn.tree import DecisionTreeRegressor, plot_tree
X = mtcars[['weight', 'acceleration']].values # Our inputs - capital X since we have more than one
y = mtcars['mpg']
dtree = DecisionTreeRegressor(max_depth=3)
dtree.fit(X_train, y_train)
print('Train score:', dtree.score(X_train, y_train))
print('Test score:', dtree.score(X_test, y_test))
plt.scatter(y_test, model.predict(X_test), alpha=0.5)
plt.plot(y, y, c='red') # All predictions should lie on this line for a perfect model
Try a deeper tree (max_depth=12) or leave out the max_depth parameter - what happens to the train score and the test score?
fig, ax = plt.subplots(figsize=(12, 8)) # Optional way to make the plot bigger
plot_tree(dtree)
plt.show()
from sklearn.ensemble import RandomForestRegressor
import sklearn.datasets
data = sklearn.datasets.load_boston() # Loading the data - it's built in to sklearn!
print('Data dictionary keys:', data.keys()) # This is a dictionary conatining our features, target, feature names, a description etc
boston = pd.DataFrame(data['data'], columns=data['feature_names']) # Convert to a dataframe
boston['target'] = data['target'] # Add the target column
print(boston.shape)
boston.head()
# Do a brief explore of the data
# Which columns have the highest correlation with the target?
# Fit a linear model
# Plot predicted vs actual
If you'd like a lovely introduction to the idea of machine learning and deep learning, the first lesson in the fastai course is excellent - you could also save that for a follow-on to lesson 5. And lesson 5 goes much deeper into Random Forests and is also highly recommended.