Introduction
For this first lesson, we're just going to briefly touch the tools we'll use in future lessons. Don't feel that you need to memorise anything, or understand everything that's going on - all we want to do here is get a feel for the space we'll be working in, and meet our data-crunching companions so things aren't awkward when we start diving deeper into Data Science :)
You'll see little exercises scattered throughout - use these as excuses to play around and break things! Running through without exploring is no fun ;)
A video run-through is available at https://youtu.be/B6bbOHvzx24
Getting used to our environment
Setting up everything you need to do data science on your computer isn't very hard, but it's easy to hit errors that suck up your time. So, we're going to leave it as a challenge for another day and use an environment that is already all set up for us: Google Colab.
For each notebook in this course, you can use the 'Open in Colab' button to open a copy of the notebook which you can then run, edit and explore. Try it out with this one, and make sure to save a copy to your drive so you don't loose your changes and notes.
You can think of Google Colab as a temporary cloud computer which we can use to run our code. In the top right, you'll see an option to 'Connect' (it will connect by default if you try to run some code). On the left, you can access the files pane to view what files are available, upload your own or connect to Google Drive (which will appear as another folder).
So how do we actually run code on this 'computer'? And what is this Jupyter thing we keep mentioning?
Jupyter Notebooks
If you've done some coding before, you might be used to editing a code file ('my_code.py') in a fancy IDE or text editor. You make changes, save, run somewhere and see the final output. This is fine for some use-cases, but we want to interact with out data, making tweaks, figuring things out as we go and trying things out. For this Exploratory Programming, Jupyter notebooks are an amazing tool. Each 'cell' can contain either text (like this) or code (like the next cell). You can run a cell with the shortcut Shift-Enter, and the output of that cell is shown immediately after the code that generated it:
print('Hello World')
This is super useful! Our code can live together with explanations, images, equations and output, all in a single place. Some people call this 'Literate programing' - we're not just writing instructions for the computer (code) but also instructions and explanation for humans.
Cells don't run in isolation - we can create variables in one cell and modify them in another:
a = 17
b = 43 + a
print(b)
Cells can have multiple lines, and you can do anything you would do in a normal python program inside a cell - create classes, define functions, etc.
def add_7(x):
return x+7
c = 12
add_7(c)
Exercise: play around a little! Refresh your python knowledge and use a loop to print 'hi' 12 times:
As you'll see later in this notebook, the output from a cell can be much more than just text!
Numpy
Numpy is a python library that makes it easier and faster to do certain kinds of numerical operations. Many of the other libraries we'll be using are built on top of numpy. By convention, we import it with the alias 'np' - useful if you're needing to use it a lot. We won't learn too much about it until we need to, but here are some quick examples of a few ways it comes in useful
import numpy as np
a = np.array([2, 5, 7, 8, 9])
a+6
np.mean(a)
b = np.array([2, 2, 3, 2, 2])
a+b
b**2
Numpy let's do complex operations on arrays or matrices of numbers, and they're extremely fast and efficient. While you can replicate this with standard python, using numpy makes things simpler and faster in many cases. These don't always have to be 1-D arrays either - it's perfectly comfortable with arrays of any shape:
array_3d = np.zeros([3, 5, 5]) # Create a 3x5x5 array
array_3d += 5 # Add 5 to everything. '+=' is basically a shortcut for 'array_3d = array_3d + 5'
array_3d[1] += 2 # Add 2 to a subset of the array
array_3d # View the result
Exercise: How do you access a particular element? Try setting a single value to 17:
Pandas
Pandas will be the primary way we interact with and manipulate data. It's designed for 'tabular' data - think an Excel spreadsheet or a data table.
There is a great introduction to pandas here: https://lewtun.github.io/dslectures/lesson01_intro-to-pandas/ - refer to this when you don't know how to do one of these exercises.
import pandas as pd # Importing as 'pd' is something of a standard
Loading Data
Data comes in many different formats, all of which can have their quirks. But one of the most popular formats for tabular data is a CSV (Comma Separated Variable) file. At their most basic, they look something like this:
Column 1, Column 2
12, 7
15, 9
...
We can read them in with the read_csv
function. Usually you'll pass a file as the argument, but in this case let's load some data directly from a URL:
df = pd.read_csv('https://raw.githubusercontent.com/surabhim/Diabetes/master/Diabetes.csv', # URL
skiprows=9, # What's this? Look at the file - it breaks the convention
names=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']) # Most have column names
This df
is a DataFrame - a special data structure made available by pandas.
type(df)
It has all sorts of functions we can use. For example, to preview the dataframe, we can use .head()
to see the first 5 rows:
df.head()
Exercise: What if you want to see the bottom 3 rows?
Exercise: Can you print out some summary statistics for this dataframe?
Exercise: How many rows in this dataframe?
df[['Age', 'Outcome']].head()
We can also filter to only show specific rows. There are several ways to do this, but for example if we wanted only rows matching a certain condition we could do:
df.loc[df['Outcome'] == 1].head() # Only rows where Outcome == 1
Exercise: Filter to show only rows with BMI > 40
Exercise: Show the average age and BMI of people who have had 8 pregnancies
df.plot(x='Glucose', y='Age', kind='scatter', alpha=0.3)
df['Pregnancies'].hist()
df.boxplot(column='BMI', by='Outcome', figsize=(12, 8))
Exercise: what other kinds of plots can you make? Try a few:
Matplotlib
Pandas built-in plots are nice for quick visuals, but if you want more control you'll likely end up running into the big name in python plotting: matplotlib.
It gives us control over everything we could want, but it can also be a little bit confusing. Don't worry - the secret is that someone else has already tried to make the same kind of plot as you, and you can find TONS of examples online.
We'll show a few demos here, but if you ever want to try a complicated matplotlib thing yourself I suggest taking a look at the Ultimate matplotlib notebook for copy-paste-able code for pretty much any advanced plotting option you'll ever need :)
from matplotlib import pyplot as plt # another common import statement
X = df['Glucose']
Y = df['BMI']
plt.scatter(X, Y, c='red', alpha=0.3)
# Set title
plt.title("Glucose vs BMI and Glucose vs Insulin")
# Subplots: 2 rows, 1 column, and we want to plot on the first subplot
plt.subplot(211)
plt.scatter(X, Y)
# Now we want to plot on the second subplot
plt.subplot(212)
plt.scatter(X, df['Insulin'], c='green')
Exercise: Google and figure out how to make a histogram from the following data:
data = [1, 1, 2, 3, 2, 3, 4, 6, 6, 5, 6, 3, 2, 3, 2,1, 2, 1]
# Your Code Here
Scikit-learn
Scikit-learn is a great resource for all things machine learning. Their docs (https://scikit-learn.org/stable/) are excellent and there are tons of examples for every method or algorithm.
For demo purposes, here's fitting a classification model on the dataset we've been playing with.
NOTE: We'll explain these concepts later - this is JUST a demo to show a tiny snippit of what's possible :)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
# Define our inputs (X) and our output (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']
# We split our data, so we see how well any models we make do on the 'test' set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# We create a model:
model = RandomForestClassifier()
# Training is as simple as
model.fit(X_train, y_train)
# Making predictions as simple as
predictions = model.predict(X_test)
Doesn't look very hard does it! We can make it more complicated, but the truth is that these powerful libraries give us access to some very useful techniques in a very easy-to-use format.
predictions[:5] # What do the predictions look like?
What is our accuracy?
accuracy_score(y_test, predictions)
Accuracy score is the default for classification models, so we can do this as an alternative to the above:
model.score(X_test, y_test)
We can pass a set of inputs through .predict_proba to see the predicted probabilities. What happens when you change different valies?
X_test.head(1) # Remind ourselves what the columns are
model.predict_proba([[6, 98, 58, 33, 190, 34, 0.43, 43]]) ### Exercise: play with these values and observe
Exercises
This section has some questions and exercises to help you get a little more familiar with these tools. Play around, add cells for your own experiments and see what you can do!
1) Load in the data from the following classic dataset: https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
2) Look up the 'boston housing dataset' to find out more about where this comes from.
3) Explore the data. Is price (MEDV) correlated with anything interesting?
4) What is the average price?
5) What is the average price for places with a NOX concentration above average?
6) Find another dataset in CSV format from the UCI machine learning repository. Load it in and create some visualizations to try and understand what the data is about.