Introduction

For this first lesson, we're just going to briefly touch the tools we'll use in future lessons. Don't feel that you need to memorise anything, or understand everything that's going on - all we want to do here is get a feel for the space we'll be working in, and meet our data-crunching companions so things aren't awkward when we start diving deeper into Data Science :)

You'll see little exercises scattered throughout - use these as excuses to play around and break things! Running through without exploring is no fun ;)

A video run-through is available at https://youtu.be/B6bbOHvzx24

Getting used to our environment

Setting up everything you need to do data science on your computer isn't very hard, but it's easy to hit errors that suck up your time. So, we're going to leave it as a challenge for another day and use an environment that is already all set up for us: Google Colab.

For each notebook in this course, you can use the 'Open in Colab' button to open a copy of the notebook which you can then run, edit and explore. Try it out with this one, and make sure to save a copy to your drive so you don't loose your changes and notes.

Colab

You can think of Google Colab as a temporary cloud computer which we can use to run our code. In the top right, you'll see an option to 'Connect' (it will connect by default if you try to run some code). On the left, you can access the files pane to view what files are available, upload your own or connect to Google Drive (which will appear as another folder).

So how do we actually run code on this 'computer'? And what is this Jupyter thing we keep mentioning?

Jupyter Notebooks

If you've done some coding before, you might be used to editing a code file ('my_code.py') in a fancy IDE or text editor. You make changes, save, run somewhere and see the final output. This is fine for some use-cases, but we want to interact with out data, making tweaks, figuring things out as we go and trying things out. For this Exploratory Programming, Jupyter notebooks are an amazing tool. Each 'cell' can contain either text (like this) or code (like the next cell). You can run a cell with the shortcut Shift-Enter, and the output of that cell is shown immediately after the code that generated it:

print('Hello World')
Hello World

This is super useful! Our code can live together with explanations, images, equations and output, all in a single place. Some people call this 'Literate programing' - we're not just writing instructions for the computer (code) but also instructions and explanation for humans.

Cells don't run in isolation - we can create variables in one cell and modify them in another:

a = 17
b = 43 + a
print(b)
60

Cells can have multiple lines, and you can do anything you would do in a normal python program inside a cell - create classes, define functions, etc.

def add_7(x):
    return x+7

c = 12

add_7(c)
19

Exercise: play around a little! Refresh your python knowledge and use a loop to print 'hi' 12 times:

 

As you'll see later in this notebook, the output from a cell can be much more than just text!

Libraries

A lot of useful functionality isn't included in the standard python setup. Instead, we use libraries to extend the base python functionality in different ways. Python has a library for pretty much everything.

Let's meet the ones we'll be using in this course...

Numpy

Numpy is a python library that makes it easier and faster to do certain kinds of numerical operations. Many of the other libraries we'll be using are built on top of numpy. By convention, we import it with the alias 'np' - useful if you're needing to use it a lot. We won't learn too much about it until we need to, but here are some quick examples of a few ways it comes in useful

import numpy as np
a = np.array([2, 5, 7, 8, 9])
a+6
array([ 8, 11, 13, 14, 15])
np.mean(a)
6.2
b = np.array([2, 2, 3, 2, 2])
a+b
array([ 4,  7, 10, 10, 11])
b**2
array([4, 4, 9, 4, 4])

Numpy let's do complex operations on arrays or matrices of numbers, and they're extremely fast and efficient. While you can replicate this with standard python, using numpy makes things simpler and faster in many cases. These don't always have to be 1-D arrays either - it's perfectly comfortable with arrays of any shape:

array_3d = np.zeros([3, 5, 5]) # Create a 3x5x5 array
array_3d += 5 # Add 5 to everything. '+=' is basically a shortcut for 'array_3d = array_3d + 5'
array_3d[1] += 2 # Add 2 to a subset of the array
array_3d # View the result
array([[[5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.]],

       [[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]],

       [[5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.],
        [5., 5., 5., 5., 5.]]])

Exercise: How do you access a particular element? Try setting a single value to 17:

 

Pandas

Pandas will be the primary way we interact with and manipulate data. It's designed for 'tabular' data - think an Excel spreadsheet or a data table.

There is a great introduction to pandas here: https://lewtun.github.io/dslectures/lesson01_intro-to-pandas/ - refer to this when you don't know how to do one of these exercises.

import pandas as pd # Importing as 'pd' is something of a standard
Loading Data

Data comes in many different formats, all of which can have their quirks. But one of the most popular formats for tabular data is a CSV (Comma Separated Variable) file. At their most basic, they look something like this:

Column 1, Column 2
12, 7
15, 9
...

We can read them in with the read_csv function. Usually you'll pass a file as the argument, but in this case let's load some data directly from a URL:

df = pd.read_csv('https://raw.githubusercontent.com/surabhim/Diabetes/master/Diabetes.csv', # URL
                 skiprows=9, # What's this? Look at the file - it breaks the convention
                 names=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                        'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']) # Most have column names

This df is a DataFrame - a special data structure made available by pandas.

type(df)
pandas.core.frame.DataFrame

It has all sorts of functions we can use. For example, to preview the dataframe, we can use .head() to see the first 5 rows:

df.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Exercise: What if you want to see the bottom 3 rows?

 

Exercise: Can you print out some summary statistics for this dataframe?

 

Exercise: How many rows in this dataframe?

 
Filtering

Pandas gives us lots of ways to control which parts of the data we're looking at. For example, to see only a few columns, we can provide them as a list:

df[['Age', 'Outcome']].head()
Age Outcome
0 50 1
1 31 0
2 32 1
3 21 0
4 33 1

We can also filter to only show specific rows. There are several ways to do this, but for example if we wanted only rows matching a certain condition we could do:

df.loc[df['Outcome'] == 1].head() # Only rows where Outcome == 1
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
2 8 183 64 0 0 23.3 0.672 32 1
4 0 137 40 35 168 43.1 2.288 33 1
6 3 78 50 32 88 31.0 0.248 26 1
8 2 197 70 45 543 30.5 0.158 53 1

Exercise: Filter to show only rows with BMI > 40

 

Exercise: Show the average age and BMI of people who have had 8 pregnancies

 
Visualization

Pandas has some nice visualization options built in as well. For example:

df.plot(x='Glucose', y='Age', kind='scatter', alpha=0.3)
<matplotlib.axes._subplots.AxesSubplot at 0x7f63ae09ef50>
df['Pregnancies'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f63ae4a0550>
df.boxplot(column='BMI', by='Outcome', figsize=(12, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f63ae843850>

Exercise: what other kinds of plots can you make? Try a few:

 

Matplotlib

Pandas built-in plots are nice for quick visuals, but if you want more control you'll likely end up running into the big name in python plotting: matplotlib.

It gives us control over everything we could want, but it can also be a little bit confusing. Don't worry - the secret is that someone else has already tried to make the same kind of plot as you, and you can find TONS of examples online.

We'll show a few demos here, but if you ever want to try a complicated matplotlib thing yourself I suggest taking a look at the Ultimate matplotlib notebook for copy-paste-able code for pretty much any advanced plotting option you'll ever need :)

from matplotlib import pyplot as plt # another common import statement
X = df['Glucose']
Y = df['BMI']
plt.scatter(X, Y, c='red', alpha=0.3)
<matplotlib.collections.PathCollection at 0x7f63ad2c4450>
# Set title
plt.title("Glucose vs BMI and Glucose vs Insulin")

# Subplots: 2 rows, 1 column, and we want to plot on the first subplot
plt.subplot(211)
plt.scatter(X, Y)

# Now we want to plot on the second subplot
plt.subplot(212)
plt.scatter(X, df['Insulin'], c='green')
<matplotlib.collections.PathCollection at 0x7f63ad21d810>

Exercise: Google and figure out how to make a histogram from the following data:

data = [1, 1, 2, 3, 2, 3, 4, 6, 6, 5, 6, 3, 2, 3, 2,1, 2, 1]
# Your Code Here

Scikit-learn

Scikit-learn is a great resource for all things machine learning. Their docs (https://scikit-learn.org/stable/) are excellent and there are tons of examples for every method or algorithm.

For demo purposes, here's fitting a classification model on the dataset we've been playing with.

NOTE: We'll explain these concepts later - this is JUST a demo to show a tiny snippit of what's possible :)

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Define our inputs (X) and our output (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# We split our data, so we see how well any models we make do on the 'test' set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# We create a model:
model = RandomForestClassifier()

# Training is as simple as
model.fit(X_train, y_train)

# Making predictions as simple as 
predictions = model.predict(X_test)

Doesn't look very hard does it! We can make it more complicated, but the truth is that these powerful libraries give us access to some very useful techniques in a very easy-to-use format.

predictions[:5] # What do the predictions look like?
array([0, 0, 0, 0, 0])

What is our accuracy?

accuracy_score(y_test, predictions)
0.7598425196850394

Accuracy score is the default for classification models, so we can do this as an alternative to the above:

model.score(X_test, y_test)
0.7598425196850394

We can pass a set of inputs through .predict_proba to see the predicted probabilities. What happens when you change different valies?

X_test.head(1) # Remind ourselves what the columns are
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
668 6 98 58 33 190 34.0 0.43 43
model.predict_proba([[6, 98, 58, 33, 190, 34, 0.43, 43]]) ### Exercise: play with these values and observe
array([[0.61, 0.39]])

Exercises

This section has some questions and exercises to help you get a little more familiar with these tools. Play around, add cells for your own experiments and see what you can do!

1) Load in the data from the following classic dataset: https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv

2) Look up the 'boston housing dataset' to find out more about where this comes from.

3) Explore the data. Is price (MEDV) correlated with anything interesting?

4) What is the average price?

5) What is the average price for places with a NOX concentration above average?

6) Find another dataset in CSV format from the UCI machine learning repository. Load it in and create some visualizations to try and understand what the data is about.