Introduction
Last lesson finished with a demo of creating and training a model with scikit-learn, and we saw that it only requires a couple of lines of code. We'll look at some more complex examples in lesson 3, but the key takeaway from that example is that the process is pretty easy IF your data is ready. This lesson is all about getting to that point.
We'll look at some example datasets and in the process we'll talk about:
- How to quickly get a feel for a dataset
- Things to look out for
- Useful visualizations
- Dealing with missing data
- 'Feature Engineering'
- Tips for enriching your datasets
This is a very quick intro to the concepts - for more examples see https://lewtun.github.io/dslectures/lesson02_exploratory-data-analysis/
Let's get started.
Encountering data in courses like this is always a little weird - in real-world scenarios you'll often know a lot about the data you're about to be working with (but not always!). But for a course like this we'll be pulling up data from all over the place for illustrative purposes. Don't worry too much about the nitty-gritty details - we'll try to give a bit of introduction to each dataset, and you're welcome to go deeper, but feel free to just take them as they come and figure things out if you need them :)
For example, this dataset comes from the UCI machine learning repository, which has lots of useful example datasets. It's called the 'adult' dataset, and focuses on predicting the income class of people in America given some backgound information on them. You can read about the data here - that page links to the raw data as well as some metadata such as the 'names' file which I used to generate the list of column headings used below.
The first things I do afeter loading a dataset is to print the shape and view the first few rows to check that there aren't any glaring errors and that we have about as much data as we were expecting.
import pandas as pd
from matplotlib import pyplot as plt
names = ['age','workclass','fnlwgt','education','education-num',
'marital-status','occupation','relationship','race','sex',
'capital-gain','capital-loss','hours-per-week','native-country', 'income_class']
adults = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
names=names, )
print(adults.shape)
adults.head()
Take a minute to look at the different columns. Do the values seem to match the column headings? What kind of data does each column contain? Does anything look weird? Looking at the output above, I have no clue what fnlwgt
is but besides that it all looks fairly sensible. We'll talk about the different kinds of data shortly, but for now let's look at four other useful functions worth bringing up at this early inspection stage:
1) .describe()
The describe
function gives us useful summaries of any numerical columns. Looking at the average (mean), min and max can be a useful check to make sure the variable behaves as you'd expect. For example, here we can see that age ranges from 17 to 90, with an average age of 38.6. If there was a max age of 150, you'd want to investigate that. Check the other columns - does anything look weird?
adults.describe()
2) .info()
Different columns in a dataframe can be different data types. For example, numbers might be represented as integers (int64) or floating point numbers (float) while strings are stored as a special 'object' type. Pandas guesses these types automatically, but it can sometimes have confusing results. For example, if there is one row with a string (eg 'missing') in a column of numbers then pandas will treat the whole column as type object, and you won't be able to see stats with describe()
or use functions like .mean()
.
adults.info()
In this case the data looks fine. If you end up with columns treated as the wrong type, try to fiure out why and then fix it by replacing the troublesome values (the replace()
function should do the trick) or force cconversion to your desired type with the astype()
function.
3) sample
Using .head()
can hide errors if the problems aren't present in the first 5 rows. For example, you may have data ordered by the time it was collected, and later rows could have missing values. You could use .tail()
to see the final few rows, but again the errors might lurk in the center of the dataframe. Looking at a random sample from the data can often be instructive. Here, let's look at 5 random rows to check everything looks OK:
adults.sample(5)
4) isna
I've mentioned missing values a few times. Pandas represents missing data with a special NaN object (np.NaN). This particular dataset doesn't have any missing data*, but in general we can check if there is any by combining the .isna()
function (which returns True or False for each value) with the sum()
function which let's us tally the number of 'Trues' in each column to get the total number of missing values.
adults.isna().sum()
WAIT A MINUTE! *Did you catch anything looking at .sample() above? We might have some missing data after all - something we'll look at later in the lesson... For now let's move on and do some more exploring.
adults['age'].hist()
Correlation
Source: www.mathsisfun.com
Correlation is a measure of the relationship between two variables. A high positive correlation means that when one variable is larger, the other is likely to be larger as well. Correlation doesn't tell us about the cause of a relationship, only whether one exists.
We can calculate the corellation between numerical columns with .corr()
. This is especially useful when our 'target' colum is numerical - it might show which columns are most strongly related to the target.
adults.corr()
There are ways of making this colourful and pretty, which can help pick out unusual values in dataframes with more clumns, but I'll leave that for you to research :) We're usually looking for large (positive or negative) correlations. In this case there isn't much to write home about here, but if there were some high correlation values we might want to take a closer look with a scatter plot:
adults.plot(x='hours-per-week', y='age', kind='scatter', alpha=0.3)
Not very obvious there!
For the lazy, there is a useful function to plot all the scatter-plots at once. Warning: this can take a while with larger dataframes!
It might feel like a waste of time to be creating all these visualizations - don't we want to get straight to the modelling? But we are visual creatures, and these plots can reveal issues that we might miss trying to look through simple tables of values. To illustrate this, I'm going to re-visit a plot we made in Lesson 1:
diabetes = pd.read_csv('https://raw.githubusercontent.com/surabhim/Diabetes/master/Diabetes.csv', # URL
skiprows=9, # What's this? Look at the file - it breaks the convention
names=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']) # Most have column names
diabetes.plot(x='Glucose', y='BMI', kind='scatter')
Notice anything? On both axes are a smattering of points. I'm no doctor, but I don't think a BMI of 0 sounds healthy - this may be missing data we should investigate further.
Categorical Variables
So far we've looked at techniques for examining numerical columns - let's remedy that. Categorical data is split into discrete classes/groups/options. For example, a 'Gender' column might contain 'Male', 'Female', 'Other' and 'PreferNotToSay'. When it comes time to model, we'll need to think about how we turn these columns into numerical values the model can use as inputs.
In some cases, categories do have an obvious ordering. For example, eduction completed might be 'high school', 'bachelors', 'masters', 'phd' - this is called an ordinal variable, and we can get away with encoding these as 1, 2, 3 and 4 respectively. But in the gender example above, there is no implied ordering that would make sense.
For both modelling and visualization, it often makes sense to look at the different categories individually.
Let's quickly run through some useful functions for working with this kind of data:
adults['education'].unique()
2) .value_counts()
How many rows have 'Masters' in the education column? value_counts
gives an easy answer. In cases where your target is a categorical variable, it can be useful to see the relative distributon of the different classes. Here, for example, it's very clear that only a few people in this dataset stopped their education after preschool.
adults['education'].value_counts()
groupby
This is a more complex one, but very powerful. groupby
allows you to apply a function to all rows with a specific value in the column you're grouping with. In this case we use mean()
, but you can swap in .count()
or .sum()
or your own more complex function. You'll need groupby one day - now that you know what it does, you can search for examples to help with whatever specific challenge you're facing.
adults.groupby('education').mean()
Exercise Time
Phew, we've covered quite a bit already! The above should give you everything you need to take a nice hard look at any dataset you'll be working with. There are some extra topics covered in the next few sections, but for now let's try and apply the things we've learnt so far by putting them into practice on a new dataset. As a bonus, this will save us time when we use this dataset in lesson 4.
Take a look at the Sendy Logistics Challenge on Zindi. You'll need to register on Zindi to join the competition and access the data, which comes as 5 separate files:
- Train.csv - is the dataset that you will use to train your model.
- Test.csv - is the dataset on which you will apply your model to.
- Riders.csv - contains unique rider Ids, number of orders, age, rating and number of ratings
- VariableDefinitions.csv - Definitions of variables in the Train, Test and Riders files
- SampleSubmission.csv - Shows the submission format (we'll look at this later, in lesson 4)
1) Load and inspect the data from Train.csv. Any missing data? What is the target column?
2) How common are the different vehicle types? Is there a difference in the trip time or distance depending on vehicle type?
3) What columns are highly correlated (positively or negatively) with the target? Plot some scatter plots to visualize the relationships
Dates
Dates deserve a special mention here, as they can be particularly annoying to deal with. Date formats, timezones, etc all have the ability to ruin your day. Fortunately, pandas helps us out in a number of ways. Check out their user guide for comlex examples, or follow along here as we load some data with dates and mess about to see what we can use them for.
The dataset we'll be using contains historical bitcoin price data. Let's load it in and take a look.
# rest of the code still work?
url = 'https://raw.githubusercontent.com/Yrzxiong/Bitcoin-Dataset/master/bitcoin_dataset.csv'
bitcoin = pd.read_csv(url, parse_dates=['Date']) # Note the new addition here
bitcoin.head()
With a date column, we can filter the dataset in ways that make intuitive sense:
bitcoin.loc[bitcoin['Date']>'2014-02-07'].head()
We can also set the date column as the index, and do all sorts of fun things:
bitcoin.set_index('Date',inplace=True)
bitcoin.head(2) # Note the change
bitcoin['btc_market_price'].plot() #
bitcoin_2016 = bitcoin['2016-01-01':'2017-01-01']
bitcoin_2016['btc_market_price'].resample('2w').mean().plot() # Q: What does .resample('w').mean() do?
Now, if you don't use parse_dates when loading the data, or if you've created a date string by combining multiple columns (like delivery_year, delivery_month...), we can still convert to a datetime column with pd.to_datetime
:
b = pd.read_csv(url)
b['New Date'] = pd.to_datetime(b['Date'])
b[['Date', 'New Date']].info() # See the types
Dates are useful because we can extract so many things from them. If we know the date, we can tell
- If it's a weekend
- If it's near the end of the month
- What holidays are happening
- What the traffic might be like (if we know time as well)
- ...
We'll use the date methods, accessed through .dt
, to show just one example in the lambda function section.
Missing Data
It is rare in the real world to find a dataset with no missing data. Forms are filled incorrectly, sensors break, connections are lost... it is just something we have to accept. The ways we deal with this missing data can vary. In some cases, we can simply ignore rows with missing data. But in others, we might not have the luxury of being able to just disguard data. Other options include:
- Replace missing values with 0. If you need to have numerical inputs so that you can fit a model, and you don't really care about what the value is,
df.fillna(0)
can get rid of the NaNs which might otherwise cause issues. This can obviously be problematic if those numbers are important! - Replace with the median or mean. A better solution if you must assume the value.
fillna
can replace with the mean or, if you're really fancy, you could use groupby to replace missing values in specific caterories with the mean for rows in that category. - Replace with the most common value. Especially when there is a categorical column, you could either assign the most common class rather than creating a new 'unknown' class. Again, this could be dangerous, so think before using blindly.
- Infer with machine learning. In some cases, we can predict the missing values by creating a model that takes the other columns as inputs and fills in it's best guess!
- Interpolate (timeseries). If the value of column A was 5 and 7pm, 6 and 8pm NaN (missing) at 9pm and 8 at 10pm, it seems reasonable to guess that it was around 7 at 9pm. This is called interpolation - a useful technique if you have timeseries data (data arriving at regular intervals over time). You could also just 'forward fill' and assume it was the same as the last known reading.
Deciding which method to use requires giving some thought to the context, and the implications. Filling with the mean might be fine, but if a column is missing 80% of ot's rows it might be better to exclude it rather than adding lots of synthetic data.
We'll discuss these options each time we encounter missing data. For now, let's look at two final data-wrangling techniques to keep in our toolbox.
Pandas Merge and Concat
We often end up with useful information scattered across multiple dataframes which we'd like to join together somehow. If we have dataframes with some consistency (the same columns or rows) we can simply 'stack' them together with concat
. But in many cases we need something more nuanced, able to link information based on some shared key or bring in only specific rows - this is where merge
shines. We'll look at a few examples here, but if you're looking for more there is a nice long example here.
Let's start by making some demo dataframes so that we can see what's going on.
df_a = pd.DataFrame({
'A':[1, 2, 3],
'B':['Q', 'E', 'W'],
'C':[12.3, 17, 24.2]
})
df_a
df_b = pd.DataFrame({
'A':[4, 5],
'B':['T', 'R'],
'C':[22.1, 10.3]
})
df_b
df_c = pd.DataFrame({
'Letter':['Q', 'W', 'E', 'R', 'T', 'Y'],
'D':[9, 8, 7, 6, 5, 4]
})
df_c
df_ab = pd.concat([df_a, df_b])
df_ab
By default, it adds them vertically. But we can change this if we want to keep rows consistent but need to add extra columns using the 'axis' parameter:
pd.concat([df_a, df_b], axis=1)
Note that the above example doesn't make much sense - we now have duplicated columns and the rows in both dataframes don't really correspond to eachother. You can also see that since the shapes don't match we have some NaN values.
Merge
Take a look at df_c - the 'Letter' column looks like it corresponds to the 'B' column in our other two dataframes. And it has another column, 'D', which might contain extra information we'd like to include in our main dataframe. What we're really like is a way to join the two together based on this shared 'key'.
When doing a merge, we specify a 'left' dataframe and a 'right' dataframe, as well the type of join. There are different merge types, but the one we're most often intersted is called a 'left join' - for each key in the left dataframe, we find rows in the right one with matching keys and add that data into the left one.
We specify the 'key' colum with the 'on' parameter (if it's the same in both) or the 'left_on' and 'right_on' parameters if the column names are different.
pd.merge(df_a, # Left dataframe
df_c, # Right dataframe
left_on='B', # Column with the key
right_on='Letter', # Column with the key
how='left') # Type of merge/join
What happens when some rows don't have matches? Let's flip things around and try to add info from df_a to df_c:
pd.merge(df_c, # Switched - df_c now the left one
df_a, # Right dataframe
left_on='Letter', right_on='B', how='left')
We've got the extra data where there were matches and missing data where there aren't matches. Now:
- What happens if you change the above code to have
how='inner'
? - Try some other merges or combinations. Don't be afraid to google - maaaaany people find these confusing, but you can almost always find an example of what you're trying to do.
This little section is pretty short, but it opens a world of possibilities - we can now source data from many different places and combine it in creative ways! Data on poverty and data on climate? Link by country to plot the relationships! Insurace data spread across 'Policies.csv', 'Customers.csv' and 'Payments.csv'? Merge to the rescue! The power is yours :)