Introduction

Last lesson finished with a demo of creating and training a model with scikit-learn, and we saw that it only requires a couple of lines of code. We'll look at some more complex examples in lesson 3, but the key takeaway from that example is that the process is pretty easy IF your data is ready. This lesson is all about getting to that point.

Colab

We'll look at some example datasets and in the process we'll talk about:

  • How to quickly get a feel for a dataset
  • Things to look out for
  • Useful visualizations
  • Dealing with missing data
  • 'Feature Engineering'
  • Tips for enriching your datasets

This is a very quick intro to the concepts - for more examples see https://lewtun.github.io/dslectures/lesson02_exploratory-data-analysis/

Let's get started.

Initial Inspection

Encountering data in courses like this is always a little weird - in real-world scenarios you'll often know a lot about the data you're about to be working with (but not always!). But for a course like this we'll be pulling up data from all over the place for illustrative purposes. Don't worry too much about the nitty-gritty details - we'll try to give a bit of introduction to each dataset, and you're welcome to go deeper, but feel free to just take them as they come and figure things out if you need them :)

For example, this dataset comes from the UCI machine learning repository, which has lots of useful example datasets. It's called the 'adult' dataset, and focuses on predicting the income class of people in America given some backgound information on them. You can read about the data here - that page links to the raw data as well as some metadata such as the 'names' file which I used to generate the list of column headings used below.

The first things I do afeter loading a dataset is to print the shape and view the first few rows to check that there aren't any glaring errors and that we have about as much data as we were expecting.

import pandas as pd
from matplotlib import pyplot as plt
names = ['age','workclass','fnlwgt','education','education-num',
         'marital-status','occupation','relationship','race','sex',
         'capital-gain','capital-loss','hours-per-week','native-country', 'income_class']

adults = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                     names=names, )
print(adults.shape)
adults.head()
(32561, 15)
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income_class
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Take a minute to look at the different columns. Do the values seem to match the column headings? What kind of data does each column contain? Does anything look weird? Looking at the output above, I have no clue what fnlwgt is but besides that it all looks fairly sensible. We'll talk about the different kinds of data shortly, but for now let's look at four other useful functions worth bringing up at this early inspection stage:

1) .describe()

The describe function gives us useful summaries of any numerical columns. Looking at the average (mean), min and max can be a useful check to make sure the variable behaves as you'd expect. For example, here we can see that age ranges from 17 to 90, with an average age of 38.6. If there was a max age of 150, you'd want to investigate that. Check the other columns - does anything look weird?

adults.describe()
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 1077.648844 87.303830 40.437456
std 13.640433 1.055500e+05 2.572720 7385.292085 402.960219 12.347429
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000 99.000000

2) .info()

Different columns in a dataframe can be different data types. For example, numbers might be represented as integers (int64) or floating point numbers (float) while strings are stored as a special 'object' type. Pandas guesses these types automatically, but it can sometimes have confusing results. For example, if there is one row with a string (eg 'missing') in a column of numbers then pandas will treat the whole column as type object, and you won't be able to see stats with describe() or use functions like .mean().

adults.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income_class    32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

In this case the data looks fine. If you end up with columns treated as the wrong type, try to fiure out why and then fix it by replacing the troublesome values (the replace() function should do the trick) or force cconversion to your desired type with the astype() function.

3) sample

Using .head() can hide errors if the problems aren't present in the first 5 rows. For example, you may have data ordered by the time it was collected, and later rows could have missing values. You could use .tail() to see the final few rows, but again the errors might lurk in the center of the dataframe. Looking at a random sample from the data can often be instructive. Here, let's look at 5 random rows to check everything looks OK:

adults.sample(5)
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income_class
14577 43 Local-gov 144778 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 ? <=50K
9546 60 Private 25141 Some-college 10 Married-civ-spouse Transport-moving Husband White Male 0 0 60 United-States <=50K
3583 43 Local-gov 96102 Some-college 10 Married-civ-spouse Protective-serv Husband White Male 0 1887 40 United-States >50K
10072 18 Private 201554 Some-college 10 Never-married Machine-op-inspct Not-in-family White Female 0 0 15 United-States <=50K
12331 36 Private 102729 Some-college 10 Never-married Craft-repair Not-in-family White Male 0 0 40 United-States <=50K

4) isna

I've mentioned missing values a few times. Pandas represents missing data with a special NaN object (np.NaN). This particular dataset doesn't have any missing data*, but in general we can check if there is any by combining the .isna() function (which returns True or False for each value) with the sum() function which let's us tally the number of 'Trues' in each column to get the total number of missing values.

adults.isna().sum()
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income_class      0
dtype: int64

WAIT A MINUTE! *Did you catch anything looking at .sample() above? We might have some missing data after all - something we'll look at later in the lesson... For now let's move on and do some more exploring.

Going Deeper

With initial inspection done, we have some options that might give us more insight.

Plotting Distributions

For a numeric column such as age, it can be useful to see the distribution of the data. For this, the hist function gives us a nice visual representation.

adults['age'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f901d20d750>

Correlation

Source: www.mathsisfun.com

Correlation is a measure of the relationship between two variables. A high positive correlation means that when one variable is larger, the other is likely to be larger as well. Correlation doesn't tell us about the cause of a relationship, only whether one exists.

We can calculate the corellation between numerical columns with .corr(). This is especially useful when our 'target' colum is numerical - it might show which columns are most strongly related to the target.

adults.corr()
age fnlwgt education-num capital-gain capital-loss hours-per-week
age 1.000000 -0.076646 0.036527 0.077674 0.057775 0.068756
fnlwgt -0.076646 1.000000 -0.043195 0.000432 -0.010252 -0.018768
education-num 0.036527 -0.043195 1.000000 0.122630 0.079923 0.148123
capital-gain 0.077674 0.000432 0.122630 1.000000 -0.031615 0.078409
capital-loss 0.057775 -0.010252 0.079923 -0.031615 1.000000 0.054256
hours-per-week 0.068756 -0.018768 0.148123 0.078409 0.054256 1.000000

There are ways of making this colourful and pretty, which can help pick out unusual values in dataframes with more clumns, but I'll leave that for you to research :) We're usually looking for large (positive or negative) correlations. In this case there isn't much to write home about here, but if there were some high correlation values we might want to take a closer look with a scatter plot:

adults.plot(x='hours-per-week', y='age', kind='scatter', alpha=0.3)
<matplotlib.axes._subplots.AxesSubplot at 0x7f901c53b910>

Not very obvious there!

For the lazy, there is a useful function to plot all the scatter-plots at once. Warning: this can take a while with larger dataframes!

 

It might feel like a waste of time to be creating all these visualizations - don't we want to get straight to the modelling? But we are visual creatures, and these plots can reveal issues that we might miss trying to look through simple tables of values. To illustrate this, I'm going to re-visit a plot we made in Lesson 1:

diabetes = pd.read_csv('https://raw.githubusercontent.com/surabhim/Diabetes/master/Diabetes.csv', # URL
                 skiprows=9, # What's this? Look at the file - it breaks the convention
                 names=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                        'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']) # Most have column names
diabetes.plot(x='Glucose', y='BMI', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f901c45f310>

Notice anything? On both axes are a smattering of points. I'm no doctor, but I don't think a BMI of 0 sounds healthy - this may be missing data we should investigate further.

 

Categorical Variables

So far we've looked at techniques for examining numerical columns - let's remedy that. Categorical data is split into discrete classes/groups/options. For example, a 'Gender' column might contain 'Male', 'Female', 'Other' and 'PreferNotToSay'. When it comes time to model, we'll need to think about how we turn these columns into numerical values the model can use as inputs.

In some cases, categories do have an obvious ordering. For example, eduction completed might be 'high school', 'bachelors', 'masters', 'phd' - this is called an ordinal variable, and we can get away with encoding these as 1, 2, 3 and 4 respectively. But in the gender example above, there is no implied ordering that would make sense.

For both modelling and visualization, it often makes sense to look at the different categories individually.

Let's quickly run through some useful functions for working with this kind of data:

1) unique()

Need to see what unique values occur in a column? unique to the rescue. In some cases, you might want to see how many unique values there are - in that case use df['column'].unique.shape.

adults['education'].unique()
array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

2) .value_counts()

How many rows have 'Masters' in the education column? value_counts gives an easy answer. In cases where your target is a categorical variable, it can be useful to see the relative distributon of the different classes. Here, for example, it's very clear that only a few people in this dataset stopped their education after preschool.

adults['education'].value_counts()
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

groupby

This is a more complex one, but very powerful. groupby allows you to apply a function to all rows with a specific value in the column you're grouping with. In this case we use mean(), but you can swap in .count() or .sum() or your own more complex function. You'll need groupby one day - now that you know what it does, you can search for examples to help with whatever specific challenge you're facing.

adults.groupby('education').mean()
age fnlwgt education-num capital-gain capital-loss hours-per-week
education
10th 37.429796 196832.465166 6.0 404.574491 56.845659 37.052519
11th 32.355745 194928.077447 7.0 215.097872 50.079149 33.925957
12th 32.000000 199097.508083 8.0 284.087760 32.337182 35.780600
1st-4th 46.142857 239303.000000 2.0 125.875000 48.327381 38.255952
5th-6th 42.885886 232448.333333 3.0 176.021021 68.252252 38.897898
7th-8th 48.445820 188079.171827 4.0 233.939628 65.668731 39.366873
9th 41.060311 202485.066148 5.0 342.089494 28.998054 38.044747
Assoc-acdm 37.381443 193424.093721 12.0 640.399250 93.418932 40.504217
Assoc-voc 38.553546 181936.016643 11.0 715.051375 72.754703 41.610709
Bachelors 38.904949 188055.914846 13.0 1756.299533 118.350327 42.614006
Doctorate 47.702179 186698.760291 16.0 4770.145278 262.845036 46.973366
HS-grad 38.974479 189538.739739 9.0 576.800114 70.466622 40.575374
Masters 44.049913 179852.362739 14.0 2562.563552 166.719675 43.836332
Preschool 42.764706 235889.372549 1.0 898.392157 66.490196 36.647059
Prof-school 44.746528 185663.706597 15.0 10414.416667 231.203125 47.425347
Some-college 35.756275 188742.922370 10.0 598.824167 71.637087 38.852284
 

Exercise Time

Phew, we've covered quite a bit already! The above should give you everything you need to take a nice hard look at any dataset you'll be working with. There are some extra topics covered in the next few sections, but for now let's try and apply the things we've learnt so far by putting them into practice on a new dataset. As a bonus, this will save us time when we use this dataset in lesson 4.

Take a look at the Sendy Logistics Challenge on Zindi. You'll need to register on Zindi to join the competition and access the data, which comes as 5 separate files:

  • Train.csv - is the dataset that you will use to train your model.
  • Test.csv - is the dataset on which you will apply your model to.
  • Riders.csv - contains unique rider Ids, number of orders, age, rating and number of ratings
  • VariableDefinitions.csv - Definitions of variables in the Train, Test and Riders files
  • SampleSubmission.csv - Shows the submission format (we'll look at this later, in lesson 4)

1) Load and inspect the data from Train.csv. Any missing data? What is the target column?

2) How common are the different vehicle types? Is there a difference in the trip time or distance depending on vehicle type?

3) What columns are highly correlated (positively or negatively) with the target? Plot some scatter plots to visualize the relationships

Dates

Dates deserve a special mention here, as they can be particularly annoying to deal with. Date formats, timezones, etc all have the ability to ruin your day. Fortunately, pandas helps us out in a number of ways. Check out their user guide for comlex examples, or follow along here as we load some data with dates and mess about to see what we can use them for.

The dataset we'll be using contains historical bitcoin price data. Let's load it in and take a look.

# rest of the code still work?
url = 'https://raw.githubusercontent.com/Yrzxiong/Bitcoin-Dataset/master/bitcoin_dataset.csv'
bitcoin = pd.read_csv(url, parse_dates=['Date']) # Note the new addition here
bitcoin.head()
Date btc_market_price btc_total_bitcoins btc_market_cap btc_trade_volume btc_blocks_size btc_avg_block_size btc_n_orphaned_blocks btc_n_transactions_per_block btc_median_confirmation_time ... btc_cost_per_transaction_percent btc_cost_per_transaction btc_n_unique_addresses btc_n_transactions btc_n_transactions_total btc_n_transactions_excluding_popular btc_n_transactions_excluding_chains_longer_than_100 btc_output_volume btc_estimated_transaction_volume btc_estimated_transaction_volume_usd
0 2010-02-17 0.0 2043200.0 0.0 0.0 0.0 0.000235 0 1.0 0.0 ... 31.781022 0.0 241 244 41240 244 244 65173.13 36500.0 0.0
1 2010-02-18 0.0 2054650.0 0.0 0.0 0.0 0.000241 0 1.0 0.0 ... 154.463801 0.0 234 235 41475 235 235 18911.74 7413.0 0.0
2 2010-02-19 0.0 2063600.0 0.0 0.0 0.0 0.000228 0 1.0 0.0 ... 1278.516635 0.0 185 183 41658 183 183 9749.98 700.0 0.0
3 2010-02-20 0.0 2074700.0 0.0 0.0 0.0 0.000218 0 1.0 0.0 ... 22186.687990 0.0 224 224 41882 224 224 11150.03 50.0 0.0
4 2010-02-21 0.0 2085400.0 0.0 0.0 0.0 0.000234 0 1.0 0.0 ... 689.179876 0.0 218 218 42100 218 218 12266.83 1553.0 0.0

5 rows × 24 columns

With a date column, we can filter the dataset in ways that make intuitive sense:

bitcoin.loc[bitcoin['Date']>'2014-02-07'].head()
Date btc_market_price btc_total_bitcoins btc_market_cap btc_trade_volume btc_blocks_size btc_avg_block_size btc_n_orphaned_blocks btc_n_transactions_per_block btc_median_confirmation_time ... btc_cost_per_transaction_percent btc_cost_per_transaction btc_n_unique_addresses btc_n_transactions btc_n_transactions_total btc_n_transactions_excluding_popular btc_n_transactions_excluding_chains_longer_than_100 btc_output_volume btc_estimated_transaction_volume btc_estimated_transaction_volume_usd
1452 2014-02-08 707.00 12370925.0 8.746244e+09 3.722707e+07 14650.0 0.195672 0 371.0 8.350000 ... 5.918574 48.410448 131014 64069 32523065 57867 33335 602439.9845 74123.0 52404670.0
1453 2014-02-09 710.00 12374700.0 8.786037e+09 3.919396e+07 14679.0 0.195116 0 401.0 7.933333 ... 4.462111 45.342082 112334 59284 32582349 52019 33916 549603.5616 84848.0 60241888.0
1454 2014-02-10 665.00 12378525.0 8.231719e+09 1.124874e+08 14712.0 0.221251 0 356.0 8.166667 ... 3.179425 37.743190 155591 67622 32649971 61132 40595 640746.2296 120714.0 80274582.0
1455 2014-02-11 670.00 12382700.0 8.296409e+09 7.191955e+07 14742.0 0.183303 0 294.0 7.233333 ... 3.042742 44.608411 121449 62872 32712843 57955 40090 760757.1299 137573.0 92174104.0
1456 2014-02-12 667.01 12386875.0 8.262169e+09 2.538242e+07 14784.0 0.255354 0 344.0 7.366667 ... 4.162464 44.499194 121999 62805 32775648 56760 36459 655177.7497 100662.0 67142256.0

5 rows × 24 columns

We can also set the date column as the index, and do all sorts of fun things:

bitcoin.set_index('Date',inplace=True)
bitcoin.head(2) # Note the change
btc_market_price btc_total_bitcoins btc_market_cap btc_trade_volume btc_blocks_size btc_avg_block_size btc_n_orphaned_blocks btc_n_transactions_per_block btc_median_confirmation_time btc_hash_rate ... btc_cost_per_transaction_percent btc_cost_per_transaction btc_n_unique_addresses btc_n_transactions btc_n_transactions_total btc_n_transactions_excluding_popular btc_n_transactions_excluding_chains_longer_than_100 btc_output_volume btc_estimated_transaction_volume btc_estimated_transaction_volume_usd
Date
2010-02-17 0.0 2043200.0 0.0 0.0 0.0 0.000235 0 1.0 0.0 0.000029 ... 31.781022 0.0 241 244 41240 244 244 65173.13 36500.0 0.0
2010-02-18 0.0 2054650.0 0.0 0.0 0.0 0.000241 0 1.0 0.0 0.000029 ... 154.463801 0.0 234 235 41475 235 235 18911.74 7413.0 0.0

2 rows × 23 columns

bitcoin['btc_market_price'].plot() # 
<matplotlib.axes._subplots.AxesSubplot at 0x7f901ab981d0>
bitcoin_2016 = bitcoin['2016-01-01':'2017-01-01']
bitcoin_2016['btc_market_price'].resample('2w').mean().plot() # Q: What does .resample('w').mean() do?
<matplotlib.axes._subplots.AxesSubplot at 0x7f901821e210>

Now, if you don't use parse_dates when loading the data, or if you've created a date string by combining multiple columns (like delivery_year, delivery_month...), we can still convert to a datetime column with pd.to_datetime:

b = pd.read_csv(url)
b['New Date'] = pd.to_datetime(b['Date'])
b[['Date', 'New Date']].info() # See the types
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906 entries, 0 to 2905
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      2906 non-null   object        
 1   New Date  2906 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 45.5+ KB

Dates are useful because we can extract so many things from them. If we know the date, we can tell

  • If it's a weekend
  • If it's near the end of the month
  • What holidays are happening
  • What the traffic might be like (if we know time as well)
  • ...

We'll use the date methods, accessed through .dt, to show just one example in the lambda function section.

Missing Data

It is rare in the real world to find a dataset with no missing data. Forms are filled incorrectly, sensors break, connections are lost... it is just something we have to accept. The ways we deal with this missing data can vary. In some cases, we can simply ignore rows with missing data. But in others, we might not have the luxury of being able to just disguard data. Other options include:

  • Replace missing values with 0. If you need to have numerical inputs so that you can fit a model, and you don't really care about what the value is, df.fillna(0) can get rid of the NaNs which might otherwise cause issues. This can obviously be problematic if those numbers are important!
  • Replace with the median or mean. A better solution if you must assume the value. fillna can replace with the mean or, if you're really fancy, you could use groupby to replace missing values in specific caterories with the mean for rows in that category.
  • Replace with the most common value. Especially when there is a categorical column, you could either assign the most common class rather than creating a new 'unknown' class. Again, this could be dangerous, so think before using blindly.
  • Infer with machine learning. In some cases, we can predict the missing values by creating a model that takes the other columns as inputs and fills in it's best guess!
  • Interpolate (timeseries). If the value of column A was 5 and 7pm, 6 and 8pm NaN (missing) at 9pm and 8 at 10pm, it seems reasonable to guess that it was around 7 at 9pm. This is called interpolation - a useful technique if you have timeseries data (data arriving at regular intervals over time). You could also just 'forward fill' and assume it was the same as the last known reading.

Deciding which method to use requires giving some thought to the context, and the implications. Filling with the mean might be fine, but if a column is missing 80% of ot's rows it might be better to exclude it rather than adding lots of synthetic data.

We'll discuss these options each time we encounter missing data. For now, let's look at two final data-wrangling techniques to keep in our toolbox.

Pandas Merge and Concat

We often end up with useful information scattered across multiple dataframes which we'd like to join together somehow. If we have dataframes with some consistency (the same columns or rows) we can simply 'stack' them together with concat. But in many cases we need something more nuanced, able to link information based on some shared key or bring in only specific rows - this is where merge shines. We'll look at a few examples here, but if you're looking for more there is a nice long example here.

Let's start by making some demo dataframes so that we can see what's going on.

df_a = pd.DataFrame({
    'A':[1, 2, 3],
    'B':['Q', 'E', 'W'],
    'C':[12.3, 17, 24.2]
})
df_a
A B C
0 1 Q 12.3
1 2 E 17.0
2 3 W 24.2
df_b = pd.DataFrame({
    'A':[4, 5],
    'B':['T', 'R'],
    'C':[22.1, 10.3]
})
df_b
A B C
0 4 T 22.1
1 5 R 10.3
df_c = pd.DataFrame({
    'Letter':['Q', 'W', 'E', 'R', 'T', 'Y'],
    'D':[9, 8, 7, 6, 5, 4]
})
df_c
Letter D
0 Q 9
1 W 8
2 E 7
3 R 6
4 T 5
5 Y 4
Concat

Look at df_a and df_b. They both have the same columns - perhaps these come from two different customers collecting the same kind of data. If we want one 'master' dataframe and need to know the average value of column C across all customers, we can simply stack them together with:

df_ab = pd.concat([df_a, df_b])
df_ab
A B C
0 1 Q 12.3
1 2 E 17.0
2 3 W 24.2
0 4 T 22.1
1 5 R 10.3

By default, it adds them vertically. But we can change this if we want to keep rows consistent but need to add extra columns using the 'axis' parameter:

pd.concat([df_a, df_b], axis=1)
A B C A B C
0 1 Q 12.3 4.0 T 22.1
1 2 E 17.0 5.0 R 10.3
2 3 W 24.2 NaN NaN NaN

Note that the above example doesn't make much sense - we now have duplicated columns and the rows in both dataframes don't really correspond to eachother. You can also see that since the shapes don't match we have some NaN values.

Merge

Take a look at df_c - the 'Letter' column looks like it corresponds to the 'B' column in our other two dataframes. And it has another column, 'D', which might contain extra information we'd like to include in our main dataframe. What we're really like is a way to join the two together based on this shared 'key'.

When doing a merge, we specify a 'left' dataframe and a 'right' dataframe, as well the type of join. There are different merge types, but the one we're most often intersted is called a 'left join' - for each key in the left dataframe, we find rows in the right one with matching keys and add that data into the left one.

We specify the 'key' colum with the 'on' parameter (if it's the same in both) or the 'left_on' and 'right_on' parameters if the column names are different.

pd.merge(df_a, # Left dataframe
         df_c, # Right dataframe
         left_on='B', # Column with the key 
         right_on='Letter', # Column with the key 
         how='left') # Type of merge/join
A B C Letter D
0 1 Q 12.3 Q 9
1 2 E 17.0 E 7
2 3 W 24.2 W 8

What happens when some rows don't have matches? Let's flip things around and try to add info from df_a to df_c:

pd.merge(df_c, # Switched - df_c now the left one
         df_a, # Right dataframe
         left_on='Letter', right_on='B', how='left') 
Letter D A B C
0 Q 9 1.0 Q 12.3
1 W 8 3.0 W 24.2
2 E 7 2.0 E 17.0
3 R 6 NaN NaN NaN
4 T 5 NaN NaN NaN
5 Y 4 NaN NaN NaN

We've got the extra data where there were matches and missing data where there aren't matches. Now:

  • What happens if you change the above code to have how='inner'?
  • Try some other merges or combinations. Don't be afraid to google - maaaaany people find these confusing, but you can almost always find an example of what you're trying to do.

This little section is pretty short, but it opens a world of possibilities - we can now source data from many different places and combine it in creative ways! Data on poverty and data on climate? Link by country to plot the relationships! Insurace data spread across 'Policies.csv', 'Customers.csv' and 'Payments.csv'? Merge to the rescue! The power is yours :)

Getting Funky with Lambda Functions

 
 
 

Visualization Exercise:

Find data fill values make plots