Introduction

Last lesson finished with a demo of creating and training a model with scikit-learn, and we saw that it only requires a couple of lines of code. We'll look at some more complex examples in lesson 3, but the key takeaway from that example is that the process is pretty easy IF your data is ready. This lesson is all about getting to that point.

We'll look at some example datasets and in the process we'll talk about:

How to quickly get a feel for a dataset
Things to look out for
Useful visualizations
Dealing with missing data
'Feature Engineering'
Tips for enriching your datasets

This is a very quick intro to the concepts - for more examples see https://lewtun.github.io/dslectures/lesson02_exploratory-data-analysis/

Let's get started.

Initial Inspection

Encountering data in courses like this is always a little weird - in real-world scenarios you'll often know a lot about the data you're about to be working with (but not always!). But for a course like this we'll be pulling up data from all over the place for illustrative purposes. Don't worry too much about the nitty-gritty details - we'll try to give a bit of introduction to each dataset, and you're welcome to go deeper, but feel free to just take them as they come and figure things out if you need them :)

For example, this dataset comes from the UCI machine learning repository, which has lots of useful example datasets. It's called the 'adult' dataset, and focuses on predicting the income class of people in America given some backgound information on them. You can read about the data here - that page links to the raw data as well as some metadata such as the 'names' file which I used to generate the list of column headings used below.

The first things I do afeter loading a dataset is to print the shape and view the first few rows to check that there aren't any glaring errors and that we have about as much data as we were expecting.

import pandas as pd
from matplotlib import pyplot as plt

names = ['age','workclass','fnlwgt','education','education-num',
         'marital-status','occupation','relationship','race','sex',
         'capital-gain','capital-loss','hours-per-week','native-country', 'income_class']

adults = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                     names=names, )
print(adults.shape)
adults.head()

(32561, 15)

Take a minute to look at the different columns. Do the values seem to match the column headings? What kind of data does each column contain? Does anything look weird? Looking at the output above, I have no clue what fnlwgt is but besides that it all looks fairly sensible. We'll talk about the different kinds of data shortly, but for now let's look at four other useful functions worth bringing up at this early inspection stage:

1) .describe()

The describe function gives us useful summaries of any numerical columns. Looking at the average (mean), min and max can be a useful check to make sure the variable behaves as you'd expect. For example, here we can see that age ranges from 17 to 90, with an average age of 38.6. If there was a max age of 150, you'd want to investigate that. Check the other columns - does anything look weird?

adults.describe()

2) .info()

Different columns in a dataframe can be different data types. For example, numbers might be represented as integers (int64) or floating point numbers (float) while strings are stored as a special 'object' type. Pandas guesses these types automatically, but it can sometimes have confusing results. For example, if there is one row with a string (eg 'missing') in a column of numbers then pandas will treat the whole column as type object, and you won't be able to see stats with describe() or use functions like .mean().

adults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income_class    32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

In this case the data looks fine. If you end up with columns treated as the wrong type, try to fiure out why and then fix it by replacing the troublesome values (the replace() function should do the trick) or force cconversion to your desired type with the astype() function.

3) sample

Using .head() can hide errors if the problems aren't present in the first 5 rows. For example, you may have data ordered by the time it was collected, and later rows could have missing values. You could use .tail() to see the final few rows, but again the errors might lurk in the center of the dataframe. Looking at a random sample from the data can often be instructive. Here, let's look at 5 random rows to check everything looks OK:

adults.sample(5)

4) isna

I've mentioned missing values a few times. Pandas represents missing data with a special NaN object (np.NaN). This particular dataset doesn't have any missing data*, but in general we can check if there is any by combining the .isna() function (which returns True or False for each value) with the sum() function which let's us tally the number of 'Trues' in each column to get the total number of missing values.

adults.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income_class      0
dtype: int64

WAIT A MINUTE! *Did you catch anything looking at .sample() above? We might have some missing data after all - something we'll look at later in the lesson... For now let's move on and do some more exploring.

Going Deeper

With initial inspection done, we have some options that might give us more insight.

Plotting Distributions

For a numeric column such as age, it can be useful to see the distribution of the data. For this, the hist function gives us a nice visual representation.

adults['age'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f901d20d750>

Correlation

Source: www.mathsisfun.com

Correlation is a measure of the relationship between two variables. A high positive correlation means that when one variable is larger, the other is likely to be larger as well. Correlation doesn't tell us about the cause of a relationship, only whether one exists.

We can calculate the corellation between numerical columns with .corr(). This is especially useful when our 'target' colum is numerical - it might show which columns are most strongly related to the target.

adults.corr()

There are ways of making this colourful and pretty, which can help pick out unusual values in dataframes with more clumns, but I'll leave that for you to research :) We're usually looking for large (positive or negative) correlations. In this case there isn't much to write home about here, but if there were some high correlation values we might want to take a closer look with a scatter plot:

adults.plot(x='hours-per-week', y='age', kind='scatter', alpha=0.3)

<matplotlib.axes._subplots.AxesSubplot at 0x7f901c53b910>

Not very obvious there!

For the lazy, there is a useful function to plot all the scatter-plots at once. Warning: this can take a while with larger dataframes!

It might feel like a waste of time to be creating all these visualizations - don't we want to get straight to the modelling? But we are visual creatures, and these plots can reveal issues that we might miss trying to look through simple tables of values. To illustrate this, I'm going to re-visit a plot we made in Lesson 1:

diabetes = pd.read_csv('https://raw.githubusercontent.com/surabhim/Diabetes/master/Diabetes.csv', # URL
                 skiprows=9, # What's this? Look at the file - it breaks the convention
                 names=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                        'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']) # Most have column names
diabetes.plot(x='Glucose', y='BMI', kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x7f901c45f310>

Notice anything? On both axes are a smattering of points. I'm no doctor, but I don't think a BMI of 0 sounds healthy - this may be missing data we should investigate further.

Categorical Variables

So far we've looked at techniques for examining numerical columns - let's remedy that. Categorical data is split into discrete classes/groups/options. For example, a 'Gender' column might contain 'Male', 'Female', 'Other' and 'PreferNotToSay'. When it comes time to model, we'll need to think about how we turn these columns into numerical values the model can use as inputs.

In some cases, categories do have an obvious ordering. For example, eduction completed might be 'high school', 'bachelors', 'masters', 'phd' - this is called an ordinal variable, and we can get away with encoding these as 1, 2, 3 and 4 respectively. But in the gender example above, there is no implied ordering that would make sense.

For both modelling and visualization, it often makes sense to look at the different categories individually.

Let's quickly run through some useful functions for working with this kind of data:

1) unique()

Need to see what unique values occur in a column? unique to the rescue. In some cases, you might want to see how many unique values there are - in that case use df['column'].unique.shape.

adults['education'].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

2) .value_counts()

How many rows have 'Masters' in the education column? value_counts gives an easy answer. In cases where your target is a categorical variable, it can be useful to see the relative distributon of the different classes. Here, for example, it's very clear that only a few people in this dataset stopped their education after preschool.

adults['education'].value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

groupby

This is a more complex one, but very powerful. groupby allows you to apply a function to all rows with a specific value in the column you're grouping with. In this case we use mean(), but you can swap in .count() or .sum() or your own more complex function. You'll need groupby one day - now that you know what it does, you can search for examples to help with whatever specific challenge you're facing.

adults.groupby('education').mean()

Exercise Time

Phew, we've covered quite a bit already! The above should give you everything you need to take a nice hard look at any dataset you'll be working with. There are some extra topics covered in the next few sections, but for now let's try and apply the things we've learnt so far by putting them into practice on a new dataset. As a bonus, this will save us time when we use this dataset in lesson 4.

Take a look at the Sendy Logistics Challenge on Zindi. You'll need to register on Zindi to join the competition and access the data, which comes as 5 separate files:

Train.csv - is the dataset that you will use to train your model.
Test.csv - is the dataset on which you will apply your model to.
Riders.csv - contains unique rider Ids, number of orders, age, rating and number of ratings
VariableDefinitions.csv - Definitions of variables in the Train, Test and Riders files
SampleSubmission.csv - Shows the submission format (we'll look at this later, in lesson 4)

1) Load and inspect the data from Train.csv. Any missing data? What is the target column?

2) How common are the different vehicle types? Is there a difference in the trip time or distance depending on vehicle type?

3) What columns are highly correlated (positively or negatively) with the target? Plot some scatter plots to visualize the relationships

Dates

Dates deserve a special mention here, as they can be particularly annoying to deal with. Date formats, timezones, etc all have the ability to ruin your day. Fortunately, pandas helps us out in a number of ways. Check out their user guide for comlex examples, or follow along here as we load some data with dates and mess about to see what we can use them for.

The dataset we'll be using contains historical bitcoin price data. Let's load it in and take a look.

# rest of the code still work?
url = 'https://raw.githubusercontent.com/Yrzxiong/Bitcoin-Dataset/master/bitcoin_dataset.csv'
bitcoin = pd.read_csv(url, parse_dates=['Date']) # Note the new addition here
bitcoin.head()

With a date column, we can filter the dataset in ways that make intuitive sense:

bitcoin.loc[bitcoin['Date']>'2014-02-07'].head()

We can also set the date column as the index, and do all sorts of fun things:

bitcoin.set_index('Date',inplace=True)
bitcoin.head(2) # Note the change

bitcoin['btc_market_price'].plot() #

<matplotlib.axes._subplots.AxesSubplot at 0x7f901ab981d0>

bitcoin_2016 = bitcoin['2016-01-01':'2017-01-01']
bitcoin_2016['btc_market_price'].resample('2w').mean().plot() # Q: What does .resample('w').mean() do?

<matplotlib.axes._subplots.AxesSubplot at 0x7f901821e210>

Now, if you don't use parse_dates when loading the data, or if you've created a date string by combining multiple columns (like delivery_year, delivery_month...), we can still convert to a datetime column with pd.to_datetime:

b = pd.read_csv(url)
b['New Date'] = pd.to_datetime(b['Date'])
b[['Date', 'New Date']].info() # See the types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906 entries, 0 to 2905
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      2906 non-null   object        
 1   New Date  2906 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 45.5+ KB

Dates are useful because we can extract so many things from them. If we know the date, we can tell

If it's a weekend
If it's near the end of the month
What holidays are happening
What the traffic might be like (if we know time as well)
...

We'll use the date methods, accessed through .dt, to show just one example in the lambda function section.

Missing Data

It is rare in the real world to find a dataset with no missing data. Forms are filled incorrectly, sensors break, connections are lost... it is just something we have to accept. The ways we deal with this missing data can vary. In some cases, we can simply ignore rows with missing data. But in others, we might not have the luxury of being able to just disguard data. Other options include:

Replace missing values with 0. If you need to have numerical inputs so that you can fit a model, and you don't really care about what the value is, df.fillna(0) can get rid of the NaNs which might otherwise cause issues. This can obviously be problematic if those numbers are important!
Replace with the median or mean. A better solution if you must assume the value. fillna can replace with the mean or, if you're really fancy, you could use groupby to replace missing values in specific caterories with the mean for rows in that category.
Replace with the most common value. Especially when there is a categorical column, you could either assign the most common class rather than creating a new 'unknown' class. Again, this could be dangerous, so think before using blindly.
Infer with machine learning. In some cases, we can predict the missing values by creating a model that takes the other columns as inputs and fills in it's best guess!
Interpolate (timeseries). If the value of column A was 5 and 7pm, 6 and 8pm NaN (missing) at 9pm and 8 at 10pm, it seems reasonable to guess that it was around 7 at 9pm. This is called interpolation - a useful technique if you have timeseries data (data arriving at regular intervals over time). You could also just 'forward fill' and assume it was the same as the last known reading.

Deciding which method to use requires giving some thought to the context, and the implications. Filling with the mean might be fine, but if a column is missing 80% of ot's rows it might be better to exclude it rather than adding lots of synthetic data.

We'll discuss these options each time we encounter missing data. For now, let's look at two final data-wrangling techniques to keep in our toolbox.

Pandas Merge and Concat

We often end up with useful information scattered across multiple dataframes which we'd like to join together somehow. If we have dataframes with some consistency (the same columns or rows) we can simply 'stack' them together with concat. But in many cases we need something more nuanced, able to link information based on some shared key or bring in only specific rows - this is where merge shines. We'll look at a few examples here, but if you're looking for more there is a nice long example here.

Let's start by making some demo dataframes so that we can see what's going on.

df_a = pd.DataFrame({
    'A':[1, 2, 3],
    'B':['Q', 'E', 'W'],
    'C':[12.3, 17, 24.2]
})
df_a

df_b = pd.DataFrame({
    'A':[4, 5],
    'B':['T', 'R'],
    'C':[22.1, 10.3]
})
df_b

df_c = pd.DataFrame({
    'Letter':['Q', 'W', 'E', 'R', 'T', 'Y'],
    'D':[9, 8, 7, 6, 5, 4]
})
df_c

Concat

Look at df_a and df_b. They both have the same columns - perhaps these come from two different customers collecting the same kind of data. If we want one 'master' dataframe and need to know the average value of column C across all customers, we can simply stack them together with:

df_ab = pd.concat([df_a, df_b])
df_ab

By default, it adds them vertically. But we can change this if we want to keep rows consistent but need to add extra columns using the 'axis' parameter:

pd.concat([df_a, df_b], axis=1)

Note that the above example doesn't make much sense - we now have duplicated columns and the rows in both dataframes don't really correspond to eachother. You can also see that since the shapes don't match we have some NaN values.

Merge

Take a look at df_c - the 'Letter' column looks like it corresponds to the 'B' column in our other two dataframes. And it has another column, 'D', which might contain extra information we'd like to include in our main dataframe. What we're really like is a way to join the two together based on this shared 'key'.

When doing a merge, we specify a 'left' dataframe and a 'right' dataframe, as well the type of join. There are different merge types, but the one we're most often intersted is called a 'left join' - for each key in the left dataframe, we find rows in the right one with matching keys and add that data into the left one.

We specify the 'key' colum with the 'on' parameter (if it's the same in both) or the 'left_on' and 'right_on' parameters if the column names are different.

pd.merge(df_a, # Left dataframe
         df_c, # Right dataframe
         left_on='B', # Column with the key 
         right_on='Letter', # Column with the key 
         how='left') # Type of merge/join

What happens when some rows don't have matches? Let's flip things around and try to add info from df_a to df_c:

pd.merge(df_c, # Switched - df_c now the left one
         df_a, # Right dataframe
         left_on='Letter', right_on='B', how='left')

We've got the extra data where there were matches and missing data where there aren't matches. Now:

What happens if you change the above code to have how='inner'?
Try some other merges or combinations. Don't be afraid to google - maaaaany people find these confusing, but you can almost always find an example of what you're trying to do.

This little section is pretty short, but it opens a world of possibilities - we can now source data from many different places and combine it in creative ways! Data on poverty and data on climate? Link by country to plot the relationships! Insurace data spread across 'Policies.csv', 'Customers.csv' and 'Payments.csv'? Merge to the rescue! The power is yours :)

Getting Funky with Lambda Functions

Visualization Exercise:

Find data fill values make plots

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income_class
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	32561.000000	3.256100e+04	32561.000000	32561.000000	32561.000000	32561.000000
mean	38.581647	1.897784e+05	10.080679	1077.648844	87.303830	40.437456
std	13.640433	1.055500e+05	2.572720	7385.292085	402.960219	12.347429
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.178270e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.783560e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.370510e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-loss	hours-per-week	native-country	income_class
14577	43	Local-gov	144778	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	40	?	<=50K
9546	60	Private	25141	Some-college	10	Married-civ-spouse	Transport-moving	Husband	White	Male	0	60	United-States	<=50K
3583	43	Local-gov	96102	Some-college	10	Married-civ-spouse	Protective-serv	Husband	White	Male	1887	40	United-States	>50K
10072	18	Private	201554	Some-college	10	Never-married	Machine-op-inspct	Not-in-family	White	Female	0	15	United-States	<=50K
12331	36	Private	102729	Some-college	10	Never-married	Craft-repair	Not-in-family	White	Male	0	40	United-States	<=50K

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
age	1.000000	-0.076646	0.036527	0.077674	0.057775	0.068756
fnlwgt	-0.076646	1.000000	-0.043195	0.000432	-0.010252	-0.018768
education-num	0.036527	-0.043195	1.000000	0.122630	0.079923	0.148123
capital-gain	0.077674	0.000432	0.122630	1.000000	-0.031615	0.078409
capital-loss	0.057775	-0.010252	0.079923	-0.031615	1.000000	0.054256
hours-per-week	0.068756	-0.018768	0.148123	0.078409	0.054256	1.000000

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
education
10th	37.429796	196832.465166	6.0	404.574491	56.845659	37.052519
11th	32.355745	194928.077447	7.0	215.097872	50.079149	33.925957
12th	32.000000	199097.508083	8.0	284.087760	32.337182	35.780600
1st-4th	46.142857	239303.000000	2.0	125.875000	48.327381	38.255952
5th-6th	42.885886	232448.333333	3.0	176.021021	68.252252	38.897898
7th-8th	48.445820	188079.171827	4.0	233.939628	65.668731	39.366873
9th	41.060311	202485.066148	5.0	342.089494	28.998054	38.044747
Assoc-acdm	37.381443	193424.093721	12.0	640.399250	93.418932	40.504217
Assoc-voc	38.553546	181936.016643	11.0	715.051375	72.754703	41.610709
Bachelors	38.904949	188055.914846	13.0	1756.299533	118.350327	42.614006
Doctorate	47.702179	186698.760291	16.0	4770.145278	262.845036	46.973366
HS-grad	38.974479	189538.739739	9.0	576.800114	70.466622	40.575374
Masters	44.049913	179852.362739	14.0	2562.563552	166.719675	43.836332
Preschool	42.764706	235889.372549	1.0	898.392157	66.490196	36.647059
Prof-school	44.746528	185663.706597	15.0	10414.416667	231.203125	47.425347
Some-college	35.756275	188742.922370	10.0	598.824167	71.637087	38.852284

	Date	btc_total_bitcoins	btc_avg_block_size	btc_n_transactions_per_block	...	btc_cost_per_transaction_percent	btc_n_unique_addresses	btc_n_transactions	btc_n_transactions_total	btc_n_transactions_excluding_popular	btc_n_transactions_excluding_chains_longer_than_100	btc_output_volume	btc_estimated_transaction_volume
0	2010-02-17	2043200.0	0.000235	1.0	...	31.781022	241	244	41240	244	244	65173.13	36500.0
1	2010-02-18	2054650.0	0.000241	1.0	...	154.463801	234	235	41475	235	235	18911.74	7413.0
2	2010-02-19	2063600.0	0.000228	1.0	...	1278.516635	185	183	41658	183	183	9749.98	700.0
3	2010-02-20	2074700.0	0.000218	1.0	...	22186.687990	224	224	41882	224	224	11150.03	50.0
4	2010-02-21	2085400.0	0.000234	1.0	...	689.179876	218	218	42100	218	218	12266.83	1553.0

	Date	btc_market_price	btc_total_bitcoins	btc_market_cap	btc_trade_volume	btc_blocks_size	btc_avg_block_size	btc_n_transactions_per_block	btc_median_confirmation_time	...	btc_cost_per_transaction_percent	btc_cost_per_transaction	btc_n_unique_addresses	btc_n_transactions	btc_n_transactions_total	btc_n_transactions_excluding_popular	btc_n_transactions_excluding_chains_longer_than_100	btc_output_volume	btc_estimated_transaction_volume	btc_estimated_transaction_volume_usd
1452	2014-02-08	707.00	12370925.0	8.746244e+09	3.722707e+07	14650.0	0.195672	371.0	8.350000	...	5.918574	48.410448	131014	64069	32523065	57867	33335	602439.9845	74123.0	52404670.0
1453	2014-02-09	710.00	12374700.0	8.786037e+09	3.919396e+07	14679.0	0.195116	401.0	7.933333	...	4.462111	45.342082	112334	59284	32582349	52019	33916	549603.5616	84848.0	60241888.0
1454	2014-02-10	665.00	12378525.0	8.231719e+09	1.124874e+08	14712.0	0.221251	356.0	8.166667	...	3.179425	37.743190	155591	67622	32649971	61132	40595	640746.2296	120714.0	80274582.0
1455	2014-02-11	670.00	12382700.0	8.296409e+09	7.191955e+07	14742.0	0.183303	294.0	7.233333	...	3.042742	44.608411	121449	62872	32712843	57955	40090	760757.1299	137573.0	92174104.0
1456	2014-02-12	667.01	12386875.0	8.262169e+09	2.538242e+07	14784.0	0.255354	344.0	7.366667	...	4.162464	44.499194	121999	62805	32775648	56760	36459	655177.7497	100662.0	67142256.0

	btc_market_price	btc_total_bitcoins	btc_market_cap	btc_trade_volume	btc_blocks_size	btc_avg_block_size	btc_n_orphaned_blocks	btc_n_transactions_per_block	btc_median_confirmation_time	btc_hash_rate	...	btc_cost_per_transaction_percent	btc_cost_per_transaction	btc_n_unique_addresses	btc_n_transactions	btc_n_transactions_total	btc_n_transactions_excluding_popular	btc_n_transactions_excluding_chains_longer_than_100	btc_output_volume	btc_estimated_transaction_volume	btc_estimated_transaction_volume_usd
Date
2010-02-17	0.0	2043200.0	0.0	0.0	0.0	0.000235	0	1.0	0.0	0.000029	...	31.781022	0.0	241	244	41240	244	244	65173.13	36500.0	0.0
2010-02-18	0.0	2054650.0	0.0	0.0	0.0	0.000241	0	1.0	0.0	0.000029	...	154.463801	0.0	234	235	41475	235	235	18911.74	7413.0	0.0

	Letter	D	A	B	C
0	Q	9	1.0	Q	12.3
1	W	8	3.0	W	24.2
2	E	7	2.0	E	17.0
3	R	6	NaN	NaN	NaN
4	T	5	NaN	NaN	NaN
5	Y	4	NaN	NaN	NaN

	A	B	C
0	4	T	22.1
1	5	R	10.3

Lesson 2 - Data Cleaning and Visualization