Pulling Irish music for future experiments

Getting Tune URLs with BeautifulSoup

url = 'https://thesession.org/tunes/search?type=&mode=Dmajor&q=&page=1'
page = urlopen(url) # We would put thus in a try/catch block if we were being careful
soup = BeautifulSoup(page, 'html.parser') # BeautifulSoup kindly parses the html for us

For each page of results there is a list of ten songs. Each one is a list item (li in html) that starts with the following: # <li class="manifest-item">. We can use BeautifulSoup's findAll method to get all of these list items:

list_items = soup.findAll('li', {"class": "manifest-item"})
urls = []
for li in list_items:
    link = li.find('a') # Get the link to the tune
    urls.append(link.get('href')) # Store it for later 

print(urls) # A list of URLs for us to scrape
['/tunes/27', '/tunes/55', '/tunes/182', '/tunes/64', '/tunes/9', '/tunes/116', '/tunes/73', '/tunes/19', '/tunes/49', '/tunes/5']

Extracting the tune from the tune page

There are often multiple 'arrangements'. Each get their own 'notes' div on the tune page - we use find instead of findAll to only get the first one:

tune_page = requests.get('https://thesession.org'+urls[0])
tune_soup = BeautifulSoup(tune_page.text, 'html5lib')
tune_content = tune_soup.find('div', {"class": "notes"})
tune_content.text
'\nX: 1\nT: Drowsy Maggie\nR: reel\n\nM: 4/4\nL: 1/8\nK: Edor\n|:E2BE dEBE|E2BE AFDF|E2BE dEBE|BABc dAFD:|\nd2fd c2ec|defg afge|d2fd c2ec|BABc dAFA|\nd2fd c2ec|defg afge|afge fdec|BABc dAFD|\n\n'

I found it easiest at this stage to just parse the text above myself, using a function to pull out the tune title, info and notes:

parse_tune[source]

parse_tune(tune_content)

parse_tune(tune_content)
{'Title': 'Drowsy Maggie',
 'Type': 'reel',
 'Meter': '4/4',
 'Length': '1/8',
 'Key': 'Edor',
 'Notes': '|:E2BE dEBE|E2BE AFDF|E2BE dEBE|BABc dAFD:|d2fd c2ec|defg afge|d2fd c2ec|BABc dAFA|d2fd c2ec|defg afge|afge fdec|BABc dAFD|'}

Scraping in parallel

I'm impatient, so I use some parallel processing with ThreadPoolExecutor to fetch up to ten tunes at once. There's a thread lock to stop concurrent writes to a csv file which I've set up get_tune to store the tunes in.

csv_writer_lock = threading.Lock()

get_tune[source]

get_tune(tune_url, savefile='data/all_tunes.csv')

set_up_threads[source]

set_up_threads(urls)

We can use this to grab all our tunes in parallel:

set_up_threads(urls)

I ran this for multiple pages of search results to get 500 songs. This is what that ends up looking like:

df = pd.read_csv('data/all_tunes.csv', names = ['Title', 'Type', 'TS', 'Key', 'Notes'])
print(df.shape)
df.head()
(500, 5)
Title Type TS Key Notes
The Maid Behind The Bar reel 4/4 1/8 Dmaj |:FAAB AFED|FAAB ABde|fBBA Bcde|fBBA BcdA|FAAB...
The Musical Priest reel 4/4 1/8 Bmin |:BA|FBBA B2Bd|cBAf ecBA|FBBA B2Bd|cBAc B2:||:...
Banish Misfortune jig 6/8 1/8 Dmix fed cAG| A2d cAG| F2D DED| FEF GFG|AGA cAG| AG...
The Silver Spear reel 4/4 1/8 Dmaj A|:FA (3AAA BAFA|dfed BddA|FA (3AAA BAFA|dfed ...
Drowsy Maggie reel 4/4 1/8 Edor |:E2BE dEBE|E2BE AFDF|E2BE dEBE|BABc dAFD:|d2f...

This dataset will be useful for a project idea I have had brewing - but my hour is up so that's where we are ending for today.