Getting Tune URLs with BeautifulSoup

url = 'https://thesession.org/tunes/search?type=&mode=Dmajor&q=&page=1'
page = urlopen(url) # We would put thus in a try/catch block if we were being careful
soup = BeautifulSoup(page, 'html.parser') # BeautifulSoup kindly parses the html for us

For each page of results there is a list of ten songs. Each one is a list item (li in html) that starts with the following: # <li class="manifest-item">. We can use BeautifulSoup's findAll method to get all of these list items:

list_items = soup.findAll('li', {"class": "manifest-item"})
urls = []
for li in list_items:
    link = li.find('a') # Get the link to the tune
    urls.append(link.get('href')) # Store it for later 

print(urls) # A list of URLs for us to scrape

['/tunes/27', '/tunes/55', '/tunes/182', '/tunes/64', '/tunes/9', '/tunes/116', '/tunes/73', '/tunes/19', '/tunes/49', '/tunes/5']

Extracting the tune from the tune page

There are often multiple 'arrangements'. Each get their own 'notes' div on the tune page - we use find instead of findAll to only get the first one:

tune_page = requests.get('https://thesession.org'+urls[0])
tune_soup = BeautifulSoup(tune_page.text, 'html5lib')
tune_content = tune_soup.find('div', {"class": "notes"})
tune_content.text

'\nX: 1\nT: Drowsy Maggie\nR: reel\n\nM: 4/4\nL: 1/8\nK: Edor\n|:E2BE dEBE|E2BE AFDF|E2BE dEBE|BABc dAFD:|\nd2fd c2ec|defg afge|d2fd c2ec|BABc dAFA|\nd2fd c2ec|defg afge|afge fdec|BABc dAFD|\n\n'

I found it easiest at this stage to just parse the text above myself, using a function to pull out the tune title, info and notes:

parse_tune(tune_content)

{'Title': 'Drowsy Maggie',
 'Type': 'reel',
 'Meter': '4/4',
 'Length': '1/8',
 'Key': 'Edor',
 'Notes': '|:E2BE dEBE|E2BE AFDF|E2BE dEBE|BABc dAFD:|d2fd c2ec|defg afge|d2fd c2ec|BABc dAFA|d2fd c2ec|defg afge|afge fdec|BABc dAFD|'}

Scraping in parallel

I'm impatient, so I use some parallel processing with ThreadPoolExecutor to fetch up to ten tunes at once. There's a thread lock to stop concurrent writes to a csv file which I've set up get_tune to store the tunes in.

csv_writer_lock = threading.Lock()

We can use this to grab all our tunes in parallel:

set_up_threads(urls)

I ran this for multiple pages of search results to get 500 songs. This is what that ends up looking like:

df = pd.read_csv('data/all_tunes.csv', names = ['Title', 'Type', 'TS', 'Key', 'Notes'])
print(df.shape)
df.head()

(500, 5)

This dataset will be useful for a project idea I have had brewing - but my hour is up so that's where we are ending for today.

Web Scraping ABC

Getting Tune URLs with BeautifulSoup

Extracting the tune from the tune page

`parse_tune`[source]

Scraping in parallel

`get_tune`[source]

`set_up_threads`[source]

	Title	Type	TS	Key	Notes
The Maid Behind The Bar	reel	4/4	1/8	Dmaj	\|:FAAB AFED\|FAAB ABde\|fBBA Bcde\|fBBA BcdA\|FAAB...
The Musical Priest	reel	4/4	1/8	Bmin	\|:BA\|FBBA B2Bd\|cBAf ecBA\|FBBA B2Bd\|cBAc B2:\|\|:...
Banish Misfortune	jig	6/8	1/8	Dmix	fed cAG\| A2d cAG\| F2D DED\| FEF GFG\|AGA cAG\| AG...
The Silver Spear	reel	4/4	1/8	Dmaj	A\|:FA (3AAA BAFA\|dfed BddA\|FA (3AAA BAFA\|dfed ...
Drowsy Maggie	reel	4/4	1/8	Edor	\|:E2BE dEBE\|E2BE AFDF\|E2BE dEBE\|BABc dAFD:\|d2f...

Web Scraping ABC

Getting Tune URLs with BeautifulSoup

Extracting the tune from the tune page

parse_tune[source]

Scraping in parallel

get_tune[source]

set_up_threads[source]

`parse_tune`[source]

`get_tune`[source]

`set_up_threads`[source]