url = 'https://thesession.org/tunes/search?type=&mode=Dmajor&q=&page=1'
page = urlopen(url) # We would put thus in a try/catch block if we were being careful
soup = BeautifulSoup(page, 'html.parser') # BeautifulSoup kindly parses the html for us
For each page of results there is a list of ten songs. Each one is a list item (li
in html) that starts with the following: # <li class="manifest-item">
. We can use BeautifulSoup's findAll method to get all of these list items:
list_items = soup.findAll('li', {"class": "manifest-item"})
urls = []
for li in list_items:
link = li.find('a') # Get the link to the tune
urls.append(link.get('href')) # Store it for later
print(urls) # A list of URLs for us to scrape
tune_page = requests.get('https://thesession.org'+urls[0])
tune_soup = BeautifulSoup(tune_page.text, 'html5lib')
tune_content = tune_soup.find('div', {"class": "notes"})
tune_content.text
I found it easiest at this stage to just parse the text above myself, using a function to pull out the tune title, info and notes:
parse_tune(tune_content)
Scraping in parallel
I'm impatient, so I use some parallel processing with ThreadPoolExecutor
to fetch up to ten tunes at once. There's a thread lock to stop concurrent writes to a csv file which I've set up get_tune
to store the tunes in.
csv_writer_lock = threading.Lock()
We can use this to grab all our tunes in parallel:
set_up_threads(urls)
I ran this for multiple pages of search results to get 500 songs. This is what that ends up looking like:
df = pd.read_csv('data/all_tunes.csv', names = ['Title', 'Type', 'TS', 'Key', 'Notes'])
print(df.shape)
df.head()
This dataset will be useful for a project idea I have had brewing - but my hour is up so that's where we are ending for today.