If you’ve ever…
felt like you’re playing Simon Says with mouse clicks when repeatedly extracting data in chunks from a front-end interface to a database on the web, well, you probably are. There’s probably a better solution – Selenium.
ever used XML or httr in R or urllib2 in Python, you’ve probably encountered the situation where the source code you’ve scraped for a website doesn’t contain all the information you see in your browser. Selenium can probably help.
How it works
Selenium is a web automation tool. While not developed specifically for web scraping, Selenium does it pretty dang well. Selenium literally “drives” your browser, so it can see anything you see when you right click and inspect element in Chrome or Firefox. This vastly widens the universe of content that can be extracted from automation, but can be slow as all content must be rendered in the browser.
There are headless (invisible browsers with no GUI) such as phantomjs that speed some of this up. That said, I’ve found that Selenium works best for targeted extraction where the user knows exactly what they want.
I set out to collect tickers for all mutual funds in the asset allocation fund type. Fidelity provides a list
of all these funds here.
1,586 funds as of today in 80 conveniently paginated URLs. Each URL ends in
&pgNo=5 to indicate you want page 5 (or whatever number between 1 and 80).
In my browser, when I hover my mouse over one of the fund names in the table, I see the 5 character ticker I’m looking for. I also see the tickers directly on the webpage when I click the link to each fund. Here for example, where it says PSLDX in the top left. However, if possible I’d like to scrape the tickers from the table rather than the individual fund pages. This would mean 80 pages to scrape rather than 1,586.
Take 1: traditional http request
When possible, it makes sense to use the simple traditional methods. So I first tried to extract these tickers with the popular
httr R package
by making standard http requests.
But did our http request return the information we want?
Nope – can’t find the tickers (one of them anyway).
Nope – can’t even find the fund name that I see in the table from the webpage in my browser.
My plan B was to grab the url for each fund from the table, navigate to that fund’s page, and extract the ticker from there.
However these links weren’t in our http response. I noticed that the URLs for each fund followed a simple consistent structure.
https://fundresearch.fidelity.com/mutual-funds/summary/72201F433 for example. I thought maybe I could find 72201F433 which looks like some sort of fund ID in a list with all fund IDs in the http response. No dice. Plan C – Selenium.
Take 2: Selenium
Step 1: Fire up Selenium
Step 2: Start scraping
To figure which DOM elements I wanted Selenium extract, I used the Chrome Developer Tools which can be invoked by right clicking a fund in the table and selecting Inspect Element. The HTML displayed here contains exactly what we want, what we didn’t see with our http request.
Since I want to grab all the funds at once, I tell Selenium to select the whole table. Going a few levels up from the individual cell in the table I’ve selected, I see that
<tbody id="tbody"> is the HTML tag that contains the entire table, so I tell Selenium to find this element. I use the nifty
highlightElement function to confirm graphically in the browser that this is what I think it is.
Then it’s business as usual. I parse the string output from Selenium into an HTML tree and use XPath to parse the table for just the fund name and ticker.
Step 3: Extract ticker
Nothing fancy here – just separating the ticker from the fund name.
What else can Selenium do?
On the several computers I use, I’ve found setup ranging from seamless to frustrating.
The most frustrating issue I encountered while setting up on my Mac was this error message:
I was able to resolve it by killing all processes running on port 4444 and trying again.
At the terminal:
Kill PIDs of any processes listed. For example: