Scraping Web Pages in LabVIEW
I made a video showing how! Well, I made it several weeks ago...
The goal of the video is to convey that in many cases you can pull data from web resources without needing to automate a web browser control.
Watching the video you will see how to use the built-in browser developer tools, discover HTTP requests that are interesting, and recreate those HTTP requests with the LabVIEW HTTP Client VIs.
For example, let's say you want to get the current weather and decide to parse the front page of Weather Underground. The first thing you might try is right clicking on the weather shown on the page, choosing Inspect, and examining the DOM in the Elements pane.
Problems Relying on the Elements Pane
It is possible the structure of the page may change. Web pages change look and structure all the time, we probably don't want to fix our web scraping VI over and over again.
The page might not be very easy to parse. Many pages at best have lots of extraneous information and at worst have invalid / poorly structured HTML.
Use an API When Available
The best way to prevent issues with relying on DOM structure is to use an API provided by the web service. A programmatic API is usually the intended way to access data.
For example, Weather Underground has the Weather Underground API which has online documentation and returns easy to parse JSON:
After signing up for the API to get our key and using the docs to understand the URL structure, we can create a VI that uses the Weather Underground API:
When An API is Not Available
If the web service does not provide an API it is time to get crafty. We have to resort on relying on HTML structure or finding out where data is dynamically loaded from. This is the approach used by the video and can be summarized roughly as follows:
Find a web page you want to scrape in your web browser.
Open the developer tools in your browser, record the network requests, and inspect the network requests for useful data.
I am partial to the Chrome Developer Tools, but Firefox and Edge all provide similar tools.
Look through the network requests to find one that includes the data you are trying to find.
The best responses will be JSON or XML because LabVIEW has VIs to parse those data types.
If you can only find the data embedded in HTML, you can use LabVIEW's string parsing VIs to search through the HTML and extract your data.
Use the LabVIEW HTTP Client VIs to re-create a similar network request and query the data inside of LabVIEW
In the video we were lucky to find data as XML which LabVIEW can parse natively although we did not go through those steps. In addition, we are lucky that we could make the request without any additional complexities like cookies or authentication.
Complexities of Scraping Web Services
For many LabVIEW applications the approach described in this document probably works pretty well. Local network devices frequently expose an open web interface and many public web services provide a well-documented programmatic API.
However, there are some hurdles that you may run into:
The web service requires user login, i.e. Authentication and Authorization.
This is a fairly involved topic. However some forms of login such as Client Side SSL, OAuth 1.0a, Basic Authentication, and some approaches to OAuth 2.0 are very doable in LabVIEW. In addition, using .NET Libraries for accessing web services is an approach that works well on certain platforms.
The web service requires Cookies / Custom Headers. These are retrievable from HTTP Responses and can be passed into HTTP Requests from LabVIEW.
The web service requires reasonably accurate clock times from LabVIEW. While not generally a problem on desktops, some devices running LabVIEW may not have built-in persistent clock components and require time configuration to communicate effectively with some web services.
Have any use cases you ran into, questions, or comments? Ask on the LabVIEW Web Development Community.