- Web Scraping With Selenium Python
- Web Scraping With Selenium
- Web Scraping With Selenium
- Web Scraping Tutorial
- Web Scraping With Selenium C#
- Web Scraping With Selenium C#
- Web Scraper Free
- Selenium uses a web-driver package that can take control of the browser and mimic user-oriented actions to trigger desired events. This guide will explain the process of building a web scraping program that will scrape data and download files from Google Shopping Insights.
- As with every 'web scraping with Selenium' tutorial, you have to download the appropriate driver to interface with the browser you're going to use for scraping. Since we're using Chrome, download the driver for the version of Chrome you're using. The next step is to add it to the system path.
In part 1 of this blog post series we mentioned the most common approach to web scraping and its issues. We also made a small example on how to start web scraping with C#, Selenium and QueryStorm in Excel. Now we'll expand on the example from part 1 and create a more useful web scraper.
Web Scraping With Selenium Python
I took about 4 days figure out to understand, strategize, and execute the program by learning the capabilities of Selenium and other relevant Python libraries. The program can still be optimized and improved. Although, I believe this is a good starting point for you to understand the advantage of using Selenium for scraping data in JavaScript tag. . IMPORTANT NOTES.The ⭐new⭐ and improved article is now available on my blog:https://www.mariyasha.com/post/web-scraping-instagram-thumbnails-with-selen.
Navigating to and scraping paginated items
It's time to kick the web scraping up a notch. For instance, let's scrape the names and prices of the top items on the home page, navigate to the laptops category and scrape all of the laptops as well.
Preparing the table
We should delete the current table rows as they are irrelevant. We can use ResultsTable.Clear()
to delete all current table entries instead of deleting them by hand.
In addition, we should also edit the ResultsTable
by renaming the Results
column to Product Name
and by adding a new column named Price
.
Web Scraping With Selenium
Getting the price
To get the price along with the title of the top items, our script needs only minor modifications. First, we find all of the items by their CSS selector (div.thumbnail
). Then we find the name and the price of the items by finding their respective elements (name CSS selector – h4 > a
, price CSS selector – div.caption > h4.pull-right.price
) inside of the parent item element.
Preventing CSS selector issues
Just a heads up – without any changes to the default driver initializer, the browser will open as a small window. That means that there's a chance that the page will have a mobile/tablet layout so your CSS selectors (that are copied from the DevTools of a maximized browser window) will be invalid. To prevent this issue, we start the driver with some options where we specify that the browser should start maximized.
Page navigation
The next step is navigating – first to the Computers page and then to the Laptops page.
Web Scraping With Selenium
We can navigate by clicking on the Computers menu item and waiting for the Computers page to load. Subsequently, we should click on the Laptops menu item and wait for the Laptops page to load.
Note: We could just navigate to the URL https://webscraper.io/test-sites/e-commerce/ajax/computers/laptops instead of clicking on the side menu items, but I feel it's better to demonstrate how to click and wait for the page to load as it is a pretty common problem in web scraping.
Clicking is easy – we find the element and call its Click
method.
Waiting itself is not an issue as we can use the WebDriverWait
class that provides us with a way to wait a certain amount of time until an arbitrary condition happens. However, this condition can prove to be a problem.
In our case, the condition is to wait until the new page has loaded. To do that we need to determine when exactly has an old page unloaded and a new page has loaded. The most robust way to achieve this would be to wait for an element on the old page to go 'stale' (no longer attached to the DOM). We also have to wait for an element on the new page to be displayed.
As a sort of a helping hand, we could install and use the DotNetSeleniumExtras.WaitHelpers
NuGet package to check if a new page has loaded. However, the project is no longer maintained and the relevant code isn't complicated, so we can write the code for the conditions ourselves.
The WebDriverWait
‘s Until
method has a parameter of type Func
. Therefore, we have to create a NewPageLoaded
method that returns the specified delegate to the Until
method. The code can look something like this…
To complete the NewPageLoaded
method, we need to replace the dots with concrete staleness and visibility checks. These checks can also return a delegate so they can be used as regular methods and by the Until
method. So, let's define the methods to check for staleness and visibility.
An element is stale if any of these conditions are met:
- The element is disabled
- The element is missing (null)
- Accessing the element throws a
StaleElementReferenceException
Also, an element is visible if:
- The driver can find the element
- The element is displayed
Web Scraping Tutorial
Getting the price
To get the price along with the title of the top items, our script needs only minor modifications. First, we find all of the items by their CSS selector (div.thumbnail
). Then we find the name and the price of the items by finding their respective elements (name CSS selector – h4 > a
, price CSS selector – div.caption > h4.pull-right.price
) inside of the parent item element.
Preventing CSS selector issues
Just a heads up – without any changes to the default driver initializer, the browser will open as a small window. That means that there's a chance that the page will have a mobile/tablet layout so your CSS selectors (that are copied from the DevTools of a maximized browser window) will be invalid. To prevent this issue, we start the driver with some options where we specify that the browser should start maximized.
Page navigation
The next step is navigating – first to the Computers page and then to the Laptops page.
Web Scraping With Selenium
We can navigate by clicking on the Computers menu item and waiting for the Computers page to load. Subsequently, we should click on the Laptops menu item and wait for the Laptops page to load.
Note: We could just navigate to the URL https://webscraper.io/test-sites/e-commerce/ajax/computers/laptops instead of clicking on the side menu items, but I feel it's better to demonstrate how to click and wait for the page to load as it is a pretty common problem in web scraping.
Clicking is easy – we find the element and call its Click
method.
Waiting itself is not an issue as we can use the WebDriverWait
class that provides us with a way to wait a certain amount of time until an arbitrary condition happens. However, this condition can prove to be a problem.
In our case, the condition is to wait until the new page has loaded. To do that we need to determine when exactly has an old page unloaded and a new page has loaded. The most robust way to achieve this would be to wait for an element on the old page to go 'stale' (no longer attached to the DOM). We also have to wait for an element on the new page to be displayed.
As a sort of a helping hand, we could install and use the DotNetSeleniumExtras.WaitHelpers
NuGet package to check if a new page has loaded. However, the project is no longer maintained and the relevant code isn't complicated, so we can write the code for the conditions ourselves.
The WebDriverWait
‘s Until
method has a parameter of type Func
. Therefore, we have to create a NewPageLoaded
method that returns the specified delegate to the Until
method. The code can look something like this…
To complete the NewPageLoaded
method, we need to replace the dots with concrete staleness and visibility checks. These checks can also return a delegate so they can be used as regular methods and by the Until
method. So, let's define the methods to check for staleness and visibility.
An element is stale if any of these conditions are met:
- The element is disabled
- The element is missing (null)
- Accessing the element throws a
StaleElementReferenceException
Also, an element is visible if:
- The driver can find the element
- The element is displayed
Web Scraping Tutorial
Finally, the NewPageLoaded
method looks like this:
And once we decide what elements on the pages we're going to use to identify if a new page has loaded, we're ready to navigate to the Computers page and the Laptops page. I chose the following:
Now we can finally perform navigation:
Web Scraping With Selenium C#
Scraping paginated laptop items
Since we've navigated to the Laptops page, we can now scrape the laptop items. We need a couple of things to do that.
First of all, we need a reference to the 'Next' button element – by clicking it we can load the items, page by page (button CSS selector – button.btn.btn-default.next
).
The second thing to have in mind is that we have to wait until the next page of items is loaded. Luckily, we've made a method to check the staleness of elements, so we can infer that a new page of items has loaded when the laptop items from the current page go stale.
And lastly, we should check whether the 'Next' button is enabled or disabled, so we know if we've reached the last items page or not.
We are almost done with our scraper! Let's run the script with F5 and wait a couple of seconds. As a result of running the script, we can see 120 scraped products in our table. However, we should do one more thing – refactor the code a bit.
Web Scraping With Selenium C#
Finishing steps
First of all, the code for saving home items and laptop items is the same. Therefore, we can extract a method for saving items.
We can also extract a method for page navigation. We just need to pass different CSS selectors when calling the method.
And lastly, to keep the main part of the script nice and readable, we can do two things. We can create a new method just for scraping laptop items. Also, we can create a new method for direct navigation to the Laptops page.
Web Scraper Free
We're done!
Finally, here's the full code for the tutorial:
In the next and final part of this web scraping tutorial, we'll turn our script into a shareable workbook-application that any user with the QueryStorm Runtime can execute.