Web Scraping

Back to the data … thanks to web scraping!

Focus on the first fundamental step of data mining: data collection

Reading Time: 3 minutes

Back to the future thanks to trustworthy reliable data? “Great Scott!”

We all agree that we’re data-dependent. Every decision we make and every subsequent action we take starts with data, its analysis and interpretation and finally forecasting.

That sounds simple, but … what data can we get? where is it? how do I do it?

The purpose of this short article is to answer these questions!

Data

Any data mining process starts with the data. So, the Gordian knot of the matter are the data.

In this sense, these cases can occur:

  • there is a database of data
  • let’s create a database
  • we scrape the data from reliable sources

We’re going to focus on getting the data from a source…

Extract data from the web

Very often it happens that the data we are interested in is not available in the form of files but only as data present in web pages.

What you need to do here is to get this data from a web page.

We can do this manually, by writing the data into a spreadsheet, or automatically using special tools.

In this case we are talking about web scraping or web data extraction.

Web Scraping

It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

It is also a technique that is much discussed today as it is also used for non-legal purposes, including the undercutting of prices and the theft of copyrighted content.

In reality the real problem is not the technique used, just think that it is an automatism to read and copy data manually, but the data itself.

In fact, the data controller and the site may not allow the use of the data.

It is good practice to always know the policy of the site from which you want to scrape the data before using them.

Excel and Web Scraping

Excel also allows the web scraping technique to prove that it is not illegal.

It does so through two procedures:

  • from menu “Data/Web” … but does not always work correctly
  • from “Data/New Query” from version 2016 Excel
Figure 1 – Source ”Excel e Intelligenza Artificiale per il Trading” by Donata Petrelli & Fabrizio Cesarini
Figure 1 – Source ”Excel e Intelligenza Artificiale per il Trading” by Donata Petrelli & Fabrizio Cesarini

Data Extraction methods

An overview of possible methods for data extraction

  • Via Files (XML, CSV, JSON)
  • Via Excel / Power BI
  • Via API from source
  • Via Source Code (VBA, Python and Libraries o direct from DOM)
  • Browser Plugin
  • Via Dedicated Software

Web Scraping software

Here is our list of the best web scraping tools on the market right now:

  • Octoparse
  • ParseHub
  • Scrapy
  • DataMiner
  • Dexi.io

Last but not least … nostopitWebTableExtractor our free software to extract data and tables from web pages and files.

Figure 2 – nostopitWebTableExtractor Screenshot
Figure 2 – nostopitWebTableExtractor Screenshot

If you want to try it now this is the link where you can download it:

https://www.nostopit.com/software/nostopitwebtableextractor/

any feedback from you is welcome!

Scroll to top
Fabrizio Cesarini
Project Management Software
This is default text for notification bar