Tuesday 28 April 2015

A Guide to Web Scraping Tools

Web Scrapers are tools designed to extract / gather data in a website via crawling engine usually made in Java, Python, Ruby and other programming languages.Web Scrapers are also called as Web Data Extractor, Data Harvester , Crawler and so on which most of them are web-based or can be installed in local desktops.

Its main purpose is to enable webmasters, bloggers, journalist and virtual assistants to harvest data from a certain website whether text, numbers, contact details and images in a structured way which cannot be done easily thru manual copy and paste method. Typically, it transforms the unstructured data on the web, from HTML format into a structured data stored in a local database or spreadsheet or automates web human browsing.

Web Scraper Usage

Web Scrapers are also being used by SEO and Online Marketing Analyst to pull out some data privately from the competitor’s website such as high targeted keywords, valuable links, emails & traffic sources that were also perform by SEOClerk, Google and many other web crawling sites.

Includes:

•    Price comparison
•    Weather data monitoring
•    Website change detection
•    Research
•    Web mash up
•    Info graphics
•    Web data integration
•    Web Indexing & rank checking
•    Analyze websites quality links

List of Popular Web Scrapers

There are hundreds of Web Scrapers today available for both commercial and personal use. If you’ve never done any web scraping before, there are basic

Web scraping tools like YahooPipes, Google Web Scrapers and Outwit Firefox extensions that it’s good to start with but if you need something more flexible and has extra functionality then,  check out the following:

HarvestMan [ Free Open Source]

HarvestMan is a web crawler application written in the Python programming language. HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application. HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.Like Scrapy, HarvestMan is truly flexible however, your first installation would not be easy.

Scraperwiki [Commercial]

Using a minimal programming you will be able to extract anything. Off course, you can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.

Scraperwiki is a site that encourages programmers, journalists and anyone else to take online information and turn it into legitimate datasets. It’s a great resource for learning how to do your own “real” scrapes using Ruby, Python or PHP. But it’s also a good way to cheat the system a little bit. You can search the existing scrapes to see if your target website has already been done. But there’s another cool feature where you can request new scrapers be built.  All in all, a fantastic tool for learning more about scraping and getting the desired results while sharpening your own skills.

Best use: Request help with a scrape, or find a similar scrape to adapt for your purposes.

FiveFilters.org [Commercial]   

Is an online web scraper available for commercial use. Provides easy content extraction using Full-Text RSS tool which can identify and extract web content (news articles, blog posts, Wikipedia entries, and more) and return it in an easy to parse format. Advantages; speedy article extraction, Multi-page support, has a Autodetection and  you can deploy  on the cloud server without database required.

Kimono

Produced by Kimono labs this tool lets you convert data to into apis for automated export.   Benjamin Spiegel did a great Youmoz post on how to build a custom ranking tool with Kimono, well worth checking out!

Mozenda [Commercial]

This is a unique tool for web data extraction or web scarping.Designed for easiest and fastest way of getting data from the web for everyone. It has a point & click interface and with the power of the cloud you can scrape, store, and manage your data all with Mozenda’s incredible back-end hardware. More advance, you can automate your data extraction leaving without a trace using Mozenda’s  anonymous proxy feature that could rotate tons of IP’s .

Need that data on a schedule? Every day? Each hour? Mozenda takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus it allows advanced programming using REST API the user can connect directly Mozenda account.

Mozenda’s Data Mining Software is packed full of useful applications especially for sales people. You can do things such as “lead generation, forecasting, acquiring information for establishing budgets, competitor pricing analysis. This software is a great companion for marketing plan & sales plan creating.

Using Refine Capture tetx tool, Mozenda is smart enough to filter the text you want stays clean or get  the specific text or split them into pieces.

80Legs [Commercial]

The first time I heard about 80Legs my mind really got confused of what really this software does. 80Legs like Mozenda is a web-based data extraction  tool with customizable features:

•    Select which websites to crawl by entering URLs or uploading a seed list
•    Specify what data to extract by using a pre-built extractor or creating your own
•    Run a directed or general web crawler
•    Select how many web pages you want to crawl
•    Choose specific file types to analyze

80 legs offers customized web crawling that lets you get very specific about your crawling parameters, which tell 80legs what web pages you want to crawl and what data to collect from those web pages and also the general web crawling which can collect data like web page content, outgoing links and other data. Large web crawls take advantage of 80legs’ ability to run massively parallel crawls.

Also crawls data feeds and offers web extraction design services. (No installation needed)

ScrapeBox [Commercial]

ScrapeBox are most popular web scraping tools to SEO experts, online marketers and even spammers with its very user-friendly interface you can easily harvest data from a website;

•    Grab Emails
•    Check page rank
•    Checked high value backlinks
•    Export URLS
•    Checked Index
•    Verify working proxies
•    Powerful RSS Submission

Using thousands of rotating proxies you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked.

The latest updates allow the users to spin comments and anchor text to avoid getting detected by search engines.

You can also check out my guide to using Scrapebox for finding guest posting opportunities:

Scrape.it [Commercial]

Using a simple point & click Chrome Extension tool, you can extract data from websites that render in javascript. You can automate filling out forms, extract data from popups, navigate and crawl links across multiple pages, extract images from even the most complex websites with very little learning curve. Schedule jobs to run at regular intervals.

When a website changes layout or your web scraper stops working, scrape.it  will fix it automatically so that you can continue to receive data uninterrupted and without the need for you to recreate or edit it yourself.

They work with enterprises using our own tool that we built to deliver fully managed solutions for competitive pricing analysis, business intelligence, market research, lead generation, process automation and compliance & risk management requirements.

Features:

    Very easy web date extraction with Windows like Explorer interface

    Allowing you to extract text, images and files from modern Web 2.0 and HTML5 websites which uses Javascript & AJAX.

    The user could select what features they’re going to pay with

    lifetime upgrade and support at no extra charge on premium license

Scrapy [Free Open Source]

Off course the list would not be cool without Scrapy, it is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Features:

•         Design with simplicity- Just writes the rules to extract the data from web pages and let Scrapy crawl the entire web site. It can crawl 500 retailers’ sites daily.

•         Ability to attach new code for extensibility without having to touch the framework core

•         Portable, open-source, 100% Python- Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD

•         Scrapy comes with lots of functionality built in.

•         Scrapy is extensively documented and has an comprehensive test suite with very good code coverage

•         Good community and commercial support

 Cons: The installation process is hard to perfect especially for beginners

Needlebase [Commercial]

Many organizations, from private companies to government agencies, store their info in a searchable database that requires you navigate a list page listing results, and a detail page with more information about each result.  Grabbing all this information could result in thousands of clicks, but as long as it fits the same formula, Needlebase can do it for you.  Point and click on example data from one page once to show Needlebase how your site is structured, and it will use that pattern to extract the information you’re looking for into a dataset.  You can query the data through Needle’s site, or you can output it as a CSV or other file format of your choice.  Needlebase can also rerun your scraper every day to continuously update your dataset.

OutwitHub [Free]

This Firefox extension is one of the more robust free products that exists Write your own formula to help it find information you’re looking for, or just tell it to download all the PDFs listed on a given page.  It will suggest certain pieces of information it can extract easily, but it’s flexible enough for you to be very specific in directing it.  The documentation for Outwit is especially well written, they even have a number of tutorials for what you might be looking to do.  So if you can’t easily figure out how to accomplish what you want, investing a little time to push it further can go a long way.

Best use: more text

irobotsoft [Free}

This is a free program that is essentially a GUI for web scraping. There’s a pretty steep learning curve to figure out how to work it, and the documentation appears to reference an old version of the software. It’s the latest in a long tradition of tools that lets a user click through the logic of web scraping. Generally, these are a good way to wrap your head around the moving parts of a scrape, but the products have drawbacks of their own that makes them little easier than doing the same thing with scripts.

Cons: The documentation seems outdated

Best use: Slightly complex scrapes involving multiple layers.

iMacros [Free]

The  same ethos on how microsoft macros works, iMacros automates repetitive task.Whether you choose the website, Firefox extension, or Internet Explorer add-on flavor of this tool, it can automate navigating through the structure of a website to get to the piece of info you care about. Record your actions once, navigating to a specific page, and entering a search term or username where appropriate.  Especially useful for navigating to a specific stock you care about, or campaign contribution data that’s mired deep in an agency website and lacks a unique Web address.  Extract that key piece (pieces) of info into a usable form.  Can also help convert Web tables into usable data, but OutwitHub is really more suited to that purpose.  Helpful video and text tutorials enable you to get up to speed quickly.

Best use: Eliminate repetition in navigating to a particular datapoint in a website that you’re checking up on often by recording a repeatable action that pulls the datapoint out of the clutter it’s naturally surrounded by.

InfoExtractor [Commercial]

This is a neat little web service that generates all sorts of information given a list of urls. Currently, it only works for YouTube video pages, YouTube user profile pages, Wikipedia entries, Huffingtonpost posts, Blogcatalog blog posts and The Heritage Foundation blog (The Foundry). Given a url, the tool will return structured information including title, tags, view count, comments and so on.

Google Web Scraper [Free]

A browser-based web scraper works like Firefox’s Outwit Hub, it’s designed for plain text extraction from any online pages and export to spreadsheets via Google docs. Google Web Scraper can be downloaded as an extension and you can install it in your Chrome browser without seconds. To use it: highlight a part of the webpage you’d like to scrape, right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs™. The latest version still had some bugs on spreadsheets.

Cons: It doesn’t work for images and sometimes it can’t perform well on huge volume of text but it’s easy and fast to use.


Tutorials:

Scraping Website Images Manually using Google Inspect Elements

The main purpose of Google Inspect Elements is for debugging like the Firefox Firebug however, if you’re flexible you can use this tool also for harvesting images in a website. Your main goal is to get the specific images like web backgrounds, buttons, banners, header images and product images which is very useful for web designers.

Now, this is a very easy task. First, you will definitely need to download and install the Google Chrome browser in your computer. After the installation do the following:

1. Open the desired webpage in Google Chrome

2. Highlight any part of the website and right click > choose Google Inspect Elements

3. In the Google Inspect Elements, go to Resources tab

4. Under Resources tab, expand all folders. You will eventually see script folders and IMAGES folders

5. In the Images folders, just use arrow keys to find the images you need to have (see the screenshot above)

6. Next, right click the images and choose Open the Image in New Tab

7. Finally, right click the image > choose Save Image As… . (save to your local folder)

You’re done!

How to Extract Links from a Web Page with OutWit Hub

In this tutorial we are going to learn how to extract links from a webpage with OutWit Hub.

Sometimes it can be useful to extract all links from a given web page. OutWit Hub is the easiest way to achieve this goal.

1. Launch OutWit Hub

If you haven’t installed OutWit Hub yet, please refer to the Getting Started with OutWit Hub tutorial.

Begin by launching OutWit Hub from Firefox. Open Firefox then click on the OutWit Button in the toolbar.

If the icon is not visible go to the menu bar and select Tools -> OutWit -> OutWit Hub

OutWit Hub will open displaying the Web page currently loaded on Firefox.


2. Go to the Desired Web Page

In the address bar, type the URL of the Website.

Go to the Page view where you can see the Web page as it would appear in a traditional browser.

Now, select “Links” from the view list.

In the “Links” widget, OutWit Hub displays all the links from the current page.

If you want to export results to Excel, just select all links using ctrl/cmd + A, then copy using ctrl/cmd + C and paste it in Excel (ctrl/cmd + V).

Source: http://www.garethjames.net/a-guide-to-web-scrapping-tools/

No comments:

Post a Comment