Monday, 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

    Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)

1

It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

    Your tasks are represented by classes and not processes

    You can set and forget tasks and scale across multiple servers

    You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Source: http://programmers.stackexchange.com/questions/118077/data-scraping-one-application-or-multiple/118098#118098


Friday, 22 May 2015

How to prevent getting blacklisted while scraping

Crawlers can retrieve data much quicker and in greater depth than human searchers, so bad scraping practices can have some impact on the performance of the site.

Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a under powered server would have a hard time keeping up with requests from multiple crawlers.

Since spiders don’t bring direct organic traffic and seemingly affect the performance of the site, most site admins hate spiders and do their best to prevent them.

Lets go through how websites detect and block spiders and also know the techniques to overcome those barriers.

Most websites don’t have anti scraping mechanisms since it would affect the user experience, but some sites do not believe in open data access.

Before going through this article always keep in mind that

    A GOOD SPIDER MUST OBEY A WEBSITE’S CRAWLING POLICIES.

HOW DOES DETECTING ‘SPIDER ACTIVITY’ WORK?

A web server can use different mechanisms to detect a spider from a normal user. Here are some methods used by a site to detect a spider:

•    Unusual traffic/high download rate especially from a single client/or IP address within a short time span raises a bot alert.

•    Repetitive tasks done on website based on an assumption that a human user won’t perform the same repetitive tasks all the time.

•    The site has honeypot traps inside their pages, these honeypots are usually links which aren’t visible to a normal user but only to a spider . When a scraper/spider tries to access the link, the alarms are tripped.

Spend some time and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.

EASIEST WAY TO FIND IF A SITE HATES BOTS

Check the robots.txt file if it contains line like these, It means the site doesn’t like bots. However, since most sites want to be on Google (arguably the largest scraper of websites globally ;-)) they do allow access to bots and spiders.

User-agent: *
Disallow: /

This line is for preventing well-behaved bots or the bots which respect robots.txt.

Another way is CAPTCHAs irritating presence in the sites other than in authentication page.

WHAT HAPPENS WHEN YOU GET BANNED

There are two ways to ban a webspider, either by banning all accesses from a particular IP or by banning all accesses that use a specific id to access the server (most browsers and web spiders identify themselves whenever they request a page by user agents. Chrome browser for example uses Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36

The banning can be temporary or permanent. Temporary blocks can last minutes or hours.

HOW DO WE KNOW A SITE HAS BLOCKED US?

If any of the following symptoms appear on the site that you are crawling, it is a sign of being blocked or banned.

•    Showing CAPTCHA pages
•    Unusual content delivery delay
•    Frequent response with 404,301,500 errors,

also frequent appearance of these status codes are also indication of blocking.

•    401 Unauthorized
•    403 Forbidden
•    404 Not Found
•    408 Request Timeout
•    429 Too Many Requests

WEB CRAWLING BEST PRACTICES

These are the best practices we can follow to overcome the detection.

1. MAKE CRAWLING SLOWER, DO NOT DDoS THE SERVER, TREAT THEM NICELY

Use auto throttling mechanisms, which will automatically throttle crawling speed based on the load on both spider and the website, you are crawling and also adjust the spider to optimum crawling speed. The faster you crawl, the worse it is for everyone.

Put some random sleeps in between requests, Add some delays after crawled number of pages. Choose the lowest number of concurrent requests possible. These techniques make the spider looks like a human being.

2. DISGUISE YOUR REQUESTS BY ROTATING IP/PROXY

A server can easily detects a bot by checking the requests from a single IP address, So we use different IPs for making request to a server and detection rate become lesser. Make a pool of IPs that you can use and use random ones for each request.

There are several methods can be used to change the IP. Services like VPN ,shared proxies, TOR can help  and some third parties are also provides services for IP rotation.

3. USER-AGENT SPOOFING

Since every request made from a client end contains a user-agent header ,Using the same useragent multiple times leads to the detection of a bot. User agent spoofing is the best solution for this. Spoof the User agent by making a list of user agents and pick a random one for each request.

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1; (http://www.google.com/bot.html)

You can check your user-agent string here:

http://www.whatsmyuseragent.com/

A good user-agent string list can be found here:

http://www.useragentstring.com/pages/useragentstring.php

4. BE AWARE OF HONEYPOTS

Some site designers put honeypot traps inside websites to detect web spiders, They may be links that normal user can’t see and a spider can.

When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will be have the CSS style display:none or will be color disguised to blend in with the page’s background color.

5. DO NOT ALWAYS FOLLOW THE SAME CRAWLING PATTERN

Only robots follow the same crawling pattern,Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding pattern in their actions. Humans wont perform repetitive tasks a lot of times. Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human client.

6. ALWAYS RESPECT THE robots.txt

All web spiders are supposed to follow rules that you place in a robots.txt file in a website, such as how frequently they are allowed to request pages, and from what directories they are allowed to crawl through. They should also be supplying a consistent valid User-Agent string that identifies the requests as a bot request.

Source: http://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

Tuesday, 19 May 2015

The Features of the "Holographic Meridian Scraping Therapy"

1. Systematic nature: Brief introduction to the knowledge of viscera, meridians and points in traditional Chinese medicine, theory of holographic diagnosis and treatment; preliminary discussion of the treatment and health care mechanism of scraping therapy; systemat­ic introduction to the concrete methods of the holographic meridian scraping therapy; enumerating a host of therapeutic methods of scraping for disorders in both Chinese and Western medicine to em­body a combination of disease differentiation and syndrome differen­tiation; and summarizing the health care scraping methods. It is a practical handbook of gua sha.

2. Scientific: Applying the theories of Chinese and Western medicine to explain the health care and treatment mechanism and clinical applications of scraping therapy; introducing in detail the practical manipulations, items for attention, and indications and contraindications of the scraping therapy. Here are introduced repre­sentative diseases in different clinical departments, for which scrap­ing therapy has a better curative effect and the therapeutic methods of scraping for these diseases. Stress is placed on disease differentia­tion in Western medicine and syndrome differentiation in Chinese medicine, which should be combined in practical application.

Although there are more than 140,000 kinds of disease known to modem medicine, all diseases are related to dysfunction of the 14 meridians and internal organs, according to traditional Chinese med­icine. The object of scraping therapy is to correct the disharmony in the meridians and internal organs to recover the normal bodily func­tions. Thus, the scraping of a set of meridian points can be used to treat many diseases. In the section on clinical application only about 100 kinds of common diseases are discussed, although the actual number is much more than that. For easy reference the "Index of Diseases and Symptoms" is appended at the back of the book.

3. Practical: Using simple language and plenty of pictures and diagrams to guarantee that readers can easily leam, memorize and apply the principles of scraping therapy. As long as they master the methods explained in Chapter Three, readers without any medical knowledge can apply scraping therapy to themselves or others, with reference to the pictures in Chapters Four and Five. Besides scraping therapy, herbal treatment for each disease or syndrome is explained and may be used in combination with the scraping techniques.

Referring to the Holographic Meridian Hand Diagnosis and pic­tures at the back of the book will enhance accuracy of diagnosis and increase the effectiveness of scraping therapy.

Since the first publication and distribution of the Chinese edition of the book in July 1995, it has been welcomed by both medical specialists and lay people. In March 1996 this book was republished and adopted as a textbook by the School for Advanced Studies of Traditional Chinese Medicine affiliated to the Institute of the Acu­puncture and Moxibustion of the China Academy of Traditional Chi­nese Medicine.

In order to bring this health care method to more and more peo­ple and to make traditional Chinese medicine better appreciated They have modified and replenished this book in the spirit of constant im­provement. They hope that they may make a contribution to the health care of mankind with this natural therapy which has no side-effects and causes no pollution.

They hope that the Holographic Meridian Scraping Therapy can help the health and happiness of more and more families in the world.

Source: http://ezinearticles.com/?The-Features-of-the-Holographic-Meridian-Scraping-Therapy&id=5005031

Wednesday, 6 May 2015

Web Scraping Services Are Important Tools For Knowledge

Data extraction and web scraping techniques are important tools to find relevant data and information for personal or business use. Many companies, self-employed to copy and paste data from web pages. This process is very reliable, but very expensive as it is a waste of time and effort to get results. This is because the data collected and spent less resources and time required to collect these data are compared.

At present, several mining companies and their websites effective web scraping technique specifically for the thousands of pages of information developed culture can be traced. The information from a CSV file, database, XML file, or any other source with the required format is alameda. understanding of correlations and patterns in the data, so that policies can be designed to assist decision making. The information can also be stored for future reference.

The following are some common examples of data extraction process:

In order to rule through a government portal, citizens who are reliable for a given survey name removed.

Competitive pricing and data products include scraping websites

To access the web site or web design Stock download the videos and photos of scratching

Automatic Data Collection

It regularly collects data on a regular basis. Automated data collection techniques are very important because they find the company’s customer trends and market trends to help. By determining market trends, it is possible to understand customer behavior and predict the likelihood of the data will change.

The following are some examples of automated data collection:

Monitoring of special hourly rates for stocks

collects daily mortgage rates from various financial institutions

on a regular basis is necessary to check the weather

By using web scraping services, you can extract all data related to your business. Then analyzed the data to a spreadsheet or database can be downloaded and compared. Storing data in a database or in a required format and interpretation of the correlations to understand and makes it easier to identify hidden patterns.

Data extraction services, it is possible pricing, email, databases, profile data, and consistently to competitors for information about the data. Different techniques and processes designed to collect and analyze data, and has developed over time. Web Scraping for business processes that have beaten the market recently is one. It is a process from various sources such as websites and databases with large amounts of data provides.

Some of the most common methods used to scrape web crawling, text, fun, DOM analysis and include matching expression. After the process is only analyzers, HTML pages or meaning can be achieved through annotations. There are many different ways of scaling data, but more importantly is working toward the same goal. The main purpose of using web scraping service to retrieve and compile data in databases and web sites. In the business world is to remain relevant to the business process.

The central question about the relevance of web scraping contact. The process is relevant to the business world? The answer is yes. The fact that it is used by large companies in the world and many awards speaks derivatives.

Source: http://www.selfgrowth.com/articles/web-scraping-services-are-important-tools-for-knowledge

Tuesday, 28 April 2015

A Guide to Web Scraping Tools

Web Scrapers are tools designed to extract / gather data in a website via crawling engine usually made in Java, Python, Ruby and other programming languages.Web Scrapers are also called as Web Data Extractor, Data Harvester , Crawler and so on which most of them are web-based or can be installed in local desktops.

Its main purpose is to enable webmasters, bloggers, journalist and virtual assistants to harvest data from a certain website whether text, numbers, contact details and images in a structured way which cannot be done easily thru manual copy and paste method. Typically, it transforms the unstructured data on the web, from HTML format into a structured data stored in a local database or spreadsheet or automates web human browsing.

Web Scraper Usage

Web Scrapers are also being used by SEO and Online Marketing Analyst to pull out some data privately from the competitor’s website such as high targeted keywords, valuable links, emails & traffic sources that were also perform by SEOClerk, Google and many other web crawling sites.

Includes:

•    Price comparison
•    Weather data monitoring
•    Website change detection
•    Research
•    Web mash up
•    Info graphics
•    Web data integration
•    Web Indexing & rank checking
•    Analyze websites quality links

List of Popular Web Scrapers

There are hundreds of Web Scrapers today available for both commercial and personal use. If you’ve never done any web scraping before, there are basic

Web scraping tools like YahooPipes, Google Web Scrapers and Outwit Firefox extensions that it’s good to start with but if you need something more flexible and has extra functionality then,  check out the following:

HarvestMan [ Free Open Source]

HarvestMan is a web crawler application written in the Python programming language. HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application. HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.Like Scrapy, HarvestMan is truly flexible however, your first installation would not be easy.

Scraperwiki [Commercial]

Using a minimal programming you will be able to extract anything. Off course, you can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.

Scraperwiki is a site that encourages programmers, journalists and anyone else to take online information and turn it into legitimate datasets. It’s a great resource for learning how to do your own “real” scrapes using Ruby, Python or PHP. But it’s also a good way to cheat the system a little bit. You can search the existing scrapes to see if your target website has already been done. But there’s another cool feature where you can request new scrapers be built.  All in all, a fantastic tool for learning more about scraping and getting the desired results while sharpening your own skills.

Best use: Request help with a scrape, or find a similar scrape to adapt for your purposes.

FiveFilters.org [Commercial]   

Is an online web scraper available for commercial use. Provides easy content extraction using Full-Text RSS tool which can identify and extract web content (news articles, blog posts, Wikipedia entries, and more) and return it in an easy to parse format. Advantages; speedy article extraction, Multi-page support, has a Autodetection and  you can deploy  on the cloud server without database required.

Kimono

Produced by Kimono labs this tool lets you convert data to into apis for automated export.   Benjamin Spiegel did a great Youmoz post on how to build a custom ranking tool with Kimono, well worth checking out!

Mozenda [Commercial]

This is a unique tool for web data extraction or web scarping.Designed for easiest and fastest way of getting data from the web for everyone. It has a point & click interface and with the power of the cloud you can scrape, store, and manage your data all with Mozenda’s incredible back-end hardware. More advance, you can automate your data extraction leaving without a trace using Mozenda’s  anonymous proxy feature that could rotate tons of IP’s .

Need that data on a schedule? Every day? Each hour? Mozenda takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus it allows advanced programming using REST API the user can connect directly Mozenda account.

Mozenda’s Data Mining Software is packed full of useful applications especially for sales people. You can do things such as “lead generation, forecasting, acquiring information for establishing budgets, competitor pricing analysis. This software is a great companion for marketing plan & sales plan creating.

Using Refine Capture tetx tool, Mozenda is smart enough to filter the text you want stays clean or get  the specific text or split them into pieces.

80Legs [Commercial]

The first time I heard about 80Legs my mind really got confused of what really this software does. 80Legs like Mozenda is a web-based data extraction  tool with customizable features:

•    Select which websites to crawl by entering URLs or uploading a seed list
•    Specify what data to extract by using a pre-built extractor or creating your own
•    Run a directed or general web crawler
•    Select how many web pages you want to crawl
•    Choose specific file types to analyze

80 legs offers customized web crawling that lets you get very specific about your crawling parameters, which tell 80legs what web pages you want to crawl and what data to collect from those web pages and also the general web crawling which can collect data like web page content, outgoing links and other data. Large web crawls take advantage of 80legs’ ability to run massively parallel crawls.

Also crawls data feeds and offers web extraction design services. (No installation needed)

ScrapeBox [Commercial]

ScrapeBox are most popular web scraping tools to SEO experts, online marketers and even spammers with its very user-friendly interface you can easily harvest data from a website;

•    Grab Emails
•    Check page rank
•    Checked high value backlinks
•    Export URLS
•    Checked Index
•    Verify working proxies
•    Powerful RSS Submission

Using thousands of rotating proxies you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked.

The latest updates allow the users to spin comments and anchor text to avoid getting detected by search engines.

You can also check out my guide to using Scrapebox for finding guest posting opportunities:

Scrape.it [Commercial]

Using a simple point & click Chrome Extension tool, you can extract data from websites that render in javascript. You can automate filling out forms, extract data from popups, navigate and crawl links across multiple pages, extract images from even the most complex websites with very little learning curve. Schedule jobs to run at regular intervals.

When a website changes layout or your web scraper stops working, scrape.it  will fix it automatically so that you can continue to receive data uninterrupted and without the need for you to recreate or edit it yourself.

They work with enterprises using our own tool that we built to deliver fully managed solutions for competitive pricing analysis, business intelligence, market research, lead generation, process automation and compliance & risk management requirements.

Features:

    Very easy web date extraction with Windows like Explorer interface

    Allowing you to extract text, images and files from modern Web 2.0 and HTML5 websites which uses Javascript & AJAX.

    The user could select what features they’re going to pay with

    lifetime upgrade and support at no extra charge on premium license

Scrapy [Free Open Source]

Off course the list would not be cool without Scrapy, it is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Features:

•         Design with simplicity- Just writes the rules to extract the data from web pages and let Scrapy crawl the entire web site. It can crawl 500 retailers’ sites daily.

•         Ability to attach new code for extensibility without having to touch the framework core

•         Portable, open-source, 100% Python- Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD

•         Scrapy comes with lots of functionality built in.

•         Scrapy is extensively documented and has an comprehensive test suite with very good code coverage

•         Good community and commercial support

 Cons: The installation process is hard to perfect especially for beginners

Needlebase [Commercial]

Many organizations, from private companies to government agencies, store their info in a searchable database that requires you navigate a list page listing results, and a detail page with more information about each result.  Grabbing all this information could result in thousands of clicks, but as long as it fits the same formula, Needlebase can do it for you.  Point and click on example data from one page once to show Needlebase how your site is structured, and it will use that pattern to extract the information you’re looking for into a dataset.  You can query the data through Needle’s site, or you can output it as a CSV or other file format of your choice.  Needlebase can also rerun your scraper every day to continuously update your dataset.

OutwitHub [Free]

This Firefox extension is one of the more robust free products that exists Write your own formula to help it find information you’re looking for, or just tell it to download all the PDFs listed on a given page.  It will suggest certain pieces of information it can extract easily, but it’s flexible enough for you to be very specific in directing it.  The documentation for Outwit is especially well written, they even have a number of tutorials for what you might be looking to do.  So if you can’t easily figure out how to accomplish what you want, investing a little time to push it further can go a long way.

Best use: more text

irobotsoft [Free}

This is a free program that is essentially a GUI for web scraping. There’s a pretty steep learning curve to figure out how to work it, and the documentation appears to reference an old version of the software. It’s the latest in a long tradition of tools that lets a user click through the logic of web scraping. Generally, these are a good way to wrap your head around the moving parts of a scrape, but the products have drawbacks of their own that makes them little easier than doing the same thing with scripts.

Cons: The documentation seems outdated

Best use: Slightly complex scrapes involving multiple layers.

iMacros [Free]

The  same ethos on how microsoft macros works, iMacros automates repetitive task.Whether you choose the website, Firefox extension, or Internet Explorer add-on flavor of this tool, it can automate navigating through the structure of a website to get to the piece of info you care about. Record your actions once, navigating to a specific page, and entering a search term or username where appropriate.  Especially useful for navigating to a specific stock you care about, or campaign contribution data that’s mired deep in an agency website and lacks a unique Web address.  Extract that key piece (pieces) of info into a usable form.  Can also help convert Web tables into usable data, but OutwitHub is really more suited to that purpose.  Helpful video and text tutorials enable you to get up to speed quickly.

Best use: Eliminate repetition in navigating to a particular datapoint in a website that you’re checking up on often by recording a repeatable action that pulls the datapoint out of the clutter it’s naturally surrounded by.

InfoExtractor [Commercial]

This is a neat little web service that generates all sorts of information given a list of urls. Currently, it only works for YouTube video pages, YouTube user profile pages, Wikipedia entries, Huffingtonpost posts, Blogcatalog blog posts and The Heritage Foundation blog (The Foundry). Given a url, the tool will return structured information including title, tags, view count, comments and so on.

Google Web Scraper [Free]

A browser-based web scraper works like Firefox’s Outwit Hub, it’s designed for plain text extraction from any online pages and export to spreadsheets via Google docs. Google Web Scraper can be downloaded as an extension and you can install it in your Chrome browser without seconds. To use it: highlight a part of the webpage you’d like to scrape, right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs™. The latest version still had some bugs on spreadsheets.

Cons: It doesn’t work for images and sometimes it can’t perform well on huge volume of text but it’s easy and fast to use.


Tutorials:

Scraping Website Images Manually using Google Inspect Elements

The main purpose of Google Inspect Elements is for debugging like the Firefox Firebug however, if you’re flexible you can use this tool also for harvesting images in a website. Your main goal is to get the specific images like web backgrounds, buttons, banners, header images and product images which is very useful for web designers.

Now, this is a very easy task. First, you will definitely need to download and install the Google Chrome browser in your computer. After the installation do the following:

1. Open the desired webpage in Google Chrome

2. Highlight any part of the website and right click > choose Google Inspect Elements

3. In the Google Inspect Elements, go to Resources tab

4. Under Resources tab, expand all folders. You will eventually see script folders and IMAGES folders

5. In the Images folders, just use arrow keys to find the images you need to have (see the screenshot above)

6. Next, right click the images and choose Open the Image in New Tab

7. Finally, right click the image > choose Save Image As… . (save to your local folder)

You’re done!

How to Extract Links from a Web Page with OutWit Hub

In this tutorial we are going to learn how to extract links from a webpage with OutWit Hub.

Sometimes it can be useful to extract all links from a given web page. OutWit Hub is the easiest way to achieve this goal.

1. Launch OutWit Hub

If you haven’t installed OutWit Hub yet, please refer to the Getting Started with OutWit Hub tutorial.

Begin by launching OutWit Hub from Firefox. Open Firefox then click on the OutWit Button in the toolbar.

If the icon is not visible go to the menu bar and select Tools -> OutWit -> OutWit Hub

OutWit Hub will open displaying the Web page currently loaded on Firefox.


2. Go to the Desired Web Page

In the address bar, type the URL of the Website.

Go to the Page view where you can see the Web page as it would appear in a traditional browser.

Now, select “Links” from the view list.

In the “Links” widget, OutWit Hub displays all the links from the current page.

If you want to export results to Excel, just select all links using ctrl/cmd + A, then copy using ctrl/cmd + C and paste it in Excel (ctrl/cmd + V).

Source: http://www.garethjames.net/a-guide-to-web-scrapping-tools/

Wednesday, 22 April 2015

Hand Scraped Versus Machine Scraped Floors - The Distinction

In society today hardwood flooring has become the new must have. The days of carpet are gone, and if you have looked into bringing your home up to date with the styling of today you will have noticed by now that there are many different options. At times this may become very overwhelming, especially if you are not a hardwood specialist like most people are not. That is why this article is here to help you understand the many different options available to you.

The flooring type covered in this article is hand scraped flooring. This flooring type is a custom look flooring that is in very high demand in flooring marketplace, which is understandable because it is probably the most unique flooring there is. You can choose from many different types of wood species such as oak, maple, hickory, and most exotic species. There is computerized hand scraped that is when the manufacturer makes one piece of wood and places it into a computer that will cut thousands of different wood types with that one design. This type of process is also known as machine scraping. Hardwood floors employing this type of technology usually cost less, but most of the pieces look the same because the hand scraping is done by a machine.

Then you have actual hand scraped flooring that is done all by hand and takes more time and effort than machine scraped. This flooring is made custom each individual piece is scraped and notched in different ways, so every piece is unique. If you decide to purchase actual hand scraped flooring it will cost you more than mass produced computerized version but it will definitely be the more unique option. If you are the type of person who wants to have a one of kind floor then an actual hand scraped floor is the way to go.

So in conclusion hand scraped flooring is a great option for a lot of people. It comes in several different wood types, and several different colors. You can find flooring options for every budget and to meet every style. If having a custom floor in your home it may be important or not important on whether it be computer or done by hand. Most consumers cannot tell the difference between actual hand scraped flooring and machine scraped when just looking at a small sample. So when shopping at your local retailer ask the tough questions and find out if the manufacturer uses machine or authentic hand scrapping on their products.

To view your many options on hand scraped flooring please check out our website that covers all hardwood flooring options.

Source: http://ezinearticles.com/?Hand-Scraped-Versus-Machine-Scraped-Floors---The-Distinction&id=4151157

Tuesday, 7 April 2015

rvest: easy web scraping with R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

  html_node("strong span") %>%

  html_text() %>%

  as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

  html_nodes("#titleCast .itemprop span") %>%

  html_text()

#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"   

#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"   

#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

  html_nodes("table") %>%

  .[[3]] %>%

  html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1...  marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

•    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

•    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

•    Detect and repair text encoding problems with guess_encoding() and repair_encoding().

•    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/