Wednesday 31 December 2014

Data Extraction, Web Screen Scraping Tool, Mozenda Scraper

Web Scraping

Web scraping, also known as Web data extraction or Web harvesting, is a software method of extracting data from websites. Web scraping is closely related and similar to Web indexing, which indexes Web content. Web indexing is the method used by most search engines. The difference with Web scraping is that it focuses more on the translation of unstructured content on the Web, characteristically in rich text format like that of HTML, into controlled data that can be analyzed stored and in a spreadsheet or database. Web scraping also makes Web browsing more efficient and productive for users. For example, Web scraping automates weather data monitoring, online price comparison, and website change recognition and data integration. 

This clever method that uses specially coded software programs is also used by public agencies. Government operations and Law enforcement authorities use data scrape methods to develop information files useful against crime and evaluation of criminal behaviors. Medical industry researchers get the benefit and use of Web scraping to gather up data and analyze statistics concerning diseases such as AIDS and the most recent strain of influenza like the recent swine flu H1N1 epidemic.

Data scraping is an automatic task performed by a software program that extracts data output from another program, one that is more individual friendly. Data scraping is a helpful device for programmers who have to generate a line through a legacy system when it is no longer reachable with up to date hardware. The data generated with the use of data scraping takes information from something that was planned for use by an end user.

One of the top providers of Web Scraping software, Mozenda, is a Software as a Service company that provides many kinds of users the ability to affordably and simply extract and administer web data. Using Mozenda, individuals will be able to set up agents that regularly extract data then store this data and finally publish the data to numerous locations. Once data is in the Mozenda system, individuals may format and repurpose data and use it in other applications or just use it as intelligence. All data in the Mozenda system is safe and sound and is hosted in a class A data warehouses and may be accessed by users over the internet safely through the Mozenda Web Console.

One other comparative software is called the Djuggler. The Djuggler is used for creating web scrapers and harvesting competitive intelligence and marketing data sought out on the web. With Dijuggles, scripts from a Web scraper may be stored in a format ready for quick use. The adaptable actions supported by the Djuggler software allows for data extraction from all kinds of webpages including dynamic AJAX, pages tucked behind a login, complicated unstructured HTML pages, and much more. This software can also export the information to a variety of formats including Excel and other database programs.

Web scraping software is a ground-breaking device that makes gathering a large amount of information fairly trouble free. The program has many implications for any person or companies who have the need to search for comparable information from a variety of places on the web and place the data into a usable context. This method of finding widespread data in a short amount of time is relatively easy and very cost effective. Web scraping software is used every day for business applications, in the medical industry, for meteorology purposes, law enforcement, and government agencies.

Source:http://www.articlesbase.com/databases-articles/data-extraction-web-screen-scraping-tool-mozenda-scraper-3568330.html

Saturday 27 December 2014

So What Exactly Is A Private Data Scraping Services To Use You?

If your computer connects to the Internet or resources on the request for this information, and queries to different servers. If you have a website to introduce to the site server recognizes your computer's IP address and displays the data and much more. Many e - commerce sites use to log your IP address, and the browsing patterns for marketing purposes.

Related Articles

Follow Some Tips For Data Scraping Services

Web Data Scraping Assuring Scraping Success Proxy Data Services

Data Scraping Services with Proxy Data Scraping

Web Data Extraction Services for Data Collection - Screen Scrapping Services, Data Mining Services

The  Scraping server you connect to your destination or to process your information and make a filter. For example, IP address or protocol filtering traffic through a  Scraping service. As you might guess, there are many types of  Scraping services. including the ability to a high demand for the software. Email messages are quickly sent to businesses and companies to help you search for contacts.

Although there are Sanding free  Scraping IP addresses in this way can work, the use of payment services, and automatic user interface (plug and play) are easy to give.  Scraping web information services, thus offering a variety of relevant sources of data.  Scraping information service organizations are generally used where large amounts of data every day. It is possible for you to receive efficient, high precision is also affordable.

Information on the various strategies that companies,  Scraping excellent information services, and use the structure planned out and has led to the introduction of more rapid relief of the Earth.

In addition, the application software that has flexibility as a priority. In addition, there is a software that can be tailored to the needs of customers, and satisfy various customer requirements play a major role. Particular software, allows businesses to sell, a customer provides the features necessary to provide the best experience.

If you do not use a private Data Scraping Services suggest that you immediately start your Internet marketing. It is an inexpensive but vital to your marketing company. To choose how to set up a private  Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Without the steady stream of data from these sites to get stopped? Scraping HTML page requests sent by argument on the web server, depending on changes in production, it is very likely to break their staff. 

Data Scraping Services is common in the respective outsourcing company. Many companies outsource  Data Scraping Services service companies are increasingly outsourcing these services, and generally dealing with the Internet business-related activities, in particular a lot of money, can earn.

Web  Data Scraping Services, pull information from a structured plan format. Informal or semi-structured data source from the source.They are there to just work on your own server to extract data to execute. IP blocking is not a problem for them when they switch servers in minutes and back on track, scraping exercise. Try this service and you'll see what I mean.

It is an inexpensive but vital to your marketing company. To choose how to set up a private  Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Source:http://www.articlesbase.com/outsourcing-articles/so-what-exactly-is-a-private-data-scraping-services-to-use-you-5587140.html

Monday 22 December 2014

GScholarXScraper: Hacking the GScholarScraper function with XPath

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function  get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage  into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essensially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

    Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).

    As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but f0r me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions!) :)

•    Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
•    Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
•    Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
•    Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
•    Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

</pre>
<div id="LC198"># #EXAMPLE 1: Display word cloud based on the title field of each Google Scholar search result returned</div>
<div id="LC199"># GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)</div>
<div id="LC200">#</div>
<div id="LC201"># # word freq</div>
<div id="LC202"># # game game 71</div>
<div id="LC203"># # comput comput 22</div>
<div id="LC204"># # video video 13</div>
<div id="LC205"># # learn learn 11</div>
<div id="LC206"># # [TRUNC...]</div>
<div id="LC207"># #</div>
<div id="LC208"># #</div>
<div id="LC209"># # Number of titles submitted = 210</div>
<div id="LC210"># #</div>
<div id="LC211"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC212"># #</div>
<div id="LC213"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>

<pre>

// image

I think that’s kind of cool and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :)  The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:

</pre>

<div id="LC215"># # EXAMPLE 2: Display word cloud based on the description field of each Google Scholar search result returned</div>
<div id="LC216">GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)</div>
<div id="LC217">#</div>
<div id="LC218"># # word freq</div>
<div id="LC219"># # page page 147</div>
<div id="LC220"># # gate gate 132</div>
<div id="LC221"># # game game 130</div>
<div id="LC222"># # baldur baldur 129</div>
<div id="LC223"># # roleplay roleplay 21</div>
<div id="LC224"># # [TRUNC...]</div>
<div id="LC225"># #</div>
<div id="LC226"># # Number of titles submitted = 210</div>
<div id="LC227"># #</div>
<div id="LC228"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC229"># #</div>
<div id="LC230"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>
<pre>

//image

Not bad. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set.

Code:

Full code for GScholarXScraper can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper

Original GSchloarScraper code is here: https://docs.google.com/document/d/1w_7niLqTUT0hmLxMfPEB7pGiA6MXoZBy6qPsKsEe_O0/edit?hl=en_US

Full code for just the XPath scraping function is here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R

Source:http://www.r-bloggers.com/gscholarxscraper-hacking-the-gscholarscraper-function-with-xpath/

Tuesday 16 December 2014

Benefits of Predictive Analytics and Data Mining Services

Predictive Analytics is the process of dealing with variety of data and apply various mathematical formulas to discover the best decision for a given situation. Predictive analytics gives your company a competitive edge and can be used to improve ROI substantially. It is the decision science that removes guesswork out of the decision-making process and applies proven scientific guidelines to find right solution in the shortest time possible.

Predictive analytics can be helpful in answering questions like:

•    Who are most likely to respond to your offer?
•    Who are most likely to ignore?
•    Who are most likely to discontinue your service?
•    How much a consumer will spend on your product?
•    Which transaction is a fraud?
•    Which insurance claim is a fraudulent?
•    What resource should I dedicate at a given time?

Benefits of Data mining include:

•    Better understanding of customer behavior propels better decision
•    Profitable customers can be spotted fast and served accordingly
•    Generate more business by reaching hidden markets
•    Target your Marketing message more effectively
•    Helps in minimizing risk and improves ROI.
•    Improve profitability by detecting abnormal patterns in sales, claims, transactions etc
•    Improved customer service and confidence
•    Significant reduction in Direct Marketing expenses

Basic steps of Predictive Analytics are as follows:

•    Spot the business problem or goal
•    Explore various data sources such as transaction history, user demography, catalog details, etc)
•    Extract different data patterns from the above data
•    Build a sample model based on data & problem
•    Classify data, find valuable factors, generate new variables
•    Construct a Predictive model using sample
•    Validate and Deploy this Model

Standard techniques used for it are:

•    Decision Tree
•    Multi-purpose Scaling
•    Linear Regressions
•    Logistic Regressions
•    Factor Analytics
•    Genetic Algorithms
•    Cluster Analytics
•    Product Association

Should you have any queries regarding Data Mining or Predictive Analytics applications, please feel free to contact us. We would be pleased to answer each of your queries in detail.

Source:http://ezinearticles.com/?Benefits-of-Predictive-Analytics-and-Data-Mining-Services&id=4766989

Saturday 13 December 2014

Microfinance Data Scraping

I went to the Datakind‘s New York Datadive last November and met the Microfinance Information Exchange (MIX), a group that ‘delivers data services, analysis, research and business information on the institutions that provide financial services to the world’s poor’. They wanted to see whether web-scraping could save them from manually gathering data. So fellow divers and I showed MIX the utility of web-scraping. Over the course of a day, about six people scraped data about microfinance institutions from a bunch of websites, saving MIX an estimated year of manual data entry.

Over the past few months, I worked further with MIX to study who has access to what sorts of financial services. DataKind just put up our blog post about the project. Read the post, or just look at the map and explore the data.

Source:https://blog.scraperwiki.com/2012/05/microfinance-data-scraping/

Thursday 4 December 2014

Finding & Removing Spam Blogs Who Scrape Content Onto Free Hosted Blogs

The more popular you become in the blogging world, the more crap you have to deal with!
Content scraping is one chore that can be dealt with swiftly once you understand what to do.
This post contains links which you can use to quickly and easily report content scrapers and spam blogs.
Please share this post and help clean up spam blogs and punish content scrapers.
First step is to find your url’s which have been scraped of content and then get the scrapers spam blog removed.

Some of the tools i use to do this are:

    Google Webmaster Tools
    Google Alerts

Finding Scraped Content
Login to your Google Webmaster Tools account and go to traffic > links to your site.
You should see something like this:
Webmaster Tools Links to Your Site

The first domain is a site which has copied and embedded my homepage which i have already dealt with.
The second site is a search engine.
The third domain is the one i want to deal with.

A common method scrapers use is to post the scraped content from your rss feed on to a free hosted blog like WordPress.com or blogger.com.

Once you click the WordPress.com link in webmaster tools, you’ll find all the url’s which have been scraped.
Links to Your Site

There’s 32 url’s which have been linked to so its simply a matter of clicking each of your links and finding the culprits.

The first link is my homepage which has been linked to by legit domains like WordPress developers.
The others are mainly linked to by spam blogs who have scraped the content and used a free hosted service which in this case is WordPress.com.
WordPress.com Links to Your Site
 Reporting & Removing Spam Blogs

Once you have the url’s of the content scraping blogs as seen in the screenshot above:

    Fill in this basic form to report spam to WordPress.com
    Fill in this form to report copyright content to WordPress.com
    Use this form to report Blogspot and Blogger.com content which has been scraped.
    Fill in one of these forms to remove content from Google

Google Alerts

Its very easy to setup a Google alert to find your post titles when they get scraped.
If you’ve setup the WordPress SEO plugin correctly, you should have included your site title at the end of all your post titles.
Then all you need to do is setup a Google alert for your site title and you’ll be notified every time a scraper links to your content.

Link Notifications

You may also receive a pingback or trackback if you have this feature enabled in your discussion settings.

Link Notifications
RSS Feed Links

Most content scrapers use automated software to scrape the content from RSS feeds.
Make sure you configure your Reading settings so only a summary is displayed.
Reading Settings Feed Summary

Next step is to configure the settings in Yoast’s SEO plugin so links back to your site are included in all RSS feed post summaries.

RSS Feed Links

This will help search engines identify you and your domain as the original author of the content.
There’s other services like copyscape and dmca which can help you protect your sites content if you’re prepared to pay a premium.
That’s it folks.
Its easy to find and get spam sites removed once you know what to do.
Hope you don’t have to deal with this garbage to often.
Ever found out your content has been scraped?
What did you do about it?

Source: http://wpsites.net/blogging/content-scraping-monitoring-and-prevention-tips/

Thursday 27 November 2014

Scraping SSL Labs Server Test Results With R

    NOTE: Qualys allows automated access to their SSL Server Test site in their T&C’s, and the R fucntion/script provided here does its best to adhere to their guidelines. However, if you launch multiple scripts at one time and catch their attention you will, no doubt, be banned.

This post will show you how to do some basic web page data scraping with R. To make it more palatable to those in the security domain, we’ll be scraping the results from Qualys’ SSL Labs SSL Test site by building an R function that will:

    fetch the contents of a URL with RCurl
    process the HTML page tags with R’s XML library
    identify the key elements from the page that need to be scraped
    organize the results into a usable R data structure

You can skip ahead to the code at the end (or in this gist) or read on for some expository that isn’t in the code’s comments.

Setting up the script and processing flow

We’ll need some assistance from three R packages to perform the scraping, processing and transformation tasks:

library(RCurl) # scraping
library(XML)   # XML (HTML) processing
library(plyr)  # data transformation

If you poke at the SSL Test site with a few different URLs, you’ll see there are three primary inputs to the GET request we’ll need to issue:

    d (the domain)
    s (the IP address to test)
    ignoreMismatch (which we’ll leave as ‘on‘)

You’ll also see that there’s often a delay between issuing a request and getting the results, so we’ll need to build in a GET+check-loop (like the javascript on the page does automagically). Finally, when the results are eventually displayed they are (at least for this example) usually either "Overall Rating" or "Assessment" and, we’ll use that status result in our tests for what to return.

We’ll account for the domain and IP address in the function parameters along with the amount of time we should pause between GET+check attempts. It’s also a good idea to provide a way to pass in any extra curl options (e.g. in the event folks are behind a proxy server and need to input that to make the requests work). We’ll define the function with some default parameters:

get_rating <- function(site="rud.is", ip="", pause=5, curl.opts=list()) {

}

This definition says that if we just call get_rating(), it will

    default to using "rud.is" as the domain (you can pick what you want in your implementation)
    not supply an IP address (which the script will then have to lookup with nsl)
    will pause 5s between GET+check attempts
    pass no extra curl options

Getting into the details

For the IP address logic, we’ll have to test if we passed in an an address string and perform a lookup if not:

# try to resolve IP if not specified; if no IP can be found, return
# a "NA" data frame

  if (ip == "") {

    tmp <- nsl(site)
    if (is.null(tmp)) {
      return(data.frame(site=site, ip=NA, Certificate=NA,
                        Protocol.Support=NA, Key.Exchange=NA,
                        Cipher.Strength=NA)) }
    ip <- tmp
  }

(don’t worry about the return(...) part yet, we’ll get there in a bit).

Once we have an IP address, we’ll need to make the call to the ssllabs.com test site and perform the check loop:

# get the contents of the URL (will be the raw HTML text)
# build the URL with sprintf

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)

# while we don't find some indication of a completed request,
# pause and try again

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {
  Sys.sleep(pause)
  rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)
}

We can then start making some decisions based on the results:

# if the assessment failed, return a data frame of NA's

if (grepl("Assessment failed", rating.dat)) {

  return(data.frame(site=site, ip=NA, Certificate=NA,
                    Protocol.Support=NA, Key.Exchange=NA,
                    Cipher.Strength=NA))
}

# otherwise, parse the resultant HTML

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

Unfortunately, the results are not “consistent”. While there are plenty of uniquely identifiable <div>s, there are enough differences between runs that we have to be a bit generic in our selection of data elements to extract. I’ll leave the view-source: of a result as an exercise to the reader. For this example, we’ll focus on extracting:

        the overall rating (A-F)
        the “Certificate” score
        the “Protocol Support” score
        the “Key Exchange” score
        the “Cipher Strength” score

There are plenty of additional fields to extract, but you should be able to extrapolate and grab what you want to from the rest of the example.

Extracting the results

We’ll need to delve into XPath to extract the <div> values. We’ll use the xpathSApply function to perform this task. Since there sometimes is a <span> tag within the <div> for the rating and since the rating has a class tag to help identify which color it should be, we use a starts-with selection parameter to just get anything beginning with rating_. If it returns an R list structure, we know we have the one with a <span> element, so we re-issue the call with that extra XPath component.

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/text()", xmlValue)

if (class(rating) == "list") {

  rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/span/text()", xmlValue)
}

For the four attributes (and values) we’ll be extracting, we can use the getNodeSet call which will give us all of them into a structure we can process with xpathSApply

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

At this point, labs will be a vector of label names and vals will be the corresponding values. We’ll put them, the original domain and the IP address into a data frame:

# rbind will turn the vector into row elements, with each

# value being in a column

rating.result <- data.frame(site=site, ip=ip,

                            rating=rating, rbind(vals),
                            row.names=NULL)

# we use the labs vector as the column names (in the right spot)    

colnames(rating.result) <- c("site", "ip", "rating",

                              gsub(" ", "\\.", labs))

and return the result:
return(rating.result)
Finishing up

If we run the whole function on one domain we’ll get a one-row data frame back as a result. If we use ldply from the plyr package to run the get_rating function repeatedly on a vector of domains, it will combine them all into one whole data frame. For example:

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

##                site              ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1            rud.is  184.106.97.102      B         100               70           80              90

## 2 stackoverflow.com 198.252.206.140      A         100               90           80              90

## 3        er-ant.com            <NA>   <NA>        <NA>             <NA>         <NA>            <NA>

There are many tweaks you can make to this function to extract more data and perform additional processing. If you make some of your own changes, you’re encouraged to add to the gist (link above & below) and/or drop a note in the comments.

Hopefully you’ve seen how well-suited R is for this type of operation and have been encouraged to use it in your next attempt at some site/data scraping.

library(RCurl)
library(XML)
library(plyr)

 #' get the Qualys SSL Labs rating for a domain+cert

#'

#' @param site domain to test SSL configuration of

#' @param ip address of \code{site} (will resolve it and take\cr

#' first response if not specified, but that may not always work as you expect)

#' @param hide.results ["on"|"off"] should the results show up in the SSL Labs history (default "on")

#' @param pause timeout between tries (default 5s)

#' @param curl.opts options to pass to \code{getURL} i.e. proxy setting

#' @return data frame of results

#'

  get_rating <- function(site="rud.is", ip="", hide.results="on", pause=5, curl.opts=list()) {

# try to resolve IP if not specified; if no IP can be found, return

# a "NA" data frame

if (ip == "") {

tmp <- nsl(site)

if (is.null(tmp)) { return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA)) }

ip <- tmp

}

# need to let it actually process the certificate if not already cached

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {

Sys.sleep(pause)

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

}

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA))

}

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

# sometimes there is a <span ...> tag in the <div>, which will result in an

# empty list() object being returned. we check for that and handle it

# appropriately.

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/text()"]])

if (class(rating) == "list") {

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/span/text()"]])

}

# extract the XML objects for the ratings labels & values

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

# make them into a data frame

rating.result <- data.frame(site=site, ip=ip, rating=rating, rbind(vals), row.names=NULL)

colnames(rating.result) <- c("site", "ip", "rating", gsub(" ", "\\.", labs))

return(rating.result)

}

 sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

Source: http://www.r-bloggers.com/scraping-ssl-labs-server-test-results-with-r/

Sunday 23 November 2014

Data Mining Outsourcing in a Better and Unique Approach

Data mining outsourcing services are ideal for clarity in various decision making processes.  It is the ultimate goal of any organization and business to increase on its profits as well as strengthen the bond with its customers. Equipping the business in such a way that it’s very easy to detect frauds and manage risks in a convenient manner is equally important. Volumes of data that are irrelevant or cannot be used when raw needs to be converted to a more useful form.  The data mining outsourcing services can greatly help you to analyze and interpret data in a more diligent way.

This service to reliable, experienced and qualified hands is very important. Your research project or engineering project can be easily and conveniently handled by experienced staff who guarantees you an accuracy level of about 98% and a massive reduction in operating costs. The quality of work is unsurpassed and the presentation is done in a format that is easy and simple for you. The project is done in a very short time alleviating you delays as well as ensuring on-time completion of your projects. To enjoy a successful outsourcing experience, you need to bank on a famous and reliable expertise.

The only time to rely with data mining outsourcing services is when you do not have a reliable, experienced expertise in your business.  Statistics indicate that it’s very easy to lose business intelligence or expose the privacy of the customers through this process. However companies which offer secure outsourcing process are on the increase as a result of massive competition. It’s an opportunity to develop your potential of sourced data and improve your business in all fields. 

Data mining potential applications are infinite. However major applications are in the marketing research and scientific projects. It’s done both on large and small quantities of data by experienced staff well known for their best analytical procedures to guarantee you accurate and easy to use information. Data mining outsourcing services are the only perfect way to profitability.

Source:http://www.e-edge.biz/Data_Mining_Outsourcing_in_a_Better_and_Unique_Approach.html

Monday 17 November 2014

How to scrape data without coding? A step by step tutorial on import.io

Import.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi, from Import.io, told Journalism.co.uk: the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

A beginner’s guide

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search results website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):

crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:

data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.

train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:

add column

Next, highlight the name of the first orphanage in the list and press “Train”.

highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract the address of the orphanage by highlighting the two lines of addresses and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check that all the rows have filled up. If not, highlighting and training of the individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in you regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:

i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.

upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London, e.g. using Google Fusion Tables.

Source:http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Thursday 13 November 2014

The PromptCloud Advantage- Web Scraping with an Edge

The global market is now more aware of its data scraping needs. And so with the demand, the list of suppliers has grown too. This post is dedicated to bringing out the PromptCloud Advantage among such providers.

PromptCloud-Winning-The Race

1. The know-how- Crawling the web, as mundane as it may sound, is a fairly complex task. No one is to be blamed for overlooking the complexity as these things surface only after you’ve tried it yourself and delved into the nitty-gritty. The design decisions you take sit at the core of what you build and eventually monetize. And the long-term effects of such architectural choices are as pleasing if you’ve done it right as disturbing they might turn out if you’re not far-sighted.

Although the expertise of building the tech stack for such large-scale data acquisition, distributing your clusters (and putting thoughts into their geographical locations), maintaining queues, databases and backups, does come from ‘been there done that’, we have been lucky to have the tech advantage imbibed into us since inception. Not that we got it right the first time, but our systems have evolved with technologies, improving each day. Now that we have been there in this business for the last 56 months, it does feel like a long journey for our stack and yes, we do know better :)

2. SLAs- SLAs are what bolsters the data itself. PromptCloud’s key SLAs are scale and quality; while not compromising the data coverage or the politeness policies on your sources. Since we perform focused crawls, there’s no dilution of data and you can consume it all or ask us to index it in order to search using logical combinations in queries. For your reference, here’s a list of all SLAs to visit while picking your data service provider.

changing_place_changing_time_changing_thouts_changing_future.

3. The Experience- There are many scraping tools and crawling services in the market which might just serve the need. What PromptCloud provides is a data acquisition experience; and we go as many number of extra miles as you’d like us to go for it. By leveraging our DaaS platform, we make sure you get what you need from the time you start your research for a data provider through importing the data feeds into your database. We hear your requirements in detail, make sure we’ve got it right by sharing samples and going multiple iterations of reprocessing the data to match your needs while you battle internally on freezing your requirements. But what’s more magical is the way all these feeds get delivered to you, at the intervals you requested; programatically.

It might be evident for the SLAs and the know-how fusing to provide the experience, but it’s that additional human touch that actually aids in sustaining it. We make sure you’re at peace while our systems handle the roadblocks and sort out the messiness on the web.

Source:https://www.promptcloud.com/blog/the-promptcloud-advantage-web-scraping/

Wednesday 12 November 2014

'Scrapers' Dig Deep for Data on Web

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its "Mood" discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was "scraping," or copying, every single message off PatientsLikeMe's private online forums.

Enlarge Image

Bilal Ahmed wrote about his health on a site that was scraped. Andrew Quilty for The Wall Street Journal.

PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online "buzz" for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

"I felt totally violated," says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. "It was very disturbing to know that your information is being sold," he says. Nielsen says it no longer scrapes sites requiring an individual account for access, unless it has permission.

Related Reading

    Digits: Escaping the 'Scrapers'
    Complete Coverage: What They Know

Journal Community

The market for personal data about Internet users is booming, and in the vanguard is the practice of "scraping." Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

The Wall Street Journal's examination of scraping—a trade that involves personal information as well as many other types of data—is part of the newspaper's investigation into the business of tracking people's activities online and selling details about their behavior and personal interests.

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell numbers, photographs and posts on social-network sites.

Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.

One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web—which may include personal information contained in news articles and blog postings—that help corporate clients monitor how they are portrayed. It says it doesn't gather information from password-protected parts of sites.

It's rarely a coincidence when you see Web ads for products that match your interests. WSJ's Christina Tsuei explains how advertisers use cookies to track your online habits.

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.

Nielsen spokesman Matt Anchin says the company's reports to its clients include publicly available information gleaned from the Internet, "so if someone decides to share personally identifiable information, it could be included."

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.

California has a special protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.

Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people's profiles.

Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.

"Social networks are becoming the new public records," says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminal background checks and "Date Check," which promises details about a prospective date for $14.95.

"This data is out there," Mr. Adler says. "If we don't bring it to the consumer's attention, someone else will."

Scraping for Your Real Name

PeekYou.com has applied for a patent for a way to, among other things, match people's real names to pseudonyms they use on blogs, Twitter and online forums.

Read PeekYou.com's patent application.

Enlarge Image

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names or addresses.

Employers, too, are trying to figure out how to use such data to screen job candidates. It's tricky: Employers legally can't discriminate based on gender, race and other factors they may glean from social-media profiles.

One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offering limited social-networking data—some of it scraped—to employers about a year ago. "It's slowly starting to grow," says Chris Dugger, national account manager. He says he's particularly interested in things like whether people are "talking about how they just ripped off their last employer."

Scrapers operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. "Scraping is ubiquitous, but questionable," says Eric Goldman, a law professor at Santa Clara University. "Everyone does it, but it's not totally clear that anyone is allowed to do it without permission."

Scrapers and listening companies say what they're doing is no different from what any person does when gathering information online—they just do it on a much larger scale.

"We take an incomprehensible amount of information and make it intelligent," says Chase McMichael, chief executive of InfiniGraph, a Palo Alto, Calif., "listening service" that helps companies understand the likes and dislikes of online customers.

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.

Some scrapers-for-hire don't ask clients many questions.

"If we don't think they're going to use it for illegal purposes—they often don't tell us what they're going to use it for—generally, we'll err on the side of doing it," says Todd Wilson, owner of screen-scraper.com, a 10-person firm in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as "Happy Valley" that specialize in scraping.

Enlarge Image

Some of the computer code behind screen-scraper.com's software. Chris Detrick for The Wall Street Journal

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct "business intelligence," working for companies who want to scrape competitors' websites.

One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? "We don't know," says Scott Wilson, the owner's brother and vice president of sales. Another job: attempting to scrape Facebook for a multi-level marketing company that wanted email addresses of users who "like" the firm's page—as well as their friends—so they all could be pitched products.

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any talented computer programmer can do it. But penetrating a site's defenses can be tough.

One defense familiar to most Internet users involves "captchas," the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.

More From the Series

    Web's New Goldmine: Your Secrets

    Personal Details Exposed Via Biggest Websites

    Microsoft Quashed Bid to Boost Web Privacy

    On Web's Cutting Edge, Anonymity in Name Only

    Stalking by Cellphone

    Google Agonizes Over Privacy

    The Tracking Ecosystem

    On the Web, Children Face Intensive Tracking

Some professional scrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.

Raids like these are on the rise. "Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping," says Marino Zini, managing director of Sentor Anti Scraping System. The company's Stockholm team blocks scrapers on behalf of website clients.

At Monster.com, the jobs website that stores résumés for tens of millions of individuals, fighting scrapers is a full-time job, "every minute of every day of every week," says Patrick Manzo, global chief privacy officer of Monster Worldwide Inc. Facebook, with its trove of personal data on some 500 million users, says it takes legal and technical steps to deter scraping.

At PatientsLikeMe, there are forums where people discuss experiences with AIDS, supranuclear palsy, depression, organ transplants, post-traumatic stress disorder and self-mutilation. These are supposed to be viewable only by members who have agreed not to scrape, and not by intruders such as Nielsen.

"It was a bad legacy practice that we don't do anymore," says Dave Hudson, who in June took over as chief executive of the Nielsen unit that scraped PatientsLikeMe in May. "It's something that we decided is not acceptable, and we stopped."

Mr. Hudson wouldn't say how often the practice occurred, and wouldn't identify its client.

The Nielsen unit that did the scraping is now part of a joint venture with McKinsey & Co. called NM Incite. It traces its roots to a Cincinnati company called Intelliseek that was founded in 1997. One of its most successful early businesses was scraping message boards to find mentions of brand names for corporate clients.

In 2001, the venture-capital arm of the Central Intelligence Agency, In-Q-Tel Inc., was among a group of investors that put $8 million into the business.

Intelliseek struggled to set boundaries in the new business of monitoring individual conversations online, says Sundar Kadayam, Intelliseek's co-founder. The firm decided it wouldn't be ethical to use automated software to log into private message boards to scrape them.

But, he says, Intelliseek occasionally would ask employees to do that kind of scraping if clients requested it. "The human being can just sign in as who they are," he says. "They don't have to be deceitful."

In 2006, Nielsen bought Intelliseek, which had revenue of more than $10 million and had just become profitable, Mr. Kadayam says. He left one year after the acquisition.

At the time, Nielsen, which provides television ratings and other media services, was looking to diversify into digital businesses. Nielsen combined Intelliseek with a New York startup it had bought called BuzzMetrics.

The new unit, Nielsen BuzzMetrics, quickly became a leader in the field of social-media monitoring. It collects data from 130 million blogs, 8,000 message boards, Twitter and social networks. It sells services such as "ThreatTracker," which alerts a company if its brand is being discussed in a negative light. Clients include more than a dozen of the biggest pharmaceutical companies, according to the company's marketing material.

Like many websites, PatientsLikeMe has software that detects unusual activity. On May 7, that software sounded an alarm about the "Mood" forum.

David Williams, the chief marketing officer, quickly determined that the "member" who had triggered the alert actually was an automated program scraping the forum. He shut down the account.

The next morning, the holder of that account e-mailed customer support to ask why the login and password weren't working. By the afternoon, PatientsLikeMe had located three other suspect accounts and shut them down. The site's investigators traced all of the accounts to Nielsen BuzzMetrics.

On May 18, PatientsLikeMe sent a cease-and-desist letter to Nielsen. Ten days later, Nielsen sent a letter agreeing to stop scraping. Nielsen says it was unable to remove the scraped data from its database, but a company spokesman later said Nielsen had found a way to quarantine the PatientsLikeMe data to prevent it from being included in its reports for clients.

PatientsLikeMe's president, Ben Heywood, disclosed the break-in to the site's 70,000 members in a blog post. He also reminded users that PatientsLikeMe also sells its data in an anonymous form, without attaching user's names to it. That sparked a lively debate on the site about the propriety of selling sensitive information. The company says most of the 350 responses to the blog post were supportive. But it says a total of 218 members quit.

In total, PatientsLikeMe estimates that the scraper obtained about 5% of the messages in the site's forums, primarily in "Mood" and "Multiple Sclerosis."

Source: http://online.wsj.com/articles/SB10001424052748703358504575544381288117888

Friday 7 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled.  Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/

Monday 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>



Similarly, you can create xpath expression of your choice to work with.


Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:           

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above   
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log        
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."



Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

    They're using javascript to load the comments (because I'm staying on the same page).
    They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

    Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

    The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

    Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

    Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

    You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

    Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.


Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

Saturday 6 September 2014

A good web data extraction/screen scraper program?


I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:

    License the data from the provider. Call em up and ask 'em.

    Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.

    For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.


You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.


Source: http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Friday 5 September 2014

How to login to website and extract data using PHP [closed]

I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.


SOurce: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

Is it ok to scrape data from Google results?


I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.



Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

    You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

    If you want a higher amount of API requests you need to pay.
    60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

    Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
    If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
    By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
    There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.


Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Thursday 4 September 2014

Data Scraping from PDF and Excel


I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.



Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java

environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF

files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction

purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three

sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

Wednesday 3 September 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.


In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Tuesday 2 September 2014

Need to pull data from a website…web query? macro?


I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.


Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper


I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the

mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use

Firebug to see how it is loaded.



You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions

(comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.


Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Legality of Web Scraping vs Normal Use


I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area

depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and

(b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is

illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story

altogether... my point is simply accessing and using data that is already open to the public.



I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my

Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act.

Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like

that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in

Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per

second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach.

With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to

chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a

scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering

Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do

something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and

penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a

defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape.

They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user

of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't

really stealing if it is a government site that offers public data especially for download but is if you download all

or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the

content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a

legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no

agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair

game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a

million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not

entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be

known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get

banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.


Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use

Anyone knows an online tool that can scrape a page and create a REST API for the scraped data?


I'm looking for a SaaS solution that is able to login to a platform, scrape data (reports) and then allow accessing the

data through an API. I have some reporting platforms that provide web reporting and email reporting but with no API.

Online reporting doesn't help and email reporting, although can be automated and scraped, isn't so reliable.

If you are willing to do the scraping through your own connection, have a look at Import IO. They have a desktop

application that you use to teach the system how to scrape a page, and then you run the crawler from that application -

and you can run it for as long as you like, as far as I can tell.

You may then upload your data to the Import cloud, from where it is available via an API on the import.io servers.

Useful data can be made public to donate it "to the commons" if you wish.


I did some more digging, found iMacros as a possible solution. Its Windows based, which is a drawback in my case, but

it does allow automation of the scraping and afterwards interaction via common web scripting languages like PHP and

ASP.net.


If you are familiar with jQuery, I think you can use node.js and Cheerio module, then you can create a simple

application to do auto scraping. Actually I have already built a site to do on line web scraping based on the above

mentioned tech, the site is www.datafiddle.net, you can take a look at it.


Source: http://stackoverflow.com/questions/19646028/anyone-knows-an-online-tool-that-can-scrape-a-page-and-create-a-

rest-api-for-the