Friday, 3 July 2015

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:     http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521

  tp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629

    ttp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine  the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

“http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0') UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like <table class=”destinations”> or <div id=”statistics”>. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads <table class=”destinations”>. If it was <div id=”statistics”> then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag <tr>

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

<tr> <th></th> <th>Abbotswell School</th> <th>Aberdeen City</th> <th>Scotland</th> </tr> <tr> <th>Percentage of pupils</th> <td>25.5%</td> <td>16.3%</td> <td>22.6%</td> </tr>

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

<td>25.5%</td>

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

Source: http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/

Wednesday, 24 June 2015

Data Scraping - Hand Scraped Hardwood Flooring Gives Your Home That Exclusive Look

Today hand scraped hardwood flooring is becoming extremely popular in the more opulent homes as well as in some commercial properties. Although this type of flooring has only recently become fashionable it has been around for many centuries.

Certainly before the invention of modern sanding techniques all floors where hand scraped at the location where they were to be installed to ensure that the floor would be flat and even. However today this method is used instead to provide texture, richness as well as a unique look and feel to the flooring.

Although manufacturers have produced machines which can provide a scraped look to their flooring it looks cheap compared to the real thing. Unfortunately the main problem with using a machine to scrape the flooring is that it provides a uniform look to the pattern of the wood. Because of this it lacks the natural feel that you would see with a floor which has been scraped by hand.

When done by hand, scraping creates a truly unique look to the floor. However the actual look and feel of each floor will vary as it depends on the skills of the person actually carrying out the work. If there is no control in place whilst the work is being carried out this can result in disastrous look to the finished product.

Many manufacturers who actually provide hand scraped hardwood flooring will either just dent, scoop or rough the floor up. But others will use sanding techniques in order to create a worn and uneven look to the flooring. The more professional teams will scrape the entire surface of the wood in order to create the unique hand made look for their customers.

Many companies will allow their customers to choose what type of scraping takes place on their wood. They can choose between light, medium and heavy. The companies who are really good at hand scraping will be able give the hardwood floor a reclaimed look by including wormholes, splits and other naturally-occurring features within the wood.

If you do decide to choose hand scraped hardwood flooring you will need to factor the costs that are associated with it into your budget. Unfortunately this type of flooring does not come cheap and you can find yourself paying upwards of $15 per sq ft. But once it is installed it will give a room a unique and warm rich feel to it and is certainly going to wow your friends and family when they see it for the first time.

Source: http://ezinearticles.com/?Hand-Scraped-Hardwood-Flooring-Gives-Your-Home-That-Exclusive-Look&id=572577

Friday, 19 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"  "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1024648/"  "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

##  "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"  "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"


# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

##  [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"       

##  [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "                        

##  [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"                 

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

##  [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

##  [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

##  [1] "http://www.imdb.com/title/tt1045658/"  "http://www.imdb.com/title/tt0903624/"

##  [3] "http://www.imdb.com/title/tt0454876/"  "http://www.imdb.com/title/tt1024648/"

##  [5] "http://www.imdb.com/title/tt2024432/"  "http://www.imdb.com/title/tt1234719/"

##  [7] "http://www.imdb.com/title/tt1446192/"  "http://www.imdb.com/title/tt1853728/"

##  [9] "http://www.imdb.com/title/tt0443272/"  "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer  http://www.imdb.com/title/tt1045658/

## 2  The Hobbit: An Unexpected Journey     2 8.2 / trailer  http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer  http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer  http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer  http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer  http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer  http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer  http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer  http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/

Monday, 8 June 2015

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source: http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Tuesday, 2 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload)  # This'll be the results page

html = r.text

root = lxml.html.fromstring(html)  # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/

Thursday, 28 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get  HTML Document text and that text then parsed to extract data from the HTML codes.  Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service.  If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library.  PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, 26 May 2015

Web Scraping Services - Extracting Business Data You Need

Would you like to have someone collect, extract, find or scrap contact details, stats, list, extract data, or information from websites, online stores, directories, and more?

"Hi-Tech BPO Services offers 100% risk-free, quick, accurate and affordable web scraping, data scraping, screen scraping, data collection, data extraction, and website scraping services to worldwide organizations ranging from medium-sized business firms to Fortune 500 companies."

At Hi-Tech BPO Services we are helping global businesses build their own database, mailing list, generate leads, and get access to vast resources of unstructured data available on World Wide Web.

We scrape data from various sources such as websites, blogs, podcasts, and online directories; and convert them into structured formats such as excel, csv, access, text, My SQL using automated and manual scraping technologies. Through our web data scraping services, we crawl through websites and gather sales leads, competitor’s product details, new offers, pricing methodologies, and various other types of information from the web.

Our web scraping services scrape data such as name, email, phone number, address, country, state, city, product, and pricing details among others.

Areas of Expertise in Web Scraping:

•    Contact Details
•    Statistics data from websites
•    Classifieds
•    Real estate portals
•    Social networking sites
•    Government portals
•    Entertainment sites
•    Auction portals
•    Business directories
•    Job portals
•    Email ids and Profiles
•    URLs in an excel spreadsheet
•    Market place portals
•    Search engine and SEO
•    Accessories portals
•    News portals
•    Online shopping portals
•    Hotels and restaurant
•    Event portals
•    Lead generation

Industries we Serve:

Our web scraping services are suitable for industries including real estate, information technology, university, hospital, medicine, property, restaurant, hotels, banking, finance, insurance, media/entertainment, automobiles, marketing, human resources, manufacturing, healthcare, academics, travel, telecommunication and many more.

Why Hi-Tech BPO Services for Web Scraping?

•    Skilled and committed scraping experts
•    Accurate solutions
•    Highly cost-effective pricing strategies
•    Presence of satisfied clients worldwide
•    Using latest and effectual web scraping technologies
•    Ensures timely delivery
•    Round the clock customer support and technical assistance

Get Quick Cost and Time Estimate

Source: http://www.hitechbposervices.com/web-scraping.php

Monday, 25 May 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006


Friday, 22 May 2015

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data – Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

•    minimal manual intervention
•    low on effort and time
•    can work on any scale

Cons-

•    Data quality compromised
•    inaccurate and incomplete datasets
•    lesser details suited only for high-level analyses
•    Suited for gathering- blogs, forums, news
•    Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets – If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

•    High data quality
•    Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering – any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Tuesday, 19 May 2015

The Features of the "Holographic Meridian Scraping Therapy"

1. Systematic nature: Brief introduction to the knowledge of viscera, meridians and points in traditional Chinese medicine, theory of holographic diagnosis and treatment; preliminary discussion of the treatment and health care mechanism of scraping therapy; systemat­ic introduction to the concrete methods of the holographic meridian scraping therapy; enumerating a host of therapeutic methods of scraping for disorders in both Chinese and Western medicine to em­body a combination of disease differentiation and syndrome differen­tiation; and summarizing the health care scraping methods. It is a practical handbook of gua sha.

2. Scientific: Applying the theories of Chinese and Western medicine to explain the health care and treatment mechanism and clinical applications of scraping therapy; introducing in detail the practical manipulations, items for attention, and indications and contraindications of the scraping therapy. Here are introduced repre­sentative diseases in different clinical departments, for which scrap­ing therapy has a better curative effect and the therapeutic methods of scraping for these diseases. Stress is placed on disease differentia­tion in Western medicine and syndrome differentiation in Chinese medicine, which should be combined in practical application.

Although there are more than 140,000 kinds of disease known to modem medicine, all diseases are related to dysfunction of the 14 meridians and internal organs, according to traditional Chinese med­icine. The object of scraping therapy is to correct the disharmony in the meridians and internal organs to recover the normal bodily func­tions. Thus, the scraping of a set of meridian points can be used to treat many diseases. In the section on clinical application only about 100 kinds of common diseases are discussed, although the actual number is much more than that. For easy reference the "Index of Diseases and Symptoms" is appended at the back of the book.

3. Practical: Using simple language and plenty of pictures and diagrams to guarantee that readers can easily leam, memorize and apply the principles of scraping therapy. As long as they master the methods explained in Chapter Three, readers without any medical knowledge can apply scraping therapy to themselves or others, with reference to the pictures in Chapters Four and Five. Besides scraping therapy, herbal treatment for each disease or syndrome is explained and may be used in combination with the scraping techniques.

Referring to the Holographic Meridian Hand Diagnosis and pic­tures at the back of the book will enhance accuracy of diagnosis and increase the effectiveness of scraping therapy.

Since the first publication and distribution of the Chinese edition of the book in July 1995, it has been welcomed by both medical specialists and lay people. In March 1996 this book was republished and adopted as a textbook by the School for Advanced Studies of Traditional Chinese Medicine affiliated to the Institute of the Acu­puncture and Moxibustion of the China Academy of Traditional Chi­nese Medicine.

In order to bring this health care method to more and more peo­ple and to make traditional Chinese medicine better appreciated They have modified and replenished this book in the spirit of constant im­provement. They hope that they may make a contribution to the health care of mankind with this natural therapy which has no side-effects and causes no pollution.

They hope that the Holographic Meridian Scraping Therapy can help the health and happiness of more and more families in the world.

Source: http://ezinearticles.com/?The-Features-of-the-Holographic-Meridian-Scraping-Therapy&id=5005031

Wednesday, 6 May 2015

Web Scraping Services Are Important Tools For Knowledge

Data extraction and web scraping techniques are important tools to find relevant data and information for personal or business use. Many companies, self-employed to copy and paste data from web pages. This process is very reliable, but very expensive as it is a waste of time and effort to get results. This is because the data collected and spent less resources and time required to collect these data are compared.

At present, several mining companies and their websites effective web scraping technique specifically for the thousands of pages of information developed culture can be traced. The information from a CSV file, database, XML file, or any other source with the required format is alameda. understanding of correlations and patterns in the data, so that policies can be designed to assist decision making. The information can also be stored for future reference.

The following are some common examples of data extraction process:

In order to rule through a government portal, citizens who are reliable for a given survey name removed.

Competitive pricing and data products include scraping websites

To access the web site or web design Stock download the videos and photos of scratching

Automatic Data Collection

It regularly collects data on a regular basis. Automated data collection techniques are very important because they find the company’s customer trends and market trends to help. By determining market trends, it is possible to understand customer behavior and predict the likelihood of the data will change.

The following are some examples of automated data collection:

Monitoring of special hourly rates for stocks

collects daily mortgage rates from various financial institutions

on a regular basis is necessary to check the weather

By using web scraping services, you can extract all data related to your business. Then analyzed the data to a spreadsheet or database can be downloaded and compared. Storing data in a database or in a required format and interpretation of the correlations to understand and makes it easier to identify hidden patterns.

Data extraction services, it is possible pricing, email, databases, profile data, and consistently to competitors for information about the data. Different techniques and processes designed to collect and analyze data, and has developed over time. Web Scraping for business processes that have beaten the market recently is one. It is a process from various sources such as websites and databases with large amounts of data provides.

Some of the most common methods used to scrape web crawling, text, fun, DOM analysis and include matching expression. After the process is only analyzers, HTML pages or meaning can be achieved through annotations. There are many different ways of scaling data, but more importantly is working toward the same goal. The main purpose of using web scraping service to retrieve and compile data in databases and web sites. In the business world is to remain relevant to the business process.

The central question about the relevance of web scraping contact. The process is relevant to the business world? The answer is yes. The fact that it is used by large companies in the world and many awards speaks derivatives.

Source: http://www.selfgrowth.com/articles/web-scraping-services-are-important-tools-for-knowledge

Tuesday, 28 April 2015

Scraping a website from a windows service

Question

Hi there.  I have a windows forms application that scrapes a website to retrieve some data.  I would like to implement the same functionality as a windows service.  The reason for this is to allow the program to run 24/7 without  having a user signed in. 

To that end, my current version of the program uses a web browser control (system.windows.forms.webbrowser) to navigate the pages, click the buttons, allow scripts to do their thing, etc.  I cannot figure out a way to do the same without the web browser control, but the web browser control cannot be instantiated in a windows service (because there is no user interface in a web service).

Does anyone have any brilliant ideas on how to get around this?

Thank you very much!

Answers

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

All replies

You are not telling if you are using a .NET Express edition or not

You are not telling which Framework

You are not realy saying what data you are getting from the web site.

So

I made an example of service that work on any Studio edition (including the Express)

to install it, I supposed that you have at least the Framework2, so you will use something similar to:

    %SystemRoot%\Microsoft.NET\Framework\v2.0.50727\installutil /i C:\Test\MyWindowService\MyWindowService\bin\Release\MyWindowService.exe

In the example, I supposed that you are downloading some file from the site

You will need a reference to Windows.Form for the timer


Imports System.ServiceProcess

Imports System.Configuration.Install

Public Class WindowsService : Inherits ServiceBase

  Private Minute As Integer = 60000

  Private WithEvents Timer As New Timer With {.Interval = 30 * Minute, .Enabled = True}

  Public Sub New()

    Me.ServiceName = "MyService"

    Me.EventLog.Log = "Application"

    Me.CanHandlePowerEvent = True

    Me.CanHandleSessionChangeEvent = True

    Me.CanPauseAndContinue = True

    Me.CanShutdown = True

    Me.CanStop = True

  End Sub


  Private Sub Timer_Tick(ByVal sender As Object, ByVal e As System.EventArgs) Handles Timer.Tick

    If IO.File.Exists("C:\MyPath.Data") Then IO.File.Delete("C:\MyPath.Data")

    My.Computer.Network.DownloadFile("http://MyURL.com", "C:\MyPath.Data", "MyUserName", "MyPassword")

    'Do Something with the data downloaded

  End Sub

End Class

<Microsoft.VisualBasic.HideModuleName()> _

Module MainModule

  Public TheServiceName As String

  Public Sub main()

    Dim TheServiceApplication As New WindowsService

    TheServiceName = TheServiceApplication.ServiceName

    ServiceBase.Run(TheServiceApplication)

  End Sub

End Module

<System.ComponentModel.RunInstaller(True)> _

Public Class WindowsServiceInstaller : Inherits Installer

  Public Sub New()

    Dim serviceProcessInstaller As ServiceProcessInstaller = New ServiceProcessInstaller()

    Dim serviceInstaller As ServiceInstaller = New ServiceInstaller()

    serviceProcessInstaller.Account = ServiceAccount.LocalSystem

    serviceProcessInstaller.Username = Nothing

    serviceProcessInstaller.Password = Nothing

    serviceInstaller.DisplayName = "My Windows Service"

    serviceInstaller.StartType = ServiceStartMode.Automatic

    serviceInstaller.ServiceName = TheServiceName

    Me.Installers.Add(serviceProcessInstaller)

    Me.Installers.Add(serviceInstaller)

  End Sub

End Class



 Hello Andy,

Thanks for your post.

What do you want to scrape from the page? HttpWebRequest class ans WebClient class may be what you need. More information, please check:

The HttpWebRequest class provides support for the properties and methods defined in WebRequest and for additional properties and methods that enable the user to interact directly with servers using HTTP.

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

If you have any concenrs, please feel free to follow up.

Best regards



Hi Andy,

What about this problem on your side now? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards



Hi Andy,

When you come back, if you need further assistance about this issue, please feel free to let us know. We will continue to work with this issue.

Have a nice day.

Best regards



Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I am using Visual Studio 2010 Ultimate Edition and the .NET framework 4.0.  Actually, I am upgrading some old code written in VB 6.0, but I can use the latest and greatest thats available.

The application uses a browser control to go to the page, fill in values, click on UI elements, read the HTML that returns, etc.  The purpose of the application is to collection useful information regularily/automatically.

I know how to create a web service, but using the web control in such a service is problematic because the web browser control was meant to be placed on a windows form.  I am not able to create a new instance of it in a project designated as a windows service.



Andy

Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I thought a web request was for web services (retrieving information from them).  I am trying to retreive useful information from a website designed for interaction by a human, such as selecting items from lists and clicking buttons.   I currently use a web browser control to programmatically do what a person would do and get the pages back which in turn get parsed.

Andy



Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards



Thanks for the suggestion.  I will go to that link and see if it will work.  I will update this post with what I find.

I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions? If you have any concerns, please feel free to follow up.

Have a nice day.



Best regards

Hi Liliane

Thank for the follow up reply.  I don't have an answer as of yet.  Implementing this is going to take time and I haven't been given the go-ahead by my boss to spend the time to pursue it.



Hi Andy,

Never minde. You could have a try when you feel free. If you have any further questions about this issue, please feel free to let us know. We will continue to work with you on this issue.

Have a nice day.

Best regards

Source: https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5d565b1-236b-43c2-90c7-f5cc3b2c341b/scraping-a-website-from-a-windows-service

Saturday, 25 April 2015

Scraping the Bottom of the Barrel - The Perils of Online Article Marketing

Many online article marketers so desperately wish to succeed, they want to dump corporate life and work for themselves out of their home. They decide they are going to create an online money making website. Therefore, they look around to see what everyone else is doing, and watch the methods others use to attract online buyers, and then they mimic their marketing, their strategies, and their business models.

Still, if you are copying what other people (less ethical people) are doing in online article marketing, those which are scraping the bottom of the barrel and using false advertising and misrepresentations, then all you are really doing is perpetuating distrust on the Internet. Therefore, you are hurting everyone, including people like me. You must realize that people like me don't appreciate that.

Let me give you a few examples of some of the things going on out there, thing that are being done by people who are ethically challenged. Far too many people write articles and then on their byline they send the Internet surfer or reader of the article to a website that has a squeeze page. The squeeze page has no real information on it, rather it asks for their name and e-mail address.

If the would-be Internet surfer is unwise enough to type in their name and email address they will be spammed by e-mail, receiving various hard-sell marketing pieces. Then, if the Internet Surfer does decide to put in their e-mail address, the website grants them access and then takes them to the page with information about what they are selling, or their online marketing "make you a millionaire" scheme.

Generally, these are five page sales letters, with tons of testimonials of people you've never heard of, and may not actually exist, and all sorts of unsubstantiated earnings claims of how much money you will make if you give them $39.35 by way of PayPal, for this limited offer "Now!" And they will send you an E-book with a strategic plan of how you can duplicate what they are doing. The reality is whatever they are doing is questionable to begin with.

If you are going to do online article marketing please don't scrape the bottom of the barrel, there's just too much competition down there from what I can see. Please consider all this.

Source: http://ezinearticles.com/?Scraping-the-Bottom-of-the-Barrel---The-Perils-of-Online-Article-Marketing&id=2710103

Wednesday, 22 April 2015

Hand Scraped Flooring For a Natural and Unique Look

An option in hardwood flooring that is being increasingly adopted by those looking for something new, innovative and unique for their homes is hand scraped flooring. This type of wood flooring helps one achieve a distinct natural look on one's floor and also has a couple of advantages.

There are three types of scraping that you can get done on your wooden flooring: light, medium and hard. Preferably, if you have a light colored woodwork, then you should go for light scraping and if your floor has a darker shade, then you should opt for hard scraping. But, irrespective of the type of scraping you go for, you must ensure that the laborers doing the scraping are very skilled and impeccable in their job as hand scraping floors is an art that demands patience, time, talent and hard work.

Nowadays, many people tend to go for machine scraping, attracted by the lower investment involved in it. But such people are unable to achieve the requisite natural effect on their floors as machines create patterns on the floors that are easily detectable. These patterns do not emerge with hand scraping and the consequent look is as random and unique as it gets.

Though such scraped flooring is a costly option in flooring, it demands little maintenance. While with perfectly smooth surfaces, you will be always on the edge ensuring that there are no scratches, with hand scraped floors, you will not have to be concerned about this as any new scratches will only add to the already distressed appearance of the flooring.

Prefinished hand scraped wood flooring is also available in the market nowadays. These eliminate the need of any on-site scraping. But this option is of course unsuitable for those who have already got their floors installed. As it is, if you get on-site scraping done, you will have more control over things as you would be able to see the scraping as it develops and would be therefore in a position to exercise your preferences more.

Source: http://ezinearticles.com/?Hand-Scraped-Flooring-For-a-Natural-and-Unique-Look&id=4581623

Thursday, 9 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

web-scraping-service

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s


During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

get-results-from-your-lead-generation-campaign

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Tuesday, 7 April 2015

The Nasty Problem with Scraping Results from the Engines

One theme that I've been concerned with this week centers around data transparency in the search engine world. Search engines provide information that is critical to the business of optimizing and growing a business on the web, yet barriers to this data currently force many companies to use methods of data extraction that violate the search engines' terms of service.

Specifically, we're talking about two pieces of information that no large-scale, successful web operation should be without. These include rankings (the position of their site(s) vs. their competitors) for important keywords and link data (currently provided most accurately through Yahoo!, but also available through MSN and in lower quality formats from Google).

Why do marketers and businesses need this data so badly? First we'll look at rankings:

•    For large sites in particular, rankings across the board will go up or down based on their actions and the actions of their competition. Any serious company who fails to monitor tweaks to their site, public relations, press and optimization tactics in this way will lose out to competitors who do track this data and, thus, can make intelligent business decisions based on it.

•    Rankings provide a benchmark that helps companies estimate their global reach in the search results and make predictions about whether certain areas of extension or growth make logical sense. If a company must decide on how to expand their content or what new keywords to target or even if they can compete in new markets, the business intelligence that can be extracted from large swaths of ranking data is critical.

•    Rankings can be mapped directly to traffic, allowing companies to consider advertising, extending their reach or forming partnerships

And, on the link data side:

•    Temporal link information allows marketers to see what effects certain link building, public relations and press efforts have on a site's link profile. Although some of this data is available through referring links in analytics programs, many folks are much more interested in the links that search engines know about and count, which often includes many more than those that pass traffic (and also ignores/doesn't count some that do pass traffic).

•    Link data may provide references for reputation management or tracking of viral campaigns - again, items that analytics don't entirely encompass.

•    Competitive link data may be of critical importance to many marketers - this information can't be tracked any other way.

I admit it. SEOmoz is a search engine scraper - we do it for our free public tools, for our internal research and we've even considered doing it for clients (though I'm seriously concerned about charging for data that's obtained outside TOS). Many hundreds of large firms in the search space (including a few that are 10-20X our size) do it, too. Why? Because search engine APIs aren't accurate.

Let's look at each engine's abilities and data sources individually. Since we've got a few hundred thousand points of data (if not more) on each, we're in a good position to make calls about how these systems are working.

Google (all APIs listed here):

•    Search SOAP API - provides ranking results that are massively different from almost every datacenter. The information is often less than useless, it's actually harmful, since you'll get a false sense of what's happening with your positions.

•    AJAX Search API - This is really designed to be integrated with your website, and the results can be of good quality for that purpose, but it really doesn't serve the job of providing good stats reporting.

•    AdSense & AdWords APIs - In all honesty, we haven't played around with these, but the fact that neither will report the correct order of the ads, nor will they show more than 8 ads at a time tells me that if a marketer needed this type of data, the APIs wouldn't work.

Yahoo! (APIs listed here):

•    Search API - Provides ranking information that is a somewhat accurate map to Yahoo!'s actual rankings, but is occassionally so far off-base that they're not reliable. Our data points show a lot more congruity with Yahoo!'s than Google's, but not nearly enough when compared with scraped results to be valuable to marketers and businesses.

•    Site Explorer API - Shows excellent information as far as number of pages indexed on a site and the link data that Yahoo! knows about. We've been comparing this information with that from scraped Yahoo! search results (for queries like linkdomain: and site:) and those at the Site Explorer page and find that there's very little quality difference in the results returned, though the best estimate numbers can still be found through a last page search of results.

•    Search Marketing API - I haven't played with this one at all, so I'd love to hear comments from those who have.

MSN:

•    Doesn't mind scraping as long as you use the RSS results. We do, we love them and we commend MSN for giving them out - bravo! They've also got a web search SDK program, but we've yet to give it a whirl. The only problem is the MSN estimates, which are so far off as to be useless. The links themselves, though, are useful.

Ask.com

•    Though it's somewhat hidden, the XML.Teoma.com page allows for scraping of results and Ask doesn't seem to mind, though they haven't explicitly said anything. Again, bravo! - the results look solid, accurate and match up against the Ask.com queries. Now, if Ask would only provide links

I know a lot of you are probably asking:

•    "Rand, if scraping is working, why do you care about the search engines fixing the APIs?"

•    The straight answer is that scraping hurts the search engines, hurts their users and isn't the most practical way to get the data. Let me give you some examples:

•    Scraped queries have to look as much like real users as possible to avoid detection and banning - thus, they affect the query data that search engineers use to improve web search.

•    These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.

•    They take up search engine resources and though even the heaviest scraping barely impacts their server loads, it's still an annoyance.

•    With all these negative elements, and so many positive incentives to have the data, it's clear what's needed - a way for marketers/businesses to get the data they need without hurting the search engines. Here's how they can do it:

•    Provide the search ranking position of a site in the referral string - this works for ranking data, but not for link data and since Yahoo! (and Google) both send referrals through re-directs at times, it wouldn't be a hard piece to add.

•    Make the API's accurate, complete and unlimited

•    If the last option is too ambitious, the search engines could charge for API queries - anyone who needs the data would be more than happy to pay for it. This might help with quality control, too.

•    For link data - serve up accurate, wholistic data in programs like Google Sitemaps and Yahoo! Search Submit (or even, Google Analytics). Obviously, you'd only get information about your own site after verifying.

I've talked to lots of people at the search engine level about making changes this week (including Jeremy, Priyank, Matt, Adam, Aaron, Brett and more). I can only hope for the best...

Source: http://moz.com/blog/the-nasty-problem-with-scraping-results-from-the-engines

Friday, 27 March 2015

The Great Advantages of Data Extraction Software – Why a Company Needs it?

Data extraction is being a huge problem for large corporate companies and businesses, which needs to be handled technically and safely. There are many different approaches used for data extraction from web and various tools have designed to solve certain problems.

Moreover, algorithms and advanced techniques were also developed for data extraction. In this array, the Data Extraction Software is widely used to extract information from web as designed.

Data Extraction Software:


 This is a program specifically designed to collect and organize the information or data from the website or webpage and reformat them.

Uses of Data Extraction Software:

Data extraction software can be used at various levels including social web and enterprise levels.

Enterprise Level: Data extraction techniques at the enterprise level are used as the prime tool to perform analysis of the data in business process re-engineering, business system and in competitive intelligence system.

Social Web Level: This type of web data extraction techniques is widely used for gathering structured data in large amount that are continuously generated by Web.2.0, online social network users and social media.

To specify other uses of Data Extraction software:

  •     It helps in assembling stats for the business plans
  •     It helps to gather data from public or government agencies
  •     It helps to collect data for legal needs

Does the Data Extraction Software make Your Job Simple?

The usage of data extraction software has been widely appreciated by many large corporate companies. In this array, here are a few points to favor the usage of the software;

  •     Data toolbar consists of web scraping tool to automate the process of web data extraction
  •     Point data fields from which the data need to be collected and the tool will do the rest
  •     There are no technical skills required to use data tool
  •     It is possible to extract a huge number of data records in just a few seconds
  • Benefits of Data Extraction Software:
  • This data extraction software benefits many computer users. Here follows a few remarkable benefits of the software;
  •     It can extract detailed data like description, name, price, image and more as defined from a website
  •     It is possible to create projects in the extractor and extract required information automatically from the site without the user’s interference
  •     The process saves huge effort and time
  •     It makes extracting data from several websites easy like online auctions, online stores, real estate portal, business directories, shopping portals and more
  •     It makes it possible to export extracted data to various formats like Microsoft Excel, HTML, SQL, XML, Microsoft Access, MySQL and more
  •     This will allow processing and analyzing data in any custom format

Who majorly Benefits from Data Extraction Software?

Any computer user benefit from this data extraction software, however, it is majorly benefiting users like;

  •     Business men to collect market figures, real estate data and product pricing data
  •     Book lovers to extract information about titles, authors, images, descriptions prices and more
  •     Collectors and hobbyists to extract auction and betting information
  •     Journalists to extract article and news from new websites
  •     Travelers to extract information about holiday places, vacations, prices, images and more
  •     Job seekers to extract information about jobs available, employers and more

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.


Tuesday, 17 March 2015

Predictive Analytics and Web Scraping

The integration of web scraping and predictive analytics can be used to make the marketing process an efficient. This is possible by use of a number of techniques such as business intelligence. The main aim of any business is to make profit, in this article we are looking at the web scraping process and predictive analytics in marketing your products. Integrating the two processes is quite beneficial for business. Web scraping plays the role of harvesting data and predictive analytics in determining the best methods to be used in marketing campaigns.

Business intelligence may be regarded as a decision support system where data is harvested for the purposes of predictive analysis. It can also be used for supporting business decisions. Over the years business intelligence data has been gathered manually. The emergence of the internet has madPredictive Analytics and Web Scrapinge it possible a lot of data for the purposes of business intelligence. The collection of information from various sources or departments of a company such as finance, sales and purchasing consumed a lot of time before correlating such information into any meaningful application.

web scraping plays an important role in collecting data to be used in business intelligence. This is so because normal web scraping process involves data harvesting, selection and even pre-processing.Web scraping makes the business intelligence a reality and a dynamic process. This is so because the business intelligence data needed can be accessed from the internet by the use of web scraping process. There is absolutely no reason why managers ought to wait for a number of months to get data for decision making when they can use specialized companies in the data mining sector such as Loginworks softwares. This is so because these companies have taken a number of years in providing these services and have professional staff on the same.

There is a great need for businesses to engage in predictive analytics. Predictive analytics can be defined as method of using business intelligence. This is because it is used in modeling and forecasting. It is a method of predicting patterns and has wide applications in credit, medical and insurance industries. The most common application of integration between web scraping and predictive analytics is credit assessment. The use past events in estimating the future of a business and markets is an integral part for any business.

Web scraping aids the predictive analysis process by provision of data from the past which can be analyzed and prediction of the customer behaviors such customers who are likely to purchase, renew or even purchase similar products. Predictive analysis and web scraping are very important for any business marketing campaigns. Since marketing is an investment by a company it is therefore necessary for businesses to employ web scraping to get the appropriate data for making business decisions. Predictive analysis narrows your target market and enables you to tailor your campaigns to specific customers. This enables the market teams to come up with a number of advertisements which may be based on your traffic.

Since web scraping is an integral part of predictive analysis, it is therefore important for a company to invest in the process. There is a need for companies to contact customers who are likely to respond positively. Marketing methods will only become efficient if a company is able to target goods and services that are required by customers at the required time. Predictive analytics plays an important role in reducing the amount of investment done to make a sale.

Business intelligence plays an important role in helping marketing teams prepare and anticipate customer needs, rather than reacting to them. Web scraping can present data based on the demographics that may have been overlooked in the past. Any combination of customer demographics is useful in the determination of which platform to use in marketing and what method of marketing can be used and when applicable.

The combination of web scraping and predictive analytics can be useful to managers to bring more sales at the same time spending less. Maximizing profits and minimizing loses is one of the goals of a business. Therefore for a business whether online or offline it is important for companies to engage in web scraping and predictive anal.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/predictive-analytics-web-scraping/

Friday, 13 March 2015

Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source:http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396