Wednesday, 31 December 2014

Have You Ever Heard To Web Scraping Expert Use Business Information?

Have you ever heard of "data scraping?" Scaling of the use of information and data scraping technology made his fortune many a successful trader is not new technology. Sometimes website owners automated harvesting of your data can not be happy with sitting

Fortunately there is a modern solution to this problem. Proxy data scraping technology solves the problem by using proxy IP addresses. Scraping data each time you run the program, organized the evacuation of a website, the website thinks that it comes from a different IP address. For website owners, worldwide only a short period of increased traffic from the proxy data scraping sounds.

Now you might be asking yourself: "Can the technology proxy data scraping project?" Certainly better than the choice is dangerous and unreliable (but) free public proxy servers.

There are literally thousands of the world that is quite easy to free proxy servers are all on. But the trick is finding them. Many sites list hundreds of servers, but open to find, and the protocol perseverance, trial and error, works for one of the first lessons you something about server to server, or do not know what activities are going for. A public proxy requests or sensitive data transmitted through a bad idea.

A less risky scenario for proxy data for scraping a rotating proxy connection goes through many private IP addresses to hire.

Scrape data from the software-only website is the proven process of extracting data from the Web. Offer the best of the web software to extract data. We have the expertise and knowledge in web data extraction, image, display, email extract, eliminate services, data mining and web intervene to eliminate.

For example, many companies based on their own needs, in particular, helped to find the data.

Data collection

Generally, data, information, automated computer programs for processing by the appropriate structures transmission. Such formats and protocols are usually strictly structured, well-documented, easily decompose, and confusion to a minimum. Very often, these transmissions are not human readable.

Tractor unit that automatically Extractor is an email from a reliable source that the e-mail ID helps to remove. This is fundamentally different than web pages, HTML files, text files or other format, business services contacts duplicate email addresses without.

A web spider is a computer program that a methodical, automated or surf the World Wide Web in a systematic way. Especially the many sites in the search engines, up-to-date information, as a means to quickly use.

Proxy data scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program is a production of a website, the website that comes from a different IP address. The owner of this website, proxy data from around the world in an increase in traffic looks exactly like scraping the short term.

Now you might be asking yourself, "my project where I can get the data scraping proxy technology?" "Do it yourself" solution, but unfortunately, there is no need to call. Consider hosting the proxy server you choose to rent, but this option is quite pricey, but definitely better than the alternative is incredibly dangerous (but) free public proxy server.

Source:http://www.articlesbase.com/outsourcing-articles/have-you-ever-heard-to-web-scraping-expert-use-business-information-6250856.html

Saturday, 27 December 2014

What Kind of Legal Problems Can Web Scraping Cause

Web scraping software is readily available and has been used by many for legitimate purposes. It has also been used for illegal purposes. A website that engages in this practice should know the legal dangers of the activity.

Related Articles

Black Hat SEO Popular Techniques

General Knowledge- VII

The idea of web scraping is not new. Search engines have used this type of software to determine which results appear when someone conducts a search. They use special software software to extract data from a website and this data is then used to calculate the rankings of the website. Websites work very hard to improve their ranking and their chance of being found by anyone making a search. This use of this practice is understood and is considered to be a legitimate use for the software. However, there are services that provide web scraping and screen scraping prevention services and help the webmaster to remain safe from the attack of bad bots.

The problem with duplicacy is that it is often used for less than legitimate reasons. Since the software responsible can collect all sorts of data from websites and store the information that is collected, it represents a danger to anyone who might be affected by it. The information that can be collected can be used for many practices that are not so legitimate and may even be illegal. Anyone who is involved in this practice of content duplicacy should be aware of the legal issues implicated with this practice. It may be wise for anyone who has a website to find ways to prevent a site from being scraped or to use professional services to block site scraping.

Legal problems

The first thing to worry about, if you have a website or are using web scraping software, is when you might run into legal problems. Some of the issues that web scraping can cause include:

•    Access. If the software is used to access sites it does not have the right to access and takes information that it is not entitled to, the owner of the web scarping software may find themselves in legal trouble.

•    Re-use. The software can collect and reuse information. If that information is copyrighted, that might be a legal problem. Any information that is reused without permission may create legal issues for anyone who uses it.

•    Robots. Some states have enacted laws that are designed to keep people from using scraping robots. These automatically search out information on websites and using them may be illegal in some states. It is up to the user of the web scraping software to comply with any laws in the state in which they are operating.

Who is Responsible

The laws and regulations surrounding this practice are not always clear. There are many grey areas that allow this practice to occur. The question is, who is responsible for determining whether the use of web scraping software is legal?

Websites collect the information, but they may not be the entity using the web scraping software. If they are using this type of software, it is not always enough to inform the website's visitors that this practice is occurring. Putting this information into the user agreement may or may not protect the website from legal problems.

It is also partly the responsibility of a site owner to prevent a site from being scraped. There is software that can be used that will do this for a website and will keep any information that is collected safe and secure. A website may or may not be held legally responsible for any web scraper that is able to collect information they have. It will depend on why the data was collected, how it was used, who collected it, and whether precautions were taken.

What to expect

The issue of content copying and the legal issues surrounding it will continue to evolve. As more courts take on this issue, the lines between legal and illegal web scraping will become clearer. Many of the cases that have been brought to court have occurred in civil court, although there are some that have been taken up in a criminal court. There will be times when such practice may actually be a felony.

Before you use spying software, you need to realize that the laws surrounding its use are not clear. If you operate a website, you need to know the legal issues that you may face if scraping software is used on your website. The best step is to use the software available to protect your website and stop web scraping and be honest on your site if web scraping is used.

Source: http://www.articlesbase.com/technology-articles/what-kind-of-legal-problems-can-web-scraping-cause-6780486.html

Wednesday, 24 December 2014

Choose Mining Wear Parts Wisely

It is important to choose a reputable supplier of mining wear parts; one that has been acknowledged as a leader in mining expertise. You will want to research and seek out a company that specializes in the engineering, manufacturing, procurement and design of mining wear parts and who has access to a multitude of patterns and templates to choose from.

It is vital to find a company that invites you to put them to the test; a company that is committed to selling more than just a product, standing behind the parts that they design and manufacture with an unprecedented industry guarantee. Some companies are so confident in their products that each wear part is stamped with their logo, identifying it as a superior product.

You will also want to find a company that takes pride in establishing strong customer relationships and who employs people who are as equally committed to providing outstanding service with customer satisfaction a priority. Your research will help you find a mining wear parts company that guarantees that if they do not have the part available, that they will find it for you or are capable of custom designing products to your exact specifications.

If you stop to consider the ramifications of an equipment malfunction or breakdown on production quotas, the significance of reliable parts becomes readily apparent. The impact can be far reaching if it halts production while the necessary repairs are completed. The ugly reality is that downtime incurs financial losses.

While the cost of aftermarket replacement mining wear parts is one factor, the installation of the part is equally as important. It is vital that aftermarket parts are built to a rugged standard to endure the rigorous industrial demands placed on them. Mining wear parts are routinely subjected to high stress abrasion and impact. The fabricated parts need to have the structural strength to be wear resistant with extended usage. Hardened manganese is the preferred material of choice to impart added strength and avoid premature breakage and replacement. Using inferior quality parts may result in the necessity of replacing them prematurely if they do not withstand the wear and tear that they are subjected to daily. While a few dollars may be saved initially by purchasing inferior mining wear parts, production costs can dramatically increase if frequent breakdowns occur and manpower hours are wasted in the field. Efficient use of manpower is an important budget consideration. Reliability is an absolute necessity w
hen you have production deadlines to meet and operations can quickly grind to a standstill when production is halted.

Quality assurance management monitors the consistency of the parts, demanding that they are machined within precise measurements. In addition, they focus on striving to improve the quality of parts as new technology becomes available. Using precision made, high quality wear parts can make your business more competitive, giving you an advantage and improving your bottom line.

Source:http://ezinearticles.com/?Choose-Mining-Wear-Parts-Wisely&id=6691631

Tuesday, 16 December 2014

Benefits of Predictive Analytics and Data Mining Services

Predictive Analytics is the process of dealing with variety of data and apply various mathematical formulas to discover the best decision for a given situation. Predictive analytics gives your company a competitive edge and can be used to improve ROI substantially. It is the decision science that removes guesswork out of the decision-making process and applies proven scientific guidelines to find right solution in the shortest time possible.

Predictive analytics can be helpful in answering questions like:
•    Who are most likely to respond to your offer?
•    Who are most likely to ignore?
•    Who are most likely to discontinue your service?
•    How much a consumer will spend on your product?
•    Which transaction is a fraud?
•    Which insurance claim is a fraudulent?
•    What resource should I dedicate at a given time?

Benefits of Data mining include:

•    Better understanding of customer behavior propels better decision
•    Profitable customers can be spotted fast and served accordingly
•    Generate more business by reaching hidden markets
•    Target your Marketing message more effectively
•    Helps in minimizing risk and improves ROI.
•    Improve profitability by detecting abnormal patterns in sales, claims, transactions etc
•    Improved customer service and confidence
•    Significant reduction in Direct Marketing expenses

Basic steps of Predictive Analytics are as follows:

•    Spot the business problem or goal
•    Explore various data sources such as transaction history, user demography, catalog details, etc)
•    Extract different data patterns from the above data
•    Build a sample model based on data & problem
•    Classify data, find valuable factors, generate new variables
•    Construct a Predictive model using sample
•    Validate and Deploy this Model

Standard techniques used for it are:

•    Decision Tree
•    Multi-purpose Scaling
•    Linear Regressions
•    Logistic Regressions
•    Factor Analytics
•    Genetic Algorithms
•    Cluster Analytics
•    Product Association

Should you have any queries regarding Data Mining or Predictive Analytics applications, please feel free to contact us. We would be pleased to answer each of your queries in detail.

Source:http://ezinearticles.com/?Benefits-of-Predictive-Analytics-and-Data-Mining-Services&id=4766989

Saturday, 13 December 2014

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases.

Here is some ways that an exception might be raised.
[][0] #The list has no zeroth element, so this raises an IndexError
{}['foo'] #The dictionary has no foo element, so this raises a KeyError

Catching the exception is sometimes cleaner than preventing it from happening in the first place. Here are some examples handling bizarre exceptions in scrapers.

Example 1: Inconsistant date formats

Let’s say we’re parsing dates.
import datetime
This doesn’t raise an error.
datetime.datetime.strptime('2012-04-19', '%Y-%m-%d')
But this does.
datetime.datetime.strptime('April 19, 2012', '%Y-%m-%d')

It raises a ValueError because the date formats don’t match. So what do we do if we’re scraping a data source with multiple date formats?

Ignoring unexpected date formats

A simple thing is to ignore the date formats that we didn’t expect.

import lxml.html
import datetime
def parse_date1(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text
    try:
         cleandate = datetime.datetime.strptime(rawdate, '%Y-%m-%d')
    except ValueError:
         cleandate = None
    return cleandate

print parse_date1('<div id="date">2012-04-19</div>')

If we make a clean date column in a database and put this in there, we’ll have some rows with dates and some rows with nulls. If there are only a few nulls, we might just parse those by hand.

Trying multiple date formats

Maybe we have determined that this particular data source uses three different date formats. We can try all three.

import lxml.html
import datetime

def parse_date2(source):

    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    for date_format in ['%Y-%m-%d', '%B %d, %Y', '%d %B, %Y']:

        try:
             cleandate = datetime.datetime.strptime(rawdate, date_format)
             return cleandate
        except ValueError:
             pass
    return None

print parse_date2('<div id="date">19 April, 2012</div>')

This loops through three different date formats and returns the first one that doesn’t raise the error.

Example 2: Unreliable HTTP connection

If you’re scraping an unreliable website or you are behind an unreliable internet connection, you may sometimes get HTTPErrors or URLErrors for valid URLs. Trying again later might help.

import urllib2
def load(url):
    retries = 3
    for i in range(retries):
        try:
            handle = urllib2.urlopen(url)
            return handle.read()
        except urllib2.URLError:
            if i + 1 == retries:
                raise
            else:
                time.sleep(42)
    # never get here

print load('http://thomaslevine.com')

This function tries to download the page thee times. On the first two fails, it waits 42 seconds and tries again. On the third failure, it raises the error. On a success, it returs the content of the page.

Example 3: Logging errors rather than raising them

For more complicated parses, you might find loads of errors popping up in weird places, so you might want to go through all of the documents before deciding which to fix first or whether to do some of them manually.

import scraperwiki
for document_name in document_names:
    try:
        parse_document(document_name)
    except Exception as e:
        scraperwiki.sqlite.save([], {
            'documentName': document_name,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')

This catches any exception raised by a particular document, stores it in the database and then continues with the next document. Looking at the database afterwards, you might notice some trends in the errors that you can easily fix and some others where you might hard-code the correct parse.

Example 4: Exiting gracefully

When I’m scraping over 9000 pages and my script fails on page 8765, I like to be able to resume where I left off. I can often figure out where I left off based on the previous row that I saved to a database or file, but sometimes I can’t, particularly when I don’t have a unique index.


for bar in bars:
    try:
        foo(bar)
    except:
        print('Failure at bar = "%s"' % bar)
        raise

This will tell me which bar I left off on. It’s fancier if I save the information to the database, so here is how I might do that with ScraperWiki.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except:
        scraperwiki.sqlite.save_var('resume_index', i)
        raise
scraperwiki.sqlite.save_var('resume_index', 0)

ScraperWiki has a limit on CPU time, so an error that often concerns me is the scraperwiki.CPUTimeExceededError. This error is raised after the script has used 80 seconds of CPU time; if you catch the exception, you have two CPU seconds to clean up. You might want to handle this error differently from other errors.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except scraperwiki.CPUTimeExceededError:
        scraperwiki.sqlite.save_var('resume_index', i)
    except Exception as e:
        scraperwiki.sqlite.save_var('resume_index', i)
        scraperwiki.sqlite.save([], {
            'bar': bar,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')
scraperwiki.sqlite.save_var('resume_index', 0)

tl;dr

Expect exceptions to occur when you are scraping a randomly unreliable website with randomly inconsistent content, and consider handling them in ways that allow the script to keep running when one document of interest is bizarrely formatted or not available.

Source: https://blog.scraperwiki.com/2012/05/handling-exceptions-in-scrapers/

Thursday, 11 December 2014

Content Scraping Reuses Blog Posts without Permission

What do popular blogs and websites such as Social Media Examiner, Copy Blogger, CNN.com, Mashable, and Type A Parent have in common? No, it’s not traffic and a loyal online community, each was a victim of the content scraping site “BuzzMyFx.” Although most bloggers fall victim to content scrapers at least once, the offending website was such an extreme case the backlash against it was fast and furious. Thanks to the quick action of many angry bloggers, BuzzMyFix was taken down in a matter of days.

If you’re not familiar with content scraping sites and aren’t sure why they’re bad and what you can do if you fall prey, read on. Not knowing what steps you can take to remove your content from a scraping site can mean someone else is profiting from your hard work.

What is content scraping?

Content scraping is when a blog or website pulls in other bloggers’ content without permission, in many cases passing it off as their own. Instead of stocking their sites with unique content, they steal entire blog posts. Some do leave the original authors’ bylines, but there are plenty that don’t provide attribution at all. This is not a good thing at all.

If you don’t care about someone taking your content and putting it on their blogs and websites without your permission, you should. These sites are stealing traffic, search engine rankings, and even advertising revenue from bloggers. Moreover, by ignoring scraping sites you’re giving the message that this practice is OK.

It’s not OK.

How was BuzzMyFx different?

BuzzMyFx was a little different from your usual scrapers. Bloggers didn’t just find their content had been posted on this site, they learned their entire blogs — down to the design and comments — had been cloned. Plus, any bloggers checking to see if their blogs were being cloned immediately found themselves being scraped as well. Dozens, if not hundreds of blogs were affected. However, bloggers didn’t take this incident sitting down. They spread the word and contacted the site’s host en masse. Thanks to their swift action, and the high number of complaints, the site was removed quickly.

How can I tell if my content is being scraped?

Fortunately for content creators, scrapers are a lazy bunch. Because their sites are automated, and they don’t check or read the content being pulled, they don’t take many precautions to ensure the people they scrape from don’t find their sites. In fact, they may not even care. Fortunately, this makes it easy to learn if your content is being stolen.

    Link to your own articles — When you write a blog post and link to other (of your own) blog posts within that post, it’s not only good SEO. You also will get pingbacks whenever someone else steals your content because of your interlinks. You’re alerted when someone links to your content, and when content is published with your links, you’ll get that alert.

    Google Alerts — If your name, blog’s name, or other unique keywords are set up as Google Alerts, you’ll receive an e-mail every time content is published with these keywords.

    Analytics — When people click on your links that are in scraped content, it will show up as referring traffic in your analytics program. You should always check referring traffic so you can thank the referring site owner, but also to make sure no one is stealing your content.

What steps can I take to remove my content from a scraper?

If you find your content is being stolen, know you have several options. First, you’ll need to find out who owns the scraping site. You can find this out by doing a WHOis domain lookup, which will enable you to search for the website’s details, including the name of the webmaster, contact info, and the name of the site’s host.

Keep in mind that sometimes the website’s owner will pay extra to have his or her name kept private, but you will always be able to find the name of the host. Once you have this information, you can take the necessary steps to have your content removed.

    Contact the site’s owner personally: Your first step should always be a polite request to remove your content immediately. Let the website owner know he or she is in violation of the Digital Millennium Copyright Act (DMCA), and you will take the necessary steps to report him if he doesn’t comply.

    Contact the site’s host: If you can’t find the name of the person who owns the site, or if he won’t comply with your takedown request, contact the website’s host. You’ll have to prove your content is being stolen. As the host can be held liable for allowing the content theft, it’s in their best interest to contact the website owner and request removal.

    Contact Google: You can contact Google and fill out a form to have them remove the website from their search engines.

    Spread the word: Let all your blogging friends know about content scrapers when you come across them. The more people who take action against content scrapers, the less likely they are to do it again.

Contacting the webmaster with a takedown notice doesn’t have to be an intimidating process, either. The website Plagiarism Today has a wonderful set of stock letters to use to contact webmasters, web hosts, and even Google. All you have to do is insert the necessary information.

Content scrapers and cloners may try to steal your content, but you don’t have to let them. Stand up for what’s yours.

Source: http://www.dummies.com/how-to/content/content-scraping-reuses-blog-posts-without-permiss.html

Thursday, 4 December 2014

Scraping and Analyzing Angel List Syndicates: Kimono Labs + Silk

Because we use Silk to tell stories and visualize data, we are always looking for interesting ways to pull data into a Silk. Right now that means getting data into the CSV format. Fortunately, a wave of new and powerful visual webscraping tools and services have emerged. These make it very simple for anyone (no technical skills required) to scrape data from a website and export that data into a CSV which we can quickly upload into a Silk.

Cool New Scraping Tools

One of the tools we love in this new space is Kimono Labs. Backed by Y Combinator, Kimono combines a visual scraping editor with the ability to do very granular code-inspector level editing to scraping paths. Saved scrapes can be turned into APIs and exported as JSON. Kimono also lets you save time-series versioning of scrapes.

Data from angel-list-syndicates.silk.co

Like many startups, we watch the goings on at AngelList very closely. Syndicates are of particular interest. Basically, these are DIY venture capital pools that allow a qualified investor to serve as a syndicate leader and aggregate small investments from other qualified investors who are members of AngelList. The idea of the syndicates is to democratize the VC process and make it easier and less risky for individuals to participate.

We used Kimono to scrape information on the Top 25 Syndicates ranked by dollars backing each round. Kimono makes it very easy to visually designate which parts of a page to scrape and how many rows there are on a page. (Here you can see me highlighting the minimum dollar investment). We downloaded the information as a CSV and did a quick scrub to get it ready for upload to Silk. The process took no more than 15 minutes.

We could tell by eyeballing the numbers beforehand that a serious Power Law was in effect. And the actual data analysis on Silk bore this out. We chose to use a pie chart to show distribution. Three syndicates control nearly two-thirds of all the committed capital by Angel.co members in the syndicate program. One of the top three - Tim Ferriss - has no background as a venture capitalist or building technology companies but is rapidly becoming a force in startup investing. The person with the largest committed syndicate pool, Gil Penachina, is someone who is a quiet mover and shaker in Silicon Valley but he clearly packs a huge punch.

The largest syndicate in terms of likely commitments of deals per year is Foundry Group Angels, a group led by Brad Feld (@bfeld). While they put in less per deal, they are planning to back over 50 deals per year - a huge number. Trailing far behind those three was media impresario and Launch conference mogul Jason Calacanis, who is one of the most visible people in the startup space.

Source: http://blog.silk.co/post/83501793279/scraping-and-analyzing-angel-list-syndicates

Sunday, 23 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

    Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

    There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor.  Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Monday, 17 November 2014

How to scrape data without coding? A step by step tutorial on import.io

Import.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi, from Import.io, told Journalism.co.uk: the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

A beginner’s guide

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search results website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):

crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:

data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.

train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:

add column

Next, highlight the name of the first orphanage in the list and press “Train”.

highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract the address of the orphanage by highlighting the two lines of addresses and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check that all the rows have filled up. If not, highlighting and training of the individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in you regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:

i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.

upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London, e.g. using Google Fusion Tables.

Source:http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Thursday, 13 November 2014

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data - Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention

low on effort and time

can work on any scale

Cons-

Data quality compromised

inaccurate and incomplete datasets

lesser details suited only for high-level analyses

Suited for gathering- blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets - If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

High data quality

Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering - any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Friday, 7 November 2014

Web Scraping the Solution to Data Harvesting

The internet is the number one information provider in the world and it is of course the largest in the same course. Web scraping is meant to extract and harvest useful information from the internet. It can be regarded as a multidisciplinary process that involves statistics, databases, data harvesting and data retrieval.

There has been noted a rapid expansion of the web and therefore causing an enormous growth of information. This has led to increased difficulty in the extraction of useful and potential information. Web scraping therefore confronts this problem by harvesting explicit information from a number of websites for knowledge discovery and easy access. It is important to realize that query interfaces of web databases are prone to sharing of same building blocks. It is therefore important to realize that the web offers unprecedented challenge and opportunity to data harvesting.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-solution-data-harvesting/

Wednesday, 5 November 2014

Web Scraping: The Invaluable Decision Making Tool

Business decisions are mandatory in any company. They reflect and directly influence about the future of the company. It is important to realize that decisions must be made in any business situation. The generation of new ideas calls for new actions. This in turn calls for decisions. Decisions can only be made when there is adequate information or data regarding the problem and the cause of action to be taken. Web scraping offers the best opportunity in getting the required information that will enable the management make a wise and sound decision.

Therefore web scraping is an important part in generation of the practical interpretations for the business decision making process. Since businesses take many courses of actions the following areas call for adequate web scraping in order to make outstanding decisions.

1. Suppliers. Whether you are running an offline business there is need to get information regarding your suppliers. In this case there are two situations. The first situation is about your current suppliers and the second situation is about the possibility of acquiring new suppliers. By web scraping you has the opportunity to gather about your suppliers. You need to know other business they are supplying to and the kind of discounts and prices they offer to them. Another important aspect about consumers is to determine the periods when they have surplus and therefore be able to determine the purchasing prices.

Web scraping can provide new information concerning new suppliers. This will make a cutting edge in the purchasing sector. You can get new suppliers that have reasonable prices. This will go a long way in ensuring a profitable business. Therefore web scraping is an integral process that should be taken first before making a vital decision concerning suppliers.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-invaluable-decision-making-tool/

Monday, 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>



Similarly, you can create xpath expression of your choice to work with.


Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:           

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above   
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log        
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."



Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday, 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

    They're using javascript to load the comments (because I'm staying on the same page).
    They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

    Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

    The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

    Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

    Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

    You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

    Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.


Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

Saturday, 6 September 2014

A good web data extraction/screen scraper program?


I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:

    License the data from the provider. Call em up and ask 'em.

    Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.

    For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.


You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.


Source: http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Friday, 5 September 2014

How to login to website and extract data using PHP [closed]

I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.


SOurce: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

Is it ok to scrape data from Google results?


I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.



Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

    You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

    If you want a higher amount of API requests you need to pay.
    60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

    Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
    If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
    By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
    There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.


Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Wednesday, 3 September 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.


In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Tuesday, 2 September 2014

Need to pull data from a website…web query? macro?


I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.


Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper


I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the

mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use

Firebug to see how it is loaded.



You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions

(comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.


Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Legality of Web Scraping vs Normal Use


I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area

depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and

(b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is

illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story

altogether... my point is simply accessing and using data that is already open to the public.



I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my

Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act.

Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like

that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in

Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per

second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach.

With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to

chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a

scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering

Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do

something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and

penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a

defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape.

They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user

of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't

really stealing if it is a government site that offers public data especially for download but is if you download all

or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the

content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a

legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no

agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair

game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a

million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not

entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be

known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get

banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.


Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use

Anyone knows an online tool that can scrape a page and create a REST API for the scraped data?


I'm looking for a SaaS solution that is able to login to a platform, scrape data (reports) and then allow accessing the

data through an API. I have some reporting platforms that provide web reporting and email reporting but with no API.

Online reporting doesn't help and email reporting, although can be automated and scraped, isn't so reliable.

If you are willing to do the scraping through your own connection, have a look at Import IO. They have a desktop

application that you use to teach the system how to scrape a page, and then you run the crawler from that application -

and you can run it for as long as you like, as far as I can tell.

You may then upload your data to the Import cloud, from where it is available via an API on the import.io servers.

Useful data can be made public to donate it "to the commons" if you wish.


I did some more digging, found iMacros as a possible solution. Its Windows based, which is a drawback in my case, but

it does allow automation of the scraping and afterwards interaction via common web scripting languages like PHP and

ASP.net.


If you are familiar with jQuery, I think you can use node.js and Cheerio module, then you can create a simple

application to do auto scraping. Actually I have already built a site to do on line web scraping based on the above

mentioned tech, the site is www.datafiddle.net, you can take a look at it.


Source: http://stackoverflow.com/questions/19646028/anyone-knows-an-online-tool-that-can-scrape-a-page-and-create-a-

rest-api-for-the

Wednesday, 27 August 2014

Scrapy, scraping price data from StubHub


I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average price.

http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = ["stubhub.com"]
    start_urls = ["http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/"]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
        items.append(item)
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!


Source: http://stackoverflow.com/questions/22770917/scrapy-scraping-price-data-from-stubhub

Tuesday, 26 August 2014

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.



Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Monday, 25 August 2014

Data Scraping using php


Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.



Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R


I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Sunday, 24 August 2014

Obtaining reddit data

I am interested in obtaining data from different reddit subreddits. Does anyone know

if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data

collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API

(follow the rules)
    automatically generated API docs -- provides information on the requests needed to

access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about

reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you

should check out the existing set of API wrappers for various languages. Despite my

bias (I am the package maintainer) I am quite certain PRAW, for python, has support

for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data

Wednesday, 20 August 2014

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Tuesday, 19 August 2014

Scrape Data Point Using Python

I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python