Lawyers Data Scraping: April 2015

Tuesday, 28 April 2015

Scraping a website from a windows service

Question

Hi there. I have a windows forms application that scrapes a website to retrieve some data. I would like to implement the same functionality as a windows service. The reason for this is to allow the program to run 24/7 without having a user signed in.

To that end, my current version of the program uses a web browser control (system.windows.forms.webbrowser) to navigate the pages, click the buttons, allow scripts to do their thing, etc. I cannot figure out a way to do the same without the web browser control, but the web browser control cannot be instantiated in a windows service (because there is no user interface in a web service).

Does anyone have any brilliant ideas on how to get around this?

Thank you very much!

Answers

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

All replies

You are not telling if you are using a .NET Express edition or not

You are not telling which Framework

You are not realy saying what data you are getting from the web site.

So

I made an example of service that work on any Studio edition (including the Express)

to install it, I supposed that you have at least the Framework2, so you will use something similar to:

    %SystemRoot%\Microsoft.NET\Framework\v2.0.50727\installutil /i C:\Test\MyWindowService\MyWindowService\bin\Release\MyWindowService.exe

In the example, I supposed that you are downloading some file from the site

You will need a reference to Windows.Form for the timer

Imports System.ServiceProcess

Imports System.Configuration.Install

Public Class WindowsService : Inherits ServiceBase

Private Minute As Integer = 60000

Private WithEvents Timer As New Timer With {.Interval = 30 * Minute, .Enabled = True}

Public Sub New()

    Me.ServiceName = "MyService"

    Me.EventLog.Log = "Application"

    Me.CanHandlePowerEvent = True

    Me.CanHandleSessionChangeEvent = True

    Me.CanPauseAndContinue = True

    Me.CanShutdown = True

    Me.CanStop = True

End Sub

Private Sub Timer_Tick(ByVal sender As Object, ByVal e As System.EventArgs) Handles Timer.Tick

    If IO.File.Exists("C:\MyPath.Data") Then IO.File.Delete("C:\MyPath.Data")

    My.Computer.Network.DownloadFile("http://MyURL.com", "C:\MyPath.Data", "MyUserName", "MyPassword")

    'Do Something with the data downloaded

End Sub

End Class

<Microsoft.VisualBasic.HideModuleName()> _

Module MainModule

Public TheServiceName As String

Public Sub main()

    Dim TheServiceApplication As New WindowsService

    TheServiceName = TheServiceApplication.ServiceName

    ServiceBase.Run(TheServiceApplication)

End Sub

End Module

<System.ComponentModel.RunInstaller(True)> _

Public Class WindowsServiceInstaller : Inherits Installer

Public Sub New()

    Dim serviceProcessInstaller As ServiceProcessInstaller = New ServiceProcessInstaller()

    Dim serviceInstaller As ServiceInstaller = New ServiceInstaller()

    serviceProcessInstaller.Account = ServiceAccount.LocalSystem

    serviceProcessInstaller.Username = Nothing

    serviceProcessInstaller.Password = Nothing

    serviceInstaller.DisplayName = "My Windows Service"

    serviceInstaller.StartType = ServiceStartMode.Automatic

    serviceInstaller.ServiceName = TheServiceName

    Me.Installers.Add(serviceProcessInstaller)

    Me.Installers.Add(serviceInstaller)

End Sub

End Class

Hello Andy,

Thanks for your post.

What do you want to scrape from the page? HttpWebRequest class ans WebClient class may be what you need. More information, please check:

The HttpWebRequest class provides support for the properties and methods defined in WebRequest and for additional properties and methods that enable the user to interact directly with servers using HTTP.

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

If you have any concenrs, please feel free to follow up.

Best regards

Hi Andy,

What about this problem on your side now? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards

Hi Andy,

When you come back, if you need further assistance about this issue, please feel free to let us know. We will continue to work with this issue.

Have a nice day.

Best regards

Thank you for the reply. Sorry it has taken me so long to respond. I did not receive any notification that someone had replied!

I am using Visual Studio 2010 Ultimate Edition and the .NET framework 4.0. Actually, I am upgrading some old code written in VB 6.0, but I can use the latest and greatest thats available.

The application uses a browser control to go to the page, fill in values, click on UI elements, read the HTML that returns, etc. The purpose of the application is to collection useful information regularily/automatically.

I know how to create a web service, but using the web control in such a service is problematic because the web browser control was meant to be placed on a windows form. I am not able to create a new instance of it in a project designated as a windows service.

Andy

Thank you for the reply. Sorry it has taken me so long to respond. I did not receive any notification that someone had replied!

I thought a web request was for web services (retrieving information from them). I am trying to retreive useful information from a website designed for interaction by a human, such as selecting items from lists and clicking buttons.   I currently use a web browser control to programmatically do what a person would do and get the pages back which in turn get parsed.

Andy

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

Thanks for the suggestion. I will go to that link and see if it will work. I will update this post with what I find.

I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards

Hi Liliane

Thank for the follow up reply. I don't have an answer as of yet. Implementing this is going to take time and I haven't been given the go-ahead by my boss to spend the time to pursue it.

Hi Andy,

Never minde. You could have a try when you feel free. If you have any further questions about this issue, please feel free to let us know. We will continue to work with you on this issue.

Have a nice day.

Best regards

Source: https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5d565b1-236b-43c2-90c7-f5cc3b2c341b/scraping-a-website-from-a-windows-service

Saturday, 25 April 2015

Scraping the Bottom of the Barrel - The Perils of Online Article Marketing

Many online article marketers so desperately wish to succeed, they want to dump corporate life and work for themselves out of their home. They decide they are going to create an online money making website. Therefore, they look around to see what everyone else is doing, and watch the methods others use to attract online buyers, and then they mimic their marketing, their strategies, and their business models.

Still, if you are copying what other people (less ethical people) are doing in online article marketing, those which are scraping the bottom of the barrel and using false advertising and misrepresentations, then all you are really doing is perpetuating distrust on the Internet. Therefore, you are hurting everyone, including people like me. You must realize that people like me don't appreciate that.

Let me give you a few examples of some of the things going on out there, thing that are being done by people who are ethically challenged. Far too many people write articles and then on their byline they send the Internet surfer or reader of the article to a website that has a squeeze page. The squeeze page has no real information on it, rather it asks for their name and e-mail address.

If the would-be Internet surfer is unwise enough to type in their name and email address they will be spammed by e-mail, receiving various hard-sell marketing pieces. Then, if the Internet Surfer does decide to put in their e-mail address, the website grants them access and then takes them to the page with information about what they are selling, or their online marketing "make you a millionaire" scheme.

Generally, these are five page sales letters, with tons of testimonials of people you've never heard of, and may not actually exist, and all sorts of unsubstantiated earnings claims of how much money you will make if you give them $39.35 by way of PayPal, for this limited offer "Now!" And they will send you an E-book with a strategic plan of how you can duplicate what they are doing. The reality is whatever they are doing is questionable to begin with.

If you are going to do online article marketing please don't scrape the bottom of the barrel, there's just too much competition down there from what I can see. Please consider all this.

Source: http://ezinearticles.com/?Scraping-the-Bottom-of-the-Barrel---The-Perils-of-Online-Article-Marketing&id=2710103

Wednesday, 22 April 2015

Hand Scraped Flooring For a Natural and Unique Look

An option in hardwood flooring that is being increasingly adopted by those looking for something new, innovative and unique for their homes is hand scraped flooring. This type of wood flooring helps one achieve a distinct natural look on one's floor and also has a couple of advantages.

There are three types of scraping that you can get done on your wooden flooring: light, medium and hard. Preferably, if you have a light colored woodwork, then you should go for light scraping and if your floor has a darker shade, then you should opt for hard scraping. But, irrespective of the type of scraping you go for, you must ensure that the laborers doing the scraping are very skilled and impeccable in their job as hand scraping floors is an art that demands patience, time, talent and hard work.

Nowadays, many people tend to go for machine scraping, attracted by the lower investment involved in it. But such people are unable to achieve the requisite natural effect on their floors as machines create patterns on the floors that are easily detectable. These patterns do not emerge with hand scraping and the consequent look is as random and unique as it gets.

Though such scraped flooring is a costly option in flooring, it demands little maintenance. While with perfectly smooth surfaces, you will be always on the edge ensuring that there are no scratches, with hand scraped floors, you will not have to be concerned about this as any new scratches will only add to the already distressed appearance of the flooring.

Prefinished hand scraped wood flooring is also available in the market nowadays. These eliminate the need of any on-site scraping. But this option is of course unsuitable for those who have already got their floors installed. As it is, if you get on-site scraping done, you will have more control over things as you would be able to see the scraping as it develops and would be therefore in a position to exercise your preferences more.

Source: http://ezinearticles.com/?Hand-Scraped-Flooring-For-a-Natural-and-Unique-Look&id=4581623

Thursday, 9 April 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

web-scraping-service

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s

During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

Check out How to use Twitter data for your business

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

get-results-from-your-lead-generation-campaign

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Tuesday, 7 April 2015

The Nasty Problem with Scraping Results from the Engines

One theme that I've been concerned with this week centers around data transparency in the search engine world. Search engines provide information that is critical to the business of optimizing and growing a business on the web, yet barriers to this data currently force many companies to use methods of data extraction that violate the search engines' terms of service.

Specifically, we're talking about two pieces of information that no large-scale, successful web operation should be without. These include rankings (the position of their site(s) vs. their competitors) for important keywords and link data (currently provided most accurately through Yahoo!, but also available through MSN and in lower quality formats from Google).

Why do marketers and businesses need this data so badly? First we'll look at rankings:

•    For large sites in particular, rankings across the board will go up or down based on their actions and the actions of their competition. Any serious company who fails to monitor tweaks to their site, public relations, press and optimization tactics in this way will lose out to competitors who do track this data and, thus, can make intelligent business decisions based on it.

•    Rankings provide a benchmark that helps companies estimate their global reach in the search results and make predictions about whether certain areas of extension or growth make logical sense. If a company must decide on how to expand their content or what new keywords to target or even if they can compete in new markets, the business intelligence that can be extracted from large swaths of ranking data is critical.

•    Rankings can be mapped directly to traffic, allowing companies to consider advertising, extending their reach or forming partnerships

And, on the link data side:

•    Temporal link information allows marketers to see what effects certain link building, public relations and press efforts have on a site's link profile. Although some of this data is available through referring links in analytics programs, many folks are much more interested in the links that search engines know about and count, which often includes many more than those that pass traffic (and also ignores/doesn't count some that do pass traffic).

•    Link data may provide references for reputation management or tracking of viral campaigns - again, items that analytics don't entirely encompass.

•    Competitive link data may be of critical importance to many marketers - this information can't be tracked any other way.

I admit it. SEOmoz is a search engine scraper - we do it for our free public tools, for our internal research and we've even considered doing it for clients (though I'm seriously concerned about charging for data that's obtained outside TOS). Many hundreds of large firms in the search space (including a few that are 10-20X our size) do it, too. Why? Because search engine APIs aren't accurate.

Let's look at each engine's abilities and data sources individually. Since we've got a few hundred thousand points of data (if not more) on each, we're in a good position to make calls about how these systems are working.

Google (all APIs listed here):

•    Search SOAP API - provides ranking results that are massively different from almost every datacenter. The information is often less than useless, it's actually harmful, since you'll get a false sense of what's happening with your positions.

•    AJAX Search API - This is really designed to be integrated with your website, and the results can be of good quality for that purpose, but it really doesn't serve the job of providing good stats reporting.

•    AdSense & AdWords APIs - In all honesty, we haven't played around with these, but the fact that neither will report the correct order of the ads, nor will they show more than 8 ads at a time tells me that if a marketer needed this type of data, the APIs wouldn't work.

Yahoo! (APIs listed here):

•    Search API - Provides ranking information that is a somewhat accurate map to Yahoo!'s actual rankings, but is occassionally so far off-base that they're not reliable. Our data points show a lot more congruity with Yahoo!'s than Google's, but not nearly enough when compared with scraped results to be valuable to marketers and businesses.

•    Site Explorer API - Shows excellent information as far as number of pages indexed on a site and the link data that Yahoo! knows about. We've been comparing this information with that from scraped Yahoo! search results (for queries like linkdomain: and site:) and those at the Site Explorer page and find that there's very little quality difference in the results returned, though the best estimate numbers can still be found through a last page search of results.

•    Search Marketing API - I haven't played with this one at all, so I'd love to hear comments from those who have.

MSN:

•    Doesn't mind scraping as long as you use the RSS results. We do, we love them and we commend MSN for giving them out - bravo! They've also got a web search SDK program, but we've yet to give it a whirl. The only problem is the MSN estimates, which are so far off as to be useless. The links themselves, though, are useful.

Ask.com

•    Though it's somewhat hidden, the XML.Teoma.com page allows for scraping of results and Ask doesn't seem to mind, though they haven't explicitly said anything. Again, bravo! - the results look solid, accurate and match up against the Ask.com queries. Now, if Ask would only provide links

I know a lot of you are probably asking:

•    "Rand, if scraping is working, why do you care about the search engines fixing the APIs?"

•    The straight answer is that scraping hurts the search engines, hurts their users and isn't the most practical way to get the data. Let me give you some examples:

•    Scraped queries have to look as much like real users as possible to avoid detection and banning - thus, they affect the query data that search engineers use to improve web search.

•    These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.

•    They take up search engine resources and though even the heaviest scraping barely impacts their server loads, it's still an annoyance.

•    With all these negative elements, and so many positive incentives to have the data, it's clear what's needed - a way for marketers/businesses to get the data they need without hurting the search engines. Here's how they can do it:

•    Provide the search ranking position of a site in the referral string - this works for ranking data, but not for link data and since Yahoo! (and Google) both send referrals through re-directs at times, it wouldn't be a hard piece to add.

•    Make the API's accurate, complete and unlimited

•    If the last option is too ambitious, the search engines could charge for API queries - anyone who needs the data would be more than happy to pay for it. This might help with quality control, too.

•    For link data - serve up accurate, wholistic data in programs like Google Sitemaps and Yahoo! Search Submit (or even, Google Analytics). Obviously, you'd only get information about your own site after verifying.

I've talked to lots of people at the search engine level about making changes this week (including Jeremy, Priyank, Matt, Adam, Aaron, Brett and more). I can only hope for the best...

Source: http://moz.com/blog/the-nasty-problem-with-scraping-results-from-the-engines