Three Common Methods For Web Data Scraping
Probably the most common technique traditionally used to extract data from web pages
, is to cook up some regular expressions that the pieces you want (such as URLs and link titles) match. Our screen-scraper software actually started as an application written in Perl for this reason. In addition to the regular expression, you can also use a piece of code written in something like Java or Active Server Pages to screen out larger chunks of text. Use of raw regular expressions to pull out of the data can be a bit daunting to the uninitiated, and scrape your project is relatively small, they can be a great solution.
So what is the best method for recovering data? It really depends on what your needs are and what resources you have at your disposal. Here are some of the advantages and disadvantages of the various approaches, as well as suggestions on when to use each:
Rough regular expressions and code
Benefits:
- If you are already familiar with regular expressions and at least one programming language, this can be a quick fix.
Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations do not differ significantly in their syntax.
Cons:
- They can be complicated for those who do not have much experience with them. Learning regular expressions is not like going from Perl to Java. It's more like going from Perl to XSLT, where your mind wrap around a very different way of looking at the problem.
When you use this procedure: You will probably just regular expressions in screen scraping, if you have a small job you want done quickly. Especially if you're already familiar with regular expressions, there is no point in getting into other tools when all you have to do is pull some news from a website.
Benefits:
- The data model is generally built example, if you are extracting data on car websites extraction engine already know what make, model and price, so it is easy to map existing data structures (eg place the data in the appropriate places in your database) .
- There is relatively little need for long-term maintenance. As websites change you will probably have to do very little for your extraction engine to take account of the changes.
Cons:
- It is relatively complicated to produce and to work with such a motor. The level of expertise to even understand an extraction engine that uses artificial intelligence and ontologies are much higher than what is necessary to deal with regular expressions.
Benefits:
- Extract of the most complicated things away. You can do some pretty advanced stuff in most screen-scraping applications without knowing anything about regular expressions, or HTTP cookies.
- Drastically reduces the time required to carry out a site scraping. When you learn a specific screen-scraping application, the time it takes to websites scrape over other methods is significantly reduced.
When this method is used: Screen-scraping applications vary widely in their ease of use, price and suitability for a wide range of scenarios to solve. Chances are though, that if you do not mind paying a little, you can save yourself a significant amount of time using a. If you do a quick scrape of a single page, you can almost all languages with regular expressions.
As a side note, I thought I'd share a recent project; we have been involved in the actual use of a hybrid approach of two of the above methods calls. We are currently working on a project dealing with extraction newspaper ads. The data in ads is about as unstructured as you can get. For example, the term "number of rooms" in a real estate ad can be written in 25 different ways. The data extraction part of the process is one that lends itself well to an ontology-based approach, which is what we did. But we still had data portion discovery process.
by: Tonny Raval
Why We Need A Good Web Design Ecommerce Web Development: The Benefits Web Design India Can Help You Make More Earnings Over The Web How To Find Best Web Hosting Provider? Create Use Of Psd To Html5 Modification Alternatives To Enhance Your Web Page Specifications The Best Trends About Web Application Development 5 Things To Consider Before Hiring A Web Design Firm Best Web Hosting Companies Only There! Find Commercial Success Through Austin Web Design Reasons To Hire A Good Web Development Company Php Web Development The Ultimate Tool For Web Designing Help Make Your Web Design More Effective With This Particular Suggestions Web Hosting Services Provided Via Singapore Web Hosting