Board logo

subject: Introduction To Popular Web Data Extraction Applications [print this page]


If your organization wants to design and develop comprehensive information system the first challenge comes to you is extraction of data from World Wide Web. Issues that arise include extraction, validation and management of the large amount of data available on the internet. These data have typically a low quality, format mismatch and content mistakes making things more difficult.

Most popular algorithm in practice for effective Web Data extraction is Regular Expressions or Wrapper. This algorithm offers flexible and scalable mechanisms to harvest necessary data from various web resources such as directories, forums, blogs, etc. Since all these web sources are quite assorted its nearly impossible to build and maintain huge database for business intelligence and market research purpose.

Wrappers are dedicated applications that automatically harvest data from online documents and store the information into a specified structured format. The wrapper application first downloads HTML pages from internet, browses data for extraction and then stores this data in MS Excel, CSV, MySQL or other structured format to facilitate further refinements.

The very common approach to build Wrappers is manual i.e. identify a set of pattern using HTML programming and then harvest particular data manually. However, this is very inefficient technique because small modification in the database make the wrapper fail big way.

A Regular Expression is a intuitive approach to discover a pattern from a particular data or information. Regular expression or simply Regex is a convenient way for many text editors and programming languages to browse and reuse text based information. A wrapper comes with generic operators and extraction modules in order to retrieve simple elements that are later used, shared and embedded into the data system. A Regex can be represented keeping in mind particular features such as content, syntax and semantic relationships.

For more information on Web data extraction email us at info@outsourcingwebresearch.com

by: Richard Kaith




welcome to loan (http://www.yloan.com/) Powered by Discuz! 5.5.0