Abstract:
Information extraction on web pages, commonly known as screen
scraping, is usually performed through wrapper induction, a technique that
is based on the internal structure of HTML documents. As such, the main
limitation of these kinds of techniques is that a generated wrapper is only
useful for the web page it was designed for. To overcome this, we have designed
a system that generates ?rst-order logic rules that can be used to extract data
from web pages. These rules are based on visual features such as font size,
elements positioning or types of contents. Thus, they do not depend on a
document structure, and can be applied on dierent sites. The system has
been evaluated on a set of web pages, which has served to identify several
design patterns used across the Web.