Grupo de Sistemas Ingeligentes Scraping Ontology

Scraping Ontology Specification

V1.0 - 11 July 2012

This version: http://gsi.upm.es/ontologies/scraping/1.0 (RDF/XML, HTML)
Latest version: http://gsi.upm.es/ontologies/scraping
Editors: Jose Ignacio Fernández-Villamor, Carlos Iglesias, Mercedes Garijo
Authors: Jose Ignacio Fernández-Villamor, Carlos Iglesias, Mercedes Garijo
Contributors: See acknowledgements

Creative Commons License


Abstract

Semantic scraping defines the mapping between web data and semantic web resources. An RDF model that allows formalizing this mapping has been defined, and is called the Scraping Ontology

The proposed vocabulary serves as link between HTML document’s data and RDF data by defining a model for scraping agents. With this RDF model, it is possible to build an RDF graph of HTML nodes given an HTML document, and provides semantics to syntactic scraping.


Table of Contents

  1. Introduction
    1. Semantic Scrapping approach
    2. What is EWE for?
  2. Scraping Ontology at a glance
  3. Scraping Ontology overview
    1. Main classes summary
    2. Scraping example
  4. Cross-reference for EWE Classes and Properties

Appendixes

  1. Changelog
  2. Acknowledgements

1 Introduction

1.1 Semantic Scraping approach

The web is a hypermedia system that follows the REST architectural style. When a client accesses a web resource on a server, the server returns a representation of the resource. Usually, these representations are formatted in HTML, a language that allows defining the structure of a document for its rendering on a web browser. HTML documents are structured as a DOM tree, which defines the logical structure of the HTML document that will be used for rendering the representation on a web browser. In order to have information about the resource’s content and not about its rendering structure, Linked Data proposes using resources’ representations that include metadata, by enhancing HTML with semantic annotations or by providing RDF representations.

Whenever a resource provides unannotated HTML, a technique that processes the DOM tree in some way needs to be used to identify the structure of the data present in the HTML document and build the associated RDF graph. Also, in a web resource there are DOM fragments that do not provide information, such as advertisements, headers, footers, or decorative elements, while other fragments such as posts or comments have valuable information. Discovery rules can be employed to identify what pieces of information are relevant in a web resource and to identify what relations are stated in a web fragment. For instance, a heading in a piece of news might represent the news title. A discovery rule could use Content Style Sheets (CSS) information, rendering information or NLP to identify the relevant data in the resource’s representation.

Therefore, the input model that discovery rules use at this level comprise HTML fragments, which identify relevant pieces of data in a document, and selectors, which are any mean to identify a fragment inside a document. Usually, web scrapers use regular expressions or CSS or XPath selectors to achieve these tasks, while the output of a web browser when rendering a web fragment, which consists of a set of properties such as typeface, color or dimensions, can also be used through visual selectors.

On the contrary, the output model is comprisen by the different types of contents that are available in the web. Ontologies like Semantically-Interlinked Online Communities Project (SIOC), Friend of a Friend (FOAF) or Dublin Core (DC) address this issue by defining schemas for the modeling of blog posts, relationships between users or annotation of metadata in publications, thus comprising the output content model of our discovery framework.

1.1 Semantic Scraping approach

2. Scraping Ontology at a glance

An alphabetical index of EWE terms, by class (concepts) and by property (relationships, attributes), are given below. All the terms are hyperlinked to their detailed description for quick reference.

3. Scraping Ontology overview

The diagram presented below shows the most relevant connections between main classes that implement the data model of Scraping Ontology.

Scraping Ontology Diagram (detail)
Scraping Ontology Diagram

3.1. Main classes summary

The basic classes of the model are described next:

3.2. Scraping example

An example of the usage of selectors for a news scraper is shown in the figure below. In this case, a scraper is defined that is able to scrape a set of posts (by using the SIOC ontology) from a specific URI. A sample mapped RDF graph is shown in the figure, too.

Scraping example with RDF code
Example of the usage of selectors for a news scraper (PNG)

4. Cross-reference for Scraping Ontology classes and properties

Below see a comprehensive list of all Scraping Ontology classes, properties and their descriptions.

Classes and Properties (full detail)


Classes

Class: sc:BaseUriSelector

Base URI selector - A selector that returns the URI of a web resource
Status: unknown
Sub class of Selector

[#] [back to top]


Class: sc:CssSelector

CSS selector - A selector that returns a set of HTML tags identified by a CSS expression
Status: unknown
Sub class of Tag selector List selector

[#] [back to top]


Class: sc:Format

Format - A text serialization format
Status: unknown
Used with: format
Has sub class Format Plain Format

[#] [back to top]


Class: sc:Fragment

Fragment - A fragment of a web page (or a complete web page)
Status: unknown
Properties include: subfragment superclass identifier tag type sameas relation selector
Used with: subfragment

[#] [back to top]


Class: sc:Html

Format - HTML serialization format
Status: unknown
Sub class of Format

[#] [back to top]


Class: sc:Index

Index - A web resource that contains a list of references to other resources
Status: unknown

[#] [back to top]


Class: sc:KeywordSelector

Keyword selector - A selector that may limit the scope by using keywords
Status: unknown
Properties include: keyword
Sub class of Selector

[#] [back to top]


Class: sc:ListSelector

List selector - A selector that outputs a list of results
Status: unknown
Properties include: index
Sub class of Selector
Has sub class XPath selector CSS selector Slice selector

[#] [back to top]


Class: sc:NewUriSelector

New URI selector - A selector that returns a URI, built out of a fragment's text
Status: unknown
Properties include: prefix suffix downcase
Sub class of Selector

[#] [back to top]


Class: sc:Page

Page - A web resource that extends the data existing in another web resource, usually employing the user-interface pattern of pagination. When scraping a web resource, all sc:Pages should be scraped as well in order to retrieve all the data present in that web resource
Status: unknown

[#] [back to top]


Class: sc:Plain

Plain - A plain text serialization format
Status: unknown
Sub class of Format

[#] [back to top]


Class: sc:RootSelector

Root selector - A selector that performs no scoping
Status: unknown
Sub class of Tag selector

[#] [back to top]


Class: sc:SectionSelector

Section selector - A selector that limits the scope to a text section which starts with a headline specified by some keyword
Status: unknown
Sub class of Selector

[#] [back to top]


Class: sc:Selector

Selector - A restriction on the scope of a web fragment
Status: unknown
Properties include: format debug
Used with: identifier selector
Has sub class URI selector Section selector Tag selector URI pattern selector Base URI selector New URI selector Keyword selector List selector

[#] [back to top]


Class: sc:SliceSelector

Slice selector - A selector that splits a fragment's text given a token a returns a specified slice
Status: unknown
Sub class of List selector

[#] [back to top]


Class: sc:TagSelector

Tag selector - A selector that limits the scope to a single HTML tag
Status: unknown
Properties include: attribute
Sub class of Selector
Has sub class Root selector CSS selector XPath selector

[#] [back to top]


Class: sc:UnivocalSelector

Univocal selector - A selector that selects a specific node in the DOM tree of a particular URI
Status: unknown
Properties include: path document

[#] [back to top]


Class: sc:UriPatternSelector

URI pattern selector - A selector that limits the scope to resources identified by a URI pattern defined by a regular expression
Status: unknown
Sub class of Selector

[#] [back to top]


Class: sc:UriSelector

URI selector - A selector that limits the scope to resources identified by a URI
Status: unknown
Sub class of Selector

[#] [back to top]


Class: sc:WikiText

Format - Wiki text serialization format
Status: unknown
Sub class of Format

[#] [back to top]


Class: sc:XPathSelector

XPath selector - A selector that returns a set of HTML tags identified by an XPath expression
Status: unknown
Sub class of Tag selector List selector

[#] [back to top]


Properties

Property: sc:debug

debug - Indicates whether or not a selector must provide debugging information when processed by a scraper
Status: unknown
Domain: Selector
Range: rdf:Literal

[#] [back to top]


Property: sc:document

document - Web document that sets the context of the subject selector, given by its URI
Status: unknown
Domain: Univocal selector
Range: rdf:Literal

[#] [back to top]


Property: sc:downcase

downcase - Indicates whether or not to build a downcased URI on a new URI selector
Status: unknown
Domain: New URI selector
Range: rdf:Literal

[#] [back to top]


Property: sc:identifier

identifier - A selector that defines the URI of the mapped resource of the fragment subject
Status: unknown
Domain: Fragment
Range: Selector

[#] [back to top]


Property: sc:index

index - Index number of the value that should be returned by a list selector
Status: unknown
Domain: List selector
Range: rdf:Literal

[#] [back to top]


Property: sc:keyword

keyword - Keyword used to restrict the scope of a selector
Status: unknown
Domain: Keyword selector
Range: rdf:Literal

[#] [back to top]


Property: sc:path

path - XPath to the element selected by the subject
Status: unknown
Domain: Univocal selector
Range: rdf:Literal

[#] [back to top]


Property: sc:prefix

prefix - Prefix used when building a URI on a new URI selector
Status: unknown
Domain: New URI selector
Range: rdf:Literal

[#] [back to top]


Property: sc:relation

relation - The relation between a mapped resource of the fragment subject and the mapped resource of its parent fragment
Status: unknown
Domain: Fragment
Range: rdf:Property

[#] [back to top]


Property: sc:sameas

same as - A resource with the same semantics as the one mapped by the fragment subject
Status: unknown
Domain: Fragment
Range: rdf:Resource

[#] [back to top]


Property: sc:selector

selector - A selector that defines the scope of the subject, typically a web fragment
Status: unknown
Domain: Fragment
Range: Selector

[#] [back to top]


Property: sc:subfragment

subfragment - A fragment that is contained inside the fragment subject
Status: unknown
Domain: Fragment
Range: Fragment

[#] [back to top]


Property: sc:suffix

suffix - Suffix used when building a URI on a new URI selector
Status: unknown
Domain: New URI selector
Range: rdf:Literal

[#] [back to top]


Property: sc:superclass

superclass - The superclass of the mapped resource of the fragment subject
Status: unknown
Domain: Fragment
Range: rdf:Class

[#] [back to top]


Property: sc:tag

tag - HTML tag that is the root of the fragment subject
Status: unknown
Domain: Fragment
Range: rdf:Literal

[#] [back to top]


Property: sc:type

type - The type of a resource that is mapped by the fragment subject
Status: unknown
Domain: Fragment
Range: rdf:Class

[#] [back to top]


Property: sc:uri

URI - URI of an RDF resource
Status: unknown
Domain: rdf:Resource
Range: rdf:Literal

[#] [back to top]


A. Change Log

2013 - 05 - 20

B. Acknowledgments

This documentation has been generated automatically from the most recent ontology specification in OWL using a python script called SpecGen. The style formatting has been inspired on FOAF specification.

Special thanks for support with ontology creation and research to: Prof. Carlos A. Iglesias, Prof. Mercedes Garijo and members of the GSI Group of DIT department of Universidad Politécnica de Madrid.

This work has been funded by the European Union through the Omelette project.