Scraping Ontology Specification

V1.0 - 11 July 2012

This version:	http://gsi.upm.es/ontologies/scraping/1.0 (RDF/XML, HTML)
Latest version:	http://gsi.upm.es/ontologies/scraping
Editors:	Jose Ignacio Fernández-Villamor, Carlos Iglesias, Mercedes Garijo
Authors:	Jose Ignacio Fernández-Villamor, Carlos Iglesias, Mercedes Garijo
Contributors:	See acknowledgements

This work is licensed under a Creative Commons Attribution License. This copyright applies to the Scraping Ontology Specification and accompanying documentation in RDF. This ontology uses W3C's RDF technology, an open Web standard that can be freely used by anyone.

Abstract

Semantic scraping defines the mapping between web data and semantic web resources. An RDF model that allows formalizing this mapping has been defined, and is called the Scraping Ontology

The proposed vocabulary serves as link between HTML document’s data and RDF data by defining a model for scraping agents. With this RDF model, it is possible to build an RDF graph of HTML nodes given an HTML document, and provides semantics to syntactic scraping.

Introduction
1. Semantic Scrapping approach
2. What is EWE for?
Scraping Ontology at a glance
Scraping Ontology overview
1. Main classes summary
2. Scraping example
Cross-reference for EWE Classes and Properties

1 Introduction

1.1 Semantic Scraping approach

The web is a hypermedia system that follows the REST architectural style. When a client accesses a web resource on a server, the server returns a representation of the resource. Usually, these representations are formatted in HTML, a language that allows defining the structure of a document for its rendering on a web browser. HTML documents are structured as a DOM tree, which defines the logical structure of the HTML document that will be used for rendering the representation on a web browser. In order to have information about the resource’s content and not about its rendering structure, Linked Data proposes using resources’ representations that include metadata, by enhancing HTML with semantic annotations or by providing RDF representations.

Whenever a resource provides unannotated HTML, a technique that processes the DOM tree in some way needs to be used to identify the structure of the data present in the HTML document and build the associated RDF graph. Also, in a web resource there are DOM fragments that do not provide information, such as advertisements, headers, footers, or decorative elements, while other fragments such as posts or comments have valuable information. Discovery rules can be employed to identify what pieces of information are relevant in a web resource and to identify what relations are stated in a web fragment. For instance, a heading in a piece of news might represent the news title. A discovery rule could use Content Style Sheets (CSS) information, rendering information or NLP to identify the relevant data in the resource’s representation.

Therefore, the input model that discovery rules use at this level comprise HTML fragments, which identify relevant pieces of data in a document, and selectors, which are any mean to identify a fragment inside a document. Usually, web scrapers use regular expressions or CSS or XPath selectors to achieve these tasks, while the output of a web browser when rendering a web fragment, which consists of a set of properties such as typeface, color or dimensions, can also be used through visual selectors.

On the contrary, the output model is comprisen by the different types of contents that are available in the web. Ontologies like Semantically-Interlinked Online Communities Project (SIOC), Friend of a Friend (FOAF) or Dublin Core (DC) address this issue by deﬁning schemas for the modeling of blog posts, relationships between users or annotation of metadata in publications, thus comprising the output content model of our discovery framework.

1.1 Semantic Scraping approach

2. Scraping Ontology at a glance

An alphabetical index of EWE terms, by class (concepts) and by property (relationships, attributes), are given below. All the terms are hyperlinked to their detailed description for quick reference.

3. Scraping Ontology overview

The diagram presented below shows the most relevant connections between main classes that implement the data model of Scraping Ontology.

Scraping Ontology Diagram

3.1. Main classes summary

The basic classes of the model are described next:

Scraper A scraper is an automatic agent that is able to extract particular fragments out of the web.
Fragment Any element of an HTML document. It serves to represent and traverse a whole subtree of a document.
Selector A condition that indicates which this element is. Different selector terms are deﬁned for each selector type. Selectors can be XML Path Language (XPath) expressions, CSS selectors, URI selectors, etc. Selectors are means to identify a web document fragment.
Mapping The mapping between a fragment and an RDF resource or blank node. An identiﬁer is deﬁned to map the fragment to a URI. A predicate between the parent’s mapped fragment and this is deﬁned to produce an RDF triple. Also, an RDF class can be assigned to the mapped resource of this fragment.
Presentation The representation of a fragment. This includes HTML attributes as well as visual parameters such as color, size or font.

3.2. Scraping example

An example of the usage of selectors for a news scraper is shown in the figure below. In this case, a scraper is defined that is able to scrape a set of posts (by using the SIOC ontology) from a specific URI. A sample mapped RDF graph is shown in the figure, too.

Example of the usage of selectors for a news scraper (PNG)

4. Cross-reference for Scraping Ontology classes and properties

Below see a comprehensive list of all Scraping Ontology classes, properties and their descriptions.

Classes and Properties (full detail)

Classes

Class: sc:BaseUriSelector

Base URI selector - A selector that returns the URI of a web resource

Status:	unknown
Sub class of	Selector

Status:	unknown
Used with:	format
Has sub class	Format Plain Format

Status:	unknown
Properties include:	subfragment superclass identifier tag type sameas relation selector
Used with:	subfragment

Status:	unknown
Domain:	Selector
Range:	rdf:Literal

Status:	unknown
Domain:	Univocal selector
Range:	rdf:Literal

Status:	unknown
Domain:	New URI selector
Range:	rdf:Literal

Status:	unknown
Domain:	Fragment
Range:	Selector

Scraping Ontology Specification

V1.0 - 11 July 2012

Abstract

Table of Contents

Appendixes

1 Introduction

1.1 Semantic Scraping approach

1.1 Semantic Scraping approach

2. Scraping Ontology at a glance

3. Scraping Ontology overview

3.1. Main classes summary

3.2. Scraping example

4. Cross-reference for Scraping Ontology classes and properties

Classes and Properties (full detail)

Classes

Class: sc:BaseUriSelector

Class: sc:CssSelector

Class: sc:Format

Class: sc:Fragment

Class: sc:Html

Class: sc:Index

Class: sc:KeywordSelector

Class: sc:ListSelector

Class: sc:NewUriSelector

Class: sc:Page

Class: sc:Plain

Class: sc:RootSelector

Class: sc:SectionSelector

Class: sc:Selector

Class: sc:SliceSelector

Class: sc:TagSelector

Class: sc:UnivocalSelector

Class: sc:UriPatternSelector

Class: sc:UriSelector

Class: sc:WikiText

Class: sc:XPathSelector

Properties

Property: sc:debug

Property: sc:document

Property: sc:downcase

Property: sc:identifier

Property: sc:index

Property: sc:keyword

Property: sc:path

Property: sc:prefix

Property: sc:relation

Property: sc:sameas

Property: sc:selector

Property: sc:subfragment

Property: sc:suffix

Property: sc:superclass

Property: sc:tag

Property: sc:type

Property: sc:uri

A. Change Log

2013 - 05 - 20

B. Acknowledgments