This version: | http://gsi.upm.es/ontologies/scraping/1.0 (RDF/XML, HTML) |
Latest version: | http://gsi.upm.es/ontologies/scraping |
Editors: | Jose Ignacio Fernández-Villamor, Carlos Iglesias, Mercedes Garijo |
Authors: | Jose Ignacio Fernández-Villamor, Carlos Iglesias, Mercedes Garijo |
Contributors: | See acknowledgements |
This work is licensed under a Creative Commons Attribution License. This copyright applies to the Scraping Ontology Specification and accompanying documentation in RDF. This ontology uses W3C's RDF technology, an open Web standard that can be freely used by anyone.
Semantic scraping defines the mapping between web data and semantic web resources. An RDF model that allows formalizing this mapping has been defined, and is called the Scraping Ontology
The proposed vocabulary serves as link between HTML document’s data and RDF data by defining a model for scraping agents. With this RDF model, it is possible to build an RDF graph of HTML nodes given an HTML document, and provides semantics to syntactic scraping.
The web is a hypermedia system that follows the REST architectural style. When a client accesses a web resource on a server, the server returns a representation of the resource. Usually, these representations are formatted in HTML, a language that allows defining the structure of a document for its rendering on a web browser. HTML documents are structured as a DOM tree, which defines the logical structure of the HTML document that will be used for rendering the representation on a web browser. In order to have information about the resource’s content and not about its rendering structure, Linked Data proposes using resources’ representations that include metadata, by enhancing HTML with semantic annotations or by providing RDF representations.
Whenever a resource provides unannotated HTML, a technique that processes the DOM tree in some way needs to be used to identify the structure of the data present in the HTML document and build the associated RDF graph. Also, in a web resource there are DOM fragments that do not provide information, such as advertisements, headers, footers, or decorative elements, while other fragments such as posts or comments have valuable information. Discovery rules can be employed to identify what pieces of information are relevant in a web resource and to identify what relations are stated in a web fragment. For instance, a heading in a piece of news might represent the news title. A discovery rule could use Content Style Sheets (CSS) information, rendering information or NLP to identify the relevant data in the resource’s representation.
Therefore, the input model that discovery rules use at this level comprise HTML fragments, which identify relevant pieces of data in a document, and selectors, which are any mean to identify a fragment inside a document. Usually, web scrapers use regular expressions or CSS or XPath selectors to achieve these tasks, while the output of a web browser when rendering a web fragment, which consists of a set of properties such as typeface, color or dimensions, can also be used through visual selectors.
On the contrary, the output model is comprisen by the different types of contents that are available in the web. Ontologies like Semantically-Interlinked Online Communities Project (SIOC), Friend of a Friend (FOAF) or Dublin Core (DC) address this issue by defining schemas for the modeling of blog posts, relationships between users or annotation of metadata in publications, thus comprising the output content model of our discovery framework.
An alphabetical index of EWE terms, by class (concepts) and by property (relationships, attributes), are given below. All the terms are hyperlinked to their detailed description for quick reference.
Classes: | BaseUriSelector | CssSelector | Format | Fragment | Html | Index | KeywordSelector | ListSelector | NewUriSelector | Page | Plain | RootSelector | SectionSelector | Selector | SliceSelector | TagSelector | UnivocalSelector | UriPatternSelector | UriSelector | WikiText | XPathSelector |
Properties: | debug | document | downcase | identifier | index | keyword | path | prefix | relation | sameas | selector | subfragment | suffix | superclass | tag | type | uri |
The diagram presented below shows the most relevant connections between main classes that implement the data model of Scraping Ontology.
The basic classes of the model are described next:
An example of the usage of selectors for a news scraper is shown in the figure below. In this case, a scraper is defined that is able to scrape a set of posts (by using the SIOC ontology) from a specific URI. A sample mapped RDF graph is shown in the figure, too.
Below see a comprehensive list of all Scraping Ontology classes, properties and their descriptions.
Status: | unknown |
---|---|
Sub class of | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Tag selector List selector |
[#] [back to top]
Status: | unknown |
---|---|
Used with: | format |
Has sub class | Format Plain Format |
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | subfragment superclass identifier tag type sameas relation selector |
Used with: | subfragment |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Format |
[#] [back to top]
Status: | unknown |
---|
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | keyword |
Sub class of | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | index |
Sub class of | Selector |
Has sub class | XPath selector CSS selector Slice selector |
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | prefix suffix downcase |
Sub class of | Selector |
[#] [back to top]
Status: | unknown |
---|
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Format |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Tag selector |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | format debug |
Used with: | identifier selector |
Has sub class | URI selector Section selector Tag selector URI pattern selector Base URI selector New URI selector Keyword selector List selector |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | List selector |
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | attribute |
Sub class of | Selector |
Has sub class | Root selector CSS selector XPath selector |
[#] [back to top]
Status: | unknown |
---|---|
Properties include: | path document |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Format |
[#] [back to top]
Status: | unknown |
---|---|
Sub class of | Tag selector List selector |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Univocal selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | New URI selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | List selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Keyword selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Univocal selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | New URI selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | rdf:Property |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | rdf:Resource |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | Selector |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | Fragment |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | New URI selector |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | rdf:Class |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | rdf:Literal |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | Fragment |
Range: | rdf:Class |
[#] [back to top]
Status: | unknown |
---|---|
Domain: | rdf:Resource |
Range: | rdf:Literal |
[#] [back to top]
This documentation has been generated automatically from the most recent ontology specification in OWL using a python script called SpecGen. The style formatting has been inspired on FOAF specification.
Special thanks for support with ontology creation and research to: Prof. Carlos A. Iglesias, Prof. Mercedes Garijo and members of the GSI Group of DIT department of Universidad Politécnica de Madrid.
This work has been funded by the European Union through the Omelette project.