This document summarizes the guidelines and best practices for developing sentiment and emotion analysis services using Linked Data models. The content is based on the experience of the different partners of the Community Group in different projects. In addition to the general purpose of the community group, the aim of this document is to facilitate the creation of new services and to foster interoperability until the creation of a standard in the field. A reference implementation is also provided, to ease adoption and to showcase the capabilities of the approach herein.
This is a draft to get comments from the community members, it is subject to change. There are several ways that to participate in the development of these guidelines:
Currently, there are many commercial social media text analysis tools, and a lot of social media monitoring tools that generate statistics about presence, influence power, customer/followers engagement, which are presented in intuitive charts on the user's dashboard. Opinion and affect mining is an emerging research direction that can be applied in this monitoring. However, the companies offering such services tend not to disclose the methodologies and algorithms they use to process data. The academic community has also shown interest into these domains (Pang andLee, 2008), but it remains a research topic and there is little interest in standardization.
An enormous amount of social media content is created daily. An interesting aspect of this type of media is that there are many features in the source beyond pure text that can be exploited. Using these features we could gain deeper knowledge and understanding of the user generated content, and ultimately train a system to look for more targeted characteristics. Such a system would be more accurate in processing and categorizing such content. Among the extra features in social media, we find the name of the users who created the content, together with more information about their demographics and other social activities. Another aspect is that users come from different backgrounds (ethnical, cultural, social), speak different languages and discuss a variety of topics. Encoding this extra information is beyond the capabilities of any the existing formats for sentiment analysis. Moreover, different applications require different sentiment and emotion models. For instance, some applications apply binary classification, whereas some require a more fine-grained analysis. This issue affects both the analysis, and the resources used in the analysis (i.e., lexicons, corpora, etc.).
The lack of consensus on how to model the heterogeneous context of social media is detrimental in two ways. First, it is hindering the appearance of applications that make deep sense of data. Secondly, it hampers interoperability between services and reuse of resources.
A Linked Data approach would tackle both issues. On the one hand, it would enable researchers to use this information, as well as other rich information in the Linked Data cloud. It would also make it possible to infer new knowledge based on existing reusable vocabularies. On the other hand, the combination of a Linked Data approach with a common set of vocabularies would result in higher interoperability, which would make services easier to consume, and would enable new features such as service chaining. A simple inter-operable model would also foster and speed up the creation of new powerful analysis services.The basic NLP aspects of building Linked Data NLP services has been covered in other documents, such as the guidelines for developing NIF-based NLP services, by the BPMLOD Working Group. This document provides a succinct summary of some generic NLP aspects, but it is focused on
Services should use existing and well-known vocabularies and ontologies in their request definitions and results. NIF (the NLP Interchange Format) should be used as the base vocabulary for the NLP parts of the service. NIF defines classes to represent linguistic data in its Core Ontology. A more detailed description of the ontology and its purposes can be found in the [[[NIF]]].
Sentiments and opinions can be annotated using Marl. Emotions and Emotion models can be annotated and modeled with Onyx. Lastly, resources such as corpora can be modeled using vocabularies such as lemon. Provenance information can be added with PROV-O, unambiguously tying results to the process that generated them, and the resources involved.
[[[MARL]]] is a vocabulary to annotate and describe subjective opinions expressed on the web or in particular Information Systems. This opinions may be provided by the user (as in online rating and review systems), or extracted from natural text (sentiment analysis).
Marl models opinions on the aspect and feature level, which is useful for fine-grained opinions and analysis. Marl follows the Linked Data principles as it is aligned with [[[PROV-O]]]. It also takes a linguistic Linked Data approach: it is aligned with the Provenance Ontology,it represents lexical resources as linked data, and has been integrated with lemon.
Be explicit about the ranges of polarity values, and use polarity classes when possible
Marl gives you the ability to specify the range (maximum and minimum value) for your polarity. It is recommended to specify these values, for interoperability with other services. Moreover, marl defines Positive, Neutral and Negative polarities. This is sufficient for most scenarios, but some applications require more fine-grained categories of sentiment (e.g., mildly positive, very positive). If such cases, relying on fixed values (e.g., 0.25 for mildly positive) is discouraged. Instead, the specific polarities used should be properly defined, documented and published for reuse.
[[[ONYX]]] is a vocabulary for emotions in resources, services and tools. It has been designed with services and lexical resources for Emotion Analysis in mind.
What differentiates Onyx from other ontologies for emotion is that instead of adhering to a specific model of emotions, it provides a meta-model for emotions, i.e., it describes the concepts to formalize different models of emotion. These models are known as vocabularies in Onyx’s terminology, following the example of EmotionML. A number of commonly used models have already been integrated and published as linked data. A tool for limited two-way conversion between Onyx representation and EmotionML markup is available, using specific mapping. Just like Marl, Onyx is aligned with the Provenance Ontology, and can be used together with lemon in lexical resources.
Reuse existing emotion models
There are several emotion models already defined as part of the [[[ONYX-vocabularies]]]. As of this writing, the list includes all EmotionML vocabularies (Ashimura, Kazuyuki et al., 2014), WordNet-Affect (Strapparava, Valitutti, 2004) labels and the hourglass of emotions (Cambria et al., 2012).
lemon is a proposed model for modeling lexicon and machine-readable dictionaries and linked to the SemanticWeb and the Linked Data cloud. It was designed to meet the following challenges:
As stated in the document of the [[[PROV-O]]], provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. In the case of sentiment and emotion analysis services, provenance can link the results of an analysis, the request, the resources used in the analysis, and the people and organizations involved in the analysis (e.g., authors).
The PROV Ontology (PROV-O) expresses the PROV Data Model using the OWL2 Web Ontology Language (OWL2). It provides a set of classes, properties, and restrictions that can be used to represent and interchange provenance information generated in different systems and under different contexts.
Include provenance information in service results
PROV-O should be used to link results from services to the annotation generated (e.g., polarity) with the original content and the algorithm or process that produced the annotation. Ideally, a reference to all other resources used (e.g., lexicons) and sub-processes should also be included.
The [[[NIF]]] defines a set of base and optional parameters for NLP services. In addition to those, sentiment and emotion analysis services require specific parameters to better process sentiment and emotion. The following table presents an extended list of API parameters that includes both base and specific parameters:
parameter | aliases | example values | description |
---|---|---|---|
input | i |
|
serialized data (i.e. the text or other formats, depends on informat) |
informat | f |
|
format in which the input is provided |
outformat | o |
|
format in which the input is serialized. |
prefix | p | http://service.example/ns | prefix used to create new IRIs and to expand relative IRIs |
language | l |
| language of the text or content (preferrably as a ISO 639-1 or 639-2 code) |
domain | d |
| domain of the content, which the service may use to provide better results. |
min-polarity |
|
| Minimum polarity value for sentiments. The service should use this parameter to normalize the values of the sentiment annotations. |
max-polarity |
|
| Maximum polarity value for sentiments. The service should use this parameter to normalize the values of the sentiment annotations. |
emotion-model |
|
|
Emotion model to use in the response |
conversion |
|
Define how the results of converting annotations from the service's model to the model specified by the user should be presented in the results. |
In general, schema-less RDF serialization formats are preferred. Among these, human readable options should be prioritized for user-facing services, for demonstration purposes, and for any service that is meant to be consumed by third parties. A popular choice is Turtle, for its combination of terseness and readability.
Unfortunately, there are several reasons that make schema-less unfeasible in certain scenarios. Firstly, support for RDF formats varies depending on the platform and programming language. Another reason to prefer other serialization options is familiarity with the technology. Hence, other formats may be considered, while preserving all semantic information. JSON-LD is a good candidate in this regard, as it mixes one of the most popular serialization formats (JSON) with RDF semantics.
There are two key points to take into consideration when using JSON-LD as serialization format for a service: 1) providing a readable and sensible schema, and 2) maintaining semantics intact.
JSON-LD allows for the definition of contexts, which can be used to provide semantic information about the structure in the JSON document. A properly crafted context can simplify the structure of the JSON-LD object, and reduce its verbosity.
{ "@context": { "@base": "http://ldmesa.example/", "dc": "http://dublincore.org/2012/06/14/dcelements#", "emoml": "http://www.gsi.upm.es/ontologies/onyx/vocabularies/emotionml/ns#", "prov": "http://www.w3.org/ns/prov#", "marl": "http://www.gsi.upm.es/ontologies/marl/ns#", "nif": "http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#", "onyx": "http://www.gsi.upm.es/ontologies/onyx/ns#", "wna": "http://www.gsi.upm.es/ontologies/wnaffect/ns#", "xsd": "http://www.w3.org/2001/XMLSchema#" } }
This section presents examples of results for different kinds of requests.
{ "@context": "http://ldmesa.example/context.jsonld", "@id": "me:Result1", "@type": "results", "activities": [ { "@id": "_:SAnalysis1_Activity", "@type": "marl:SentimentAnalysis", "prov:wasAssociatedWith": "me:SAnalysis1" } ], "entries": [ { "@id": "http://micro.blog/status1", "@type": [ "nif:RFC5147String", "nif:Context" ], "nif:isString": "Dear Microsoft, put your Windows Phone on your newest #open technology program. You'll be awesome. #opensource", "sentiments": [ { "@id": "http://micro.blog/status1#char=80,97", "nif:beginIndex": 80, "nif:endIndex": 97, "nif:anchorOf": "You'll be awesome.", "marl:hasPolarity": "marl:Positive", "marl:polarityValue": 0.9, "prov:wasGeneratedBy": "_:SAnalysis1_Activity" } ], "emotions": [ ] } ] }
The best way to ensure compatibility and a uniform use of these guidelines is by implementing them in thoroughly tested open source libraries and frameworks. This is a curated list of frameworks and libraries that follow at least some of these guidelines. Please, consider using one of these alternatives before developing an ad-hoc solution. If you find a bug or missing feature in them, contribute to them so the rest of the community can benefit.
This is a work in progress. If you have developed a tool or framework that is not in the list, please get in touch with the authors.