Semantic Web & Linked Open Data
This is one of several posts written in collaboration with the students participating the Sease’s Scientific Blog Posting seminar at the University of Padova. This post is written in collaboration with the student Daniel Lupu.
What is the Semantic Web?
The Web was invented in 1989 by Tim Berners-Lee at CERN as a linked information system to ease storing and retrieving documents from large projects at CERN. The novel idea was using the recent implementation of a global system of interconnected computers: the Internet, expanding it into a global information system using hyperlinks.
Web 1.0 or “Read Web” was born, with producers creating static web pages and consumers being passive readers. The next natural iteration was the “Read/Write or Participative Web”, allowing the introduction of user-generated content, the rise of social media, and standard technologies such as JSON and REST to provide improved interoperability (Web 2.0).
In the meantime, a new vision was emerging, an extension from a Web of documents to a Web of data that can be processed by machines: the “Semantic Web”, also known as “Web 3.0”. Its roots are in the early days of the Web, with references already in the 90s, but officially presented in a May 2001 Scientific American article by Tim Berners-Lee, James Hendler, and Ora Lassila.
What can the Semantic Web do for me?
Imagine planning a trip to a new city, looking for hotels, and trying to find the ones closer to the city centre that allow late check-in and have breakfast included. In order to find all alternatives you might check multiple travel agencies, to not miss the best one and to check the information reported to be correct. Instead, what if the hotels themselves published data that could be directly understood by the software of online travel agencies?
Let’s say you ask your voice assistant what are the opening hours and telephone number of a new restaurant in town. As of today, search engines will try to best match your request, providing several different web pages that you need to explore in order to find your answer. If the restaurant owner publishes data using a specific and agreed-upon syntax, this could be directly made available to machines, that could present it to you in a quick fashionable way.
How to write data on the Semantic Web?
In order to make distributed data understandable by machines we need to agree on a standard. The World Wide Web Consortium (W3C) is the authority in the field, setting and promoting standards for an open and equitable Web. Resource Description Framework (RDF) is the standard model to represent data on the Web:
- Everything in RDF is a resource. A resource can be a concept or a physical/object.
For example the movie “The Godfather Part I”. - Every resource is represented by a URI, a unique sequence of characters that identifies a logical or physical resource across the Web.
- RDF describes reality as statements represented by three values, called triples, in the format: “Subject – Predicate – Object”. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.
Let’s see this through an example!
Let’s suppose to have the triple: “The Godfather Part I – contentLocation – New York City”.
Elements in a triple are represented by URIs:
Subject → https://dbpedia.org/page/The_Godfather_Part_I
Predicate → https://schema.org/contentLocation
Object → https://dbpedia.org/page/New_York_City
Objects can be a resource o a literal. Literals are concepts representing values such as strings, numbers, and dates. They can be plain (a string with an optional language tag) or typed (a string combined with a datatype URI). Typed literals allow for operations, such as sums between integers or comparisons between values.
Subject → https://dbpedia.org/page/The_Godfather_Part_I
Predicate → https://www.w3.org/2000/01/rdf-schema#label
Object (plain literal) → “The Godfather”@en.
Object (typed literal) → “1972-01-31″^^xsd:date
In order to simplify the representation of the triples objects, the Compact URI (CURIE) syntax can be used. This solution uses namespaces. Given the resource “The Godfather” from DBpedia, its CURIE or “Compact URI” representation is “dbpedia:The_Godfather_Part_I
” instead of https://dbpedia.org/page/The_Godfather_Part_I.
Where dbpedia
is a prefix we define beforehand as: "PREFIX dbpedia: https://dbpedia.org/page/
“.
Combining everything together and adding some resources we have the following example about movies and actors. The serialization format used is called Turtle, a syntax and file format for expressing data in the RDF data model.
PREFIX dbr: https://dbpedia.org/page/
PREFIX dbo: https://dbpedia.org/ontology/
PREFIX schema: https://schema.org/
PREFIX rdfs: https://www.w3.org/2000/01/rdf-schema#
dbr:Irrational_Man schema:contentLocation dbr:New_York_City .
dbr:The_Godfather_Part_I schema:contentLocation dbr:New_York_City ;
rdfs:label "The Godfather"@en .
dbr:Manhattan_(1979_film) schema:contentLocation dbr:New_York_City .
dbr:Emma_Stone schema:actor dbr:Irrational_Man .
dbr:Woody_Allen schema:director dbr:Manhattan_(1979_film) ;
schema:actor dbr:Manhattan_(1979_film) ;
schema:director dbr:Irrational_Man .
dbr:Francis_Ford_Coppola schema:director dbr:The_Godfather_Part_I ;
dbo:birthPlace dbr:Detroit .
At the top, we define the prefix for the URLs we are going to use, like dbr, dbo, schema, and rdfs.
Then all the triples are listed. If a subject-predicate couple has multiple objects/literals you can just list them in order:
dbr:The_Godfather_Part_I schema:contentLocation dbr:New_York_City ;
rdfs:label "The Godfather"@en .
If a subject has multiple predicate-object couples, again just list them in order:
dbr:Woody_Allen schema:director dbr:Manhattan_(1979_film) ;
schema:actor dbr:Manhattan_(1979_film) ;
schema:director dbr:Irrational_Man .
Every URI is dereferenceable and one can retrieve the server name, file name, protocol, and port number in order to identify the resource location and access mechanism.
Try to navigate them!
Resource “Irrational Man” has URI dbpedia:Irrational_Man or https://dbpedia.org/page/Irrational_Man
The same triples can be also represented in an RDF graph. Where subjects and objects/literals are nodes and the predicates are edges. It improves readability but becomes too messy when we have more than just a few nodes.

How to query the data? SPARQL!
Data published in RDF is understandable by machines and can be queried with SPARQL: SPARQL Protocol and RDF Query Language. It’s a query language like SQL for RDF datasets. Websites can provide their SPARQL endpoints, an HTTP URL capable of receiving and processing SPARQL requests, such as SPARQL Query Editor.
Recalling the definition given in https://www.w3.org/TR/rdf-sparql-query/
“A SPARQL query is made of a set of triple patterns called a basic graph pattern. Triple patterns are like RDF triples except that each of the subject, predicate, and object may be a variable. The result of a query is a solution sequence, corresponding to the ways in which the query’s graph pattern matches the data. There may be zero, one, or multiple solutions to a query. Each solution gives one way in which the selected variables can be bound to RDF terms so that the query pattern matches the data. The result set gives all the possible solutions.”
Variables start with a question mark, for example ?movie.
What is Woody Allen director of?
SELECT ?movie WHERE { ?movie dbo:director dbr:Woody_Allen }
The result set comprehends all movies that match the specified triple in the WHERE clause, such as “dbr:Irrational_Man” and “dbr:Manhattan”.
Try the query by clicking here. In this case, the result is presented as an HTML page.
Get the dataset of all the movies starring Woody Allen
CONSTRUCT { ?movie dbo:starring dbr:Woody_Allen }
WHERE { ?movie dbo:starring dbr:Woody_Allen }
Try the query by clicking here. In this case, the result is presented as an HTML page.
We’ve just scratched the surface of SPARQL, but if you are interested, a more throughout specification can be found in SPARQL 1.1 Query Language by W3C.
Linked Open Data
The examples used up to now are mostly retrieved from the portal of DBpedia, which contains structured information from Wikimedia projects (such as Wikipedia) and is freely accessible to everyone. One can navigate the website with browsers or automated crawlers and ask complex questions in SPARQL to explore related resources.
This is a good example of a concept closely related to the Semantic Web, which is Linked Open Data (LOD). In order to make the original idea for a Web of Data feasible, we need to have a huge amount of RDF data available, reachable, and interconnected (as opposed to just a collection of individual datasets) on the Web.
The publishing of structured data on the Web adheres to a set of four principles defined by Tim Berners-Lee on an issue note:
- URIs as names for things → Everything can be uniquely identified.
- Use HTTP URIs so that people can look up those names → Everything can be accessed using the Web infrastructure.
- When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) → Everything has additional descriptive information.
- Include links to other URIs → Everything is connected to other data on the Web.
Not always Organizations that release open data, usually public or no-profit ones, completely follow the previous points. It can be for various reasons and because it’s expensive to transform data from one format to another.
There exist 5 stages of publishing Linked Open Data:
★ Data is available on the Web with an open license.
★★ Data is available in a machine-readable format (for example .xlsx).
★★★ Data is available in a machine-readable and non-proprietary format (for example .csv).
★★★★ Data is available following W3C standards, i.e. in RDF and SPARQL.
★★★★★ Data is linked to other’s people data to provide context.
The W3C has set standards for the syntax to use when publishing LOD, and a complete reference can be found following links on their website. However, it may be easier to use already existing tools that integrate with the software stack in your organization. A couple of them to quickly start your journey on the Semantic Web are:
We hope that you have found this post to be informative and useful. See you in the next blog post!
Do you want to be published?
This blog post is part of our collaboration with the University of Padua.
If you are a University student or professor and want to collaborate, contact us through e-mail.
Subscribe to our newsletter
Did you like this post about Semantic Web & Linked Open Data? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!
Related
Author
Anna Ruggero
Anna Ruggero is a software engineer passionate about Information Retrieval and Data Mining. She loves to find new solutions to problems, suggesting and testing new ideas, especially those that concern the integration of machine learning techniques into information retrieval systems.