Content Strategy: RDF

Showing posts with label RDF. Show all posts

Tuesday, January 13, 2015

Getting started with Semantic Technologies

Recently Ontotext launched a Self-Service Semantic Suite – shorted to S4. S4 provides a set of services for low-cost (currently free) on-demand text analytics and metadata management on the cloud. This provides a great way to get acquainted with Semantic Technologies.

Available S4 services

S4 currently offers the following services:

Text analytics for News content, Biomedical content or Twitter content
Linked Data server with reliable access to the DBpedia, FactForge, GeoNames, WordNet, MusicBrainz, and New York Times datasets
Self-managed RDF database (GraphDB) on the cloud

Trying out Text analytics

Text analytics – in this context – is about finding out what is important in texts (natural language), and using this information.

To try this out, copy some text containing some Persons and Places, biomedical terms and/or Twitter content. Preferably in English for best results, but other languages will produce results as well.
Go to the S4 homepage and click on “Demo S4 today for free”. Paste your text in the Text Analytics box; choose whether your text is more News, Biomedical or Twitter oriented, and hit Execute.

Your result will show the provided text with different types of terms highlighted in different colours. See the below example.

Figure 1. Example of an annotated text

If you hover over an annotated term, it will show extra information. For instance for an organisation it will show the location in DBpedia (the semantic version of Wikipedia). In my example this makes clear that this article is not about some IMF, it is about the IMF, and more info is available on http://dbpedia.org/page/International_Monetary_Fund.

So what’s in it for you?

This service can provide all kinds of structure and information on topics that can help you to classify, understand, link and enrich information.

Trying out Semantic queries

S4 also lets you try out semantic queries using SPARQL, the query language for semantically stored information such as the DBpedia.
Go to the S4 homepage and click on “Demo S4 today for free”. Go to the LOD Access Tab. Select a query from the Pulldown. Let’s try “Find airports near London”.
The SPARQL query is:

PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX omgeo: <http://www.ontotext.com/owlim/geo#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX ff: <http://factforge.net/>
PREFIX om: <http://www.ontotext.com/owlim/>

SELECT distinct ?airport ?label ?RR
WHERE {
dbpedia:London geo-pos:lat ?latBase ;
geo-pos:long ?longBase .
?airport omgeo:nearby(?latBase ?longBase "50mi");
a dbp-ont:Airport ;
ff:preferredLabel ?label ;
om:hasRDFRank ?RR .
} ORDER BY DESC(?RR)

Even without a SPARQL crash-course, this is quite easy to read:

First some definitions are introduced
The query will return the airport ID, name and some rank (RR)
From DBpedia the latitude and longitude of London are retrieved
Only results that have the DBpedia Ontology type “Airport” are selected,
They must be nearer than 50 miles to London, according to the Owlim Geospatial function “Nearby”

Click on “Execute” and have a look at the results. Try out some of the other queries as well.

So what’s in it for you?

An enormous wealth of structured information is available for you to use. Were you aware that you could ask Wikipedia such detailed questions? Be aware: extensive knowledge is needed to write such concise statements and really use the results.

Next step

Now that you know that Text analysis and Semantic queries are available, what is your next step in using Semantic technologies?

Thursday, September 25, 2014

Thesis students

Being close to the Delft Technical University, my employer has tightened our relations with universities. I have the pleasure of mentoring two students performing their thesis study at HintTech.

Krishna Akundi from TU Delft has just completed his thesis study on visualizing large news archives in the temporal perspective. From a more fundamental research of the subject, his study has transformed into the quest for finding the most usable interface to visualize news trends in topics over time.

Marian Szabo from TU Delft has just started on the subjects of finding and fixing errors in ontologies. All ontologies have the potential for errors, and Marian quest is trying to help fixing them by finding, visualizing and showing possibilities to repair the errors.

Both offer interesting challenges in the exploration, understanding and visualization of Big Content sets, especially sets of RFD.

We are grateful to Newz for use of their data and providing substantive direction.

Monday, May 5, 2014

Big Content challenges

At Dayon, we are used to work with Big Data. Coming from a publisher’s background, we have provided content solutions to publishers since 1997.

I read some stories about Big Content, and was intrigued that Gartner saw Big Content as the unstructured part of Big Data. To me, Big Content is the structured version of Big Data.

Let me explain this and address some challenges and Big Content technologies.

Planned Variety

In Terms of the three Big Data V’s (Volume, Velocity and Variety), publishers content is odd. Since the goal of publishers is to make a profit from providing content, content must be able to be published to a vast arrange of channels. To enable this, content must be structured (preferably in XML) and enriched with metadata. Any Variety is planned, because unplanned Variety leads to unplanned structures and/or unplanned publications.

Data is generated, where Content is handcrafted. Tweets en Facebook-posts are only lightly structured, but Blog posts are already quite structured. Some numbers by Chartbeat can be found here and a useful insight by Fastcompany on the rise of “Big Content” as a marketing Tool.

Publishers Content is usually completely structured: XML + Meta Data, sometimes already as RDF Triples (read my earlier Blog post on Semantic Technologies).

So to me, Content is structured Data. Big Content problems differ from other Big Data problems, where handling the Variety to understand your data is a big issue. Therefore, I would like to label the publishers challenge to be a Big Content challenge.

So how big is Big Content?

A quick scan at some of our publishing clients provided these numbers (XML only!):

Publisher 1: 10 million files, 25 GB
Publisher 2: 750.000 files, 15 GB
Publisher 3: 150 million files, 15 GB
Publisher 4: 1 million files, 15.000 new files per day (max)
Publisher 5: 45 million files, 20.000 new files per day (max)
Publisher 6: 500.000 files

Challenges of Big Content

With these numbers in mind, what are the challenges for Big Content?

Volume - XML: Are 30 million XML files a challenge? Or 25 GB in XML a challenge? It really should not be, but in reality I have met quite some technologies struggling with these amounts. An XML system should be true XML to handle this amount of data. XML isn't hard. Doing XML right is hard. If you don’t do XML right, 100.000 files or 1 GB of XML can get you plenty of headaches.
Volume - Other file types: Alas, not all Content is XML. Many Publishers still manage huge amounts of HTML, PDF or other file formats. With PDF, huge numbers often also turn into huge volumes because multi-channel and hence print-quality PDF is stored.If you have to index lots of other file types, do a proper intake process per file and weed out the corrupt and the largest files.
Volume - Subscriptions: At various clients I encountered the problem that Big Content is offered in large amounts of different Subscriptions. Whereas a large amounts of different Subscriptions are not a problem in itself, the combination of Big Content and Big (number of) Subscriptions often is. So if you offer lots of data, be smart about the number of Subscriptions.
Volume - Triples: Nearly all Publishers storing Big Content are looking into Triples as a way to store and link Meta Data from their XML files. Storing your Meta Data in a Triple Store, and Linking it to the Linked Open Data can be a very good idea, but this calls for a Big Triple Store. A set of 1 billion Triples isn't exceptional, but also requires Big Content Technology.
Velocity - Real Time Indexing: Failing at real time indexing is usually the first sign that you are becoming a Big Content publisher. Many technologies struggle with incremental updates, needing complete re-indexing, which in term leads to strange solutions such as overnight indexing, flip-flopping or indexes out of sync with the rest of the front-end.
Velocity - Real Time Alerting: The value of Content depends on its relevance, and timeliness is a huge factor in relevance. Real Time Alerting will offer a competitive edge to content users. To provide Real Time Alerting, XML store need to handle alerting efficiently (using minimal resources) at load time
Variety - Presentation: A Big Content challenge can be how to present all of this Content. If a simple “What’s New” view results in 20.000 hits, what are you going to show the customer?The most used solutions are:

Provide a Search Only interface
Provide as much structure from Meta Data as possible to assist the user in drilling down to the most useful Content

Variety - Enrichment: If the Meta Data you need to provide useful segmentation of your Big Content to your end users just isn't there; there is a need for additional Enrichment. Big Content will (due to costs) call for automated enrichment using Natural Language Processing

Big Content Technology

At Dayon / HintTech we strongly believe that Big Content challenges require specialized Big Content Technology. Here are some of the Big Content Technologies we have implemented:

MarkLogic
Several of our Big Content clients have selected MarkLogic as their content platform. I believe that MarkLogic is the best XML store and indexer available at this moment.
As a big bonus, MarkLogic comes with all kinds of useful features such as XQuery, an Application Server, and now even a Triple Store.
Find out more about MarkLogic at MarkLogic World Amsterdam and meet us there!
OWLIM
In our project at Newz we needed a Big Content Triple Store. We found OWLIM by Ontotext to provide an excellent Big Triple Store, as did BBC and Press Association.
W3C maintains a list of Big Triple Stores, with BigOWLIM as one of the top products.
We also selected Ontotext as our partner for their Semantic Tagging capabilities.
SOLR
We have also implemented SOLR for Big Content collections. SOLR will not face all of the Big Content challenges, but is a great Open Source search engine.

PS: After writing this blog, I feel like renaming Meta Data to Meta Content. Probably better if I don’t…

Thursday, April 24, 2014

Using Semantic Technologies to crunch Big Data

The interest in Big Data (nice post by a collegue on this subject) has sparked a new interest in Semantic Technologies. It is clear that the Volume and Variance of Big Data requires technologies that can structure and segment Big Data into useful and usable structures. For this, Semantic Technologies are used.

However there are different kinds of Semantic Technologies around, so I will start off with an introduction on Semantics and the Semantic Web. Next I will cover two key Semantic Technologies to arrive at the goal of this introduction: How do Semantic Technologies help us to crunch Big Data.

Semantics and Semantic Web

In 2001 Tim Berners-Lee and others published in article in Scientific American: “The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities.”

This title provides a first definition of Semantic Data:

"Content that is meaningful to computers"

Tim Berners-Lee understood that HTML web pages were useful to humans, but since they were (and often still are) encoded to store visual information, rather than the meaning of the information, they were of no use to automated systems to understand.

To be meaningful for computers, content has to be encoded in such a way that the meaning is clear, and can be processed automatically. XML is the first step in this. The progress so far is:

HTML:<p>Tim Berners-Lee</p>
XML: <author>Tim Berners-Lee</author>

The computer can now apply a formatting style on all authors, and can index them separately, but still cannot use the meaning of the concept “Author” or distinguish this Tim Berners-Lee from any other Tim Berners-Lee (If you think this is a silly example, please visit the Michael Jackson disambiguation page on Wikipedia: http://en.wikipedia.org/wiki/Michael_Jackson_(disambiguation)).

Wikipedia defines Semantic Technology:

"Using Semantic Technologies, meanings is stored separately from data and content files, and separately from application code"

So in our example, the author role is matched to a central definition for the creation of documents, preferably using a standard such as the Dublin Core standard “DC.creator”.

XML: <author>Tim Berners-Lee</author>
RDF expressed in XML:
<rdf:Description dc:title=" The Semantic Web">
<dc:creator>Tim Berners-Lee </dc:creator>
</rdf:Description>

In the next step we can replace “The Semantic Web” and “Tim Berners-Lee” with Unique Resource Identifiers (URI). For ease of understanding, the URI for Tim Berners-Lee could be: http://en.wikipedia.org/wiki/Tim_Berners-Lee and the article could be referenced as: http://dx.doi.org/10.1038/scientificamerican0501-34.

So from a formatted piece of text we arrive at a well-defined relation between two specific URI’s. A computer now can apply logic, based on understandable definitions and relationships.

Such a relationship is called a “Triple”: consisting of three pieces of information – from left to right: a Subject, a Predicate and an Object – together describing a piece of knowledge.

The de facto standard for expressing Semantic information is the W3C's Resource Description Framework (RDF).

So what do we need to make the Semantic Web work?

Well defined relations – like the Dublin Core relations, for instance RFD Schema’s
A way to store a multitude of triples: A Triple Store
Vocabularies: the concepts and relationships between them that describe a certain domain.
Semantic Enrichment to create triples from unstructured data

This article is about Semantic Technologies, so let’s look at how Triple Stores and Semantic Enrichment will help us to get to our goal: Linked Big Data

Triple Stores

Triples are a specific way to store information. To use Triples in an effective way – querying using SPARQL and reasoning – a special database is needed to store these Graph structures. These databases are called Triple Stores

From the given example, it is easy to understand that a vocabulary + dataset can expand into millions or billions of triples. Performance – both ingestion and querying – are important considerations.

Some of the better known Triple Stores are Sesame and Jena for smaller implementations and OWLIM, MarkLogic and Virtuoso for large implementations.

Semantic Enrichment Technologies

To use the Big, we have to understand the Data. In an ideal world, data is created according to a well organised ontology.

Alas, in most cases Big Data is created with no ontology present. To create structure from unstructured data (or structured with a different goal in mind) we need automatic recognition of meaning from our data.

This usually starts with recognising types of information using Semantic Enrichment Technologies. Semantic Enrichment Technologies are a collection of linguistic tools and -techniques such as Natural Language Processing (NLP) and artificial intelligence (AI) to analyse unstructured natural language or -data and try to classify and relate it.

By identifying the parts of speech (subject, predicate, etc.), algorithms can recognise categories, concepts (people, places, organisations, events, etc.), and topics. Once analysed, text can be further enriched with vocabularies, dictionaries, taxonomies, and ontologies (so regardless which literal is used, concepts are matched, for example: KLM = Koninklijke Luchtvaart Maatschappij = Royal Dutch Airlines).

This layer of linked metadata over our data creates Linked Data.

The quality of enrichment will range from (nearly) 100% for literal translated content to 90% or less, depending on the amount of training that is available.

Linked Big Data

So Semantic Enrichment Technologies gives us the opportunity to turn Big Data into Linked Big Data.
Tim Berners-Lee defined Linked Open Data to comply with the following 5 rules:

Available on the web (whatever format) but with an open licence, to be Open Data
Available as machine-readable structured data (e.g. excel instead of image scan of a table)
As (2) plus non-proprietary format (e.g. CSV instead of excel)
All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
All the above, plus: Link your data to other people’s data to provide context

Governments and other public organisations are putting much effort in providing Linked Open Data for citizens and organisations to use.

Commercial organisations will not likely openly publish their data, but will use the same standards as Linked Open Data (such as HTTP, URIs and RDF) and therefore have similar implementations for Linked Big Data.

Some examples of Big Linked Data and Big Open Data initiatives:

Linked Open Data in the Netherlands, UK and USA
Linked Open Data sources from DbPedia which, essentially makes the content of Wikipedia available in RDF and also links to GeoNames for geographical locations, Freebase, a community-curated database of well-known people, places, and things
A browse interface for triple stores
Enriched Dutch Newspaper articles via Newz
Dutch Laws in RDF
Europeana opens up European cultural history collections

So what’s in it for me?

Does your organisation create or own lots of unstructured data? Hidden in there probably is a wealth of knowledge, which you can access:

Find out what structure (ontology) fits your needs
Use Semantic Enrichment Technologies to create structure from your unstructured data
Store your data in a Triple Store
Start exploring, learn & earn

I will post more on Triple Stores and Semantic Enrichments in future blogs.