Content Strategy: Ontotext

Recently Ontotext launched a Self-Service Semantic Suite – shorted to S4. S4 provides a set of services for low-cost (currently free) on-demand text analytics and metadata management on the cloud. This provides a great way to get acquainted with Semantic Technologies.

Available S4 services

S4 currently offers the following services:

Text analytics for News content, Biomedical content or Twitter content
Linked Data server with reliable access to the DBpedia, FactForge, GeoNames, WordNet, MusicBrainz, and New York Times datasets
Self-managed RDF database (GraphDB) on the cloud

Trying out Text analytics

Text analytics – in this context – is about finding out what is important in texts (natural language), and using this information.

To try this out, copy some text containing some Persons and Places, biomedical terms and/or Twitter content. Preferably in English for best results, but other languages will produce results as well.
Go to the S4 homepage and click on “Demo S4 today for free”. Paste your text in the Text Analytics box; choose whether your text is more News, Biomedical or Twitter oriented, and hit Execute.

Your result will show the provided text with different types of terms highlighted in different colours. See the below example.

Figure 1. Example of an annotated text

If you hover over an annotated term, it will show extra information. For instance for an organisation it will show the location in DBpedia (the semantic version of Wikipedia). In my example this makes clear that this article is not about some IMF, it is about the IMF, and more info is available on http://dbpedia.org/page/International_Monetary_Fund.

So what’s in it for you?

This service can provide all kinds of structure and information on topics that can help you to classify, understand, link and enrich information.

Trying out Semantic queries

S4 also lets you try out semantic queries using SPARQL, the query language for semantically stored information such as the DBpedia.
Go to the S4 homepage and click on “Demo S4 today for free”. Go to the LOD Access Tab. Select a query from the Pulldown. Let’s try “Find airports near London”.
The SPARQL query is:

PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX omgeo: <http://www.ontotext.com/owlim/geo#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX ff: <http://factforge.net/>
PREFIX om: <http://www.ontotext.com/owlim/>

SELECT distinct ?airport ?label ?RR
WHERE {
dbpedia:London geo-pos:lat ?latBase ;
geo-pos:long ?longBase .
?airport omgeo:nearby(?latBase ?longBase "50mi");
a dbp-ont:Airport ;
ff:preferredLabel ?label ;
om:hasRDFRank ?RR .
} ORDER BY DESC(?RR)

Even without a SPARQL crash-course, this is quite easy to read:

First some definitions are introduced
The query will return the airport ID, name and some rank (RR)
From DBpedia the latitude and longitude of London are retrieved
Only results that have the DBpedia Ontology type “Airport” are selected,
They must be nearer than 50 miles to London, according to the Owlim Geospatial function “Nearby”

Click on “Execute” and have a look at the results. Try out some of the other queries as well.

So what’s in it for you?

An enormous wealth of structured information is available for you to use. Were you aware that you could ask Wikipedia such detailed questions? Be aware: extensive knowledge is needed to write such concise statements and really use the results.

Next step

Now that you know that Text analysis and Semantic queries are available, what is your next step in using Semantic technologies?

At Dayon, we are used to work with Big Data. Coming from a publisher’s background, we have provided content solutions to publishers since 1997.

I read some stories about Big Content, and was intrigued that Gartner saw Big Content as the unstructured part of Big Data. To me, Big Content is the structured version of Big Data.

Let me explain this and address some challenges and Big Content technologies.

Planned Variety

In Terms of the three Big Data V’s (Volume, Velocity and Variety), publishers content is odd. Since the goal of publishers is to make a profit from providing content, content must be able to be published to a vast arrange of channels. To enable this, content must be structured (preferably in XML) and enriched with metadata. Any Variety is planned, because unplanned Variety leads to unplanned structures and/or unplanned publications.

Data is generated, where Content is handcrafted. Tweets en Facebook-posts are only lightly structured, but Blog posts are already quite structured. Some numbers by Chartbeat can be found here and a useful insight by Fastcompany on the rise of “Big Content” as a marketing Tool.

Publishers Content is usually completely structured: XML + Meta Data, sometimes already as RDF Triples (read my earlier Blog post on Semantic Technologies).

So to me, Content is structured Data. Big Content problems differ from other Big Data problems, where handling the Variety to understand your data is a big issue. Therefore, I would like to label the publishers challenge to be a Big Content challenge.

So how big is Big Content?

A quick scan at some of our publishing clients provided these numbers (XML only!):

Publisher 1: 10 million files, 25 GB
Publisher 2: 750.000 files, 15 GB
Publisher 3: 150 million files, 15 GB
Publisher 4: 1 million files, 15.000 new files per day (max)
Publisher 5: 45 million files, 20.000 new files per day (max)
Publisher 6: 500.000 files

Challenges of Big Content

With these numbers in mind, what are the challenges for Big Content?

Volume - XML: Are 30 million XML files a challenge? Or 25 GB in XML a challenge? It really should not be, but in reality I have met quite some technologies struggling with these amounts. An XML system should be true XML to handle this amount of data. XML isn't hard. Doing XML right is hard. If you don’t do XML right, 100.000 files or 1 GB of XML can get you plenty of headaches.
Volume - Other file types: Alas, not all Content is XML. Many Publishers still manage huge amounts of HTML, PDF or other file formats. With PDF, huge numbers often also turn into huge volumes because multi-channel and hence print-quality PDF is stored.If you have to index lots of other file types, do a proper intake process per file and weed out the corrupt and the largest files.
Volume - Subscriptions: At various clients I encountered the problem that Big Content is offered in large amounts of different Subscriptions. Whereas a large amounts of different Subscriptions are not a problem in itself, the combination of Big Content and Big (number of) Subscriptions often is. So if you offer lots of data, be smart about the number of Subscriptions.
Volume - Triples: Nearly all Publishers storing Big Content are looking into Triples as a way to store and link Meta Data from their XML files. Storing your Meta Data in a Triple Store, and Linking it to the Linked Open Data can be a very good idea, but this calls for a Big Triple Store. A set of 1 billion Triples isn't exceptional, but also requires Big Content Technology.
Velocity - Real Time Indexing: Failing at real time indexing is usually the first sign that you are becoming a Big Content publisher. Many technologies struggle with incremental updates, needing complete re-indexing, which in term leads to strange solutions such as overnight indexing, flip-flopping or indexes out of sync with the rest of the front-end.
Velocity - Real Time Alerting: The value of Content depends on its relevance, and timeliness is a huge factor in relevance. Real Time Alerting will offer a competitive edge to content users. To provide Real Time Alerting, XML store need to handle alerting efficiently (using minimal resources) at load time
Variety - Presentation: A Big Content challenge can be how to present all of this Content. If a simple “What’s New” view results in 20.000 hits, what are you going to show the customer?The most used solutions are:

Provide a Search Only interface
Provide as much structure from Meta Data as possible to assist the user in drilling down to the most useful Content

Variety - Enrichment: If the Meta Data you need to provide useful segmentation of your Big Content to your end users just isn't there; there is a need for additional Enrichment. Big Content will (due to costs) call for automated enrichment using Natural Language Processing

Big Content Technology

At Dayon / HintTech we strongly believe that Big Content challenges require specialized Big Content Technology. Here are some of the Big Content Technologies we have implemented:

MarkLogic
Several of our Big Content clients have selected MarkLogic as their content platform. I believe that MarkLogic is the best XML store and indexer available at this moment.
As a big bonus, MarkLogic comes with all kinds of useful features such as XQuery, an Application Server, and now even a Triple Store.
Find out more about MarkLogic at MarkLogic World Amsterdam and meet us there!
OWLIM
In our project at Newz we needed a Big Content Triple Store. We found OWLIM by Ontotext to provide an excellent Big Triple Store, as did BBC and Press Association.
W3C maintains a list of Big Triple Stores, with BigOWLIM as one of the top products.
We also selected Ontotext as our partner for their Semantic Tagging capabilities.
SOLR
We have also implemented SOLR for Big Content collections. SOLR will not face all of the Big Content challenges, but is a great Open Source search engine.

PS: After writing this blog, I feel like renaming Meta Data to Meta Content. Probably better if I don’t…

Content Strategy

Tuesday, January 13, 2015

Getting started with Semantic Technologies

Available S4 services

Trying out Text analytics

So what’s in it for you?

Trying out Semantic queries

So what’s in it for you?

Next step

Monday, May 5, 2014

Big Content challenges

Planned Variety

So how big is Big Content?

Challenges of Big Content

Big Content Technology