Monday, May 26, 2014

Een inspirerende editor

Binnen HintTech organiseren we regelmatig One Day Product sessies. Tijdens deze One Day Product sessies bedenken en bouwen we binnen één dag een werkend product.

Bij de One Day Product sessie van donderdag 22 mei hebben we gewerkt aan beeldverwerving bij artikelen. Klassieke beeldverwerving is vaak een duur en weinig creatief proces, met een erg statisch eindresultaat. Zou het niet fantastisch zijn als je tekstverwerker je inspireert tijdens het schrijven door interessant beeldmateriaal te tonen?

Met dit idee ging het team uit Dayon, HintTech en Tripitch aan de slag. Irina ontwierp een mooi interface, dat door Martin en Sebastiaan werd gebouwd. Tegelijk gingen Jorg, Fedor en Fabian aan de slag met tekstherkenning en het aanboren van beeldbanken.

Na een lange dag van nadenken, ploeteren, discussiëren en bouwen mag het resultaat er zijn: MOOD THIS!

Probeer het uit! Type een tekst in, en als u een paar seconden niet typt gaat MOOD THIS! op zoek naar een moodboard met beelden en kleuren op basis van woorden uit de tekst. Het resultaat zal vaak verrassend zijn (het is afhankelijk van social media tagging) en hopelijk inspirerend, maar vaak ook gewoon erg leuk.

MOOD THIS! is geïnspireerd door Plastr,  een project van Sebastiaan Hoejenbos en Michiel Peters.

Wednesday, May 21, 2014

Lessons from the New York Times innovation report

A few days ago, the full NY Times innovation report was leaked to the press. The NY Times did a rigorous self-examination, and came to some incredibly important observations on the why and how of true transformation to the digital era.

Reading this document sparked a WOW moment in me. I think it is an absolute must read for anyone dealing with content strategy. Not so much for profound new thoughts, but more for the insight in where NY Times digital performance was good and where it is was poor, affirming that changing a content strategy is hard.
The Nieman Journalism Lab wrote an excellent report on it.

I would like to share a few of the observations and proposals, which I run into in my daily practice:

Create a strategy team

The newsroom is too focused on the daily task to produce the daily newspaper, which leaves insufficient time and focus to assess and implement strategic changes.

I think we all can relate to this problem. It is OK to have a team focused on the daily chores, but somewhere change has to happen. Creating a strategy team is a good idea – as long as it is well-anchored within the standing organisation and has the power to implement change.

Put the Reader first & Grow an audience

The NY Times offers content through all the obvious channels. But by better headlines, better search and better social media skills, NY Times was outperformed, even on their own content. Simple changes such as posting at a time your audience is actually reading, repurposing, repeating, repackaging and personalizing.

The battle for an audience isn’t fought anymore on selling a paper or a website by its front-page: every article has to fight its own battle on the saturated content market.

The report also suggests branching off and seeking audiences in different forms. Hosting events and talking to real customers should be on anybody’s agenda.

Structure to repurpose

The report tells a compelling story on why content needs to be enriched through tagging and structured data.

This reaffirms the choices we made in Newz project: structuring content by location, time (and events), story type and topics (persons and organisations). The NY Times adds to this: timeliness, story threads, story tone and the use of imagery.

I feel that any content-organisation should consider structuring content on some of these base functions.

Overcome the huge cultural change to become ‘Digital First’

The report points out that changing a newsroom from a century-old habit of paper & front-page to ‘Digital First’ is a huge challenge. The cultural change will take new technology savvy people and new managers encouraging cultural change.

The Times created many beautiful digital showcases like “Snow Fall”. But the NY Times built “Snow Fall” and not a “Snow Fall building tool”.

Anyone working difficult markets should be willing to experiment more in presentation formats, accept imperfection, accept failure, measure so you know what succeeds and have the tools ready to expand on success.

Read this!

If you are in any way active in content strategy or news: Please read this report. See that you are not alone in your struggle and gorge on the valuable insights.

Thanks to a Dutch analysis from the Bladendokter that pointed me to the report.

Friday, May 16, 2014

MarkLogic World Amsterdam

May 15th, the MarkLogic World Tour touched down in Amsterdam. Approximately 300 attendees flocked the Amsterdam Arena to be immersed in MarkLogic and it possibilities. As longstanding MarkLogic partner and implementer, my employer Dayon and HintTech sponsored this event. MarkLogic presented interesting new Developments, Case studies and Technical Sessions.

Big Content challenges 

Keynote speaker an MarkLogic CEO Gary Bloom and Joe Pasqua, Senior Vice President of Product Strategy explained MarkLogic choices for addressing Big Data challenges, but also the Big Content challenges (see my earlier blog) such as Heterogenic structured data: supporting Multi Schema XML and changing schema's.

MarkLogic 8

MarkLogic 7 already introduced a lot of interesting new features; MarkLogic 8 raises the bar some more:

  • JavaScript & JSON support
    Building Applications on MarkLogic will be available to lots more developers: front-end oriented developers and other developers who prefer {} over <>
  • SPARQL 1.1 & Inferencing
    The Triple Store in MarkLogic 7 is a good starting point, but becomes much more complete by adding Inferencing and SPARQL 1.1 support
  • Bitemporal support
    A very interesting feature, which I could use immediately: What did we know at a certain point in time?


I visited the more business oriented tracks, with cases from Standards Norway, Intel, and the case, known to us as Obamacare. MarkLogic explained that their parts of the system kept performing throughout the toughest time. If only all components would have been this reliable…
Some of our customers presented as well: Ted van Dongen of Swets told how a great tool needs good people to implement. Thanks for the compliments Ted!
Our (previously) own Michel de Ru contributed for Newz to a panel session on Sharing Data despite the Silos.

Can you have too much of a good thing?

I will admit to anyone that I am biased: I think MarkLogic is a great tool that solves many challenges my customers face. But I had an interesting discussion: with all these new features, will MarkLogic become a too monolithic solution? Can you have too much of a good thing?
My 2 eurocents are: For now, keep piling on the good stuff, MarkLogic!

Monday, May 5, 2014

Big Content challenges

At Dayon, we are used to work with Big Data. Coming from a publisher’s background, we have provided content solutions to publishers since 1997.

I read some stories about Big Content, and was intrigued that Gartner saw Big Content as the unstructured part of Big Data. To me, Big Content is the structured version of Big Data.

Let me explain this and address some challenges and Big Content technologies.

Planned Variety

In Terms of the three Big Data V’s (Volume, Velocity and Variety), publishers content is odd. Since the goal of publishers is to make a profit from providing content, content must be able to be published to a vast arrange of channels. To enable this, content must be structured (preferably in XML) and enriched with metadata. Any Variety is planned, because unplanned Variety leads to unplanned structures and/or unplanned publications.

Data is generated, where Content is handcrafted. Tweets en Facebook-posts are only lightly structured, but Blog posts are already quite structured. Some numbers by Chartbeat can be found here and a useful insight by Fastcompany on the rise of “Big Content” as a marketing Tool.

Publishers Content is usually completely structured: XML + Meta Data, sometimes already as RDF Triples (read my earlier Blog post on Semantic Technologies).

So to me, Content is structured Data. Big Content problems differ from other Big Data problems, where handling the Variety to understand your data is a big issue. Therefore, I would like to label the publishers challenge to be a Big Content challenge.

So how big is Big Content?

A quick scan at some of our publishing clients provided these numbers (XML only!):

  1. Publisher 1: 10 million files, 25 GB
  2. Publisher 2: 750.000  files, 15 GB
  3. Publisher 3: 150 million files, 15 GB
  4. Publisher 4: 1 million files, 15.000 new files per day (max)
  5. Publisher 5: 45 million files, 20.000 new files per day (max)
  6. Publisher 6: 500.000  files

Challenges of Big Content

With these numbers in mind, what are the challenges for Big Content?
  1. Volume - XML: Are 30 million XML files a challenge? Or 25 GB in XML a challenge? It really should not be, but in reality I have met quite some technologies struggling with these amounts. An XML system should be true XML to handle this amount of data. XML isn't hard. Doing XML right is hard. If you don’t do XML right, 100.000 files or 1 GB of XML can get you plenty of headaches.
  2. Volume - Other file types: Alas, not all Content is XML. Many Publishers still manage huge amounts of HTML, PDF or other file formats. With PDF, huge numbers often also turn into huge volumes because multi-channel and hence print-quality PDF is stored.If you have to index lots of other file types, do a proper intake process per file and weed out the corrupt and the largest files.
  3. Volume - Subscriptions: At various clients I encountered the problem that Big Content is offered in large amounts of different Subscriptions. Whereas a large amounts of different Subscriptions are not a problem in itself, the combination of Big Content and Big (number of) Subscriptions often is. So if you offer lots of data, be smart about the number of Subscriptions.
  4. Volume - Triples: Nearly all Publishers storing Big Content are looking into Triples as a way to store and link Meta Data from their XML files. Storing your Meta Data in a Triple Store, and Linking it to the Linked Open Data can be a very good idea, but this calls for a Big Triple Store. A set of 1 billion Triples isn't exceptional, but also requires Big Content Technology.
  5. Velocity - Real Time Indexing: Failing at real time indexing is usually the first sign that you are becoming a Big Content publisher. Many technologies struggle with incremental updates, needing complete re-indexing, which in term leads to strange solutions such as overnight indexing, flip-flopping or indexes out of sync with the rest of the front-end.
  6. Velocity - Real Time Alerting: The value of Content depends on its relevance, and timeliness is a huge factor in relevance. Real Time Alerting will offer a competitive edge to content users. To provide Real Time Alerting, XML store need to handle alerting efficiently (using minimal resources) at load time
  7. Variety - Presentation: A Big Content challenge can be how to present all of this Content. If a simple “What’s New” view results in 20.000 hits, what are you going to show the customer?The most used solutions are:
    1. Provide a Search Only interface
    2. Provide as much structure from Meta Data as possible to assist the user in drilling down to the most useful Content
  8. Variety - Enrichment: If the Meta Data you need to provide useful segmentation of your Big Content to your end users just isn't there; there is a need for additional Enrichment. Big Content will (due to costs) call for automated enrichment using Natural Language Processing

Big Content Technology

At Dayon / HintTech we strongly believe that Big Content challenges require specialized Big Content Technology. Here are some of the Big Content Technologies we have implemented:
  1. MarkLogic
    Several of our Big Content clients have selected MarkLogic as their content platform. I believe that MarkLogic is the best XML store and indexer available at this moment.
    As a big bonus, MarkLogic comes with all kinds of useful features such as XQuery, an Application Server, and now even a Triple Store.
    Find out more about MarkLogic at MarkLogic World Amsterdam and meet us there!
  2. OWLIM
    In our project at Newz we needed a Big Content Triple Store. We found OWLIM by Ontotext to provide an excellent Big Triple Store, as did BBC and Press Association.
    W3C maintains a list of Big Triple Stores, with BigOWLIM as one of the top products.
    We also selected Ontotext as our partner for their Semantic Tagging capabilities.
  3. SOLR
    We have also implemented SOLR for Big Content collections. SOLR will not face all of the Big Content challenges, but is a great Open Source search engine.

PS: After writing this blog, I feel like renaming Meta Data to Meta Content. Probably better if I don’t…