Hi,

I would like to know how Calais "learns" new things. For instance, it recognizes "Chelsea Clinton" as a person, but does neither know "Chelsea" as a district of London nor as a football (soccer) club (yet).

I imagine that there's a staff of people working on feeding it new stuff - is this correct?

If so, how many people are doing this?

Is it all done "by hand", or is the process partially (or even completely) automatable? Or are you using your journalistic workforce to do it?

There are many potential entities to recognize. Currently, it doesn't seem to recognize Manchester City as a football club, distinguish that from Manchester United - another football club -, or know that football clubs in England are the same as soccer clubs in the USA. And then there are nicknames and common abbreviations (like ManU :-)... How far are you going to go in making entities like this available, and what are your priorities?

I'm particulary curious about how you're going to tackle disambiguation ("football", "Chelsea", "Paris", &c.). Is there a paper available which explains the strategy of the Calais developers regarding this common difficulty for NLP applications?

Hope you don't take my curiosity for disrespect - I'm just fascinated by your work, and I'm trying to figure out where it might lead to.

Best regards, Dirk


Comments

Hi Dirk,

Well, being one of the people behind Calais learning, I first want to say that I am glad you are curious - it means we're doing our work pretty well (although, as you've noticed, there is always a place for improvements).
The processing is completely automatic, and there are only around a dozen people (+ a few past, yet important, contributors), responsible for the NLP code.
Calais is learning using rules we are "teaching" it - we are trying to mimic the way a human reader is identifying or disambiguating an entity (or a relation) when he/she reads a text, using clues within the entity itself, its close context and the entire text context.
We have developed (and still developing) a sophisticated rule-based system with our own (and if I may say, cool) programming language. In writing the rules, we are using elements which are based on several NLP levels (From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases), and we are also using lexicons. This combined lexicons+discovery approach allows Calais to identify an entity even if most of the world (including the people writing the rules) never heard of it, and to disambiguate an entity meaning according to the context it appears in.
Our sports teams identification was just released in the latest Calais update, and will probably improve in the next versions (especially if we'll get feedback about it), so the soccer teams will get their attention...
Regarding common abbreviations - we are trying to identify them and map them to their full names using abbreviations and acronyms creation methods, as well as thesauruses.
We hope to have available in the future as many entities as possible. And as for prioritization - that is not for me to answer...
I hope I have managed to answer at least some of your curiosity, and that we'll be able to keep you fascinated for a long time :)

Regards,
Naama

If there's such thing as a paper describing the inner workings of Open Calais, I'd second that request :)

There's no single paper that provides a good overview of the complete Open Calais technology stack - though there probably should be.

Let us think about this for awhile and get back to the forum. While I'd like to get this put together - it is of course the same people that are building that we would need to take time out for writing.

Regards,

Hi,

for sure, I'd also be interested in learning more about how Calais learns :)

One specific question: did you ever try deep parsing to improve relation extraction capabilities of Calais and if yes how were your experiences?

Thanks and best Regards

Markus

Markus:

Take a look at the R3 release notes on "Exhaustive Extraction" (http://www.opencalais.com/R3Overview). You might find them interesting.