I've noticed this a couple of times but never took the time to think about this. I'm talking about entities that are recognized just on first name basis. Let me give you an example.

"Amy" ->  "I spent last night with my friends Bruce and ]Amy[ and their two young children, got my  "

"Sarah" ->  "N.Y. - Bullets whizzed past as "]Sarah[" translated for U.S. soldiers in Iraq. Shrapnel " 

 "Tom" -> " The 29-year-old "Batman" star and wife of ]Tom[... "

Certainly Tom, Amy and Sarah are all persons but I'm asking myself if this information is really valuable. When processing e.g. an article and taking a look at recognized people this becomes especially questionable since most detections are based on full names and really useful.

 I also think the the current results on people could still need some improvements and fell kinda... unsemantically. E.g. "Hillary Clinton" and "Hillary Rodham Clinton" are recognized as two different persons. Same with "George Bush", "Bush", "George W. Bush". 

It would probably be a good idea to differentiate clearly between recognized names and persons, since both are different things.

 

 


Comments

No worries about response time...
Well, this identity resolution challenge you refer to is not specific for first names, it is true also for full names (there is more than one "John Smith" in the world ;)), and we are definitely working on it.

As for person names cross documents - we are not trying to normalize a person name across documents on purpose. How could we know that "Bush" or "George Bush" refers to "George W. Bush" and not to his father? I'm sure that you wouldn't like us to normalize "Clinton" in a single way. My point is that it is risky to try to normalize person names across documents. Per single document we do it, creating on the fly mapping. When coming to think about it, normalization of a person name across documents, actually comes back to the identity resolution problem, wouldn't you agree?

Regards,
Naama

P.S. speaking of first names, I don't know yours...

Normalization shouldn't be a problem as soon as a person is identified. But this is not possible in many cases when taking only the name into account. Even "George W. Bush" isn't unique afaik. Context is very important here. There seems to be a common understanding for names in a certain context, e.g. "Britney" will always be recognized by readers of entertainment articles. "Bush" always in political articles and so on. Taking context into account will give a chance of to further improve identification and make a good guess. But guesses are not enough for the semantic web, aye? Even if you are 90% sure, by using context and statistical elements, that this person is indeed George Bush, how do you express it? You cannot draw a relation that is only 90% true... I think the way you deal with people/names makes sense after thinking about it, or at least I can see the natural limits. Thanks for the insight.

Best wishes,
Stefan

Edit:
I'm looking forward to your new "Relevancy Ranking" features. Maybe this will put a different light on the questions above.

This is an interesting point you are raising, but I think you should consider the following - Calais is not identifying only entities, but also relations (events and facts), involving more than one entity.
So you might argue that a first name as a person is not interesting as a standalone (although probably there are applications for which it is interesting), but a Person can also be part of a relation such as Family Relation, and if you'll ignore a first name Person, you'll probably miss many Family Relation instances, since it is very common to state such a relation in a format such as "Hillary Clinton's daughter, Chelsea". I'd be happy to hear your thoughts about this.

As for the same person with different names identified as different persons - have you encountered this phenomenon in a single text (in such case, it would be great if you could provide us the text or a link to it, so we can improve our entity recognition), or across different texts?

Regards,
Naama

Hello Naama

It took me some time to reply, sorry about that. I really haven't thought about the role of a detected person in family relations. Good point. So referring to a person on first name basis does definitely make sense. However, what about the identity aspect? As far as I can tell the hashes are always the same, both in a single or in different documents. So "Sarah" seems to be the same person across all kind of different documents which is a fairly optimistic guess..

"As for the same person with different names identified as different persons - have you encountered this phenomenon in a single text (in such case, it would be great if you could provide us the text or a link to it, so we can improve our entity recognition), or across different texts?"
Across different documents. Is this by purpose? Do you maintain some kind of alias lists for people?