I am using mainly the Person output from Calais, and I believe I will be able to compute a kind of "stop list" where Calais is making mistakes.

Are you interested, or this kind of technic is banned ? I will use it internaly to improve relevancy, but you may as well include it !!


Comments

Ok,

So after using openCalais on a bunch of large documents,
and in particular, page from search, I noticed some garbage from
the titles. Indeed titles from news are usually fully capitalized making
hard to extract.

And that is polluting my results...

So I can remove the titles, but usually titles are great stuff, so... not a great option.
What is more interesting is that this is kind of garbage is really recurrent.
Meaning that the errors are usually the same.
So if I process enough data, garbage should show up easily.

So how do I do that ?

Well I process large amount of docs, titles, news,
and I extract all the names ( well almost, I wont go in details )
then I try to find the cases where a last name is also a first name like in
"Barack Obama" and "Obama Goes"
Then I extract the part that I suppose to be garbage, "Goes" in this case,
but that is still too much. So I extract from these, the ones that ends with an "s"
( not really multi langage but since opencalais is only english that will do it )
And I obtain this :

Does
Marks
Previews
Nichols
Jokes
Thomas
Hercules
Announces
Lurks
Stalkers
Voters
Goes
Speaks
Pethokoukis
Walters
Edwards
Supporters
Fridays
Articles
Blogs
Is
Jeans

It comes from the analysis of 600 names.
With more, let say 10 000, I ll do better :)
Some of them aren't garbage,
but at least half is, so that is fine with me :D

I'll keep you posted if I get better results.

First, thanks for sharing your approach - I'll forward it to our development team to get their thoughts.

I believe your observation regarding titles is really important here.
You are saying:
Indeed titles from news are usually fully capitalized making hard to extract. ... So I can remove the titles, but usually titles are great stuff, so... not a great option.

Because titles "behave" differently than regular natural language paragraphs we do try to identify blocks of text as titles and apply slightly different logic for extracting metadata from them.
This is true both for entities (that get messy because of full capitalization) but also for events & facts (that are often missed because titles aren't always syntactically correct).
Obviously sometimes we fail to identify titles, and hence if you have means to identify titles better you can send the content as XML with the tags that encapsulate the identified title -- this will optimize the results returned by OpenCalais. You can find more information in this section: http://opencalais.com/APIcalls#inputcontent.

Michal

My bad ... I should always read the full documentation.

I'll get back to you if the problem I noticed are still there.

I really feel bad because I dislike users who post without reading documentation, so I am really really sorry, and hope I did not waste too much of your time.

But in the meantime, thank you so much !!!
I really appreciate these active forums :D

PS: I have uploaded my face so you can see I'm a nice guy and not a time-waster :D

I remember now. I did not use the "TEXT/XML" because it did not seem to work at the time.
So just I just switched to "TEXT/HTML" which worked better.

I am now using TEXT/XML but what is not written in the documentation I guess ( at the very least, not in this page ), is that BODY has to encapsulate the CONTENT otherwize it does not work.
( it may be of obvious but it was not so much for me... )

<URL count="1" relevance="0.057">http://www.clintonlibrary<wbr></URL>

This crashes my xml parser...

We'd be happy to hear about - certainly nothing like this is banned.

Why not post your approach right here and see if others can take advantage of it as well.

Regards,