A large proportion of the pages I'm sending  are returning error " Text length has exceeded the allowed size ."

 

This even includes longer Wikipedia pages. 

 

Is it recommended that I break pages up first before sending them? (that would be annoying)

 

I'd rather just pay to increase my text limit. 

 

  


Comments

Currently submitted content is limited to 100,000 characters per "transaction" (otherwise you get that error message). We may increase this limit in the future.
If you have specific needs for submitting larger texts, please drop us a note at questions@opencalais.com.

For a service like SemanticProxy, it's a pain to have to break it up manually. Could you just automatically analyze the first 100,000 characters and return what you can, instead of failing out entirely?

While I like the idea of just processing the first 100K characters - I'm worried about truncating the analysis without a mechanism for informing the user that it wasn't a full analysis.

Thoughts? Ideas?

Raise an exception, or insert a message. So long as its consistent, it doesn't matter too much how the error is added to the response.



I went ahead and wrote a script to cutoff webpages at 100k characters and perform a semanticproxy analysis. I'm getting Parsing Error returned each time. That's not much of a surprise considering that I'm cutting off HTML documents in arbitrary places, but it does mean that this issue needs to be addressed before I can consider using SemanticProxy for my service.

I encountered this today working on a project that will use Open Calais. One thing I found that helped was to strip out all javascript and css before submitting my html content. This might be a good idea for the Semantic WebProxy. In Python all I did to cut down the character count for the urls I was pulling is

# typed from memory, working on a separate machine, excuse any typos please.
html = re.sub('\n', '', html)
p = re.compile('<script.*?</script>|<noscript.*?</noscript>|<style.*?</style>', re.IGNORECASE)
html = p.sub('', html)

After doing this I haven't hit the limit on any pages yet.

We're working on it. We plan to extend the size limit in the near future to deal with content such as Wikipedia articles - we'll also improve the error messaging in our next release.