Tag Archives: TikaInputStream

[Solved]org.apache.tika.exception.TikaMemoryLimitException


TikaMemoryLimitException is a subclass of TikaException. This exception generally occurred when there are lots of nested or embedded files within documents.

For Example :

  1.  Maven jars: Where one jar contains pom having a reference of other dependencies
  2. Git objects
  3. Word documents having lots of embedded files.

For parsing these nested/embedded files a large number of memory required that’s the reason for parser consuming memory up to highest mark will through this exception.

Solutions

  1. Set memory uses limit for TIKA as much as possible. at least more than 1 GB
  2. Make a common practice to shield the input stream with CloseShieldInputStreams so that it can fail if reaching the max limit.

Generally in TIKA, these allocations were coming from TikaInputStream.get(InputStream, TemporaryResources) which check if the type of InputStream for identify it’s support mark or not.

  • BufferedInputStream
  • ByteArrayInputStream

Unfortunately, because of this common practice to wrap InputStreams in CloseShieldInputStreams, causing this exception even if the mark is in fact supported.

public class TikaMemoryLimitException extends TikaException

Constructors

  • TikaMemoryLimitException(String msg)

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/TikaMemoryLimitException.html