[Solved] org.apache.tika.exception.TikaException: Error creating OOXML extractor


TikaException is the most common cached exception which required to handle while using APIs for TIKA.

Constructors

These are two constructors of the TikaException class.

  • TikaException(String msg): TikaException  throw with message
  • TikaException(String msg, Throwable cause): TikaException throws message and cause of the exception.

Example

In this example, parsing pdf file content and metadata throwing TikaException because of using the parser for PDF doesn’t support it. By mistake or copy-paste use Parser of OOXMLParser which is generally used to parser Microsoft documents.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;

import org.xml.sax.SAXException;

public class TikaPdfParserExample {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      //detecting the file type
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("C:\\Users\\Saurabh Gupta\\Desktop\\TIKA\\PDF-FILE.pdf"));
      ParseContext pcontext=new ParseContext();

      //auto detect document parser
      Parser  parser = new OOXMLParser();
      parser.parse(inputstream, handler, metadata,pcontext);
      System.out.println("Contents of the text document:" + handler.toString());
      System.out.println("Metadata of the text document:");
      String[] metadataNames = metadata.names();

      for(String name : metadataNames) {
         System.out.println(name + " : " + metadata.get(name));
      }
   }
}

Output


Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:209)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
    at com.fiot.tika.exceptions.handling.TikaTextParserExample.main(TikaTextParserExample.java:31)
Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:143)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:47)
    at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:299)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
    ... 2 more
Caused by: java.util.zip.ZipException: Unexpected record signature: 0X46445025
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
    ... 6 more

Solutions

Always use AutoDetectParser in TIKA if not sure about document type or specific Parser as per document type.

Preferences

https://tika.apache.org/1.8/api/org/apache/tika/exception/TikaException.html

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s