[Solved] org.apache.tika.exception.TikaException: Error creating OOXML extractor

TikaException is the most common cached exception which required to handle while using APIs for TIKA.

Constructors

These are two constructors of the TikaException class.

  • TikaException(String msg): TikaException  throw with message
  • TikaException(String msg, Throwable cause): TikaException throws message and cause of the exception.

Example

In this example, parsing pdf file content and metadata throwing TikaException because of using the parser for PDF doesn’t support it. By mistake or copy-paste use Parser of OOXMLParser which is generally used to parser Microsoft documents.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;

import org.xml.sax.SAXException;

public class TikaPdfParserExample {

public static void main(final String[] args) throws IOException,SAXException, TikaException {

//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("C:\\Users\\Saurabh Gupta\\Desktop\\TIKA\\PDF-FILE.pdf"));
ParseContext pcontext=new ParseContext();

//auto detect document parser
Parser  parser = new OOXMLParser();
parser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the text document:" + handler.toString());
System.out.println("Metadata of the text document:");
String[] metadataNames = metadata.names();

for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}

Output


Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:209)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
    at com.fiot.tika.exceptions.handling.TikaTextParserExample.main(TikaTextParserExample.java:31)
Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:143)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:47)
    at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:299)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
    ... 2 more
Caused by: java.util.zip.ZipException: Unexpected record signature: 0X46445025
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
    ... 6 more

Solutions

Always use AutoDetectParser in TIKA if not sure about document type or specific Parser as per document type.

Preferences

https://tika.apache.org/1.8/api/org/apache/tika/exception/TikaException.html