TikaException is the most common cached exception which required to handle while using APIs for TIKA.
Constructors
These are two constructors of the TikaException class.
- TikaException(String msg): TikaException throw with message
- TikaException(String msg, Throwable cause): TikaException throws message and cause of the exception.
Example
In this example, parsing pdf file content and metadata throwing TikaException because of using the parser for PDF doesn’t support it. By mistake or copy-paste use Parser of OOXMLParser which is generally used to parser Microsoft documents.
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.parser.microsoft.ooxml.OOXMLParser; import org.apache.tika.parser.pdf.PDFParser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.parser.txt.TXTParser; import org.xml.sax.SAXException; public class TikaPdfParserExample { public static void main(final String[] args) throws IOException,SAXException, TikaException { //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File("C:\\Users\\Saurabh Gupta\\Desktop\\TIKA\\PDF-FILE.pdf")); ParseContext pcontext=new ParseContext(); //auto detect document parser Parser parser = new OOXMLParser(); parser.parse(inputstream, handler, metadata,pcontext); System.out.println("Contents of the text document:" + handler.toString()); System.out.println("Metadata of the text document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + " : " + metadata.get(name)); } } }
Output
Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:209)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at com.fiot.tika.exceptions.handling.TikaTextParserExample.main(TikaTextParserExample.java:31)
Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:143)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:47)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:299)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
... 2 more
Caused by: java.util.zip.ZipException: Unexpected record signature: 0X46445025
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
... 6 more
Solutions
Always use AutoDetectParser in TIKA if not sure about document type or specific Parser as per document type.
Preferences
https://tika.apache.org/1.8/api/org/apache/tika/exception/TikaException.html
You must log in to post a comment.