Category Archives: Tika

[Solved] org.apache.tika.parser.utils.DataURISchemeParseException


DataURISchemaParseException is a subclass of TikaException. This schema has occurred when there is a mismatch of syntax or encoding of URI data when compared with URI schema.

public class DataURISchemeParseException extends TikaException

Constructors

  • DataURISchemeParseException(String msg)

Data URI Scheme?

Data URI Scheme is a URI scheme that provides a way to include data inline in webpages if that were external resources. The data URI scheme is useful to get CSS or images for the web pages with the same URL and no need any separate HTTP URL for download.

Data URI Schema.jpg

For more detail, about Data URI Schema you can refer this link: https://en.wikipedia.org/wiki/Data_URI_scheme

References

https://tika.apache.org/1.22/api/org/apache/tika/parser/utils/DataURISchemeParseException.html

[Solved] org.apache.tika.parser.chm.exception.ChmParsingException


ChmParsingException is a subclass of TikaException. This is exception occurs when there is a problem with the CHM file.

public class ChmParsingExceptionv extends TikaException

Constructors

  • ChmParsingException(String description)

CHM ?

CHM is a compiled HTML help format used for software documentation, which consists of HTML pages, indexes, and other navigation tools. These files are compressed and deployed in binary format.

CHM files support the following features:

  • Data Compression
  • In-built search engine.
  • One file can merge multiple .chm files.
  • Extended character supports, although fully not support Unicode.

 

References

[Solved] org.apache.tika.io.EndianUtils. BufferUnderrunException


BufferUnderFlowException is a subclass of TikaException. This exception occurred when buffer fed from a lower rate while read at a higher rate. There can be many reasons for this connection interruption, hard drive corrupted or CPU speed issue.

public static class EndianUtils.BufferUnderrunException extends TikaException

Constructors

  • BufferUnderrunException()

Solutions

As this issue can be from multiple reasons that’s why having multiple solutions as per need:

  1. Increase buffer size.
  2. Before burning external devices perform hard drive defragmentation.
  3. Avoid burn data onto a device in the network
  4. Always take the backup of data before transferring.
  5. Run hard drive scanning software to identify the corrupted file in the machine before export it.
  6. Always set TIKA memory consumption as higher and CPU and hard drive speed requirements to ensure enough RAM is available.
  7. Make sure the device consuming data or network connection functioning properly.

References

https://tika.apache.org/1.22/api/org/apache/tika/io/EndianUtils.BufferUnderrunException.html

[Solved]org.apache.tika.exception.TikaMemoryLimitException


TikaMemoryLimitException is a subclass of TikaException. This exception generally occurred when there are lots of nested or embedded files within documents.

For Example :

  1.  Maven jars: Where one jar contains pom having a reference of other dependencies
  2. Git objects
  3. Word documents having lots of embedded files.

For parsing these nested/embedded files a large number of memory required that’s the reason for parser consuming memory up to highest mark will through this exception.

Solutions

  1. Set memory uses limit for TIKA as much as possible. at least more than 1 GB
  2. Make a common practice to shield the input stream with CloseShieldInputStreams so that it can fail if reaching the max limit.

Generally in TIKA, these allocations were coming from TikaInputStream.get(InputStream, TemporaryResources) which check if the type of InputStream for identify it’s support mark or not.

  • BufferedInputStream
  • ByteArrayInputStream

Unfortunately, because of this common practice to wrap InputStreams in CloseShieldInputStreams, causing this exception even if the mark is in fact supported.

public class TikaMemoryLimitException extends TikaException

Constructors

  • TikaMemoryLimitException(String msg)

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/TikaMemoryLimitException.html

[Solved] org.apache.tika.mime.MimeTypeException


MimeTypeException is a subclass of TikaException. This exception occurred when there is a mismatch with selected parser and document mime type or Mime Type not supported by TIKA.

public class MimeTypeException extends TikaException

Constructors

  • MimeTypeException(String message) :Constructs a MimeTypeException with the specified detail message.
  • MimeTypeException(String message, Throwable cause)
    Constructs a MimeTypeException with the specified detail message and root cause.

References

https://tika.apache.org/1.22/api/org/apache/tika/mime/MimeTypeException.html

TIKA: MS-Excel Content and Metadata Extraction


In this program, you will see complete steps to extraction content and metadata of the MS-Excel file by using TIKA OOXMLParser.

Sample File

TIKA MS excel File Content and Metadata extrcation
TIKA MS Excel File Content and Metadata extraction

Complete Example

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaMSExcelParserExample {

	public static void main(final String[] args) throws IOException, TikaException, SAXException {

		// detecting the file type
		BodyContentHandler handler = new BodyContentHandler();
		Metadata metadata = new Metadata();
		FileInputStream inputstream = new FileInputStream(new File("C:\\Users\\Saurabh Gupta\\Desktop\\TIKA\\TIKA-MS-EXCEL.xlsx"));
		ParseContext pcontext = new ParseContext();

		// OOXml parser
		OOXMLParser msofficeparser = new OOXMLParser();
		msofficeparser.parse(inputstream, handler, metadata, pcontext);
		System.out.println("Contents of the excel document:" + handler.toString());
		System.out.println("Metadata of the excel document:");
		String[] metadataNames = metadata.names();

		for (String name : metadataNames) {
			System.out.println(name + ": " + metadata.get(name));
		}
	}
}

Output


Contents of the excel document:Sheet1
    First Name  Last Name   DOB
    Saurabh Gupta   10-Dec-85
    Gaurav  Kumar   12-May-86
    Rahul   Roi 12-Jun-10
    Raghvendra  Rana    5-Jan-95
    Tanaya  Jain    13-Mar-85



Metadata of the excel document:
date: 2019-11-23T00:25:08Z
extended-properties:AppVersion: 15.0300
meta:creation-date: 2006-09-16T00:00:00Z
extended-properties:Application: Microsoft Excel
extended-properties:Company: 
Creation-Date: 2006-09-16T00:00:00Z
dcterms:created: 2006-09-16T00:00:00Z
custom:WorkbookGuid: e742a774-13a6-49b2-8ba3-1b6118163781
dcterms:modified: 2019-11-23T00:25:08Z
Last-Modified: 2019-11-23T00:25:08Z
Last-Save-Date: 2019-11-23T00:25:08Z
Application-Version: 15.0300
protected: false
meta:save-date: 2019-11-23T00:25:08Z
Application-Name: Microsoft Excel
modified: 2019-11-23T00:25:08Z
publisher: 
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
dc:publisher: 

[Solved] org.apache.tika.exception.TikaConfigException


TikaConfigException is a subclass of TikaException. This exception occurred when there is an error in the Tika config file. It can also occur when one or more of the parsers failed to initialize from that erroneous config.

public class TikaConfigException extends TikaException

Constructors

  • TikaConfigException(String msg): Creates an instance of the exception with a message.
  • TikaConfigException(String msg, Throwable cause): Create an instance of exception with message and cause.

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/TikaConfigException.html

[Solved] org.apache.tika.exception.CorruptedFileException


CorruptedFileException is a subclass of TikaException. This exception occurred when the parse absolutely, and because of corrupted content positively has to stop. This exception doesn’t catch and swallowed if an embedded parser throws it.

public class CorruptedFileException extends TikaException

Constructors

  • CorruptedFileException(String msg): This constructor use to throw an error message.
  • CorruptedFileException(String msg, Throwable cause): This constructor is used to through exception with the cause.

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/CorruptedFileException.html

[Solved] org.apache.tika.exception.AccessPermissionException


AccessPermissionException is a subclass on TikaException. This exception occurred when a document/file does not allow content extraction. For Example, This exception is most common for PDF type documents, which might cause this type of exception.

public class AccessPermissionException extends TikaException

Solutions

Always check file access, read, write and executable permission before going to use with TIKA, accordingly perform operations.

File file = new File("TEST-File");

With Java NIO Libraries
boolean isRegularFile = Files.isRegularFile(file);
boolean isHidden = Files.isReadable(file);
boolean isReadable = Files.isReadable(file);
boolean isExecutable = Files.isExecutable(file);
boolean isSymbolicLink = Files.isSymbolicLink(file);
boolean isWritable = Files.isWritable(directory);

With Java IO Libraries
boolean isReadable=file.isReadable();
boolean isWritable=file.setWritable();
boolean isExecutable=file.setExecutable();

Constructors

Here are list of Constructor for this exception class:

  • AccessPermissionException() : Default constructor
  • AccessPermissionException(String info) : Constructor with exception message
  • AccessPermissionException(String info, Throwable th): Throw exception with message and stack trace.
  • AccessPermissionException(Throwable th): Throw exception message stack trace.

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/AccessPermissionException.html

[Solved]org.apache.tika.exception.UnsupportedFormatException


UnsupportedFormatException is a subclass of  TikaException. This exception is thrown by parsers when a file format does not support it. It happens generally when based on MIME type not able to differentiate versions.

For Example: When writing mime type as application/perfect covers all versions of WordPerfect format while parsers only support 6.x only.

Solution

To handle such cases whenever possible distinguish file formats by specific MIME Type so that if any unsupported version finds out that will take care by EmptyParser. Even if not able to distinguish by MIME Type use the distinguish versions.

Here is a complete list of supported Format, Parsers, and Mime Type for TIKA

TIKA Supported Document Formats, Parsers and MIME Type

public class UnsupportedFormatException
extends TikaException

 

Constructors

  • UnsupportedFormatException(String msg)

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/UnsupportedFormatException.html

[Solved]org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted


EncryptedDocumentException is subclass of TikaException. This Exception occurred when TIKA parser tries to extract the content of Encrypted Microsoft word documents.

 public class EncryptedDocumentException extends TikaException

Constructors

  • EncryptedDocumentException()
  • EncryptedDocumentException(String info)
  • EncryptedDocumentException(String info, Throwable th)
  • EncryptedDocumentException(Throwable th)

This exception message and exception type dependend on type of encrypted file (docx or doc):

  • File password-protected.docx : org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted
  • File password-protected.doc : org.apache.poi.EncryptedDocumentException: Cannot process encrypted word file

Here is stacktrace for both types of the documents:

Tika password-protected.docx


Exception in thread "main" org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:245)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)

Tika password-protected.doc


Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@119e7782
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
Caused by: org.apache.poi.EncryptedDocumentException: Cannot process encrypted word file
    at org.apache.poi.hwpf.model.FileInformationBlock.(FileInformationBlock.java:77)
    at org.apache.poi.hwpf.HWPFDocumentCore.(HWPFDocumentCore.java:155)
    at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:218)
    at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:80)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/EncryptedDocumentException.html

[Solved] org.apache.tika.exception.TikaException: Error creating OOXML extractor


TikaException is the most common cached exception which required to handle while using APIs for TIKA.

Constructors

These are two constructors of the TikaException class.

  • TikaException(String msg): TikaException  throw with message
  • TikaException(String msg, Throwable cause): TikaException throws message and cause of the exception.

Example

In this example, parsing pdf file content and metadata throwing TikaException because of using the parser for PDF doesn’t support it. By mistake or copy-paste use Parser of OOXMLParser which is generally used to parser Microsoft documents.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;

import org.xml.sax.SAXException;

public class TikaPdfParserExample {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      //detecting the file type
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("C:\\Users\\Saurabh Gupta\\Desktop\\TIKA\\PDF-FILE.pdf"));
      ParseContext pcontext=new ParseContext();

      //auto detect document parser
      Parser  parser = new OOXMLParser();
      parser.parse(inputstream, handler, metadata,pcontext);
      System.out.println("Contents of the text document:" + handler.toString());
      System.out.println("Metadata of the text document:");
      String[] metadataNames = metadata.names();

      for(String name : metadataNames) {
         System.out.println(name + " : " + metadata.get(name));
      }
   }
}

Output


Exception in thread "main" org.apache.tika.exception.TikaException: Error creating OOXML extractor
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:209)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
    at com.fiot.tika.exceptions.handling.TikaTextParserExample.main(TikaTextParserExample.java:31)
Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:143)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:47)
    at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:299)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110)
    ... 2 more
Caused by: java.util.zip.ZipException: Unexpected record signature: 0X46445025
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
    ... 6 more

Solutions

Always use AutoDetectParser in TIKA if not sure about document type or specific Parser as per document type.

Preferences

https://tika.apache.org/1.8/api/org/apache/tika/exception/TikaException.html

[Solved]org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes


TikaZeroByteException is a subclass of TikaException. TikaZeroByteException occurred when using AutoDetectParser to extract the content of the file which is having no text or zero-bytes. In this case, auto-detect parser throws TikaZeroByteException.


public class ZeroByteFileException extends TikaException

Constructors

  • ZeroByteFileException(String msg): This constructor used to throw an exception with a message.

ZeroByteFileException Example

Here is an example to parse content and metadata of text file by using AutoDetectParser. But it’s throwing an exception because it is not having any content/zero.

package com.fiot.tika.exceptions.handling;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;

import org.xml.sax.SAXException;

public class TikaTextParserExample {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      //detecting the file type
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(new File("C:\\Users\\Saurabh Gupta\\Desktop\\TIKA\\BLANK-FILE.txt"));
      ParseContext pcontext=new ParseContext();

      //auto detect document parser
      Parser  parser = new AutoDetectParser();
      parser.parse(inputstream, handler, metadata,pcontext);
      System.out.println("Contents of the text document:" + handler.toString());
      System.out.println("Metadata of the text document:");
      String[] metadataNames = metadata.names();

      for(String name : metadataNames) {
         System.out.println(name + " : " + metadata.get(name));
      }
   }
}

Output


Exception in thread "main" org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122)
    at com.fiot.tika.exceptions.handling.TikaTextParserExample.main(TikaTextParserExample.java:29)

Solutions

To handle ZeroByteException there are two ways:

  1. Always check file size before use it.
  2. If you already know the content type of file using specific Parser. For Example in the above case replace the line with below text parser instance then no exception will occur.

Parser parser = new  TextParser();

References

https://tika.apache.org/1.22/api/org/apache/tika/exception/ZeroByteFileException.html

TIKA Document Content Extraction


TIKA supports various parsers for different types of document formats. TIKA decides the right parser and extract content based on the document type.

Here you can get a complete list of TIKA supported documents formats:

TIKA Supported Formats and Parsers

TIKA Content Extraction

There are two ways to extract content from a document by TIKA API:

  1. TIKA Facade class: Tika.parseToString()
  2. Parser Class : Parser.parse()

TIKA Facade class : Tika.parseToString()

Tika facade class parseToString() method is used to extract content from a document. Tika internally uses the following steps to extract content from the document:

  1.  Tika internally uses the mechanism to detect document type.
  2. Based on document type decide a suitable parser from the parser repository.
  3. The selected parser will parse the document and extract the content.
Tika tika = new Tika();
String content = tika.parseToString(file);

Example : TIKA Extract Content by Tika.parseToString()

Here in this program, you will see complete steps to extract content by the Tika facade class.

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

import org.xml.sax.SAXException;

public class TikaContentExtraction1 {

   public static void main(final String[] args) throws IOException, TikaException {

      File file = new File("hello.txt");

      //Instantiating Tika facade class
      Tika tika = new Tika();

      String filecontent = tika.parseToString(file);
      System.out.println("Document Content: " + filecontent);
   }
}

Output


Document Content:
This is
TIKA
Test

Parser Interface: Parser.parse()

In TIKA, the parser package provides several interfaces and classes to extract the content of a document. Here is a list of Interface, classes, and method used to extract content:

Parser Interface

TIKA supports multiple parsers according to document format. All these parser classes implement the Parser interface. For example : PDFParser, Mp3Parser,OfficeParser etc.

See Also: TIKA Supported Documents Format and Parsers

CompositeParser

CompositeParser has used a composite design pattern internally which allows using a group of parser by a single instance. It allows accessing all parser those implemented Parser interfaces.

AutoDetectParser

AutoDetectParser is a subclass of CompositeParser, which provides automatic document type detection. It automatically detects document type and send to appropriate parser classes by composite methodology.

parse() method

parse() method of the Parser interface used to extract content and metadata from the given document. Here is a prototype of parse() method and parameters descriptions:

parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context);

TIKA supports several individual parser classes i.e XMLParser, PDFParser, Mp3Parser, etc. Which can be used parse specific document type. If you want to use a generic parsing way, TIKA provides CompositeParser or AutoDetectParser which will automatically detect document type and select specific parser for extracting the content and metadata.


Parser parser = new CompositeParser();  
   (or)
Parser parser = new AutoDetectParser();
   (or)        , 
Create object of any individual parsers supported by  TIKA Library 
Object Description
InputStream stream The input stream of a file.
ContentHandler handler Tika sends content as XHTML content, where it extracts the text content by SAX API.
Metadata metadata Metadata tells about the internal information of the document. This object used as a source and target of the document.
ParseContext context This object is used where the need to customize the parsing process as per client needs.

Steps to Extract Document content by Parser

  • Step 1: Create an instance of an input stream of the document.
File  file = new File(filepath)
FileInputStream inputstream = new FileInputStream(file);
   or
InputStream stream = TikaInputStream.get(new File(filename));

Note: FileInputSream doesn’t support random access for reads for efficiently process file format. We can use TikaInputStream for random access to the file.

  • Step 2: Create an instance of ContentHandler.
    TIKA supports these three content handlers:
Content Handler Description
BodyContentHandler This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance.
LinkContentHandler This content extraction class is used to parse only H-ref or links documents and send it to crawlers.
TeeContentHandler This class is useful when needing to use multiple tools simultaneously.

Example

BodyContentHandler handler = new BodyContentHandler( );
  • Step 3: Create an instance of Meta Data
Metadata metadata = new Metadata();
  • Step 4: Create an instance of ParserContext
ParseContext context =new ParseContext();
  • Step 5: Call Parser.parse() method
    Call Parser.parse() method with arguments as given below.

Parser.parse(inputstream, handler, metadata, context);
  • Step 6: Extract Document Content

Call handler.toString() method to extract parse content of the document as text.

Complete Example: Extract Document Content

In this example, you will get to know complete steps to extract content from TIKA supported parser.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaContentExtractionByParser {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      File file = new File("hello.txt");

      //parse() method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the file hello.txt
      parser.parse(inputstream, handler, metadata, context);

      System.out.println("Document Content : " + Handler.toString());
   }
}

Output

Document Content:
This is
TIKA
Test

In further posts, you will get to know about to extract content and metadata from the document.

TIKA Language Detection


Language detection required were needing to classified documents based on language, there is a separate class LanguageIdentifier to detect the language of the text.

LanguageIdentifier class use the following algorithms to detect language:

Profiling Corpus Algorithm

Create a profile for language based on matched common words from different language dictionaries. For example a common word for English like a, an, the, etc. Then decide the language name.

Here use terms as

Corpus: collections of the most used common terms of written language.
Profiling: a dictionary of words of each language.

Drawback: If two language is having similar characters and words then it’s difficult to decide language based on the frequency of words.

N-gram Algorithm

As a solution to the above drawback of the “Profiling Corpus Algorithm“, a new approach comes of using character sequences of a given length for profiling corpus. This sequence of characters in content is called N-gram, where N is the length of the character sequence.

N-gram approach help in the detection of language in the case of European languages. Ex: English. Tika uses a 3-gram approach for language detection. N-gram approach is good in the case of short texts.

TIKA Supported Languages

As per ISO 639-1 having 184 standard languages but Tika is able to detect only 18 languages as below:

da—Danish de—German et—Estonian
el—Greek en—English es—Spanish
fi—Finnish fr—French hu—Hungarian
is—Icelandic it—Italian nl—Dutch
no—Norwegian pl—Polish pt—Portuguese
ru—Russian sv—Swedish th—Thai

How to detect Langauge by Tika?

getLanguage() method of LanguageIdentifier class is used to get language based on passed text content.

//Create Language Identifier object based on content.
LanguageIdentifier object = new LanguageIdentifier(“English is so funny.”);
//Get lanaguage name based on passing content.
String lang=object.getLangauge()

Example: Detect Langauge from Text

This example will show you steps to get Language Name of passing content.

import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;

import org.xml.sax.SAXException;

public class LanguageDetection {

   public static void main(String args[])throws IOException, SAXException, TikaException {

      LanguageIdentifier object = new LanguageIdentifier(“English is so funny.”);
      String lang = object.getLanguage();
      System.out.println("Detected Language is : " + lang);
   }
}

Output


Language Detected from content is : en

Example: Detect Langauge from Document Contents

To detect the language of a document, first, we need to parse the document by using parse() method. This parse() method will store parse content in handler object. This handler object content used as an argument of LanguageIdentifier constructor to identify the language.

//Get metadata and extract content by parser parse() method.
parser.parse(inputstream, handler, metadata, context);
//Pass content as parameter of constructor of LanguageIdentifier
LanguageIdentifier object = new LanguageIdentifier(handler.toString());

Complete Example

Here are complete steps to get metadata and extract the content of the document.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;

import org.xml.sax.SAXException;

public class TikaDocumentLanguageDetection{

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //Instantiating a file object
      File file = new File("hello.txt");

      //Create objects of required arguments for parse() method.
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream content = new FileInputStream(file);

      //Get metadata and extract content by parser parse() method.
      parser.parse(content, handler, metadata, new ParseContext());

      LanguageIdentifier object = new LanguageIdentifier(handler.toString());
	  System.out.println("File Content :" + handler.toString());
      System.out.println("Language Name :" + object.getLanguage());
   }
}

Output


File Content : English is so funny.
Language Name : en

TIKA Document Type Detection


TIKA facade class detect() method is used to detect the document type based on the input file.

Example

In this program, we can detect file type based on the input file.

import java.io.File;
import org.apache.tika.Tika;
public class TikaTypeDetection {

   public static void main(String[] args) throws Exception {

      //Suppose hello.txt is in your current directory
      File file = new File("hello.txt");//

      //Instante tika facade class
      Tika tika = new Tika();

      //detect file type using detect method
      String filetype = tika.detect(file);
      System.out.println(filetype);
   }
}

Output


text/plain

TIKA Supported Document Formats


TIKA supports these documents formats. Here you will also get list of parser with respect to format and MIME Type.

Format Parser MIME Type
HyperText Markup Language HtmlParser text/html
application/vnd.wap.xhtml+xml
application/x-asp
application/xhtml+xml
XML and derived formats DcXMLParser
Microsoft Office document formats OfficeParser
OOXMLParser application/vnd.ms-powerpoint.template.macroenabled.12
application/vnd.ms-excel.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.ms-excel.sheet.binary.macroenabled.12
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-powerpoint.slide.macroenabled.12
application/vnd.ms-visio.drawing
application/vnd.ms-powerpoint.slideshow.macroenabled.12
application/vnd.ms-powerpoint.presentation.macroenabled.12
application/vnd.openxmlformats-officedocument.presentationml.slide
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-word.template.macroenabled.12
application/vnd.ms-word.document.macroenabled.12
application/vnd.ms-powerpoint.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.ms-xpsdocument
application/vnd.ms-visio.drawing.macroenabled.12
application/vnd.ms-visio.template.macroenabled.12
model/vnd.dwfx+xps
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-visio.stencil
application/vnd.ms-visio.template
application/vnd.openxmlformats-officedocument.presentationml.slideshow
application/vnd.ms-visio.stencil.macroenabled.12
application/vnd.ms-excel.template.macroenabled.12
OldExcelParser application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
SpreedsheetMLParser
WordMLParser application/vnd.ms-wordml
Word2006MlParser application/vnd.ms-word2006ml
MSOwnerFileParser application/x-ms-owner
OpenDocument Format OpenDocumentParser application/x-vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.image
application/vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.sun.xml.writer
application/x-vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.image
application/x-vnd.oasis.opendocument.text
application/x-vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.text
application/vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.text-master
application/vnd.oasis.opendocument.presentation
application/x-vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.text-master
iWorks document formats IWorkPackageParser application/vnd.apple.keynote
application/vnd.apple.iwork
application/vnd.apple.numbers
application/vnd.apple.pages
WordPerfect document formats WordPerfectParser application/vnd.wordperfect; version=5.1
application/vnd.wordperfect; version=5.0
application/vnd.wordperfect; version=6.x
org.apache.tika.parser.xml.DcXMLParser
application/xml
image/svg+xml
QuattroProParser application/x-quattro-pro; version=9
Portable Document Format PDFParser application/pdf
Electronic Publication Format EpubParser application/x-ibooks+zip
application/epub+zip
FictionBookParser application/x-fictionbook+xml
org.gagravarr.tika.FlacParser
audio/x-oggflac
audio/x-flac
Rich Text Format RTFParser application/rtf
Compression and packaging formats CompressorParser application/zlib
application/x-gzip
application/x-bzip2
application/x-compress
application/x-java-pack200
application/x-lzma
application/deflate64
application/x-lz4
application/x-snappy
application/x-brotli
application/gzip
application/x-bzip
application/x-xz
PackageParser application/x-tar
application/java-archive
application/x-arj
application/x-archive
application/zip
application/x-cpio
application/x-tika-unix-dump
application/x-7z-compressed
RarParser application/x-rar-compressed
AppleSingleFileParser application/applefile
Text formats TXTParser
Feed and Syndication formats FeedParser application/atom+xml
application/rss+xml
IptcAnpaParser text/vnd.iptc.anpa
Help formats ChmParser application/vnd.ms-htmlhelp
application/x-chm
application/chm
Audio formats AudioParser audio/vnd.wave
audio/x-wav
audio/basic
audio/x-aiff
MidiParser application/x-midi
audio/midi
Mp3Parser audio/mpeg
Mp4Parser video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
VorbisParser audio/vorbis
OpusParser audio/opus
audio/ogg; codecs=opus
SpeexParser audio/ogg; codecs=speex
audio/speex
FlacParser
Image formats ImageParser image/png
image/vnd.wap.wbmp
image/x-jbig2
image/bmp
image/x-xcf
image/gif
image/x-icon
image/x-ms-bmp
JpegParser image/jpeg
TiffParser image/tiff
PSDParser image/vnd.adobe.photoshop
BPGParser image/bpg
image/x-bpg
WebPParser image/webp
ICNSParser image/icns
TesseractOCRParser
WMFParser image/wmf
EMFParser image/emf
Video formats FLVParser video/x-flv
Mp4Parser video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
OggParser audio/ogg
application/kate
application/ogg
video/daala
video/x-ogguvs
video/x-ogm
audio/x-oggpcm
video/ogg
video/x-dirac
video/x-oggrgb
video/x-oggyuv
TheoraParser video/theora
PooledTimeSeriesParser
Java class files and archives ClassParser application/java-vm
Source code SourceCodeParser text/x-c++src
text/x-groovy
text/x-java-source
Mail formats MboxParser application/mbox
RFC822Parser message/rfc822
OutlookPSTParser application/vnd.ms-outlook-pst
OfficeParser application/x-tika-msoffice-embedded; format=ole10_native
application/msword
application/vnd.visio
application/vnd.ms-project
application/x-tika-msworks-spreadsheet
application/x-mspublisher
application/vnd.ms-powerpoint
application/x-tika-msoffice
application/sldworks
application/x-tika-ooxml-protected
application/vnd.ms-excel
application/vnd.ms-outlook
TNEFParser application/vnd.ms-tnef
application/x-tnef
application/ms-tnef
CAD formats DWGParser image/vnd.dwg
Font formats TrueTypeParser application/x-font-ttf
AdobeFontMetricParser application/x-font-adobe-metric
Scientific formats DIFParser application/dif+xml
GDALParser application/x-gsc
image/x-ozi
application/x-pds
image/eir
application/x-usgs-dem
application/aaigrid
application/x-bag
application/elas
application/x-rs2
application/x-tsx
application/x-lcp
image/geotiff
application/x-mbtiles
application/x-cappi
application/x-netcdf
application/x-gsag
application/x-epsilon
application/x-ace2
application/jaxa-pal-sar
image/x-pcraster
application/x-msgn
image/arg
application/x-hdf
image/x-mff
application/x-kro
image/x-hdf5-image
image/x-dimap
image/x-srp
image/big-gif
application/x-envi
application/x-cosar
application/x-ntv2
image/bmp
application/x-doq2
application/x-bt
application/x-kml
application/x-gmt
application/x-rst
application/vrt
application/pcisdk
application/x-ctg
application/x-e00-grid
application/x-rik
image/ida
image/x-mff2
application/sdts-raster
application/x-snodas
image/jp2
image/sar-ceos
application/terragen
application/x-wcs
application/leveller
application/x-ingr
application/x-gtx
image/sgi
application/x-pnm
image/raster
application/fits
application/x-r
image/gif
application/x-envi-hdr
application/x-http
application/x-rmf
application/x-ecrg-toc
application/aig
application/x-rpf-toc
image/adrg
application/x-srtmhgt
application/x-generic-bin
application/jdem
image/x-airsar
application/x-webp
application/x-ngs-geoid
application/x-pcidsk
image/x-fujibas
application/x-wms
application/x-map
image/ceos
application/xpm
application/x-zmap
image/envisat
application/x-ers
application/x-doq1
application/x-isis2
application/x-nwt-grd
application/x-ppi
image/ilwis
application/x-isis3
application/x-nwt-grc
application/x-blx
application/gff
application/x-ndf
image/jpeg
application/x-geo-pdf
application/x-l1b
image/fit
application/x-gsbg
application/x-sdat
application/x-ctable2
application/x-grib
application/x-coasp
application/x-dipex
application/grass-ascii-grid
image/fits
application/x-til
application/x-dods
image/png
application/x-gxf
application/x-gs7bg
application/x-cpg
application/x-lan
application/x-xyz
image/bsb
application/x-p-aux
application/dted
application/x-rasterlite
image/nitf
image/hfa
application/x-fast
application/x-los-las
GeographicInformationParser text/iso19139+xml
GeoParser application/geotopic
GribParser application/x-grib2
HDFParser application/x-hdf
ISArchiveParser application/x-isatab
NetCDFParser application/x-netcdf
MatParser application/x-matlab-data
Executable programs and libraries ExecutableParser application/x-msdownload
application/x-sharedlib
application/x-elf
application/x-object
application/x-executable
application/x-coredump
Crypto formats Pkcs7Parser application/pkcs7-signature
application/pkcs7-mime
TSDParser
Database formats SQLite3Parser
JackcessParser application/x-msaccess
DBFParser application/x-dbf
Natural Language Processing SentimentParser
JournalParser
Image and Video object recognition Tika recognization package

References

https://tika.apache.org/1.22/formats.html

TIKA Reference API


Java Programmers can integrate the Tika library in their applications by using the Tika facade class and other below classes.

Tika Class

Tika facade class abstracts the complexity and provides simple methods to explore the functionalities of TIKA.

package:org.apache.tika

Constructors

Followings are constructors of Tika class:

Constructor Description
Tika () Tika default constructor uses the default configuration and constructs the Tika class.
Tika (Detector detector) Creates the Tika facade class by accepting the detector instance as a parameter.
Tika (Detector detector, Parser parser) Creates a Tika facade class by accepting the detector and parser instances as parameters.
Tika (Detector detector, Parser parser, Translator translator) Creates the Tika facade class by accepting the detector, the parser, and the translator instance as parameters.
Tika (TikaConfig config) Creates a Tika facade class by accepting the object of the TikaConfig class as a parameter.

Methods and Description

The following are the important methods of the Tika facade class:

Method Description
parseToString (File file) This method parses and extract extracted text content in the String format. By default, string parameter length is limited.
int getMaxStringLength () This method returns the maximum length of strings returned by the method.
void setMaxStringLength (int maxStringLength) Set the maximum length of strings returned while extracting data from the file.
Reader parse (File file) This method parses and extract extracted text content in the form of java.io.reader object.
String detect (InputStream stream, Metadata metadata) This method accepts an InputStrea and Metadata of an object as parameters and returns the document type name.
String translate (InputStream text, String targetLanguage) This method accepts the InputStream and a String representing the language that we want our text to be translated. It returns, given text to the desired language, attempting to auto-detect the source language.

Parser Interface

This interface implemented by all the parser classes of the Tika package.

package: org.apache.tika.parser

Methods

This is the important method of Tika Parser interface −

Methods Description
parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) This parse method use is given document input stream into a sequence of XHTML and SAX events. After parsing, it places the metadata in the object of MetaData class and extracted document content in the object of the ContentHandler class.

Metadata Class

This MetaData class implements various interfaces such as CreativeCommons, Geographic, HttpHeaders, Message, MSOffice, ClimateForcast, TIFF, TikaMetadataKeys, TikaMimeKeys, Serializable to support various data models.

package: org.apache.tika.metadata

Constructors

Constructor Description
Metadata() Constructs new, empty metadata.

Methods

Methods Description
add (Property property, String value) Adds a new metadata property in the form of key/value pair.
add (String name, String value) Adds a new metadata property in the form of key/value pair.
String get (Property property) Returns the property’s value (if any).
String get (String name) Returns the key’s value (if any).
Date getDate (Property property) Returns the value of Date of metadata property.
String[] getValues (Property property) Returns all the values of metadata associated with property.
String[] getValues (String name) Returns all the values of a given metadata key.
String[] names() Returns all the key names of metadata elements in a metadata object.
set (Property property, Date date) Sets the date of the given metadata property
set(Property property, String[] values) Sets multiple values for a metadata property.

LanguageIdentifier Class

This class used to identify the language of the given content.

package : org.apache.tika.language

Constructors

Constructor Description
LanguageIdentifier (LanguageProfile profile) Instantiates the language identifier for parameter LanguageProfile.
LanguageIdentifier (String content) Instantiates the language identifier for text content.

Methods

Methods Description
String getLanguage () Returns the language of the content of current LanguageIdentifier object.

TIKA Environment Setup for Applications


In the previous post Apache Tika Introduction, you have got an idea of apache Tika and it’s used. In this post, you will learn about the TIKA  environment setup for applications.

As a programmer, we can integrate Apache TIKA in window or Linux or another OS environment by using:

  • Command-line
  • Tika API
  • Command-line interface (CLI) of TIKA
  • Graphical User interface (GUI) of TIKA
  • The source code.

System Requirements

  • JDK Java SE 2 JDK 1.6 or above
  • Memory 1 GB RAM (recommended)
  • Disk Space No minimum requirement
  • Operating System Version Windows XP or above, Linux

Tika Environment Setup Steps

  • Step 1: Set JAVA_HOME and Path as mentioned on the below link.
    JAVA_HOME and PATH Setup Steps
  • Step 2: Add these libraries in your CLASSPATH or pom.xml to use TIKA APIs.

<dependency>
   <groupId>org.apache.Tika</groupId>
   <artifactId>Tika-core</artifactId>
   <version>1.6</version>
</dependency>
<dependency>
   <groupId>org.apache.Tika</groupId>
   <artifactId> Tika-parsers</artifactId>
   <version> 1.6</version>
</dependency>
<dependency>
   <groupId> org.apache.Tika</groupId>
   <artifactId>Tika</artifactId>
   <version>1.6</version>
</dependency>
<dependency>
   <groupId>org.apache.Tika</groupId>
   < artifactId>Tika-serialization</artifactId>
   < version>1.6< /version>
</dependency>
<dependency>
   < groupId>org.apache.Tika< /groupId>
   < artifactId>Tika-app< /artifactId>
   < version>1.6< /version>
</dependency>
<dependency>
   <groupId>org.apache.Tika</groupId>
   <artifactId>Tika-bundle</artifactId>
   <version>1.6</version>
</dependency>

Apache Tika Introduction


Apache Tika provides generic API for all document type content detection, analysis and content extraction from multiple file formats. Tika internally uses various documents parsers to extract metadata and structured text content from the various file types. For Example PDF, Spreadsheet, text file, images, etc.

Tika latest version 1.22 released on 1st Aug 2019 by Apache software foundation. Tika completely has written in Java and supports cross-platform.

Tika Version History

Year Development
2006 The idea of Tika was proposed in front of the Lucene Project Management Committee.
2006 The concept of Tika and its benefits in the Jackrabbit project was discussed.
2007 Tika entered into Apache.
2008 Both 0.1 and 0.2 Versions were released and Tika graduated from the incubator to the Lucene sub-project.
2009 This year Tika Versions 0.3, 0.4, and 0.5 were released.
2010 Both 0.6 and 0.7 Version was released and Tika graduated into the top-level Apache project.
2011 Tika 1.0 was released with book “Tika in Action” was also released in the same year.
2019 Tika 1.22 was release for additional CSV and HWP files type.

Why Tika?

As per https://filext.com/, there are around 25k to 50K file extensions (Structured and Non Structured) and these are growing day by day. To deal with so many types of format Tika provides universal Java API to support around 1400 file types that cover most common and popular formats.

Tika provides content extraction, metadata extraction, and language identification capabilities. Tika written in Java, still used by other languages also by calling restful services and CLI tools.

Where to use Apache Tika?

  • Search Engine: Tika uses the search engine to create search indexing for text in digital documents.
  • Document Analysis: Analysis of the documents like images, pdf to do analysis based on extract content.
  • Digital Asset Management (DAM): It’s used with an organization where maintains a library of documents, images, videos, ebooks, drawings to classify based on common features.
  • Content Analysis: Analyse the content from the web site and care of user’s interest like amazon shows movies, products based on the user’s visit. Machine learning based on content.

Features of Tika

  • Unified parser Interface: Tika internally uses best suitable parser libraries within a single parser interface. Due to this feature Tika, reduce the burden of developer from the burden of selecting the suitable parser library and use it according to the file type encountered.
  • Low memory usage: Tika consumes fewer memory resources, therefore, it is easily embedded with Java applications. We can also use Tika within the application which runs on platforms with fewer resources like mobile PDA.
  • Fast processing: Tika can quickly extract and detect content from applications.
  • Flexible metadata: Tika understands all type of metadata models which are used to define files.
  • Parser integration: Tika supports various parser libraries available for each document type in the same application.
  • MIME-type detection: Tika can extract and detect content from all MIME types.
  • Language detection: Tika includes language identification feature, therefore it can be used in documents based on language type in multilingual websites.