TIKA Document Content Extraction


TIKA supports various parsers for different types of document formats. TIKA decides the right parser and extract content based on the document type.

Here you can get a complete list of TIKA supported documents formats:

TIKA Supported Formats and Parsers

TIKA Content Extraction

There are two ways to extract content from a document by TIKA API:

  1. TIKA Facade class: Tika.parseToString()
  2. Parser Class : Parser.parse()

TIKA Facade class : Tika.parseToString()

Tika facade class parseToString() method is used to extract content from a document. Tika internally uses the following steps to extract content from the document:

  1.  Tika internally uses the mechanism to detect document type.
  2. Based on document type decide a suitable parser from the parser repository.
  3. The selected parser will parse the document and extract the content.
Tika tika = new Tika();
String content = tika.parseToString(file);

Example : TIKA Extract Content by Tika.parseToString()

Here in this program, you will see complete steps to extract content by the Tika facade class.

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

import org.xml.sax.SAXException;

public class TikaContentExtraction1 {

   public static void main(final String[] args) throws IOException, TikaException {

      File file = new File("hello.txt");

      //Instantiating Tika facade class
      Tika tika = new Tika();

      String filecontent = tika.parseToString(file);
      System.out.println("Document Content: " + filecontent);
   }
}

Output


Document Content:
This is
TIKA
Test

Parser Interface: Parser.parse()

In TIKA, the parser package provides several interfaces and classes to extract the content of a document. Here is a list of Interface, classes, and method used to extract content:

Parser Interface

TIKA supports multiple parsers according to document format. All these parser classes implement the Parser interface. For example : PDFParser, Mp3Parser,OfficeParser etc.

See Also: TIKA Supported Documents Format and Parsers

CompositeParser

CompositeParser has used a composite design pattern internally which allows using a group of parser by a single instance. It allows accessing all parser those implemented Parser interfaces.

AutoDetectParser

AutoDetectParser is a subclass of CompositeParser, which provides automatic document type detection. It automatically detects document type and send to appropriate parser classes by composite methodology.

parse() method

parse() method of the Parser interface used to extract content and metadata from the given document. Here is a prototype of parse() method and parameters descriptions:

parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context);

TIKA supports several individual parser classes i.e XMLParser, PDFParser, Mp3Parser, etc. Which can be used parse specific document type. If you want to use a generic parsing way, TIKA provides CompositeParser or AutoDetectParser which will automatically detect document type and select specific parser for extracting the content and metadata.


Parser parser = new CompositeParser();  
   (or)
Parser parser = new AutoDetectParser();
   (or)        , 
Create object of any individual parsers supported by  TIKA Library 
Object Description
InputStream stream The input stream of a file.
ContentHandler handler Tika sends content as XHTML content, where it extracts the text content by SAX API.
Metadata metadata Metadata tells about the internal information of the document. This object used as a source and target of the document.
ParseContext context This object is used where the need to customize the parsing process as per client needs.

Steps to Extract Document content by Parser

  • Step 1: Create an instance of an input stream of the document.
File  file = new File(filepath)
FileInputStream inputstream = new FileInputStream(file);
   or
InputStream stream = TikaInputStream.get(new File(filename));

Note: FileInputSream doesn’t support random access for reads for efficiently process file format. We can use TikaInputStream for random access to the file.

  • Step 2: Create an instance of ContentHandler.
    TIKA supports these three content handlers:
Content Handler Description
BodyContentHandler This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance.
LinkContentHandler This content extraction class is used to parse only H-ref or links documents and send it to crawlers.
TeeContentHandler This class is useful when needing to use multiple tools simultaneously.

Example

BodyContentHandler handler = new BodyContentHandler( );
  • Step 3: Create an instance of Meta Data
Metadata metadata = new Metadata();
  • Step 4: Create an instance of ParserContext
ParseContext context =new ParseContext();
  • Step 5: Call Parser.parse() method
    Call Parser.parse() method with arguments as given below.

Parser.parse(inputstream, handler, metadata, context);
  • Step 6: Extract Document Content

Call handler.toString() method to extract parse content of the document as text.

Complete Example: Extract Document Content

In this example, you will get to know complete steps to extract content from TIKA supported parser.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaContentExtractionByParser {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      File file = new File("hello.txt");

      //parse() method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the file hello.txt
      parser.parse(inputstream, handler, metadata, context);

      System.out.println("Document Content : " + Handler.toString());
   }
}

Output

Document Content:
This is
TIKA
Test

In further posts, you will get to know about to extract content and metadata from the document.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s