TIKA supports various parsers for different types of document formats. TIKA decides the right parser and extracts metadata and content based on the document type.
Here you can get a complete list of TIKA supported documents formats:
TIKA Supported Formats and Parsers
TIKA Meta Data Extraction
In TIKA, the parser package provides several interfaces and classes to extract metadata and content of a document. Here is a list of Interface, classes, and method used to extract metadata:
Parser Interface
TIKA supports multiple parsers according to document format. All these parser classes implement the Parser interface. For example : PDFParser, Mp3Parser,OfficeParser etc.
See Also: TIKA Supported Formats and Parsers
CompositeParser
CompositeParser has used as a composite design pattern internally which allows using a group of parser by a single instance. It allows accessing all parser those implemented Parser interfaces.
AutoDetectParser
AutoDetectParser is a subclass of CompositeParser, which provides automatic document type detection. It automatically detects document type and sends it to appropriate parser classes by composite methodology.
parse() method
parse() method of the Parser interface used to extract content and metadata from a given document. Here is the prototype of the parse() method and parameters descriptions:
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context);
TIKA supports several individual parser classes i.e XMLParser, PDFParser, Mp3Parser, etc. Which can be used parse specific document type. If you want to use a generic parsing way, TIKA provides CompositeParser or AutoDetectParser which will automatically detect document type and select specific parser for extracting the content and metadata.
Parser parser = new CompositeParser();
(or)
Parser parser = new AutoDetectParser();
(or) ,
Create object of any individual parsers supported by TIKA Library
Object |
Description |
InputStream stream |
The input stream of a file. |
ContentHandler handler |
Tika sends content as XHTML content, where it extracts the text content by SAX API. |
Metadata metadata |
Metadata tells about the internal information of the document. This object used as a source and target of the document. |
ParseContext context |
This object is used where the need to customize the parsing process as per client needs. |
Steps to Extract Document Metadata by Parser
- Step 1: Create an instance of an input stream of the document.
File file = new File(filepath)
FileInputStream inputstream = new FileInputStream(file);
or
InputStream stream = TikaInputStream.get(new File(filename));
Note: FileInputSream doesn’t support random access for reads for efficiently process file format. We can use TikaInputStream for random access to the file.
- Step 2: Create an instance of ContentHandler.
TIKA supports these three content handlers:
Content Handler |
Description |
BodyContentHandler |
This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance. |
LinkContentHandler |
This content extraction class is used to parse only H-ref or links documents and send it to crawlers. |
TeeContentHandler |
This class is useful when needing to use multiple tools simultaneously. |
Example
BodyContentHandler handler = new BodyContentHandler( );
- Step 3: Create an instance of Meta Data
Metadata metadata = new Metadata();
- Step 4: Create an instance of ParserContext
ParseContext context =new ParseContext();
- Step 5: Call Parser.parse() method
Call Parser.parse() method with arguments as given below.
Parser.parse(inputstream, handler, metadata, context);
Complete Example: Extract Document Metadata
In this example, you will get to know complete steps to extract metadata from by TIKA supported parser.
Output
Meta Data:
X-Parsed-By: org.apache.tika.parser.DefaultParser
Content-Encoding: windows-1252
How to set metadata in TIKA?
TIKA allows setting Metadata of a document. you can use the below method to set metadata:
Metadata metadata = new Metadata();
//Setting date meta data
metadata.set(Metadata.DATE, new Date());
//Setting multiple names to author property
metadata.set(Metadata.AUTHOR, "Saurabh ,Gaurav ,Rahul");
Consider as assignment and try to run the above program after putting these metadata. The output would be similar to :
Output
Meta Data:
X-Parsed-By: org.apache.tika.parser.DefaultParser
Content-Encoding: windows-1252
Author: Saurabh ,Gaurav ,Rahul
date: 2019-11-22T14:57:17Z
In further posts, you will get to know about to extract content and metadata from the document.
Like this:
Like Loading...
You must be logged in to post a comment.