Tag Archives: TIKA parsers

TIKA Document Content Extraction


TIKA supports various parsers for different types of document formats. TIKA decides the right parser and extract content based on the document type.

Here you can get a complete list of TIKA supported documents formats:

TIKA Supported Formats and Parsers

TIKA Content Extraction

There are two ways to extract content from a document by TIKA API:

  1. TIKA Facade class: Tika.parseToString()
  2. Parser Class : Parser.parse()

TIKA Facade class : Tika.parseToString()

Tika facade class parseToString() method is used to extract content from a document. Tika internally uses the following steps to extract content from the document:

  1.  Tika internally uses the mechanism to detect document type.
  2. Based on document type decide a suitable parser from the parser repository.
  3. The selected parser will parse the document and extract the content.
Tika tika = new Tika();
String content = tika.parseToString(file);

Example : TIKA Extract Content by Tika.parseToString()

Here in this program, you will see complete steps to extract content by the Tika facade class.

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

import org.xml.sax.SAXException;

public class TikaContentExtraction1 {

   public static void main(final String[] args) throws IOException, TikaException {

      File file = new File("hello.txt");

      //Instantiating Tika facade class
      Tika tika = new Tika();

      String filecontent = tika.parseToString(file);
      System.out.println("Document Content: " + filecontent);
   }
}

Output


Document Content:
This is
TIKA
Test

Parser Interface: Parser.parse()

In TIKA, the parser package provides several interfaces and classes to extract the content of a document. Here is a list of Interface, classes, and method used to extract content:

Parser Interface

TIKA supports multiple parsers according to document format. All these parser classes implement the Parser interface. For example : PDFParser, Mp3Parser,OfficeParser etc.

See Also: TIKA Supported Documents Format and Parsers

CompositeParser

CompositeParser has used a composite design pattern internally which allows using a group of parser by a single instance. It allows accessing all parser those implemented Parser interfaces.

AutoDetectParser

AutoDetectParser is a subclass of CompositeParser, which provides automatic document type detection. It automatically detects document type and send to appropriate parser classes by composite methodology.

parse() method

parse() method of the Parser interface used to extract content and metadata from the given document. Here is a prototype of parse() method and parameters descriptions:

parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context);

TIKA supports several individual parser classes i.e XMLParser, PDFParser, Mp3Parser, etc. Which can be used parse specific document type. If you want to use a generic parsing way, TIKA provides CompositeParser or AutoDetectParser which will automatically detect document type and select specific parser for extracting the content and metadata.


Parser parser = new CompositeParser();  
   (or)
Parser parser = new AutoDetectParser();
   (or)        , 
Create object of any individual parsers supported by  TIKA Library 
Object Description
InputStream stream The input stream of a file.
ContentHandler handler Tika sends content as XHTML content, where it extracts the text content by SAX API.
Metadata metadata Metadata tells about the internal information of the document. This object used as a source and target of the document.
ParseContext context This object is used where the need to customize the parsing process as per client needs.

Steps to Extract Document content by Parser

  • Step 1: Create an instance of an input stream of the document.
File  file = new File(filepath)
FileInputStream inputstream = new FileInputStream(file);
   or
InputStream stream = TikaInputStream.get(new File(filename));

Note: FileInputSream doesn’t support random access for reads for efficiently process file format. We can use TikaInputStream for random access to the file.

  • Step 2: Create an instance of ContentHandler.
    TIKA supports these three content handlers:
Content Handler Description
BodyContentHandler This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance.
LinkContentHandler This content extraction class is used to parse only H-ref or links documents and send it to crawlers.
TeeContentHandler This class is useful when needing to use multiple tools simultaneously.

Example

BodyContentHandler handler = new BodyContentHandler( );
  • Step 3: Create an instance of Meta Data
Metadata metadata = new Metadata();
  • Step 4: Create an instance of ParserContext
ParseContext context =new ParseContext();
  • Step 5: Call Parser.parse() method
    Call Parser.parse() method with arguments as given below.

Parser.parse(inputstream, handler, metadata, context);
  • Step 6: Extract Document Content

Call handler.toString() method to extract parse content of the document as text.

Complete Example: Extract Document Content

In this example, you will get to know complete steps to extract content from TIKA supported parser.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaContentExtractionByParser {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      File file = new File("hello.txt");

      //parse() method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the file hello.txt
      parser.parse(inputstream, handler, metadata, context);

      System.out.println("Document Content : " + Handler.toString());
   }
}

Output

Document Content:
This is
TIKA
Test

In further posts, you will get to know about to extract content and metadata from the document.

TIKA Supported Document Formats


TIKA supports these documents formats. Here you will also get list of parser with respect to format and MIME Type.

Format Parser MIME Type
HyperText Markup Language HtmlParser text/html
application/vnd.wap.xhtml+xml
application/x-asp
application/xhtml+xml
XML and derived formats DcXMLParser
Microsoft Office document formats OfficeParser
OOXMLParser application/vnd.ms-powerpoint.template.macroenabled.12
application/vnd.ms-excel.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.wordprocessingml.template
application/vnd.ms-excel.sheet.binary.macroenabled.12
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-powerpoint.slide.macroenabled.12
application/vnd.ms-visio.drawing
application/vnd.ms-powerpoint.slideshow.macroenabled.12
application/vnd.ms-powerpoint.presentation.macroenabled.12
application/vnd.openxmlformats-officedocument.presentationml.slide
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-word.template.macroenabled.12
application/vnd.ms-word.document.macroenabled.12
application/vnd.ms-powerpoint.addin.macroenabled.12
application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.ms-xpsdocument
application/vnd.ms-visio.drawing.macroenabled.12
application/vnd.ms-visio.template.macroenabled.12
model/vnd.dwfx+xps
application/vnd.openxmlformats-officedocument.presentationml.template
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-visio.stencil
application/vnd.ms-visio.template
application/vnd.openxmlformats-officedocument.presentationml.slideshow
application/vnd.ms-visio.stencil.macroenabled.12
application/vnd.ms-excel.template.macroenabled.12
OldExcelParser application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
SpreedsheetMLParser
WordMLParser application/vnd.ms-wordml
Word2006MlParser application/vnd.ms-word2006ml
MSOwnerFileParser application/x-ms-owner
OpenDocument Format OpenDocumentParser application/x-vnd.oasis.opendocument.presentation
application/vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.image
application/vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.text-web
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.sun.xml.writer
application/x-vnd.oasis.opendocument.graphics-template
application/vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.chart
application/x-vnd.oasis.opendocument.spreadsheet
application/vnd.oasis.opendocument.image
application/x-vnd.oasis.opendocument.text
application/x-vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.image-template
application/x-vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.presentation-template
application/vnd.oasis.opendocument.text
application/vnd.oasis.opendocument.text-template
application/vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.chart-template
application/x-vnd.oasis.opendocument.formula-template
application/x-vnd.oasis.opendocument.text-master
application/vnd.oasis.opendocument.presentation
application/x-vnd.oasis.opendocument.graphics
application/vnd.oasis.opendocument.formula
application/vnd.oasis.opendocument.text-master
iWorks document formats IWorkPackageParser application/vnd.apple.keynote
application/vnd.apple.iwork
application/vnd.apple.numbers
application/vnd.apple.pages
WordPerfect document formats WordPerfectParser application/vnd.wordperfect; version=5.1
application/vnd.wordperfect; version=5.0
application/vnd.wordperfect; version=6.x
org.apache.tika.parser.xml.DcXMLParser
application/xml
image/svg+xml
QuattroProParser application/x-quattro-pro; version=9
Portable Document Format PDFParser application/pdf
Electronic Publication Format EpubParser application/x-ibooks+zip
application/epub+zip
FictionBookParser application/x-fictionbook+xml
org.gagravarr.tika.FlacParser
audio/x-oggflac
audio/x-flac
Rich Text Format RTFParser application/rtf
Compression and packaging formats CompressorParser application/zlib
application/x-gzip
application/x-bzip2
application/x-compress
application/x-java-pack200
application/x-lzma
application/deflate64
application/x-lz4
application/x-snappy
application/x-brotli
application/gzip
application/x-bzip
application/x-xz
PackageParser application/x-tar
application/java-archive
application/x-arj
application/x-archive
application/zip
application/x-cpio
application/x-tika-unix-dump
application/x-7z-compressed
RarParser application/x-rar-compressed
AppleSingleFileParser application/applefile
Text formats TXTParser
Feed and Syndication formats FeedParser application/atom+xml
application/rss+xml
IptcAnpaParser text/vnd.iptc.anpa
Help formats ChmParser application/vnd.ms-htmlhelp
application/x-chm
application/chm
Audio formats AudioParser audio/vnd.wave
audio/x-wav
audio/basic
audio/x-aiff
MidiParser application/x-midi
audio/midi
Mp3Parser audio/mpeg
Mp4Parser video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
VorbisParser audio/vorbis
OpusParser audio/opus
audio/ogg; codecs=opus
SpeexParser audio/ogg; codecs=speex
audio/speex
FlacParser
Image formats ImageParser image/png
image/vnd.wap.wbmp
image/x-jbig2
image/bmp
image/x-xcf
image/gif
image/x-icon
image/x-ms-bmp
JpegParser image/jpeg
TiffParser image/tiff
PSDParser image/vnd.adobe.photoshop
BPGParser image/bpg
image/x-bpg
WebPParser image/webp
ICNSParser image/icns
TesseractOCRParser
WMFParser image/wmf
EMFParser image/emf
Video formats FLVParser video/x-flv
Mp4Parser video/x-m4v
application/mp4
video/3gpp
video/3gpp2
video/quicktime
audio/mp4
video/mp4
OggParser audio/ogg
application/kate
application/ogg
video/daala
video/x-ogguvs
video/x-ogm
audio/x-oggpcm
video/ogg
video/x-dirac
video/x-oggrgb
video/x-oggyuv
TheoraParser video/theora
PooledTimeSeriesParser
Java class files and archives ClassParser application/java-vm
Source code SourceCodeParser text/x-c++src
text/x-groovy
text/x-java-source
Mail formats MboxParser application/mbox
RFC822Parser message/rfc822
OutlookPSTParser application/vnd.ms-outlook-pst
OfficeParser application/x-tika-msoffice-embedded; format=ole10_native
application/msword
application/vnd.visio
application/vnd.ms-project
application/x-tika-msworks-spreadsheet
application/x-mspublisher
application/vnd.ms-powerpoint
application/x-tika-msoffice
application/sldworks
application/x-tika-ooxml-protected
application/vnd.ms-excel
application/vnd.ms-outlook
TNEFParser application/vnd.ms-tnef
application/x-tnef
application/ms-tnef
CAD formats DWGParser image/vnd.dwg
Font formats TrueTypeParser application/x-font-ttf
AdobeFontMetricParser application/x-font-adobe-metric
Scientific formats DIFParser application/dif+xml
GDALParser application/x-gsc
image/x-ozi
application/x-pds
image/eir
application/x-usgs-dem
application/aaigrid
application/x-bag
application/elas
application/x-rs2
application/x-tsx
application/x-lcp
image/geotiff
application/x-mbtiles
application/x-cappi
application/x-netcdf
application/x-gsag
application/x-epsilon
application/x-ace2
application/jaxa-pal-sar
image/x-pcraster
application/x-msgn
image/arg
application/x-hdf
image/x-mff
application/x-kro
image/x-hdf5-image
image/x-dimap
image/x-srp
image/big-gif
application/x-envi
application/x-cosar
application/x-ntv2
image/bmp
application/x-doq2
application/x-bt
application/x-kml
application/x-gmt
application/x-rst
application/vrt
application/pcisdk
application/x-ctg
application/x-e00-grid
application/x-rik
image/ida
image/x-mff2
application/sdts-raster
application/x-snodas
image/jp2
image/sar-ceos
application/terragen
application/x-wcs
application/leveller
application/x-ingr
application/x-gtx
image/sgi
application/x-pnm
image/raster
application/fits
application/x-r
image/gif
application/x-envi-hdr
application/x-http
application/x-rmf
application/x-ecrg-toc
application/aig
application/x-rpf-toc
image/adrg
application/x-srtmhgt
application/x-generic-bin
application/jdem
image/x-airsar
application/x-webp
application/x-ngs-geoid
application/x-pcidsk
image/x-fujibas
application/x-wms
application/x-map
image/ceos
application/xpm
application/x-zmap
image/envisat
application/x-ers
application/x-doq1
application/x-isis2
application/x-nwt-grd
application/x-ppi
image/ilwis
application/x-isis3
application/x-nwt-grc
application/x-blx
application/gff
application/x-ndf
image/jpeg
application/x-geo-pdf
application/x-l1b
image/fit
application/x-gsbg
application/x-sdat
application/x-ctable2
application/x-grib
application/x-coasp
application/x-dipex
application/grass-ascii-grid
image/fits
application/x-til
application/x-dods
image/png
application/x-gxf
application/x-gs7bg
application/x-cpg
application/x-lan
application/x-xyz
image/bsb
application/x-p-aux
application/dted
application/x-rasterlite
image/nitf
image/hfa
application/x-fast
application/x-los-las
GeographicInformationParser text/iso19139+xml
GeoParser application/geotopic
GribParser application/x-grib2
HDFParser application/x-hdf
ISArchiveParser application/x-isatab
NetCDFParser application/x-netcdf
MatParser application/x-matlab-data
Executable programs and libraries ExecutableParser application/x-msdownload
application/x-sharedlib
application/x-elf
application/x-object
application/x-executable
application/x-coredump
Crypto formats Pkcs7Parser application/pkcs7-signature
application/pkcs7-mime
TSDParser
Database formats SQLite3Parser
JackcessParser application/x-msaccess
DBFParser application/x-dbf
Natural Language Processing SentimentParser
JournalParser
Image and Video object recognition Tika recognization package

References

https://tika.apache.org/1.22/formats.html