Apache Tika Introduction

Apache Tika provides generic API for all document type content detection, analysis and content extraction from multiple file formats. Tika internally uses various documents parsers to extract metadata and structured text content from the various file types. For Example PDF, Spreadsheet, text file, images, etc.

Tika latest version 1.22 released on 1st Aug 2019 by Apache software foundation. Tika completely has written in Java and supports cross-platform.

Tika Version History

Year Development
2006 The idea of Tika was proposed in front of the Lucene Project Management Committee.
2006 The concept of Tika and its benefits in the Jackrabbit project was discussed.
2007 Tika entered into Apache.
2008 Both 0.1 and 0.2 Versions were released and Tika graduated from the incubator to the Lucene sub-project.
2009 This year Tika Versions 0.3, 0.4, and 0.5 were released.
2010 Both 0.6 and 0.7 Version was released and Tika graduated into the top-level Apache project.
2011 Tika 1.0 was released with book “Tika in Action” was also released in the same year.
2019 Tika 1.22 was release for additional CSV and HWP files type.

Why Tika?

As per https://filext.com/, there are around 25k to 50K file extensions (Structured and Non Structured) and these are growing day by day. To deal with so many types of format Tika provides universal Java API to support around 1400 file types that cover most common and popular formats.

Tika provides content extraction, metadata extraction, and language identification capabilities. Tika written in Java, still used by other languages also by calling restful services and CLI tools.

Where to use Apache Tika?

  • Search Engine: Tika uses the search engine to create search indexing for text in digital documents.
  • Document Analysis: Analysis of the documents like images, pdf to do analysis based on extract content.
  • Digital Asset Management (DAM): It’s used with an organization where maintains a library of documents, images, videos, ebooks, drawings to classify based on common features.
  • Content Analysis: Analyse the content from the web site and care of user’s interest like amazon shows movies, products based on the user’s visit. Machine learning based on content.

Features of Tika

  • Unified parser Interface: Tika internally uses best suitable parser libraries within a single parser interface. Due to this feature Tika, reduce the burden of developer from the burden of selecting the suitable parser library and use it according to the file type encountered.
  • Low memory usage: Tika consumes fewer memory resources, therefore, it is easily embedded with Java applications. We can also use Tika within the application which runs on platforms with fewer resources like mobile PDA.
  • Fast processing: Tika can quickly extract and detect content from applications.
  • Flexible metadata: Tika understands all type of metadata models which are used to define files.
  • Parser integration: Tika supports various parser libraries available for each document type in the same application.
  • MIME-type detection: Tika can extract and detect content from all MIME types.
  • Language detection: Tika includes language identification feature, therefore it can be used in documents based on language type in multilingual websites.