Application programmers can easily integrate Tika in their applications by using TIKA APIs. Tika provides a GUI and Command Line Interface to make it user-friendly.
TIKA architecture divides into four important modules as below:
- Language detection mechanism.
- MIME detection mechanism.
- Parser interface.
- Tika Facade class.
Language Detection Mechanism
TIKA uses the LanguageIdentifier class to identify the language of the written text in the content of the file. TIKA detects the language in file content and additional information in the metadata.
TIKA internally uses N-gram algorithm for language detection which detects the language of text-based on language identification repository.
package : org.apache.tika.language
class : LanguageIdentifier
MIME Detection Mechanism
TIKA internally uses several techniques like file content type hints, globs, magic bytes, character encoding, etc. to detect MIME type of file.
TIKA default MIME type detection is done by using class MimeTypes and it uses Detector interface for most of the content detection.
Class : org.apache.tika.mime.mimeTypes
Class : org.apache.tika.detect.Detector
Tika Facade Class
Tika is the simplest facade class to use TIKA with Java. Tika works as a broker and provides simple interface API for MIME detection, content extraction, and language detection.
In TIKA, the Parser interface is used to extracts the text and metadata from parsing documents. This Parser uses internally concreate parser classes, specific to document types as TIKA supports lots of documents format.
Interface : org.apache.tika.parser.Parser