Apache Tika supports various functionalities:
- Document type detection
- Content extraction
- Metadata extraction
- Language detection
Document Type Detection
Tika uses various document detection techniques and detects the media type given to them.
Tika uses different types of parsers library that can parse the content of various document formats and extract them. After detecting the media type of file, it selects the appropriate parser from the parser repository and passes the document. Different Tika classes have methods to parse different document formats.
Tika can also provide metadata of media type along with the content, Tika extracts the metadata of the document with the same procedure as in content extraction. For some document types, Tika has classes to extract metadata.
Internally, Tika follows algorithms like n-gram to detect the language of the content in a given document. Tika uses LanguageIdentifier and Profiler classes for language identification.