Apache Tika supports various functionalities:
- Document type detection
- Content extraction
- Metadata extraction
- Language detection
Document Type Detection
Tika uses various document detection techniques and detects the media type given to them.
Content Extraction
Tika uses different types of parsers library that can parse the content of various document formats and extract them. After detecting the media type of file, it selects the appropriate parser from the parser repository and passes the document. Different Tika classes have methods to parse different document formats.
Metadata Extraction
Tika can also provide metadata of media type along with the content, Tika extracts the metadata of the document with the same procedure as in content extraction. For some document types, Tika has classes to extract metadata.
Language Detection
Internally, Tika follows algorithms like n-gram to detect the language of the content in a given document. Tika uses LanguageIdentifier and Profiler classes for language identification.
You must log in to post a comment.