Language detection required were needing to classified documents based on language, there is a separate class LanguageIdentifier to detect the language of the text.
LanguageIdentifier class use the following algorithms to detect language:
Profiling Corpus Algorithm
Create a profile for language based on matched common words from different language dictionaries. For example a common word for English like a, an, the, etc. Then decide the language name.
Here use terms as
Corpus: collections of the most used common terms of written language.
Profiling: a dictionary of words of each language.
Drawback: If two language is having similar characters and words then it’s difficult to decide language based on the frequency of words.
N-gram Algorithm
As a solution to the above drawback of the “Profiling Corpus Algorithm“, a new approach comes of using character sequences of a given length for profiling corpus. This sequence of characters in content is called N-gram, where N is the length of the character sequence.
N-gram approach help in the detection of language in the case of European languages. Ex: English. Tika uses a 3-gram approach for language detection. N-gram approach is good in the case of short texts.
TIKA Supported Languages
As per ISO 639-1 having 184 standard languages but Tika is able to detect only 18 languages as below:
da—Danish | de—German | et—Estonian |
el—Greek | en—English | es—Spanish |
fi—Finnish | fr—French | hu—Hungarian |
is—Icelandic | it—Italian | nl—Dutch |
no—Norwegian | pl—Polish | pt—Portuguese |
ru—Russian | sv—Swedish | th—Thai |
How to detect Langauge by Tika?
getLanguage() method of LanguageIdentifier class is used to get language based on passed text content.
//Create Language Identifier object based on content. LanguageIdentifier object = new LanguageIdentifier(“English is so funny.”); //Get lanaguage name based on passing content. String lang=object.getLangauge()
Example: Detect Langauge from Text
This example will show you steps to get Language Name of passing content.
import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.language.LanguageIdentifier; import org.xml.sax.SAXException; public class LanguageDetection { public static void main(String args[])throws IOException, SAXException, TikaException { LanguageIdentifier object = new LanguageIdentifier(“English is so funny.”); String lang = object.getLanguage(); System.out.println("Detected Language is : " + lang); } }
Output
Language Detected from content is : en
Example: Detect Langauge from Document Contents
To detect the language of a document, first, we need to parse the document by using parse() method. This parse() method will store parse content in handler object. This handler object content used as an argument of LanguageIdentifier constructor to identify the language.
//Get metadata and extract content by parser parse() method. parser.parse(inputstream, handler, metadata, context); //Pass content as parameter of constructor of LanguageIdentifier LanguageIdentifier object = new LanguageIdentifier(handler.toString());
Complete Example
Here are complete steps to get metadata and extract the content of the document.
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.language.*; import org.xml.sax.SAXException; public class TikaDocumentLanguageDetection{ public static void main(final String[] args) throws IOException, SAXException, TikaException { //Instantiating a file object File file = new File("hello.txt"); //Create objects of required arguments for parse() method. Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream content = new FileInputStream(file); //Get metadata and extract content by parser parse() method. parser.parse(content, handler, metadata, new ParseContext()); LanguageIdentifier object = new LanguageIdentifier(handler.toString()); System.out.println("File Content :" + handler.toString()); System.out.println("Language Name :" + object.getLanguage()); } }
Output
File Content : English is so funny.
Language Name : en
You must log in to post a comment.