TIKA Language Detection

Language detection required were needing to classified documents based on language, there is a separate class LanguageIdentifier to detect the language of the text.

LanguageIdentifier class use the following algorithms to detect language:

Profiling Corpus Algorithm

Create a profile for language based on matched common words from different language dictionaries. For example a common word for English like a, an, the, etc. Then decide the language name.

Here use terms as

Corpus: collections of the most used common terms of written language.
Profiling: a dictionary of words of each language.

Drawback: If two language is having similar characters and words then it’s difficult to decide language based on the frequency of words.

N-gram Algorithm

As a solution to the above drawback of the “Profiling Corpus Algorithm“, a new approach comes of using character sequences of a given length for profiling corpus. This sequence of characters in content is called N-gram, where N is the length of the character sequence.

N-gram approach help in the detection of language in the case of European languages. Ex: English. Tika uses a 3-gram approach for language detection. N-gram approach is good in the case of short texts.

TIKA Supported Languages

As per ISO 639-1 having 184 standard languages but Tika is able to detect only 18 languages as below:

da—Danish de—German et—Estonian
el—Greek en—English es—Spanish
fi—Finnish fr—French hu—Hungarian
is—Icelandic it—Italian nl—Dutch
no—Norwegian pl—Polish pt—Portuguese
ru—Russian sv—Swedish th—Thai

How to detect Langauge by Tika?

getLanguage() method of LanguageIdentifier class is used to get language based on passed text content.

//Create Language Identifier object based on content.
LanguageIdentifier object = new LanguageIdentifier(“English is so funny.”);
//Get lanaguage name based on passing content.
String lang=object.getLangauge()

Example: Detect Langauge from Text

This example will show you steps to get Language Name of passing content.

import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;

import org.xml.sax.SAXException;

public class LanguageDetection {

   public static void main(String args[])throws IOException, SAXException, TikaException {

      LanguageIdentifier object = new LanguageIdentifier(“English is so funny.”);
      String lang = object.getLanguage();
      System.out.println("Detected Language is : " + lang);
   }
}

Output


Language Detected from content is : en

Example: Detect Langauge from Document Contents

To detect the language of a document, first, we need to parse the document by using parse() method. This parse() method will store parse content in handler object. This handler object content used as an argument of LanguageIdentifier constructor to identify the language.

//Get metadata and extract content by parser parse() method.
parser.parse(inputstream, handler, metadata, context);
//Pass content as parameter of constructor of LanguageIdentifier
LanguageIdentifier object = new LanguageIdentifier(handler.toString());

Complete Example

Here are complete steps to get metadata and extract the content of the document.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;

import org.xml.sax.SAXException;

public class TikaDocumentLanguageDetection{

   public static void main(final String[] args) throws IOException, SAXException, TikaException {

      //Instantiating a file object
      File file = new File("hello.txt");

      //Create objects of required arguments for parse() method.
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream content = new FileInputStream(file);

      //Get metadata and extract content by parser parse() method.
      parser.parse(content, handler, metadata, new ParseContext());

      LanguageIdentifier object = new LanguageIdentifier(handler.toString());
	  System.out.println("File Content :" + handler.toString());
      System.out.println("Language Name :" + object.getLanguage());
   }
}

Output


File Content : English is so funny.
Language Name : en