Identify the language of text with ML Kit on Android

You can use ML Kit to identify the language of a string of text. You can get the string's most likely language as well as confidence scores for all of the string's possible languages.

ML Kit recognizes text in more than 100 different languages in their native scripts. In addition, romanized text can be recognized for Arabic, Bulgarian, Chinese, Greek, Hindi, Japanese, and Russian. See the complete list of supported languages and scripts.

See the ML Kit quickstart sample on GitHub for an example of this API in use.

Before you begin

  1. In your project-level build.gradle file, make sure to include Google's Maven repository in both your buildscript and allprojects sections.
  2. Add the dependencies for the ML Kit Android libraries to your module's app-level gradle file, which is usually app/build.gradle:
    dependencies {
      // ...
    
      implementation 'com.google.mlkit:language-id:16.1.1'
    }
    

Identify the language of a string

To identify the language of a string, call LanguageIdentification.getClient() to get an instance of LanguageIdentifier, and then pass the string to the identifyLanguage() method of LanguageIdentifier.

For example:

Kotlin

val languageIdentifier = LanguageIdentification.getClient()
languageIdentifier.identifyLanguage(text)
        .addOnSuccessListener { languageCode ->
            if (languageCode == "und") {
                Log.i(TAG, "Can't identify language.")
            } else {
                Log.i(TAG, "Language: $languageCode")
            }
        }
        .addOnFailureListener {
            // Model couldn’t be loaded or other internal error.
            // ...
        }

Java

LanguageIdentifier languageIdentifier =
        LanguageIdentification.getClient();
languageIdentifier.identifyLanguage(text)
        .addOnSuccessListener(
                new OnSuccessListener<String>() {
                    @Override
                    public void onSuccess(@Nullable String languageCode) {
                        if (languageCode.equals("und")) {
                            Log.i(TAG, "Can't identify language.");
                        } else {
                            Log.i(TAG, "Language: " + languageCode);
                        }
                    }
                })
        .addOnFailureListener(
                new OnFailureListener() {
                    @Override
                    public void onFailure(@NonNull Exception e) {
                        // Model couldn’t be loaded or other internal error.
                        // ...
                    }
                });

If the call succeeds, a BCP-47 language code is passed to the success listener, indicating the language of the text. If no language is confidently detected, the code und (undetermined) is passed.

By default, ML Kit returns a value other than und only when it identifies the language with a confidence value of at least 0.5. You can change this threshold by passing a LanguageIdentificationOptions object to getClient():

Kotlin

val languageIdentifier = LanguageIdentification
        .getClient(LanguageIdentificationOptions.Builder()
                .setConfidenceThreshold(0.34f)
                .build())

Java

LanguageIdentifier languageIdentifier = LanguageIdentification.getClient(
        new LanguageIdentificationOptions.Builder()
                .setConfidenceThreshold(0.34f)
                .build());

Get the possible languages of a string

To get the confidence values of a string's most likely languages, get an instance of LanguageIdentifier and then pass the string to the identifyPossibleLanguages() method.

For example:

Kotlin

val languageIdentifier = LanguageIdentification.getClient()
languageIdentifier.identifyPossibleLanguages(text)
        .addOnSuccessListener { identifiedLanguages ->
            for (identifiedLanguage in identifiedLanguages) {
                val language = identifiedLanguage.languageTag
                val confidence = identifiedLanguage.confidence
                Log.i(TAG, "$language $confidence")
            }
        }
        .addOnFailureListener {
            // Model couldn’t be loaded or other internal error.
            // ...
        }

Java

LanguageIdentifier languageIdentifier =
        LanguageIdentification.getClient();
languageIdentifier.identifyPossibleLanguages(text)
        .addOnSuccessListener(new OnSuccessListener<List<IdentifiedLanguage>>() {
            @Override
            public void onSuccess(List<IdentifiedLanguage> identifiedLanguages) {
                for (IdentifiedLanguage identifiedLanguage : identifiedLanguages) {
                    String language = identifiedLanguage.getLanguageTag();
                    float confidence = identifiedLanguage.getConfidence();
                    Log.i(TAG, language + " (" + confidence + ")");
                }
            }
        })
        .addOnFailureListener(
                new OnFailureListener() {
                    @Override
                    public void onFailure(@NonNull Exception e) {
                        // Model couldn’t be loaded or other internal error.
                        // ...
                    }
                });

If the call succeeds, a list of IdentifiedLanguage objects is passed to the success listener. From each object, you can get the language's BCP-47 code and the confidence that the string is in that language. Note that these values indicate the confidence that the entire string is in the given language; ML Kit doesn't identify multiple languages in a single string.

By default, ML Kit returns only languages with confidence values of at least 0.01. You can change this threshold by passing a LanguageIdentificationOptions object to getClient():

Kotlin

val languageIdentifier = LanguageIdentification
      .getClient(LanguageIdentificationOptions.Builder()
              .setConfidenceThreshold(0.5f)
              .build())

Java

LanguageIdentifier languageIdentifier = LanguageIdentification.getClient(
      new LanguageIdentificationOptions.Builder()
              .setConfidenceThreshold(0.5f)
              .build());

If no language meets this threshold, the list has one item, with the value und.

Next steps

See the ML Kit quickstart sample on GitHub for an example of this API in use.