indicate package

Submodules

indicate.base module

indicate.decoder module

class indicate.decoder.Decoder(vocab_size: int, embedding_dim: int, dec_units: int) None[source]

Bases: Module

LSTM decoder with Luong (dot-product) attention.

Mirrors the original Keras model: the attention query is a linear projection of the target embedding (not the recurrent state), attention is unscaled dot-product over the encoder outputs (Keras Attention with use_scale=False), and the attention context is concatenated with the embedding before the LSTM.

forward works for both the full target sequence (training, teacher forcing) and a single step (autoregressive inference) by carrying state across calls.

__init__(vocab_size: int, embedding_dim: int, dec_units: int) None[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(inputs: Tensor, encoder_outputs: Tensor, state: tuple[Tensor, Tensor] | None = None, src_mask: Tensor | None = None) tuple[Tensor, tuple[Tensor, Tensor]][source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type:

tuple[Tensor, tuple[Tensor, Tensor]]

indicate.encoder module

class indicate.encoder.Encoder(vocab_size: int, embedding_dim: int, enc_units: int) None[source]

Bases: Module

LSTM encoder: embedding -> LSTM.

Returns the full output sequence (for attention) and the final (hidden, cell) state used to initialise the decoder.

__init__(vocab_size: int, embedding_dim: int, enc_units: int) None[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) tuple[Tensor, tuple[Tensor, Tensor]][source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type:

tuple[Tensor, tuple[Tensor, Tensor]]

indicate.transliterator module

class indicate.transliterator.Seq2SeqTransliterator[source]

Bases: object

Base char-level seq2seq transliterator: lazy-loaded singleton.

Subclasses set SUBDIR (the per-language folder) and the tokenizer filenames. Files are resolved local-first (indicate/data/<SUBDIR>/... — present after training) and otherwise downloaded from the HF model repo (HF_REPO @ HF_REVISION) and cached, so the wheel ships no weights. All mutable state is assigned on the concrete subclass, so each language’s model loads and caches independently.

HF_REPO: str = 'soodoku/indicate'
HF_REVISION: str = 'v0.7.0'
SUBDIR: str = ''
INPUT_VOCAB: str = ''
TARGET_VOCAB: str = ''
ENCODER_FILE: str = 'saved_weights/encoder.safetensors'
DECODER_FILE: str = 'saved_weights/decoder.safetensors'
embedding_dim: int = 256
units: int = 1024
max_length_input: int = 64
max_length_output: int = 64
START_TOKEN: str = '^'
END_TOKEN: str = '$'
BEAM_WIDTH: int = 5
RERANKER: Reranker | None = None
MASK_PADDING: bool = False
input_lang_tokenizer: CharTokenizer | None = None
target_lang_tokenizer: CharTokenizer | None = None
encoder: Encoder | None = None
decoder: Decoder | None = None
classmethod get_model_path() str[source]

Local saved_weights dir (for display; weights may be on HF instead).

Return type:

str

classmethod get_input_vocab() str[source]
Return type:

str

classmethod get_target_vocab() str[source]
Return type:

str

classmethod transliterate(input: str, n: int = 1) str | list[str][source]

Transliterate one text to English (thin wrapper over transliterate_batch).

Return type:

str | list[str]

Parameters:
  • input – source-language text

  • n – number of candidates. n == 1 (default) returns a single best string; n > 1 returns a list of up to n ranked candidates (requires beam search).

Returns:

str when n == 1; list[str] when n > 1.

Raises:
  • TypeError – If input is None

  • ValueError – If input is not a string

classmethod transliterate_batch(inputs: list[str], n: int = 1) list[str] | list[list[str]][source]

Transliterate many texts at once — the batched decode engine.

All words across all inputs are decoded in a single batch (one encoder / decoder pass per step), which is much faster than calling transliterate per item. Returns one result per input, aligned to inputs: a str each when n == 1, else a list[str] of up to n candidates each.

Return type:

list[str] | list[list[str]]

indicate.hindi2english module

class indicate.hindi2english.HindiToEnglish[source]

Bases: Seq2SeqTransliterator

Hindi (Devanagari) → English transliteration model.

SUBDIR: str = 'hindi_to_english'
INPUT_VOCAB: str = 'hindi_tokens.json'
TARGET_VOCAB: str = 'english_tokens.json'
max_length_input: int = 47
max_length_output: int = 173

indicate.punjabi2english module

class indicate.punjabi2english.PunjabiToEnglish[source]

Bases: Seq2SeqTransliterator

Punjabi (Gurmukhi) → English transliteration model.

SUBDIR: str = 'punjabi_to_english'
INPUT_VOCAB: str = 'punjabi_tokens.json'
TARGET_VOCAB: str = 'english_tokens.json'
max_length_input: int = 32
max_length_output: int = 32

indicate.logging module

indicate.logging.get_logger() Logger[source]
Return type:

Logger

indicate.utils module

class indicate.utils.CharTokenizer(word_index: dict[str, int]) None[source]

Bases: object

Character-level tokenizer holding the word_index/index_word maps.

Loaded from the JSON files that were serialised by the original Keras Tokenizer so the vocabulary indices stay identical across the migration.

__init__(word_index: dict[str, int]) None[source]
property vocab_size: int
indicate.utils.load_tokenizer(path: str) CharTokenizer[source]

Load a character tokenizer from a Keras-serialised tokenizer JSON file.

Return type:

CharTokenizer

indicate.utils.sequence_to_chars(tokenizer: CharTokenizer, sequence: Iterable[int]) str[source]

Convert a sequence of indices back to characters, skipping padding (0).

Return type:

str

indicate.utils.batch_candidates(words: list[str], input_lang_tokenizer: CharTokenizer, target_lang_tokenizer: CharTokenizer, encoder: Encoder, decoder: Decoder, max_length_input: int, max_length_output: int, beam_width: int = 1, mask_padding: bool = False) list[list[tuple[str, float]]][source]

Ranked candidates for many words at once (the batched decode engine).

Returns one ranked (text, score) list per input word, aligned to words (empty/OOV words -> []). Inputs are padded to max_length_input exactly as the single-word path, so outputs are identical — just faster.

Return type:

list[list[tuple[str, float]]]

Module contents

indicate.hindi2english(input: str, n: int = 1) str | list[str]

Transliterate one text to English (thin wrapper over transliterate_batch).

Return type:

str | list[str]

Parameters:
  • input – source-language text

  • n – number of candidates. n == 1 (default) returns a single best string; n > 1 returns a list of up to n ranked candidates (requires beam search).

Returns:

str when n == 1; list[str] when n > 1.

Raises:
  • TypeError – If input is None

  • ValueError – If input is not a string

indicate.punjabi2english(input: str, n: int = 1) str | list[str]

Transliterate one text to English (thin wrapper over transliterate_batch).

Return type:

str | list[str]

Parameters:
  • input – source-language text

  • n – number of candidates. n == 1 (default) returns a single best string; n > 1 returns a list of up to n ranked candidates (requires beam search).

Returns:

str when n == 1; list[str] when n > 1.

Raises:
  • TypeError – If input is None

  • ValueError – If input is not a string

class indicate.IndicLLMTransliterator(source_lang: str, target_lang: str, provider: str | None = None, model: str | None = None, api_key: str | None = None, temperature: float = 0.3, cache_examples: bool = True)[source]

Bases: object

LLM-based transliterator for Indic languages.

DEFAULT_MODELS = {'anthropic': 'claude-3-opus-20240229', 'cohere': 'command-r-plus', 'google': 'gemini-pro', 'openai': 'gpt-5.4-mini'}
INDIC_LANGUAGES = {'bengali': {'iso': 'bn', 'native': 'বাংলা', 'script': 'bengali'}, 'english': {'iso': 'en', 'native': 'English', 'script': 'latin'}, 'gujarati': {'iso': 'gu', 'native': 'ગુજરાતી', 'script': 'gujarati'}, 'hindi': {'iso': 'hi', 'native': 'हिन्दी', 'script': 'devanagari'}, 'kannada': {'iso': 'kn', 'native': 'ಕನ್ನಡ', 'script': 'kannada'}, 'malayalam': {'iso': 'ml', 'native': 'മലയാളം', 'script': 'malayalam'}, 'marathi': {'iso': 'mr', 'native': 'मराठी', 'script': 'devanagari'}, 'odia': {'iso': 'or', 'native': 'ଓଡ଼ିଆ', 'script': 'odia'}, 'punjabi': {'iso': 'pa', 'native': 'ਪੰਜਾਬੀ', 'script': 'gurmukhi'}, 'sanskrit': {'iso': 'sa', 'native': 'संस्कृतम्', 'script': 'devanagari'}, 'tamil': {'iso': 'ta', 'native': 'தமிழ்', 'script': 'tamil'}, 'telugu': {'iso': 'te', 'native': 'తెలుగు', 'script': 'telugu'}, 'urdu': {'iso': 'ur', 'native': 'اردو', 'script': 'arabic'}}
__init__(source_lang: str, target_lang: str, provider: str | None = None, model: str | None = None, api_key: str | None = None, temperature: float = 0.3, cache_examples: bool = True)[source]

Initialize the Indic LLM transliterator.

Parameters:
  • source_lang – Source language (e.g., ‘hindi’, ‘tamil’)

  • target_lang – Target language (e.g., ‘english’)

  • provider – LLM provider (openai, anthropic, etc.). Auto-detected if not provided.

  • model – Specific model to use. Uses provider defaults if not provided.

  • api_key – API key. Uses environment variables if not provided.

  • temperature – LLM temperature for consistency (lower = more consistent).

  • cache_examples – Whether to cache generated few-shot examples.

build_group_messages(texts: list[str], examples: list[dict[str, str]] | None = None) list[dict[str, str]][source]

Build chat messages for transliterating a numbered group of texts.

Shared by the synchronous transliterate_batch and the async Batch-API path (indicate.batch) so both produce identical prompts.

Return type:

list[dict[str, str]]

default_max_tokens_for(texts: list[str]) int[source]

Estimate max output tokens for transliterating texts as a group.

Return type:

int

generate_few_shot_examples(num_examples: int = 5) list[dict[str, str]][source]

Generate few-shot transliteration examples for the language pair.

Return type:

list[dict[str, str]]

Parameters:

num_examples – Number of examples to generate.

Returns:

List of dictionaries with ‘source’ and ‘target’ keys.

transliterate(text: str, use_few_shot: bool = True, num_examples: int = 5) str[source]

Transliterate text from source language to target language.

Return type:

str

Parameters:
  • text – Text to transliterate.

  • use_few_shot – Whether to use few-shot examples.

  • num_examples – Number of few-shot examples to use.

Returns:

Transliterated text.

Raises:

RuntimeError – If the LLM transliteration call fails.

transliterate_batch(texts: list[str], batch_size: int = 10, use_few_shot: bool = True) list[str][source]

Transliterate multiple texts efficiently.

Return type:

list[str]

Parameters:
  • texts – List of texts to transliterate.

  • batch_size – Number of texts to process in one API call.

  • use_few_shot – Whether to use few-shot examples.

Returns:

List of transliterated texts.

indicate.detect_indic_script(text: str) str | None[source]

Auto-detect Indic script from Unicode ranges.

Return type:

str | None

Parameters:

text – Text to analyze.

Returns:

Detected script name or None if not Indic.

indicate.detect_language_from_script(text: str) str | None[source]

Detect the most likely language based on script and context.

Return type:

str | None

Parameters:

text – Text to analyze.

Returns:

Detected language name or None.