indicate package¶
Submodules¶
indicate.base module¶
indicate.decoder module¶
- class indicate.decoder.Decoder(vocab_size: int, embedding_dim: int, dec_units: int) None[source]¶
Bases:
ModuleLSTM decoder with Luong (dot-product) attention.
Mirrors the original Keras model: the attention query is a linear projection of the target embedding (not the recurrent state), attention is unscaled dot-product over the encoder outputs (Keras
Attentionwithuse_scale=False), and the attention context is concatenated with the embedding before the LSTM.forwardworks for both the full target sequence (training, teacher forcing) and a single step (autoregressive inference) by carryingstateacross calls.- __init__(vocab_size: int, embedding_dim: int, dec_units: int) None[source]¶
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(inputs: Tensor, encoder_outputs: Tensor, state: tuple[Tensor, Tensor] | None = None, src_mask: Tensor | None = None) tuple[Tensor, tuple[Tensor, Tensor]][source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type:
tuple[Tensor,tuple[Tensor,Tensor]]
indicate.encoder module¶
- class indicate.encoder.Encoder(vocab_size: int, embedding_dim: int, enc_units: int) None[source]¶
Bases:
ModuleLSTM encoder: embedding -> LSTM.
Returns the full output sequence (for attention) and the final
(hidden, cell)state used to initialise the decoder.- __init__(vocab_size: int, embedding_dim: int, enc_units: int) None[source]¶
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(x: Tensor) tuple[Tensor, tuple[Tensor, Tensor]][source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type:
tuple[Tensor,tuple[Tensor,Tensor]]
indicate.transliterator module¶
- class indicate.transliterator.Seq2SeqTransliterator[source]¶
Bases:
objectBase char-level seq2seq transliterator: lazy-loaded singleton.
Subclasses set
SUBDIR(the per-language folder) and the tokenizer filenames. Files are resolved local-first (indicate/data/<SUBDIR>/...— present after training) and otherwise downloaded from the HF model repo (HF_REPO@HF_REVISION) and cached, so the wheel ships no weights. All mutable state is assigned on the concrete subclass, so each language’s model loads and caches independently.- HF_REPO: str = 'soodoku/indicate'¶
- HF_REVISION: str = 'v0.7.0'¶
- SUBDIR: str = ''¶
- INPUT_VOCAB: str = ''¶
- TARGET_VOCAB: str = ''¶
- ENCODER_FILE: str = 'saved_weights/encoder.safetensors'¶
- DECODER_FILE: str = 'saved_weights/decoder.safetensors'¶
- embedding_dim: int = 256¶
- units: int = 1024¶
- max_length_input: int = 64¶
- max_length_output: int = 64¶
- START_TOKEN: str = '^'¶
- END_TOKEN: str = '$'¶
- BEAM_WIDTH: int = 5¶
- RERANKER: Reranker | None = None¶
- MASK_PADDING: bool = False¶
- input_lang_tokenizer: CharTokenizer | None = None¶
- target_lang_tokenizer: CharTokenizer | None = None¶
- encoder: Encoder | None = None¶
- decoder: Decoder | None = None¶
- classmethod get_model_path() str[source]¶
Local saved_weights dir (for display; weights may be on HF instead).
- Return type:
str
- classmethod transliterate(input: str, n: int = 1) str | list[str][source]¶
Transliterate one text to English (thin wrapper over
transliterate_batch).- Return type:
str|list[str]- Parameters:
input – source-language text
n – number of candidates.
n == 1(default) returns a single best string;n > 1returns a list of up tonranked candidates (requires beam search).
- Returns:
str when
n == 1; list[str] whenn > 1.- Raises:
TypeError – If input is None
ValueError – If input is not a string
- classmethod transliterate_batch(inputs: list[str], n: int = 1) list[str] | list[list[str]][source]¶
Transliterate many texts at once — the batched decode engine.
All words across all inputs are decoded in a single batch (one encoder / decoder pass per step), which is much faster than calling
transliterateper item. Returns one result per input, aligned toinputs: astreach whenn == 1, else alist[str]of up toncandidates each.- Return type:
list[str] |list[list[str]]
indicate.hindi2english module¶
- class indicate.hindi2english.HindiToEnglish[source]¶
Bases:
Seq2SeqTransliteratorHindi (Devanagari) → English transliteration model.
- SUBDIR: str = 'hindi_to_english'¶
- INPUT_VOCAB: str = 'hindi_tokens.json'¶
- TARGET_VOCAB: str = 'english_tokens.json'¶
- max_length_input: int = 47¶
- max_length_output: int = 173¶
indicate.punjabi2english module¶
- class indicate.punjabi2english.PunjabiToEnglish[source]¶
Bases:
Seq2SeqTransliteratorPunjabi (Gurmukhi) → English transliteration model.
- SUBDIR: str = 'punjabi_to_english'¶
- INPUT_VOCAB: str = 'punjabi_tokens.json'¶
- TARGET_VOCAB: str = 'english_tokens.json'¶
- max_length_input: int = 32¶
- max_length_output: int = 32¶
indicate.logging module¶
indicate.utils module¶
- class indicate.utils.CharTokenizer(word_index: dict[str, int]) None[source]¶
Bases:
objectCharacter-level tokenizer holding the
word_index/index_wordmaps.Loaded from the JSON files that were serialised by the original Keras
Tokenizerso the vocabulary indices stay identical across the migration.- property vocab_size: int¶
- indicate.utils.load_tokenizer(path: str) CharTokenizer[source]¶
Load a character tokenizer from a Keras-serialised tokenizer JSON file.
- Return type:
- indicate.utils.sequence_to_chars(tokenizer: CharTokenizer, sequence: Iterable[int]) str[source]¶
Convert a sequence of indices back to characters, skipping padding (0).
- Return type:
str
- indicate.utils.batch_candidates(words: list[str], input_lang_tokenizer: CharTokenizer, target_lang_tokenizer: CharTokenizer, encoder: Encoder, decoder: Decoder, max_length_input: int, max_length_output: int, beam_width: int = 1, mask_padding: bool = False) list[list[tuple[str, float]]][source]¶
Ranked candidates for many words at once (the batched decode engine).
Returns one ranked
(text, score)list per input word, aligned towords(empty/OOV words ->[]). Inputs are padded tomax_length_inputexactly as the single-word path, so outputs are identical — just faster.- Return type:
list[list[tuple[str,float]]]
Module contents¶
- indicate.hindi2english(input: str, n: int = 1) str | list[str]¶
Transliterate one text to English (thin wrapper over
transliterate_batch).- Return type:
str|list[str]- Parameters:
input – source-language text
n – number of candidates.
n == 1(default) returns a single best string;n > 1returns a list of up tonranked candidates (requires beam search).
- Returns:
str when
n == 1; list[str] whenn > 1.- Raises:
TypeError – If input is None
ValueError – If input is not a string
- indicate.punjabi2english(input: str, n: int = 1) str | list[str]¶
Transliterate one text to English (thin wrapper over
transliterate_batch).- Return type:
str|list[str]- Parameters:
input – source-language text
n – number of candidates.
n == 1(default) returns a single best string;n > 1returns a list of up tonranked candidates (requires beam search).
- Returns:
str when
n == 1; list[str] whenn > 1.- Raises:
TypeError – If input is None
ValueError – If input is not a string
- class indicate.IndicLLMTransliterator(source_lang: str, target_lang: str, provider: str | None = None, model: str | None = None, api_key: str | None = None, temperature: float = 0.3, cache_examples: bool = True)[source]¶
Bases:
objectLLM-based transliterator for Indic languages.
- DEFAULT_MODELS = {'anthropic': 'claude-3-opus-20240229', 'cohere': 'command-r-plus', 'google': 'gemini-pro', 'openai': 'gpt-5.4-mini'}¶
- INDIC_LANGUAGES = {'bengali': {'iso': 'bn', 'native': 'বাংলা', 'script': 'bengali'}, 'english': {'iso': 'en', 'native': 'English', 'script': 'latin'}, 'gujarati': {'iso': 'gu', 'native': 'ગુજરાતી', 'script': 'gujarati'}, 'hindi': {'iso': 'hi', 'native': 'हिन्दी', 'script': 'devanagari'}, 'kannada': {'iso': 'kn', 'native': 'ಕನ್ನಡ', 'script': 'kannada'}, 'malayalam': {'iso': 'ml', 'native': 'മലയാളം', 'script': 'malayalam'}, 'marathi': {'iso': 'mr', 'native': 'मराठी', 'script': 'devanagari'}, 'odia': {'iso': 'or', 'native': 'ଓଡ଼ିଆ', 'script': 'odia'}, 'punjabi': {'iso': 'pa', 'native': 'ਪੰਜਾਬੀ', 'script': 'gurmukhi'}, 'sanskrit': {'iso': 'sa', 'native': 'संस्कृतम्', 'script': 'devanagari'}, 'tamil': {'iso': 'ta', 'native': 'தமிழ்', 'script': 'tamil'}, 'telugu': {'iso': 'te', 'native': 'తెలుగు', 'script': 'telugu'}, 'urdu': {'iso': 'ur', 'native': 'اردو', 'script': 'arabic'}}¶
- __init__(source_lang: str, target_lang: str, provider: str | None = None, model: str | None = None, api_key: str | None = None, temperature: float = 0.3, cache_examples: bool = True)[source]¶
Initialize the Indic LLM transliterator.
- Parameters:
source_lang – Source language (e.g., ‘hindi’, ‘tamil’)
target_lang – Target language (e.g., ‘english’)
provider – LLM provider (openai, anthropic, etc.). Auto-detected if not provided.
model – Specific model to use. Uses provider defaults if not provided.
api_key – API key. Uses environment variables if not provided.
temperature – LLM temperature for consistency (lower = more consistent).
cache_examples – Whether to cache generated few-shot examples.
- build_group_messages(texts: list[str], examples: list[dict[str, str]] | None = None) list[dict[str, str]][source]¶
Build chat messages for transliterating a numbered group of texts.
Shared by the synchronous
transliterate_batchand the async Batch-API path (indicate.batch) so both produce identical prompts.- Return type:
list[dict[str,str]]
- default_max_tokens_for(texts: list[str]) int[source]¶
Estimate max output tokens for transliterating
textsas a group.- Return type:
int
- generate_few_shot_examples(num_examples: int = 5) list[dict[str, str]][source]¶
Generate few-shot transliteration examples for the language pair.
- Return type:
list[dict[str,str]]- Parameters:
num_examples – Number of examples to generate.
- Returns:
List of dictionaries with ‘source’ and ‘target’ keys.
- transliterate(text: str, use_few_shot: bool = True, num_examples: int = 5) str[source]¶
Transliterate text from source language to target language.
- Return type:
str- Parameters:
text – Text to transliterate.
use_few_shot – Whether to use few-shot examples.
num_examples – Number of few-shot examples to use.
- Returns:
Transliterated text.
- Raises:
RuntimeError – If the LLM transliteration call fails.
- transliterate_batch(texts: list[str], batch_size: int = 10, use_few_shot: bool = True) list[str][source]¶
Transliterate multiple texts efficiently.
- Return type:
list[str]- Parameters:
texts – List of texts to transliterate.
batch_size – Number of texts to process in one API call.
use_few_shot – Whether to use few-shot examples.
- Returns:
List of transliterated texts.