Welcome to indicate’s documentation!¶
Contents:
Indicate: Transliterate Indic Languages with PyTorch and LLMs¶
Indicate provides high-quality transliteration between Indic languages and English using both a traditional PyTorch model and state-of-the-art LLMs (Large Language Models).
🚀 Features¶
🧠 Dual Backend Support: Choose between the PyTorch model or LLM-based transliteration
🌍 Multi-Language: 12+ Indic languages (Hindi, Tamil, Telugu, Bengali, etc.)
🔄 Bidirectional: Supports both Indic→English and English→Indic transliteration
🛡️ Production Ready: Safe file handling, atomic writes, backup support
📊 Structured Output: Rich JSON format with metadata and error handling
⚡ Batch Processing: Efficient processing of large files with progress tracking
🎯 Supported Languages¶
Hindi • Tamil • Telugu • Bengali • Gujarati • Kannada • Malayalam • Punjabi • Marathi • Odia • Urdu • Sanskrit ↔ English
Install¶
We strongly recommend installing indicate inside a Python virtual environment (see venv documentation)
Requirements: Python 3.13+
pip install indicate
🔧 Quick Setup¶
For LLM-based transliteration (recommended):¶
pip install indicate
# Set your API key (choose one):
export OPENAI_API_KEY=your-key
export ANTHROPIC_API_KEY=your-key
export GOOGLE_API_KEY=your-key
For the local model (no API key):¶
pip install indicate
# No API key needed. The PyTorch weights are downloaded once from Hugging Face
# (soodoku/indicate) on first transliterate and cached locally; tokenizers ship
# in the wheel. After the first run it works fully offline.
🎯 Usage¶
🧠 LLM-Based Transliteration (New!)¶
The LLM backend provides higher accuracy and supports all Indic languages:
# Simple transliteration (auto-detects Hindi)
indicate llm "राजशेखर चिंतालपति"
# Output: Rajashekar Chintalapati
# Specify languages explicitly
indicate llm "முருகன்" --source tamil --target english
# Output: Murugan
# Between Indic languages
indicate llm "नमस्ते" --source hindi --target tamil
# Output: நமஸ்தே
# Safe batch processing with structured JSON output
indicate llm --input names.txt --output results.json --format json --batch --backup
# Dry run to preview changes
indicate llm --input large_file.txt --dry-run
Python API:
from indicate import IndicLLMTransliterator
# Initialize for any language pair
transliterator = IndicLLMTransliterator('hindi', 'english')
result = transliterator.transliterate('राजशेखर चिंतालपति')
print(result) # Output: Rajashekar Chintalapati
# Batch processing
texts = ["राजेश", "गौरव", "प्रिया"]
results = transliterator.transliterate_batch(texts)
print(results) # ['Rajesh', 'Gaurav', 'Priya']
🤖 PyTorch Backend (Traditional)¶
Local offline models are available for Hindi (Devanagari) and Punjabi (Gurmukhi):
# Hindi to English using the local PyTorch model
indicate hindi2english "राजशेखर चिंतालपति"
# Output: rajshekhar chintapalati
# Punjabi (Gurmukhi) to English
indicate punjabi2english "ਰਵਿ ਸ਼ਰਮਾ"
# Output: ravi sharma
# From file
indicate hindi2english --input hindi.txt --output english.txt
# Batch processing
indicate hindi2english --input large_file.txt --batch
Python API:
from indicate import hindi2english, punjabi2english
print(hindi2english("हिंदी")) # "hindi"
print(punjabi2english("ਰਵਿ")) # "ravi"
# Top-k candidates (n > 1 returns a list)
print(hindi2english("नमस्ते", n=3)) # ["namaste", "namastey", "namste"]
# Batched (much faster for many inputs)
from indicate.hindi2english import HindiToEnglish
HindiToEnglish.transliterate_batch(["हिंदी", "मुंबई", "गौरव सूद"])
# -> ["hindi", "mumbai", "gaurav sood"]
📊 JSON Output Format¶
The LLM backend provides rich, structured output perfect for data processing:
{
"metadata": {
"source_language": "hindi",
"target_language": "english",
"timestamp": "2024-12-09T12:00:00Z",
"total_lines": 3,
"successful_lines": 3,
"failed_lines": 0,
"encoding": "utf-8"
},
"results": [
{
"line_number": 1,
"input_text": "राजेश कुमार",
"output_text": "Rajesh Kumar",
"source_lang": "hindi",
"target_lang": "english",
"confidence": "high",
"processing_time": 1.2,
"timestamp": "2024-12-09T12:00:01Z"
}
]
}
🛡️ Safety Features¶
🔒 Input/Output Validation: Prevents accidental file overwrites
⚛️ Atomic Writing: Safe file operations using temporary files
💾 Automatic Backups: Optional timestamped backups of existing files
🔄 Resume Support: Resume interrupted batch operations
👁️ Dry Run Mode: Preview operations before execution
🎛️ Advanced Usage¶
# Show few-shot examples being used
indicate llm --show-examples --source bengali --target english
# Resume interrupted batch job
indicate llm --input large_file.txt --output results.txt --resume
# Use specific LLM provider/model
indicate llm "text" --provider anthropic --model claude-3-opus
# Process JSON from previous results
indicate llm --input results.json --source english --target hindi
🔄 Backend Comparison¶
Feature |
PyTorch Backend |
LLM Backend |
|---|---|---|
Languages |
Hindi ↔ English only |
12+ Indic languages ↔ English + Inter-Indic |
Setup |
No API key needed |
Requires LLM API key |
Speed |
Very fast (local) |
Moderate (API calls) |
Accuracy |
Good for common words |
Excellent for all types |
Cost |
Free |
Pay per API call |
Offline |
✅ Works offline |
❌ Requires internet |
Batch Processing |
✅ |
✅ with safety features |
🧪 Testing Locally¶
Clone and install:
git clone https://github.com/in-rolls/indicate.git cd indicate uv sync # or pip install -e .
Run tests:
# All tests python -m pytest # Specific tests python -m pytest tests/test_llm_indic.py python -m pytest tests/test_file_safety.py
Test both backends:
# PyTorch backend indicate hindi2english "हिंदी" # LLM backend (set API key first) export OPENAI_API_KEY=your-key indicate llm "हिंदी"
Data¶
The datasets used to train the model:
ESPN Cric Info for hindi version of the english scorecard
Evaluation¶
The v2 models (trained on our data + the public Aksharantar corpus) are benchmarked against AI4Bharat IndicXlit — the same direction (native→Latin), the same test sets, the same metric (Top-1 exact-match, match-any-reference). Training is leakage-filtered so no eval word appears in it.
Model |
Dakshina (gold) |
Held-out-own names¹ |
|---|---|---|
Hindi → English |
74.4% (IndicXlit 73.2%) |
52.8% (IndicXlit 49.7%) |
Punjabi → English |
71.9% (IndicXlit 73.2%) |
56.9% (IndicXlit 53.5%) |
¹ Held-out slice of our own electoral/affidavit names — the cleanest comparison,
since IndicXlit never trained on it. v2 matches or edges IndicXlit on the gold
benchmark and beats it on the deployment domain. Primary metric is Top-1
exact-match; CER (character error rate) is the soft companion. Reproduce with
training/eval.py and training/compare.py.
Below is the edit-distance distribution on the test set (0 = exact match):

Contributor Code of Conduct¶
The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.
License¶
The package is released under the MIT License.