HuggingFace Transformers

LocalHFBackend uses HuggingFace Transformers for local inference. It is designed for experimental Mellea features — aLoRA adapters, constrained decoding, and span-based context — that are not yet available on server-based backends.

Prerequisites: pip install 'mellea[hf]', Python 3.11+, local model weights.

Tip: For everyday local inference without experimental features, use Ollama — it is simpler to set up and well suited for development.

Install

pip install 'mellea[hf]'

Basic usage

# Requires: mellea[hf]
# Returns: ModelOutputThunk
from mellea import MelleaSession
from mellea.backends import ModelOption, model_ids
from mellea.backends.huggingface import LocalHFBackend

m = MelleaSession(
    LocalHFBackend(
        model_ids.IBM_GRANITE_4_HYBRID_MICRO,
        model_options={ModelOption.MAX_NEW_TOKENS: 256},
    )
)

result = m.instruct("Summarize the key ideas in the theory of relativity.")
print(str(result))
# Output will vary — LLM responses depend on model and temperature.

On first run, LocalHFBackend downloads the model weights via the Transformers Auto* classes and loads them onto the best available device (cuda > mps > cpu).

Device selection

The Backend selects the device automatically: CUDA GPU if available, then Apple Silicon MPS, then CPU. To override device selection, use custom_config:

# Requires: mellea[hf]
# Returns: MelleaSession
from mellea.backends.huggingface import LocalHFBackend, TransformersTorchConfig

m_backend = LocalHFBackend(
    "ibm-granite/granite-3.3-8b-instruct",
    custom_config=TransformersTorchConfig(device="cpu"),
)

KV cache

LocalHFBackend caches KV blocks across calls by default (use_caches=True). This speeds up repeated calls that share a common prefix. Pass a SimpleLRUCache to control capacity, or disable caching entirely for debugging:

# Requires: mellea[hf]
# Returns: MelleaSession
from mellea.backends.cache import SimpleLRUCache

# Enable with explicit capacity
m_backend = LocalHFBackend(model_ids.IBM_GRANITE_4_HYBRID_MICRO, cache=SimpleLRUCache(5))

# Disable entirely
m_backend = LocalHFBackend(model_ids.IBM_GRANITE_4_HYBRID_MICRO, use_caches=False)

See Prefix Caching and KV Blocks for full details on marking blocks for caching and how KV smashing works.

aLoRA adapters

LocalHFBackend supports Activated LoRA (aLoRA) adapters — lightweight domain-specific requirement validators that run on local GPU hardware. See the aLoRA guide for training and usage.

Tip: For intrinsics without local GPU requirements, Granite Switch models serve pre-embedded adapters via vLLM and the OpenAI backend. See Intrinsics for details.

Vision support

Vision support for LocalHFBackend is model-dependent and experimental. Pass a PIL image or an ImageBlock via images=[...] to instruct() or chat() when using a vision-capable model. Not all models loaded via LocalHFBackend support image input. See Use Images and Vision Models.

Troubleshooting

`pip install "mellea[hf]"` fails on Intel macOS

If you see torch/torchvision version errors on an Intel Mac, use Conda:

conda install 'torchvision>=0.22.0'
pip install mellea

Then run examples with python inside the Conda environment rather than uv run --with mellea.

Python 3.13: `error: can't find Rust compiler`

The outlines package (used by mellea[hf]) requires a Rust compiler on Python 3.13. Either downgrade to Python 3.12 or install the Rust compiler:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install​

Basic usage​

Device selection​

KV cache​

aLoRA adapters​

Vision support​

Troubleshooting​

pip install "mellea[hf]" fails on Intel macOS​

Python 3.13: error: can't find Rust compiler​