Data Extraction Pipeline
This example shows the most direct path from raw text to typed, structured
output in Mellea: a @generative function whose return annotation tells the
runtime exactly what shape the result must have.
Source file: docs/examples/information_extraction/101_with_gen_stubs.py
Concepts covered
- Declaring a generative function with
@generative - Using a
list[str]return type as an extraction contract - Passing a session (
m) as the first argument to a generative function - Keyword-only input via
doc=
Prerequisites
- Quick Start complete
- Ollama running locally with
granite4.1:3bpulled
The full example
Imports and session
from mellea import generative, start_session
from mellea.backends import model_ids
m = start_session()
start_session() with no arguments creates a session backed by the default
local model. The model_ids import is available if you want to switch to a
specific model later (see Backends and configuration).
Declaring the extraction function
@generative
def extract_all_person_names(doc: str) -> list[str]:
"""Given a document, extract names of ALL mentioned persons. Return these names as list of strings."""
The @generative decorator converts a bare function stub into a generative
stub. Three things drive the extraction:
- Parameter names (
doc) become the named inputs the model receives. - Return annotation (
list[str]) tells the runtime to parse and validate the response as a JSON array of strings. If the model returns something that cannot be coerced to that type, Mellea retries automatically. - Docstring is the task description sent to the model. Write it as a precise instruction — the docstring is the prompt.
No function body is needed. The decorator supplies the implementation.
Running the extraction
# ref: https://www.nytimes.com/2012/05/20/world/world-leaders-at-us-meeting-urge-growth-not-austerity.html
NYTimes_text = "CAMP DAVID, Md. — Leaders of the world's richest countries banded together on Saturday to press Germany to back more pro-growth policies to halt the deepening debt crisis in Europe, as President Obama for the first time gained widespread support for his argument that Europe, and the United States by extension, cannot afford Chancellor Angela Merkel's one-size-fits-all approach emphasizing austerity."
person_names = extract_all_person_names(m, doc=NYTimes_text)
print(f"person_names = {person_names}")
# out: person_names = ['President Obama', 'Angela Merkel']
Calling the decorated function follows a consistent pattern across all generative functions: pass the session as the first positional argument, then pass the declared parameters as keyword arguments. The return value is the extracted, type-validated data — not a raw string or a thunk.
Full file
# pytest: ollama, llm
"""Simple Example of information extraction with Mellea using generative stubs."""
from mellea import generative, start_session
from mellea.backends import model_ids
m = start_session()
@generative
def extract_all_person_names(doc: str) -> list[str]:
"""Given a document, extract names of ALL mentioned persons. Return these names as list of strings."""
# ref: https://www.nytimes.com/2012/05/20/world/world-leaders-at-us-meeting-urge-growth-not-austerity.html
NYTimes_text = "CAMP DAVID, Md. — Leaders of the world's richest countries banded together on Saturday to press Germany to back more pro-growth policies to halt the deepening debt crisis in Europe, as President Obama for the first time gained widespread support for his argument that Europe, and the United States by extension, cannot afford Chancellor Angela Merkel's one-size-fits-all approach emphasizing austerity."
person_names = extract_all_person_names(m, doc=NYTimes_text)
print(f"person_names = {person_names}")
# out: person_names = ['President Obama', 'Angela Merkel']
Key observations
The docstring is the prompt. There is no separate template file or prompt string. Writing a clear, imperative docstring is the primary tool for controlling extraction quality.
The return type is the schema. list[str] is simple, but the same
mechanism works for Literal["positive", "negative", "neutral"], Pydantic
models, or any other type that Mellea knows how to validate. See
Enforce structured output for richer
return types.
Sessions are explicit. Passing m as the first argument makes the
dependency on a live backend visible at the call site. You can pass different
sessions in tests (for example, a session backed by a mock) without changing
the function definition.
What to try next:
- Replace
list[str]with a Pydantic model to extract multiple fields at once — see Enforce structured output. - Add
requirementsto the@generativecall to enforce constraints on the extracted values — see the requirements system concept. - Look at
docs/examples/information_extraction/advanced_with_m_instruct.pyfor a version that usesm.instruct()directly with structured outputs.
See also: Enforce Structured Output | The Requirements System | Examples Index