docchex ¤

docchex — document QA/QC engine.

Classes:

AICheckRule –

Checks a document against a custom prompt using an LLM.
AnthropicClient –

LLM client backed by the Anthropic API.
Document –

A parsed document ready for rule evaluation.
DocumentParser –

Abstract base class for document parsers.
Finding –

A single rule violation found in a document.
LLMClient –

Protocol for LLM providers used by :class:~docchex.AICheckRule.
LLMResponse –

Result returned by an LLM provider after evaluating a document.
OllamaClient –

LLM client that connects to a local Ollama server via its OpenAI-compatible API.
OpenAIClient –

LLM client backed by the OpenAI API (or any OpenAI-compatible endpoint).
PDFParser –

Parse PDF files into Document objects using pdfplumber.
Report –

The result of running a set of rules against a document.
RequiredSectionRule –

Checks that a required section heading is present in the document.
Rule –

Abstract base class for all docchex rules.
RuleEngine –

Runs a list of rules against a document and collects findings into a report.
RuleLoader –
Severity –

Constants for rule severity levels.
TextParser –

Parse plain-text (.txt) files into Document objects.
WordCountRule –

Checks that the document word count falls within optional min/max bounds.

Functions:

get_parser –

Return the CLI argument parser.
list_presets –

Return the names of all built-in rule presets.
main –

Run the main program.
run_qaqc –

Run QA/QC checks on a document against a set of rules.

AICheckRule ¤

AICheckRule(
    rule_id: str,
    prompt: str,
    severity: str = ERROR,
    llm: LLMClient | None = None,
)

Bases: Rule

Checks a document against a custom prompt using an LLM.

The LLM is expected to return JSON with {"passed": bool, "reason": str}. If the document fails the check, a finding is emitted with the reason as the message.

Methods:

check –

Evaluate the document using the configured LLM and return any findings.
from_config –

Instantiate from a rule configuration dictionary.

Attributes:

id (str) –

Rule identifier.
prompt (str) –

The evaluation prompt sent to the LLM together with the document text.
severity (str) –

Severity of findings produced when the document fails the check.

Source code in src/docchex/_internal/rules/builtin/ai_check.py

def __init__(
    self,
    rule_id: str,
    prompt: str,
    severity: str = Severity.ERROR,
    llm: LLMClient | None = None,
) -> None:
    self.id = rule_id
    self.prompt = prompt
    self.severity = severity
    self._llm = llm

id `instance-attribute` ¤

id: str = rule_id

Rule identifier.

prompt `instance-attribute` ¤

prompt: str = prompt

The evaluation prompt sent to the LLM together with the document text.

severity `instance-attribute` ¤

severity: str = severity

Severity of findings produced when the document fails the check.

check ¤

check(doc: Document) -> list[Finding]

Evaluate the document using the configured LLM and return any findings.

Raises:

RuntimeError –

If no LLM client was provided.

Source code in src/docchex/_internal/rules/builtin/ai_check.py

def check(self, doc: Document) -> list[Finding]:
    """Evaluate the document using the configured LLM and return any findings.

    Raises:
        RuntimeError: If no LLM client was provided.
    """
    if self._llm is None:
        raise RuntimeError(
            f"Rule {self.id!r} requires an LLM client. "
            "Pass llm=... to RuleLoader or AICheckRule.",
        )
    from docchex._internal.models import Finding  # noqa: PLC0415

    result = self._llm.evaluate(doc, self.prompt)
    if result.passed:
        return []
    return [Finding(rule_id=self.id, severity=self.severity, message=result.reason)]

from_config `classmethod` ¤

from_config(
    config: dict[str, Any], llm: LLMClient | None = None
) -> AICheckRule

Instantiate from a rule configuration dictionary.

Parameters:

config ¤
(dict[str, Any]) –

Rule config dict with keys id, prompt, and optional severity.
llm ¤
(LLMClient | None, default: None ) –

LLM client to use for evaluation.

Source code in src/docchex/_internal/rules/builtin/ai_check.py

@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    llm: LLMClient | None = None,
) -> AICheckRule:
    """Instantiate from a rule configuration dictionary.

    Parameters:
        config: Rule config dict with keys ``id``, ``prompt``, and optional ``severity``.
        llm: LLM client to use for evaluation.
    """
    return cls(
        rule_id=config["id"],
        prompt=config["prompt"],
        severity=config.get("severity", Severity.ERROR),
        llm=llm,
    )

AnthropicClient ¤

AnthropicClient(
    api_key: str | None = None, model: str = _DEFAULT_MODEL
)

LLM client backed by the Anthropic API.

Requires pip install docchex[anthropic].

Parameters:

api_key ¤
(str | None, default: None ) –

Anthropic API key. Defaults to the ANTHROPIC_API_KEY environment variable.
model ¤
(str, default: _DEFAULT_MODEL ) –

Model ID to use for evaluation.

Methods:

evaluate –

Send the document and prompt to Anthropic and return a structured result.

Source code in src/docchex/_internal/llm/providers/anthropic.py

def __init__(self, api_key: str | None = None, model: str = _DEFAULT_MODEL) -> None:
    """Initialise the client.

    Parameters:
        api_key: Anthropic API key. Defaults to the ``ANTHROPIC_API_KEY`` environment variable.
        model: Model ID to use for evaluation.
    """
    try:
        import anthropic  # ty: ignore[unresolved-import]  # noqa: PLC0415
    except ImportError as exc:
        raise ImportError(
            "anthropic package is required: pip install docchex[anthropic]",
        ) from exc
    self._client = anthropic.Anthropic(api_key=api_key)
    self._model = model

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Send the document and prompt to Anthropic and return a structured result.

Source code in src/docchex/_internal/llm/providers/anthropic.py

def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Send the document and prompt to Anthropic and return a structured result."""
    text = doc.text[:32000]
    message = self._client.messages.create(
        model=self._model,
        max_tokens=256,
        system=_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": f"{prompt}\n\nDocument:\n{text}"}],
    )
    data = json.loads(message.content[0].text)
    return LLMResponse(passed=bool(data["passed"]), reason=str(data.get("reason", "")))

Document `dataclass` ¤

Document(
    path: Path,
    text: str,
    pages: list[str],
    metadata: dict[str, Any] = dict(),
)

A parsed document ready for rule evaluation.

Attributes:

metadata (dict[str, Any]) –

Optional metadata extracted from the document (e.g. PDF metadata).
pages (list[str]) –

Text content split by page.
path (Path) –

Path to the source file.
text (str) –

Full text content of the document.

metadata `class-attribute` `instance-attribute` ¤

metadata: dict[str, Any] = field(default_factory=dict)

Optional metadata extracted from the document (e.g. PDF metadata).

pages `instance-attribute` ¤

pages: list[str]

Text content split by page.

path `instance-attribute` ¤

path: Path

Path to the source file.

text `instance-attribute` ¤

text: str

Full text content of the document.

DocumentParser ¤

Bases: ABC

Abstract base class for document parsers.

Methods:

for_path –

Return the appropriate parser for the given file path based on its extension.
parse –

Parse the file at the given path into a Document.

for_path `classmethod` ¤

for_path(path: Path) -> DocumentParser

Return the appropriate parser for the given file path based on its extension.

Source code in src/docchex/_internal/parsing/base.py

@classmethod
def for_path(cls, path: Path) -> DocumentParser:
    """Return the appropriate parser for the given file path based on its extension."""
    from docchex._internal.parsing.pdf import PDFParser  # noqa: PLC0415
    from docchex._internal.parsing.text import TextParser  # noqa: PLC0415

    suffix = path.suffix.lower()
    if suffix == ".pdf":
        return PDFParser()
    if suffix == ".txt":
        return TextParser()
    raise ValueError(f"Unsupported file type: {suffix!r}")

parse `abstractmethod` ¤

parse(path: Path) -> Document

Parse the file at the given path into a Document.

Source code in src/docchex/_internal/parsing/base.py

@abstractmethod
def parse(self, path: Path) -> Document:
    """Parse the file at the given path into a Document."""
    ...

Finding `dataclass` ¤

Finding(
    rule_id: str,
    severity: str,
    message: str,
    location: str | None = None,
)

A single rule violation found in a document.

Attributes:

location (str | None) –

Optional location reference within the document.
message (str) –

Human-readable description of the violation.
rule_id (str) –

ID of the rule that produced this finding.
severity (str) –

Severity level: "error", "warning", or "info".

location `class-attribute` `instance-attribute` ¤

location: str | None = None

Optional location reference within the document.

message `instance-attribute` ¤

message: str

Human-readable description of the violation.

rule_id `instance-attribute` ¤

rule_id: str

ID of the rule that produced this finding.

severity `instance-attribute` ¤

severity: str

Severity level: "error", "warning", or "info".

LLMClient ¤

Bases: Protocol

Protocol for LLM providers used by :class:~docchex.AICheckRule.

Any object that implements evaluate(doc, prompt) -> LLMResponse satisfies this protocol — no subclassing required.

Built-in providers¤

:class:~docchex.AnthropicClient — Anthropic API (pip install docchex[anthropic])
:class:~docchex.OpenAIClient — OpenAI API or any OpenAI-compatible endpoint (pip install docchex[openai])
:class:~docchex.OllamaClient — local Ollama server via the OpenAI-compatible API (pip install docchex[ollama])

Custom providers¤

Implement a custom provider by defining a class with the evaluate method::

class MyClient:
    def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
        ...
        return LLMResponse(passed=True, reason="All good")

loader = RuleLoader(llm=MyClient())

Future extensibility¤

The current design supports single-call, stateless checks: one prompt is sent to the LLM and a pass/fail result is returned. If multi-step reasoning becomes necessary (e.g. agent loops, tool calling, or structured-output retries), the natural approach is to implement a richer LLMClient that encapsulates that logic internally — the rest of the pipeline (AICheckRule, RuleLoader, run_qaqc) stays unchanged. For that use case, litellm <https://docs.litellm.ai>_ (lightweight, 100+ providers) or LangChain <https://python.langchain.com>_ (full agent orchestration) are good building blocks to wrap inside a custom LLMClient.

Methods:

evaluate –

Evaluate the document against the given prompt and return a structured result.

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Evaluate the document against the given prompt and return a structured result.

Source code in src/docchex/_internal/llm/base.py

def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Evaluate the document against the given prompt and return a structured result."""
    ...

LLMResponse `dataclass` ¤

LLMResponse(passed: bool, reason: str)

Result returned by an LLM provider after evaluating a document.

Attributes:

passed (bool) –

Whether the document passed the check.
reason (str) –

Human-readable explanation of the result.

passed `instance-attribute` ¤

passed: bool

Whether the document passed the check.

reason `instance-attribute` ¤

reason: str

Human-readable explanation of the result.

OllamaClient ¤

OllamaClient(
    model: str = _DEFAULT_MODEL,
    base_url: str = _DEFAULT_BASE_URL,
)

LLM client that connects to a local Ollama server via its OpenAI-compatible API.

Requires pip install docchex[ollama] (installs the openai package).

Parameters:

model ¤
(str, default: _DEFAULT_MODEL ) –

Ollama model name (e.g. "llama3.2", "mistral").
base_url ¤
(str, default: _DEFAULT_BASE_URL ) –

Base URL of the Ollama server.

Methods:

evaluate –

Send the document and prompt to Ollama and return a structured result.

Source code in src/docchex/_internal/llm/providers/ollama.py

def __init__(self, model: str = _DEFAULT_MODEL, base_url: str = _DEFAULT_BASE_URL) -> None:
    """Initialise the client.

    Parameters:
        model: Ollama model name (e.g. ``"llama3.2"``, ``"mistral"``).
        base_url: Base URL of the Ollama server.
    """
    from docchex._internal.llm.providers.openai import OpenAIClient  # noqa: PLC0415

    self._inner = OpenAIClient(api_key="ollama", model=model, base_url=base_url)

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Send the document and prompt to Ollama and return a structured result.

Source code in src/docchex/_internal/llm/providers/ollama.py

def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Send the document and prompt to Ollama and return a structured result."""
    return self._inner.evaluate(doc, prompt)

OpenAIClient ¤

OpenAIClient(
    api_key: str | None = None,
    model: str = _DEFAULT_MODEL,
    base_url: str | None = None,
)

LLM client backed by the OpenAI API (or any OpenAI-compatible endpoint).

Requires pip install docchex[openai].

Parameters:

api_key ¤
(str | None, default: None ) –

OpenAI API key. Defaults to the OPENAI_API_KEY environment variable.
model ¤
(str, default: _DEFAULT_MODEL ) –

Model ID to use for evaluation.
base_url ¤
(str | None, default: None ) –

Override the API base URL (e.g. for a local Ollama server).

Methods:

evaluate –

Send the document and prompt to OpenAI and return a structured result.

Source code in src/docchex/_internal/llm/providers/openai.py

def __init__(
    self,
    api_key: str | None = None,
    model: str = _DEFAULT_MODEL,
    base_url: str | None = None,
) -> None:
    """Initialise the client.

    Parameters:
        api_key: OpenAI API key. Defaults to the ``OPENAI_API_KEY`` environment variable.
        model: Model ID to use for evaluation.
        base_url: Override the API base URL (e.g. for a local Ollama server).
    """
    try:
        import openai  # ty: ignore[unresolved-import]  # noqa: PLC0415
    except ImportError as exc:
        raise ImportError(
            "openai package is required: pip install docchex[openai]",
        ) from exc
    self._client = openai.OpenAI(api_key=api_key, base_url=base_url)
    self._model = model

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Send the document and prompt to OpenAI and return a structured result.

Source code in src/docchex/_internal/llm/providers/openai.py

def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Send the document and prompt to OpenAI and return a structured result."""
    text = doc.text[:32000]
    response = self._client.chat.completions.create(
        model=self._model,
        messages=[
            {"role": "system", "content": _SYSTEM_PROMPT},
            {"role": "user", "content": f"{prompt}\n\nDocument:\n{text}"},
        ],
        max_tokens=256,
    )
    data = json.loads(response.choices[0].message.content)
    return LLMResponse(passed=bool(data["passed"]), reason=str(data.get("reason", "")))

PDFParser ¤

Bases: DocumentParser

Parse PDF files into Document objects using pdfplumber.

Methods:

for_path –

Return the appropriate parser for the given file path based on its extension.
parse –

Parse a PDF file into a Document, extracting text and metadata.

for_path `classmethod` ¤

for_path(path: Path) -> DocumentParser

Return the appropriate parser for the given file path based on its extension.

Source code in src/docchex/_internal/parsing/base.py

@classmethod
def for_path(cls, path: Path) -> DocumentParser:
    """Return the appropriate parser for the given file path based on its extension."""
    from docchex._internal.parsing.pdf import PDFParser  # noqa: PLC0415
    from docchex._internal.parsing.text import TextParser  # noqa: PLC0415

    suffix = path.suffix.lower()
    if suffix == ".pdf":
        return PDFParser()
    if suffix == ".txt":
        return TextParser()
    raise ValueError(f"Unsupported file type: {suffix!r}")

parse ¤

parse(path: Path) -> Document

Parse a PDF file into a Document, extracting text and metadata.

Source code in src/docchex/_internal/parsing/pdf.py

def parse(self, path: Path) -> Document:
    """Parse a PDF file into a Document, extracting text and metadata."""
    try:
        import pdfplumber  # noqa: PLC0415
    except ImportError as exc:
        raise ImportError("pdfplumber is required to parse PDF files: pip install pdfplumber") from exc

    pages: list[str] = []
    metadata: dict = {}

    with pdfplumber.open(path) as pdf:
        metadata = pdf.metadata or {}
        pages.extend(page.extract_text() or "" for page in pdf.pages)

    text = "\n\n".join(pages)
    return Document(path=path, text=text, pages=pages, metadata=metadata)

Report `dataclass` ¤

Report(document_path: str, findings: list[Finding])

The result of running a set of rules against a document.

Methods:

to_dict –

Serialise the report to a plain dictionary.

Attributes:

document_path (str) –

Path to the evaluated document.
findings (list[Finding]) –

All findings produced by the rule engine.
passed (bool) –

True if no error-severity findings were produced.
summary (dict[str, int]) –

Finding counts grouped by severity.

document_path `instance-attribute` ¤

document_path: str

Path to the evaluated document.

findings `instance-attribute` ¤

findings: list[Finding]

All findings produced by the rule engine.

passed `property` ¤

passed: bool

True if no error-severity findings were produced.

summary `property` ¤

summary: dict[str, int]

Finding counts grouped by severity.

to_dict ¤

to_dict() -> dict[str, Any]

Serialise the report to a plain dictionary.

Source code in src/docchex/_internal/models.py

def to_dict(self) -> dict[str, Any]:
    """Serialise the report to a plain dictionary."""
    return {
        "document": self.document_path,
        "passed": self.passed,
        "summary": self.summary,
        "findings": [
            {
                "rule_id": f.rule_id,
                "severity": f.severity,
                "message": f.message,
                "location": f.location,
            }
            for f in self.findings
        ],
    }

RequiredSectionRule ¤

RequiredSectionRule(
    rule_id: str, match: str, severity: str = ERROR
)

Bases: Rule

Checks that a required section heading is present in the document.

Methods:

check –

Return a finding if the required section is absent from the document.
from_config –

Instantiate from a rule configuration dictionary.

Attributes:

id (str) –

Rule identifier.
match (str) –

The section heading text to search for (case-insensitive).
severity (str) –

Severity of the finding when the section is missing.

Source code in src/docchex/_internal/rules/builtin/required_section.py

def __init__(self, rule_id: str, match: str, severity: str = Severity.ERROR) -> None:
    self.id = rule_id
    self.match = match
    self.severity = severity

id `instance-attribute` ¤

id: str = rule_id

Rule identifier.

match `instance-attribute` ¤

match: str = match

The section heading text to search for (case-insensitive).

severity `instance-attribute` ¤

severity: str = severity

Severity of the finding when the section is missing.

check ¤

check(doc: Document) -> list[Finding]

Return a finding if the required section is absent from the document.

Source code in src/docchex/_internal/rules/builtin/required_section.py

def check(self, doc: Document) -> list[Finding]:
    """Return a finding if the required section is absent from the document."""
    if self.match.lower() not in doc.text.lower():
        return [
            Finding(
                rule_id=self.id,
                severity=self.severity,
                message=f"Required section not found: {self.match!r}",
            ),
        ]
    return []

from_config `classmethod` ¤

from_config(config: dict[str, Any]) -> RequiredSectionRule

Instantiate from a rule configuration dictionary.

Source code in src/docchex/_internal/rules/builtin/required_section.py

@classmethod
def from_config(cls, config: dict[str, Any]) -> RequiredSectionRule:
    """Instantiate from a rule configuration dictionary."""
    return cls(
        rule_id=config["id"],
        match=config["match"],
        severity=config.get("severity", Severity.ERROR),
    )

Rule ¤

Bases: ABC

Abstract base class for all docchex rules.

Methods:

check –

Check the document and return any findings.
from_config –

Instantiate the rule from a configuration dictionary.

Attributes:

id (str) –

Unique identifier for this rule.
severity (str) –

Default severity level for findings produced by this rule.

id `instance-attribute` ¤

id: str

Unique identifier for this rule.

severity `class-attribute` `instance-attribute` ¤

severity: str = WARNING

Default severity level for findings produced by this rule.

check `abstractmethod` ¤

check(doc: Document) -> list[Finding]

Check the document and return any findings.

Source code in src/docchex/_internal/rules/base.py

@abstractmethod
def check(self, doc: Document) -> list[Finding]:
    """Check the document and return any findings."""
    ...

from_config `classmethod` ¤

from_config(config: dict[str, Any]) -> Rule

Instantiate the rule from a configuration dictionary.

Source code in src/docchex/_internal/rules/base.py

@classmethod
def from_config(cls, config: dict[str, Any]) -> Rule:
    """Instantiate the rule from a configuration dictionary."""
    raise NotImplementedError(f"{cls.__name__} does not implement from_config")

RuleEngine ¤

RuleEngine(rules: list[Rule])

Runs a list of rules against a document and collects findings into a report.

Methods:

run –

Run all rules against the document and return a consolidated report.

Attributes:

rules (list[Rule]) –

The rules applied by this engine.

Source code in src/docchex/_internal/evaluation/engine.py

def __init__(self, rules: list[Rule]) -> None:
    self.rules = rules

rules `instance-attribute` ¤

rules: list[Rule] = rules

The rules applied by this engine.

run ¤

run(doc: Document) -> Report

Run all rules against the document and return a consolidated report.

Source code in src/docchex/_internal/evaluation/engine.py

def run(self, doc: Document) -> Report:
    """Run all rules against the document and return a consolidated report."""
    findings: list[Finding] = []
    for rule in self.rules:
        findings.extend(rule.check(doc))
    return Report(document_path=str(doc.path), findings=findings)

RuleLoader ¤

RuleLoader(llm: LLMClient | None = None)

Parameters:

llm ¤
(LLMClient | None, default: None ) –

Optional LLM client injected into any ai_check rules that are loaded.

Methods:

load –

Load rules from one or more sources.

Source code in src/docchex/_internal/rules/loader.py

def __init__(self, llm: LLMClient | None = None) -> None:
    """Initialise the loader.

    Parameters:
        llm: Optional LLM client injected into any ``ai_check`` rules that are loaded.
    """
    self._llm = llm

load ¤

load(source: _RuleSource | list[_RuleSource]) -> list[Rule]

Load rules from one or more sources.

A single source can be: - A path string or Path to a .yaml/.yml/.toml file - A preset shorthand string like "preset:tech_report" - A list of rule dicts

Multiple sources can be combined by passing a list of any of the above. A list whose first element is not a dict is treated as a list of sources.

Source code in src/docchex/_internal/rules/loader.py

def load(self, source: _RuleSource | list[_RuleSource]) -> list[Rule]:
    """Load rules from one or more sources.

    A single source can be:
    - A path string or Path to a ``.yaml``/``.yml``/``.toml`` file
    - A preset shorthand string like ``"preset:tech_report"``
    - A list of rule dicts

    Multiple sources can be combined by passing a list of any of the above.
    A list whose first element is not a dict is treated as a list of sources.
    """
    if isinstance(source, list) and (not source or not isinstance(source[0], dict)):
        rules: list[Rule] = []
        for s in source:
            rules.extend(self._load_single(s))  # ty: ignore[invalid-argument-type]
        return rules
    return self._load_single(source)  # ty: ignore[invalid-argument-type]

Severity ¤

Constants for rule severity levels.

Attributes:

ERROR –

Blocks the document from passing.
INFO –

Informational finding only.
WARNING –

Non-blocking issue worth noting.

ERROR `class-attribute` `instance-attribute` ¤

ERROR = 'error'

Blocks the document from passing.

INFO `class-attribute` `instance-attribute` ¤

INFO = 'info'

Informational finding only.

WARNING `class-attribute` `instance-attribute` ¤

WARNING = 'warning'

Non-blocking issue worth noting.

TextParser ¤

Bases: DocumentParser

Parse plain-text (.txt) files into Document objects.

Pages are split on double newlines, mirroring PDFParser's convention. Used primarily for eval fixtures to keep CI dependency-free.

Methods:

for_path –

Return the appropriate parser for the given file path based on its extension.
parse –

Parse a plain-text file into a Document, splitting pages on double newlines.

for_path `classmethod` ¤

for_path(path: Path) -> DocumentParser

Return the appropriate parser for the given file path based on its extension.

Source code in src/docchex/_internal/parsing/base.py

@classmethod
def for_path(cls, path: Path) -> DocumentParser:
    """Return the appropriate parser for the given file path based on its extension."""
    from docchex._internal.parsing.pdf import PDFParser  # noqa: PLC0415
    from docchex._internal.parsing.text import TextParser  # noqa: PLC0415

    suffix = path.suffix.lower()
    if suffix == ".pdf":
        return PDFParser()
    if suffix == ".txt":
        return TextParser()
    raise ValueError(f"Unsupported file type: {suffix!r}")

parse ¤

parse(path: Path) -> Document

Parse a plain-text file into a Document, splitting pages on double newlines.

Source code in src/docchex/_internal/parsing/text.py

def parse(self, path: Path) -> Document:
    """Parse a plain-text file into a Document, splitting pages on double newlines."""
    text = path.read_text(encoding="utf-8")
    pages = [p.strip() for p in text.split("\n\n") if p.strip()]
    if not pages:
        pages = [""]
    return Document(path=path, text=text, pages=pages, metadata={})

WordCountRule ¤

WordCountRule(
    rule_id: str,
    min_words: int | None = None,
    max_words: int | None = None,
    severity: str = WARNING,
)

Bases: Rule

Checks that the document word count falls within optional min/max bounds.

Methods:

check –

Return findings if the document word count is outside the configured bounds.
from_config –

Instantiate from a rule configuration dictionary.

Attributes:

id (str) –

Rule identifier.
max_words (int | None) –

Maximum word count allowed, or None for no upper bound.
min_words (int | None) –

Minimum word count required, or None for no lower bound.
severity (str) –

Severity of findings produced by this rule.

Source code in src/docchex/_internal/rules/builtin/word_count.py

def __init__(
    self,
    rule_id: str,
    min_words: int | None = None,
    max_words: int | None = None,
    severity: str = Severity.WARNING,
) -> None:
    self.id = rule_id
    self.min_words = min_words
    self.max_words = max_words
    self.severity = severity

id `instance-attribute` ¤

id: str = rule_id

Rule identifier.

max_words `instance-attribute` ¤

max_words: int | None = max_words

Maximum word count allowed, or None for no upper bound.

min_words `instance-attribute` ¤

min_words: int | None = min_words

Minimum word count required, or None for no lower bound.

severity `instance-attribute` ¤

severity: str = severity

Severity of findings produced by this rule.

check ¤

check(doc: Document) -> list[Finding]

Return findings if the document word count is outside the configured bounds.

Source code in src/docchex/_internal/rules/builtin/word_count.py

def check(self, doc: Document) -> list[Finding]:
    """Return findings if the document word count is outside the configured bounds."""
    count = len(doc.text.split())
    findings: list[Finding] = []
    if self.min_words is not None and count < self.min_words:
        findings.append(
            Finding(
                rule_id=self.id,
                severity=self.severity,
                message=f"Document has {count} words; minimum required is {self.min_words}.",
            ),
        )
    if self.max_words is not None and count > self.max_words:
        findings.append(
            Finding(
                rule_id=self.id,
                severity=self.severity,
                message=f"Document has {count} words; maximum allowed is {self.max_words}.",
            ),
        )
    return findings

from_config `classmethod` ¤

from_config(config: dict[str, Any]) -> WordCountRule

Instantiate from a rule configuration dictionary.

Source code in src/docchex/_internal/rules/builtin/word_count.py

@classmethod
def from_config(cls, config: dict[str, Any]) -> WordCountRule:
    """Instantiate from a rule configuration dictionary."""
    return cls(
        rule_id=config["id"],
        min_words=config.get("min"),
        max_words=config.get("max"),
        severity=config.get("severity", Severity.WARNING),
    )

get_parser ¤

get_parser() -> ArgumentParser

Return the CLI argument parser.

Returns:

ArgumentParser –

An argparse parser.

Source code in src/docchex/_internal/cli.py

def get_parser() -> argparse.ArgumentParser:
    """Return the CLI argument parser.

    Returns:
        An argparse parser.
    """
    parser = argparse.ArgumentParser(prog="docchex")
    parser.add_argument("-V", "--version", action="version", version=f"%(prog)s {debug._get_version()}")
    parser.add_argument("--debug-info", action=_DebugInfo, help="Print debug information.")
    return parser

list_presets ¤

list_presets() -> list[str]

Return the names of all built-in rule presets.

Returns:

list[str] –

A sorted list of preset names.
list[str] –

Pass any name as "preset:<name>" to run_qaqc or RuleLoader.load.

Example

docchex.list_presets()
# ['academic_paper', 'custom_template', 'letter_email', 'tech_report']

run_qaqc("report.pdf", "preset:tech_report")
run_qaqc("report.pdf", ["preset:tech_report", "my_extra_rules.yaml"])

Source code in src/docchex/__init__.py

def list_presets() -> list[str]:
    """Return the names of all built-in rule presets.

    Returns:
        A sorted list of preset names.
        Pass any name as ``"preset:<name>"`` to ``run_qaqc`` or ``RuleLoader.load``.

    Example:
        ```python
        docchex.list_presets()
        # ['academic_paper', 'custom_template', 'letter_email', 'tech_report']

        run_qaqc("report.pdf", "preset:tech_report")
        run_qaqc("report.pdf", ["preset:tech_report", "my_extra_rules.yaml"])
        ```
    """
    return _available_presets()

main ¤

main(args: list[str] | None = None) -> int

Run the main program.

This function is executed when you type docchex or python -m docchex.

Parameters:

args ¤
(list[str] | None, default: None ) –

Arguments passed from the command line.

Returns:

int –

An exit code.

Source code in src/docchex/_internal/cli.py

def main(args: list[str] | None = None) -> int:
    """Run the main program.

    This function is executed when you type `docchex` or `python -m docchex`.

    Parameters:
        args: Arguments passed from the command line.

    Returns:
        An exit code.
    """
    parser = get_parser()
    opts = parser.parse_args(args=args)
    print(opts)
    return 0

run_qaqc ¤

run_qaqc(
    document: str | Path,
    rules: _RulesArg,
    llm: LLMClient | None = None,
) -> dict[str, Any]

Run QA/QC checks on a document against a set of rules.

Parameters:

document ¤
(str | Path) –

Path to the document file (PDF or TXT supported).
rules ¤
(_RulesArg) –

One or more rule sources. Can be: - A path string or Path to a .yaml/.toml rules file - A preset name like "preset:tech_report" (see list_presets()) - A list of rule dicts (including type: ai_check entries) - A list combining any of the above
llm ¤
(LLMClient | None, default: None ) –

Optional LLM client for ai_check rules (e.g. AnthropicClient()).

Returns:

dict[str, Any] –

A dict with keys: document, passed, summary, findings.

Source code in src/docchex/__init__.py

def run_qaqc(
    document: str | Path,
    rules: _RulesArg,
    llm: LLMClient | None = None,
) -> dict[str, Any]:
    """Run QA/QC checks on a document against a set of rules.

    Parameters:
        document: Path to the document file (PDF or TXT supported).
        rules: One or more rule sources. Can be:
            - A path string or ``Path`` to a ``.yaml``/``.toml`` rules file
            - A preset name like ``"preset:tech_report"`` (see ``list_presets()``)
            - A list of rule dicts (including ``type: ai_check`` entries)
            - A list combining any of the above
        llm: Optional LLM client for ``ai_check`` rules (e.g. ``AnthropicClient()``).

    Returns:
        A dict with keys: ``document``, ``passed``, ``summary``, ``findings``.
    """
    doc_path = Path(document)
    parsed_doc = DocumentParser.for_path(doc_path).parse(doc_path)
    loaded_rules = RuleLoader(llm=llm).load(rules)
    report = RuleEngine(loaded_rules).run(parsed_doc)
    return report.to_dict()

docchex ¤

AICheckRule ¤

id instance-attribute ¤

prompt instance-attribute ¤

severity instance-attribute ¤

check ¤

from_config classmethod ¤

config ¤

llm ¤

AnthropicClient ¤

api_key ¤

model ¤

evaluate ¤

Document dataclass ¤

metadata class-attribute instance-attribute ¤

pages instance-attribute ¤

path instance-attribute ¤

text instance-attribute ¤

DocumentParser ¤

for_path classmethod ¤

parse abstractmethod ¤

Finding dataclass ¤

location class-attribute instance-attribute ¤

message instance-attribute ¤

rule_id instance-attribute ¤

severity instance-attribute ¤

LLMClient ¤

Built-in providers¤

Custom providers¤

Future extensibility¤

evaluate ¤

LLMResponse dataclass ¤

passed instance-attribute ¤

reason instance-attribute ¤

OllamaClient ¤

model ¤

base_url ¤

evaluate ¤

OpenAIClient ¤

api_key ¤

model ¤

base_url ¤

evaluate ¤

PDFParser ¤

for_path classmethod ¤

parse ¤

Report dataclass ¤

document_path instance-attribute ¤

findings instance-attribute ¤

passed property ¤

summary property ¤

to_dict ¤

RequiredSectionRule ¤

id instance-attribute ¤

match instance-attribute ¤

severity instance-attribute ¤

check ¤

from_config classmethod ¤

Rule ¤

id instance-attribute ¤

severity class-attribute instance-attribute ¤

check abstractmethod ¤

from_config classmethod ¤

RuleEngine ¤

rules instance-attribute ¤

run ¤

RuleLoader ¤

llm ¤

load ¤

Severity ¤

ERROR class-attribute instance-attribute ¤

INFO class-attribute instance-attribute ¤

WARNING class-attribute instance-attribute ¤

TextParser ¤

for_path classmethod ¤

parse ¤

WordCountRule ¤

id instance-attribute ¤

max_words instance-attribute ¤

min_words instance-attribute ¤

id `instance-attribute` ¤

prompt `instance-attribute` ¤

severity `instance-attribute` ¤

from_config `classmethod` ¤

`config` ¤

`llm` ¤

`api_key` ¤

`model` ¤

Document `dataclass` ¤

metadata `class-attribute` `instance-attribute` ¤

pages `instance-attribute` ¤

path `instance-attribute` ¤

text `instance-attribute` ¤

for_path `classmethod` ¤

parse `abstractmethod` ¤

Finding `dataclass` ¤

location `class-attribute` `instance-attribute` ¤

message `instance-attribute` ¤

rule_id `instance-attribute` ¤

severity `instance-attribute` ¤

LLMResponse `dataclass` ¤

passed `instance-attribute` ¤

reason `instance-attribute` ¤

`model` ¤

`base_url` ¤

`api_key` ¤

`model` ¤

`base_url` ¤

for_path `classmethod` ¤

Report `dataclass` ¤

document_path `instance-attribute` ¤

findings `instance-attribute` ¤

passed `property` ¤

summary `property` ¤

id `instance-attribute` ¤

match `instance-attribute` ¤

severity `instance-attribute` ¤

from_config `classmethod` ¤

id `instance-attribute` ¤

severity `class-attribute` `instance-attribute` ¤

check `abstractmethod` ¤

from_config `classmethod` ¤

rules `instance-attribute` ¤

`llm` ¤

ERROR `class-attribute` `instance-attribute` ¤

INFO `class-attribute` `instance-attribute` ¤

WARNING `class-attribute` `instance-attribute` ¤

for_path `classmethod` ¤

id `instance-attribute` ¤

max_words `instance-attribute` ¤

min_words `instance-attribute` ¤

severity `instance-attribute` ¤

from_config `classmethod` ¤

`args` ¤

`document` ¤

`rules` ¤

`llm` ¤