Skip to content

docchex ¤

docchex — document QA/QC engine.

Classes:

  • AICheckRule

    Checks a document against a custom prompt using an LLM.

  • AnthropicClient

    LLM client backed by the Anthropic API.

  • Document

    A parsed document ready for rule evaluation.

  • DocumentParser

    Abstract base class for document parsers.

  • Finding

    A single rule violation found in a document.

  • LLMClient

    Protocol for LLM providers used by :class:~docchex.AICheckRule.

  • LLMResponse

    Result returned by an LLM provider after evaluating a document.

  • OllamaClient

    LLM client that connects to a local Ollama server via its OpenAI-compatible API.

  • OpenAIClient

    LLM client backed by the OpenAI API (or any OpenAI-compatible endpoint).

  • PDFParser

    Parse PDF files into Document objects using pdfplumber.

  • Report

    The result of running a set of rules against a document.

  • RequiredSectionRule

    Checks that a required section heading is present in the document.

  • Rule

    Abstract base class for all docchex rules.

  • RuleEngine

    Runs a list of rules against a document and collects findings into a report.

  • RuleLoader
  • Severity

    Constants for rule severity levels.

  • TextParser

    Parse plain-text (.txt) files into Document objects.

  • WordCountRule

    Checks that the document word count falls within optional min/max bounds.

Functions:

  • get_parser

    Return the CLI argument parser.

  • list_presets

    Return the names of all built-in rule presets.

  • main

    Run the main program.

  • run_qaqc

    Run QA/QC checks on a document against a set of rules.

AICheckRule ¤

AICheckRule(
    rule_id: str,
    prompt: str,
    severity: str = ERROR,
    llm: LLMClient | None = None,
)

Bases: Rule

Checks a document against a custom prompt using an LLM.

The LLM is expected to return JSON with {"passed": bool, "reason": str}. If the document fails the check, a finding is emitted with the reason as the message.

Methods:

  • check

    Evaluate the document using the configured LLM and return any findings.

  • from_config

    Instantiate from a rule configuration dictionary.

Attributes:

  • id (str) –

    Rule identifier.

  • prompt (str) –

    The evaluation prompt sent to the LLM together with the document text.

  • severity (str) –

    Severity of findings produced when the document fails the check.

Source code in src/docchex/_internal/rules/builtin/ai_check.py
28
29
30
31
32
33
34
35
36
37
38
def __init__(
    self,
    rule_id: str,
    prompt: str,
    severity: str = Severity.ERROR,
    llm: LLMClient | None = None,
) -> None:
    self.id = rule_id
    self.prompt = prompt
    self.severity = severity
    self._llm = llm

id instance-attribute ¤

id: str = rule_id

Rule identifier.

prompt instance-attribute ¤

prompt: str = prompt

The evaluation prompt sent to the LLM together with the document text.

severity instance-attribute ¤

severity: str = severity

Severity of findings produced when the document fails the check.

check ¤

check(doc: Document) -> list[Finding]

Evaluate the document using the configured LLM and return any findings.

Raises:

Source code in src/docchex/_internal/rules/builtin/ai_check.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def check(self, doc: Document) -> list[Finding]:
    """Evaluate the document using the configured LLM and return any findings.

    Raises:
        RuntimeError: If no LLM client was provided.
    """
    if self._llm is None:
        raise RuntimeError(
            f"Rule {self.id!r} requires an LLM client. "
            "Pass llm=... to RuleLoader or AICheckRule.",
        )
    from docchex._internal.models import Finding  # noqa: PLC0415

    result = self._llm.evaluate(doc, self.prompt)
    if result.passed:
        return []
    return [Finding(rule_id=self.id, severity=self.severity, message=result.reason)]

from_config classmethod ¤

from_config(
    config: dict[str, Any], llm: LLMClient | None = None
) -> AICheckRule

Instantiate from a rule configuration dictionary.

Parameters:

  • config ¤

    (dict[str, Any]) –

    Rule config dict with keys id, prompt, and optional severity.

  • llm ¤

    (LLMClient | None, default: None ) –

    LLM client to use for evaluation.

Source code in src/docchex/_internal/rules/builtin/ai_check.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@classmethod
def from_config(
    cls,
    config: dict[str, Any],
    llm: LLMClient | None = None,
) -> AICheckRule:
    """Instantiate from a rule configuration dictionary.

    Parameters:
        config: Rule config dict with keys ``id``, ``prompt``, and optional ``severity``.
        llm: LLM client to use for evaluation.
    """
    return cls(
        rule_id=config["id"],
        prompt=config["prompt"],
        severity=config.get("severity", Severity.ERROR),
        llm=llm,
    )

AnthropicClient ¤

AnthropicClient(
    api_key: str | None = None, model: str = _DEFAULT_MODEL
)

LLM client backed by the Anthropic API.

Requires pip install docchex[anthropic].

Parameters:

  • api_key ¤

    (str | None, default: None ) –

    Anthropic API key. Defaults to the ANTHROPIC_API_KEY environment variable.

  • model ¤

    (str, default: _DEFAULT_MODEL ) –

    Model ID to use for evaluation.

Methods:

  • evaluate

    Send the document and prompt to Anthropic and return a structured result.

Source code in src/docchex/_internal/llm/providers/anthropic.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def __init__(self, api_key: str | None = None, model: str = _DEFAULT_MODEL) -> None:
    """Initialise the client.

    Parameters:
        api_key: Anthropic API key. Defaults to the ``ANTHROPIC_API_KEY`` environment variable.
        model: Model ID to use for evaluation.
    """
    try:
        import anthropic  # ty: ignore[unresolved-import]  # noqa: PLC0415
    except ImportError as exc:
        raise ImportError(
            "anthropic package is required: pip install docchex[anthropic]",
        ) from exc
    self._client = anthropic.Anthropic(api_key=api_key)
    self._model = model

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Send the document and prompt to Anthropic and return a structured result.

Source code in src/docchex/_internal/llm/providers/anthropic.py
39
40
41
42
43
44
45
46
47
48
49
def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Send the document and prompt to Anthropic and return a structured result."""
    text = doc.text[:32000]
    message = self._client.messages.create(
        model=self._model,
        max_tokens=256,
        system=_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": f"{prompt}\n\nDocument:\n{text}"}],
    )
    data = json.loads(message.content[0].text)
    return LLMResponse(passed=bool(data["passed"]), reason=str(data.get("reason", "")))

Document dataclass ¤

Document(
    path: Path,
    text: str,
    pages: list[str],
    metadata: dict[str, Any] = dict(),
)

A parsed document ready for rule evaluation.

Attributes:

  • metadata (dict[str, Any]) –

    Optional metadata extracted from the document (e.g. PDF metadata).

  • pages (list[str]) –

    Text content split by page.

  • path (Path) –

    Path to the source file.

  • text (str) –

    Full text content of the document.

metadata class-attribute instance-attribute ¤

metadata: dict[str, Any] = field(default_factory=dict)

Optional metadata extracted from the document (e.g. PDF metadata).

pages instance-attribute ¤

pages: list[str]

Text content split by page.

path instance-attribute ¤

path: Path

Path to the source file.

text instance-attribute ¤

text: str

Full text content of the document.

DocumentParser ¤

Bases: ABC

Abstract base class for document parsers.

Methods:

  • for_path

    Return the appropriate parser for the given file path based on its extension.

  • parse

    Parse the file at the given path into a Document.

for_path classmethod ¤

for_path(path: Path) -> DocumentParser

Return the appropriate parser for the given file path based on its extension.

Source code in src/docchex/_internal/parsing/base.py
22
23
24
25
26
27
28
29
30
31
32
33
@classmethod
def for_path(cls, path: Path) -> DocumentParser:
    """Return the appropriate parser for the given file path based on its extension."""
    from docchex._internal.parsing.pdf import PDFParser  # noqa: PLC0415
    from docchex._internal.parsing.text import TextParser  # noqa: PLC0415

    suffix = path.suffix.lower()
    if suffix == ".pdf":
        return PDFParser()
    if suffix == ".txt":
        return TextParser()
    raise ValueError(f"Unsupported file type: {suffix!r}")

parse abstractmethod ¤

parse(path: Path) -> Document

Parse the file at the given path into a Document.

Source code in src/docchex/_internal/parsing/base.py
17
18
19
20
@abstractmethod
def parse(self, path: Path) -> Document:
    """Parse the file at the given path into a Document."""
    ...

Finding dataclass ¤

Finding(
    rule_id: str,
    severity: str,
    message: str,
    location: str | None = None,
)

A single rule violation found in a document.

Attributes:

  • location (str | None) –

    Optional location reference within the document.

  • message (str) –

    Human-readable description of the violation.

  • rule_id (str) –

    ID of the rule that produced this finding.

  • severity (str) –

    Severity level: "error", "warning", or "info".

location class-attribute instance-attribute ¤

location: str | None = None

Optional location reference within the document.

message instance-attribute ¤

message: str

Human-readable description of the violation.

rule_id instance-attribute ¤

rule_id: str

ID of the rule that produced this finding.

severity instance-attribute ¤

severity: str

Severity level: "error", "warning", or "info".

LLMClient ¤

Bases: Protocol

Protocol for LLM providers used by :class:~docchex.AICheckRule.

Any object that implements evaluate(doc, prompt) -> LLMResponse satisfies this protocol — no subclassing required.

Built-in providers¤

  • :class:~docchex.AnthropicClient — Anthropic API (pip install docchex[anthropic])
  • :class:~docchex.OpenAIClient — OpenAI API or any OpenAI-compatible endpoint (pip install docchex[openai])
  • :class:~docchex.OllamaClient — local Ollama server via the OpenAI-compatible API (pip install docchex[ollama])

Custom providers¤

Implement a custom provider by defining a class with the evaluate method::

class MyClient:
    def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
        ...
        return LLMResponse(passed=True, reason="All good")

loader = RuleLoader(llm=MyClient())

Future extensibility¤

The current design supports single-call, stateless checks: one prompt is sent to the LLM and a pass/fail result is returned. If multi-step reasoning becomes necessary (e.g. agent loops, tool calling, or structured-output retries), the natural approach is to implement a richer LLMClient that encapsulates that logic internally — the rest of the pipeline (AICheckRule, RuleLoader, run_qaqc) stays unchanged. For that use case, litellm <https://docs.litellm.ai>_ (lightweight, 100+ providers) or LangChain <https://python.langchain.com>_ (full agent orchestration) are good building blocks to wrap inside a custom LLMClient.

Methods:

  • evaluate

    Evaluate the document against the given prompt and return a structured result.

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Evaluate the document against the given prompt and return a structured result.

Source code in src/docchex/_internal/llm/base.py
60
61
62
def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Evaluate the document against the given prompt and return a structured result."""
    ...

LLMResponse dataclass ¤

LLMResponse(passed: bool, reason: str)

Result returned by an LLM provider after evaluating a document.

Attributes:

  • passed (bool) –

    Whether the document passed the check.

  • reason (str) –

    Human-readable explanation of the result.

passed instance-attribute ¤

passed: bool

Whether the document passed the check.

reason instance-attribute ¤

reason: str

Human-readable explanation of the result.

OllamaClient ¤

OllamaClient(
    model: str = _DEFAULT_MODEL,
    base_url: str = _DEFAULT_BASE_URL,
)

LLM client that connects to a local Ollama server via its OpenAI-compatible API.

Requires pip install docchex[ollama] (installs the openai package).

Parameters:

  • model ¤

    (str, default: _DEFAULT_MODEL ) –

    Ollama model name (e.g. "llama3.2", "mistral").

  • base_url ¤

    (str, default: _DEFAULT_BASE_URL ) –

    Base URL of the Ollama server.

Methods:

  • evaluate

    Send the document and prompt to Ollama and return a structured result.

Source code in src/docchex/_internal/llm/providers/ollama.py
21
22
23
24
25
26
27
28
29
30
def __init__(self, model: str = _DEFAULT_MODEL, base_url: str = _DEFAULT_BASE_URL) -> None:
    """Initialise the client.

    Parameters:
        model: Ollama model name (e.g. ``"llama3.2"``, ``"mistral"``).
        base_url: Base URL of the Ollama server.
    """
    from docchex._internal.llm.providers.openai import OpenAIClient  # noqa: PLC0415

    self._inner = OpenAIClient(api_key="ollama", model=model, base_url=base_url)

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Send the document and prompt to Ollama and return a structured result.

Source code in src/docchex/_internal/llm/providers/ollama.py
32
33
34
def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Send the document and prompt to Ollama and return a structured result."""
    return self._inner.evaluate(doc, prompt)

OpenAIClient ¤

OpenAIClient(
    api_key: str | None = None,
    model: str = _DEFAULT_MODEL,
    base_url: str | None = None,
)

LLM client backed by the OpenAI API (or any OpenAI-compatible endpoint).

Requires pip install docchex[openai].

Parameters:

  • api_key ¤

    (str | None, default: None ) –

    OpenAI API key. Defaults to the OPENAI_API_KEY environment variable.

  • model ¤

    (str, default: _DEFAULT_MODEL ) –

    Model ID to use for evaluation.

  • base_url ¤

    (str | None, default: None ) –

    Override the API base URL (e.g. for a local Ollama server).

Methods:

  • evaluate

    Send the document and prompt to OpenAI and return a structured result.

Source code in src/docchex/_internal/llm/providers/openai.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def __init__(
    self,
    api_key: str | None = None,
    model: str = _DEFAULT_MODEL,
    base_url: str | None = None,
) -> None:
    """Initialise the client.

    Parameters:
        api_key: OpenAI API key. Defaults to the ``OPENAI_API_KEY`` environment variable.
        model: Model ID to use for evaluation.
        base_url: Override the API base URL (e.g. for a local Ollama server).
    """
    try:
        import openai  # ty: ignore[unresolved-import]  # noqa: PLC0415
    except ImportError as exc:
        raise ImportError(
            "openai package is required: pip install docchex[openai]",
        ) from exc
    self._client = openai.OpenAI(api_key=api_key, base_url=base_url)
    self._model = model

evaluate ¤

evaluate(doc: Document, prompt: str) -> LLMResponse

Send the document and prompt to OpenAI and return a structured result.

Source code in src/docchex/_internal/llm/providers/openai.py
45
46
47
48
49
50
51
52
53
54
55
56
57
def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
    """Send the document and prompt to OpenAI and return a structured result."""
    text = doc.text[:32000]
    response = self._client.chat.completions.create(
        model=self._model,
        messages=[
            {"role": "system", "content": _SYSTEM_PROMPT},
            {"role": "user", "content": f"{prompt}\n\nDocument:\n{text}"},
        ],
        max_tokens=256,
    )
    data = json.loads(response.choices[0].message.content)
    return LLMResponse(passed=bool(data["passed"]), reason=str(data.get("reason", "")))

PDFParser ¤

Bases: DocumentParser

Parse PDF files into Document objects using pdfplumber.

Methods:

  • for_path

    Return the appropriate parser for the given file path based on its extension.

  • parse

    Parse a PDF file into a Document, extracting text and metadata.

for_path classmethod ¤

for_path(path: Path) -> DocumentParser

Return the appropriate parser for the given file path based on its extension.

Source code in src/docchex/_internal/parsing/base.py
22
23
24
25
26
27
28
29
30
31
32
33
@classmethod
def for_path(cls, path: Path) -> DocumentParser:
    """Return the appropriate parser for the given file path based on its extension."""
    from docchex._internal.parsing.pdf import PDFParser  # noqa: PLC0415
    from docchex._internal.parsing.text import TextParser  # noqa: PLC0415

    suffix = path.suffix.lower()
    if suffix == ".pdf":
        return PDFParser()
    if suffix == ".txt":
        return TextParser()
    raise ValueError(f"Unsupported file type: {suffix!r}")

parse ¤

parse(path: Path) -> Document

Parse a PDF file into a Document, extracting text and metadata.

Source code in src/docchex/_internal/parsing/pdf.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def parse(self, path: Path) -> Document:
    """Parse a PDF file into a Document, extracting text and metadata."""
    try:
        import pdfplumber  # noqa: PLC0415
    except ImportError as exc:
        raise ImportError("pdfplumber is required to parse PDF files: pip install pdfplumber") from exc

    pages: list[str] = []
    metadata: dict = {}

    with pdfplumber.open(path) as pdf:
        metadata = pdf.metadata or {}
        pages.extend(page.extract_text() or "" for page in pdf.pages)

    text = "\n\n".join(pages)
    return Document(path=path, text=text, pages=pages, metadata=metadata)

Report dataclass ¤

Report(document_path: str, findings: list[Finding])

The result of running a set of rules against a document.

Methods:

  • to_dict

    Serialise the report to a plain dictionary.

Attributes:

document_path instance-attribute ¤

document_path: str

Path to the evaluated document.

findings instance-attribute ¤

findings: list[Finding]

All findings produced by the rule engine.

passed property ¤

passed: bool

True if no error-severity findings were produced.

summary property ¤

summary: dict[str, int]

Finding counts grouped by severity.

to_dict ¤

to_dict() -> dict[str, Any]

Serialise the report to a plain dictionary.

Source code in src/docchex/_internal/models.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def to_dict(self) -> dict[str, Any]:
    """Serialise the report to a plain dictionary."""
    return {
        "document": self.document_path,
        "passed": self.passed,
        "summary": self.summary,
        "findings": [
            {
                "rule_id": f.rule_id,
                "severity": f.severity,
                "message": f.message,
                "location": f.location,
            }
            for f in self.findings
        ],
    }

RequiredSectionRule ¤

RequiredSectionRule(
    rule_id: str, match: str, severity: str = ERROR
)

Bases: Rule

Checks that a required section heading is present in the document.

Methods:

  • check

    Return a finding if the required section is absent from the document.

  • from_config

    Instantiate from a rule configuration dictionary.

Attributes:

  • id (str) –

    Rule identifier.

  • match (str) –

    The section heading text to search for (case-insensitive).

  • severity (str) –

    Severity of the finding when the section is missing.

Source code in src/docchex/_internal/rules/builtin/required_section.py
21
22
23
24
def __init__(self, rule_id: str, match: str, severity: str = Severity.ERROR) -> None:
    self.id = rule_id
    self.match = match
    self.severity = severity

id instance-attribute ¤

id: str = rule_id

Rule identifier.

match instance-attribute ¤

match: str = match

The section heading text to search for (case-insensitive).

severity instance-attribute ¤

severity: str = severity

Severity of the finding when the section is missing.

check ¤

check(doc: Document) -> list[Finding]

Return a finding if the required section is absent from the document.

Source code in src/docchex/_internal/rules/builtin/required_section.py
26
27
28
29
30
31
32
33
34
35
36
def check(self, doc: Document) -> list[Finding]:
    """Return a finding if the required section is absent from the document."""
    if self.match.lower() not in doc.text.lower():
        return [
            Finding(
                rule_id=self.id,
                severity=self.severity,
                message=f"Required section not found: {self.match!r}",
            ),
        ]
    return []

from_config classmethod ¤

from_config(config: dict[str, Any]) -> RequiredSectionRule

Instantiate from a rule configuration dictionary.

Source code in src/docchex/_internal/rules/builtin/required_section.py
38
39
40
41
42
43
44
45
@classmethod
def from_config(cls, config: dict[str, Any]) -> RequiredSectionRule:
    """Instantiate from a rule configuration dictionary."""
    return cls(
        rule_id=config["id"],
        match=config["match"],
        severity=config.get("severity", Severity.ERROR),
    )

Rule ¤

Bases: ABC

Abstract base class for all docchex rules.

Methods:

  • check

    Check the document and return any findings.

  • from_config

    Instantiate the rule from a configuration dictionary.

Attributes:

  • id (str) –

    Unique identifier for this rule.

  • severity (str) –

    Default severity level for findings produced by this rule.

id instance-attribute ¤

id: str

Unique identifier for this rule.

severity class-attribute instance-attribute ¤

severity: str = WARNING

Default severity level for findings produced by this rule.

check abstractmethod ¤

check(doc: Document) -> list[Finding]

Check the document and return any findings.

Source code in src/docchex/_internal/rules/base.py
31
32
33
34
@abstractmethod
def check(self, doc: Document) -> list[Finding]:
    """Check the document and return any findings."""
    ...

from_config classmethod ¤

from_config(config: dict[str, Any]) -> Rule

Instantiate the rule from a configuration dictionary.

Source code in src/docchex/_internal/rules/base.py
36
37
38
39
@classmethod
def from_config(cls, config: dict[str, Any]) -> Rule:
    """Instantiate the rule from a configuration dictionary."""
    raise NotImplementedError(f"{cls.__name__} does not implement from_config")

RuleEngine ¤

RuleEngine(rules: list[Rule])

Runs a list of rules against a document and collects findings into a report.

Methods:

  • run

    Run all rules against the document and return a consolidated report.

Attributes:

Source code in src/docchex/_internal/evaluation/engine.py
19
20
def __init__(self, rules: list[Rule]) -> None:
    self.rules = rules

rules instance-attribute ¤

rules: list[Rule] = rules

The rules applied by this engine.

run ¤

run(doc: Document) -> Report

Run all rules against the document and return a consolidated report.

Source code in src/docchex/_internal/evaluation/engine.py
22
23
24
25
26
27
def run(self, doc: Document) -> Report:
    """Run all rules against the document and return a consolidated report."""
    findings: list[Finding] = []
    for rule in self.rules:
        findings.extend(rule.check(doc))
    return Report(document_path=str(doc.path), findings=findings)

RuleLoader ¤

RuleLoader(llm: LLMClient | None = None)

Parameters:

  • llm ¤

    (LLMClient | None, default: None ) –

    Optional LLM client injected into any ai_check rules that are loaded.

Methods:

  • load

    Load rules from one or more sources.

Source code in src/docchex/_internal/rules/loader.py
28
29
30
31
32
33
34
def __init__(self, llm: LLMClient | None = None) -> None:
    """Initialise the loader.

    Parameters:
        llm: Optional LLM client injected into any ``ai_check`` rules that are loaded.
    """
    self._llm = llm

load ¤

load(source: _RuleSource | list[_RuleSource]) -> list[Rule]

Load rules from one or more sources.

A single source can be: - A path string or Path to a .yaml/.yml/.toml file - A preset shorthand string like "preset:tech_report" - A list of rule dicts

Multiple sources can be combined by passing a list of any of the above. A list whose first element is not a dict is treated as a list of sources.

Source code in src/docchex/_internal/rules/loader.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def load(self, source: _RuleSource | list[_RuleSource]) -> list[Rule]:
    """Load rules from one or more sources.

    A single source can be:
    - A path string or Path to a ``.yaml``/``.yml``/``.toml`` file
    - A preset shorthand string like ``"preset:tech_report"``
    - A list of rule dicts

    Multiple sources can be combined by passing a list of any of the above.
    A list whose first element is not a dict is treated as a list of sources.
    """
    if isinstance(source, list) and (not source or not isinstance(source[0], dict)):
        rules: list[Rule] = []
        for s in source:
            rules.extend(self._load_single(s))  # ty: ignore[invalid-argument-type]
        return rules
    return self._load_single(source)  # ty: ignore[invalid-argument-type]

Severity ¤

Constants for rule severity levels.

Attributes:

  • ERROR

    Blocks the document from passing.

  • INFO

    Informational finding only.

  • WARNING

    Non-blocking issue worth noting.

ERROR class-attribute instance-attribute ¤

ERROR = 'error'

Blocks the document from passing.

INFO class-attribute instance-attribute ¤

INFO = 'info'

Informational finding only.

WARNING class-attribute instance-attribute ¤

WARNING = 'warning'

Non-blocking issue worth noting.

TextParser ¤

Bases: DocumentParser

Parse plain-text (.txt) files into Document objects.

Pages are split on double newlines, mirroring PDFParser's convention. Used primarily for eval fixtures to keep CI dependency-free.

Methods:

  • for_path

    Return the appropriate parser for the given file path based on its extension.

  • parse

    Parse a plain-text file into a Document, splitting pages on double newlines.

for_path classmethod ¤

for_path(path: Path) -> DocumentParser

Return the appropriate parser for the given file path based on its extension.

Source code in src/docchex/_internal/parsing/base.py
22
23
24
25
26
27
28
29
30
31
32
33
@classmethod
def for_path(cls, path: Path) -> DocumentParser:
    """Return the appropriate parser for the given file path based on its extension."""
    from docchex._internal.parsing.pdf import PDFParser  # noqa: PLC0415
    from docchex._internal.parsing.text import TextParser  # noqa: PLC0415

    suffix = path.suffix.lower()
    if suffix == ".pdf":
        return PDFParser()
    if suffix == ".txt":
        return TextParser()
    raise ValueError(f"Unsupported file type: {suffix!r}")

parse ¤

parse(path: Path) -> Document

Parse a plain-text file into a Document, splitting pages on double newlines.

Source code in src/docchex/_internal/parsing/text.py
21
22
23
24
25
26
27
def parse(self, path: Path) -> Document:
    """Parse a plain-text file into a Document, splitting pages on double newlines."""
    text = path.read_text(encoding="utf-8")
    pages = [p.strip() for p in text.split("\n\n") if p.strip()]
    if not pages:
        pages = [""]
    return Document(path=path, text=text, pages=pages, metadata={})

WordCountRule ¤

WordCountRule(
    rule_id: str,
    min_words: int | None = None,
    max_words: int | None = None,
    severity: str = WARNING,
)

Bases: Rule

Checks that the document word count falls within optional min/max bounds.

Methods:

  • check

    Return findings if the document word count is outside the configured bounds.

  • from_config

    Instantiate from a rule configuration dictionary.

Attributes:

  • id (str) –

    Rule identifier.

  • max_words (int | None) –

    Maximum word count allowed, or None for no upper bound.

  • min_words (int | None) –

    Minimum word count required, or None for no lower bound.

  • severity (str) –

    Severity of findings produced by this rule.

Source code in src/docchex/_internal/rules/builtin/word_count.py
23
24
25
26
27
28
29
30
31
32
33
def __init__(
    self,
    rule_id: str,
    min_words: int | None = None,
    max_words: int | None = None,
    severity: str = Severity.WARNING,
) -> None:
    self.id = rule_id
    self.min_words = min_words
    self.max_words = max_words
    self.severity = severity

id instance-attribute ¤

id: str = rule_id

Rule identifier.

max_words instance-attribute ¤

max_words: int | None = max_words

Maximum word count allowed, or None for no upper bound.

min_words instance-attribute ¤

min_words: int | None = min_words

Minimum word count required, or None for no lower bound.

severity instance-attribute ¤

severity: str = severity

Severity of findings produced by this rule.

check ¤

check(doc: Document) -> list[Finding]

Return findings if the document word count is outside the configured bounds.

Source code in src/docchex/_internal/rules/builtin/word_count.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def check(self, doc: Document) -> list[Finding]:
    """Return findings if the document word count is outside the configured bounds."""
    count = len(doc.text.split())
    findings: list[Finding] = []
    if self.min_words is not None and count < self.min_words:
        findings.append(
            Finding(
                rule_id=self.id,
                severity=self.severity,
                message=f"Document has {count} words; minimum required is {self.min_words}.",
            ),
        )
    if self.max_words is not None and count > self.max_words:
        findings.append(
            Finding(
                rule_id=self.id,
                severity=self.severity,
                message=f"Document has {count} words; maximum allowed is {self.max_words}.",
            ),
        )
    return findings

from_config classmethod ¤

from_config(config: dict[str, Any]) -> WordCountRule

Instantiate from a rule configuration dictionary.

Source code in src/docchex/_internal/rules/builtin/word_count.py
57
58
59
60
61
62
63
64
65
@classmethod
def from_config(cls, config: dict[str, Any]) -> WordCountRule:
    """Instantiate from a rule configuration dictionary."""
    return cls(
        rule_id=config["id"],
        min_words=config.get("min"),
        max_words=config.get("max"),
        severity=config.get("severity", Severity.WARNING),
    )

get_parser ¤

get_parser() -> ArgumentParser

Return the CLI argument parser.

Returns:

Source code in src/docchex/_internal/cli.py
30
31
32
33
34
35
36
37
38
39
def get_parser() -> argparse.ArgumentParser:
    """Return the CLI argument parser.

    Returns:
        An argparse parser.
    """
    parser = argparse.ArgumentParser(prog="docchex")
    parser.add_argument("-V", "--version", action="version", version=f"%(prog)s {debug._get_version()}")
    parser.add_argument("--debug-info", action=_DebugInfo, help="Print debug information.")
    return parser

list_presets ¤

list_presets() -> list[str]

Return the names of all built-in rule presets.

Returns:

  • list[str]

    A sorted list of preset names.

  • list[str]

    Pass any name as "preset:<name>" to run_qaqc or RuleLoader.load.

Example
docchex.list_presets()
# ['academic_paper', 'custom_template', 'letter_email', 'tech_report']

run_qaqc("report.pdf", "preset:tech_report")
run_qaqc("report.pdf", ["preset:tech_report", "my_extra_rules.yaml"])
Source code in src/docchex/__init__.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def list_presets() -> list[str]:
    """Return the names of all built-in rule presets.

    Returns:
        A sorted list of preset names.
        Pass any name as ``"preset:<name>"`` to ``run_qaqc`` or ``RuleLoader.load``.

    Example:
        ```python
        docchex.list_presets()
        # ['academic_paper', 'custom_template', 'letter_email', 'tech_report']

        run_qaqc("report.pdf", "preset:tech_report")
        run_qaqc("report.pdf", ["preset:tech_report", "my_extra_rules.yaml"])
        ```
    """
    return _available_presets()

main ¤

main(args: list[str] | None = None) -> int

Run the main program.

This function is executed when you type docchex or python -m docchex.

Parameters:

  • args ¤

    (list[str] | None, default: None ) –

    Arguments passed from the command line.

Returns:

  • int

    An exit code.

Source code in src/docchex/_internal/cli.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def main(args: list[str] | None = None) -> int:
    """Run the main program.

    This function is executed when you type `docchex` or `python -m docchex`.

    Parameters:
        args: Arguments passed from the command line.

    Returns:
        An exit code.
    """
    parser = get_parser()
    opts = parser.parse_args(args=args)
    print(opts)
    return 0

run_qaqc ¤

run_qaqc(
    document: str | Path,
    rules: _RulesArg,
    llm: LLMClient | None = None,
) -> dict[str, Any]

Run QA/QC checks on a document against a set of rules.

Parameters:

  • document ¤

    (str | Path) –

    Path to the document file (PDF or TXT supported).

  • rules ¤

    (_RulesArg) –

    One or more rule sources. Can be: - A path string or Path to a .yaml/.toml rules file - A preset name like "preset:tech_report" (see list_presets()) - A list of rule dicts (including type: ai_check entries) - A list combining any of the above

  • llm ¤

    (LLMClient | None, default: None ) –

    Optional LLM client for ai_check rules (e.g. AnthropicClient()).

Returns:

  • dict[str, Any]

    A dict with keys: document, passed, summary, findings.

Source code in src/docchex/__init__.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def run_qaqc(
    document: str | Path,
    rules: _RulesArg,
    llm: LLMClient | None = None,
) -> dict[str, Any]:
    """Run QA/QC checks on a document against a set of rules.

    Parameters:
        document: Path to the document file (PDF or TXT supported).
        rules: One or more rule sources. Can be:
            - A path string or ``Path`` to a ``.yaml``/``.toml`` rules file
            - A preset name like ``"preset:tech_report"`` (see ``list_presets()``)
            - A list of rule dicts (including ``type: ai_check`` entries)
            - A list combining any of the above
        llm: Optional LLM client for ``ai_check`` rules (e.g. ``AnthropicClient()``).

    Returns:
        A dict with keys: ``document``, ``passed``, ``summary``, ``findings``.
    """
    doc_path = Path(document)
    parsed_doc = DocumentParser.for_path(doc_path).parse(doc_path)
    loaded_rules = RuleLoader(llm=llm).load(rules)
    report = RuleEngine(loaded_rules).run(parsed_doc)
    return report.to_dict()