docchex
¤
docchex — document QA/QC engine.
Classes:
-
AICheckRule–Checks a document against a custom prompt using an LLM.
-
AnthropicClient–LLM client backed by the Anthropic API.
-
Document–A parsed document ready for rule evaluation.
-
DocumentParser–Abstract base class for document parsers.
-
Finding–A single rule violation found in a document.
-
LLMClient–Protocol for LLM providers used by :class:
~docchex.AICheckRule. -
LLMResponse–Result returned by an LLM provider after evaluating a document.
-
OllamaClient–LLM client that connects to a local Ollama server via its OpenAI-compatible API.
-
OpenAIClient–LLM client backed by the OpenAI API (or any OpenAI-compatible endpoint).
-
PDFParser–Parse PDF files into Document objects using pdfplumber.
-
Report–The result of running a set of rules against a document.
-
RequiredSectionRule–Checks that a required section heading is present in the document.
-
Rule–Abstract base class for all docchex rules.
-
RuleEngine–Runs a list of rules against a document and collects findings into a report.
-
RuleLoader– -
Severity–Constants for rule severity levels.
-
TextParser–Parse plain-text (.txt) files into Document objects.
-
WordCountRule–Checks that the document word count falls within optional min/max bounds.
Functions:
-
get_parser–Return the CLI argument parser.
-
list_presets–Return the names of all built-in rule presets.
-
main–Run the main program.
-
run_qaqc–Run QA/QC checks on a document against a set of rules.
AICheckRule
¤
Bases: Rule
Checks a document against a custom prompt using an LLM.
The LLM is expected to return JSON with {"passed": bool, "reason": str}.
If the document fails the check, a finding is emitted with the reason as the message.
Methods:
-
check–Evaluate the document using the configured LLM and return any findings.
-
from_config–Instantiate from a rule configuration dictionary.
Attributes:
-
id(str) –Rule identifier.
-
prompt(str) –The evaluation prompt sent to the LLM together with the document text.
-
severity(str) –Severity of findings produced when the document fails the check.
Source code in src/docchex/_internal/rules/builtin/ai_check.py
28 29 30 31 32 33 34 35 36 37 38 | |
prompt
instance-attribute
¤
prompt: str = prompt
The evaluation prompt sent to the LLM together with the document text.
severity
instance-attribute
¤
severity: str = severity
Severity of findings produced when the document fails the check.
check
¤
Evaluate the document using the configured LLM and return any findings.
Raises:
-
RuntimeError–If no LLM client was provided.
Source code in src/docchex/_internal/rules/builtin/ai_check.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
from_config
classmethod
¤
Instantiate from a rule configuration dictionary.
Parameters:
-
(config¤dict[str, Any]) –Rule config dict with keys
id,prompt, and optionalseverity. -
(llm¤LLMClient | None, default:None) –LLM client to use for evaluation.
Source code in src/docchex/_internal/rules/builtin/ai_check.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
AnthropicClient
¤
LLM client backed by the Anthropic API.
Requires pip install docchex[anthropic].
Parameters:
-
(api_key¤str | None, default:None) –Anthropic API key. Defaults to the
ANTHROPIC_API_KEYenvironment variable. -
(model¤str, default:_DEFAULT_MODEL) –Model ID to use for evaluation.
Methods:
-
evaluate–Send the document and prompt to Anthropic and return a structured result.
Source code in src/docchex/_internal/llm/providers/anthropic.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
evaluate
¤
evaluate(doc: Document, prompt: str) -> LLMResponse
Send the document and prompt to Anthropic and return a structured result.
Source code in src/docchex/_internal/llm/providers/anthropic.py
39 40 41 42 43 44 45 46 47 48 49 | |
Document
dataclass
¤
A parsed document ready for rule evaluation.
Attributes:
-
metadata(dict[str, Any]) –Optional metadata extracted from the document (e.g. PDF metadata).
-
pages(list[str]) –Text content split by page.
-
path(Path) –Path to the source file.
-
text(str) –Full text content of the document.
metadata
class-attribute
instance-attribute
¤
Optional metadata extracted from the document (e.g. PDF metadata).
DocumentParser
¤
Bases: ABC
Abstract base class for document parsers.
Methods:
-
for_path–Return the appropriate parser for the given file path based on its extension.
-
parse–Parse the file at the given path into a Document.
for_path
classmethod
¤
for_path(path: Path) -> DocumentParser
Return the appropriate parser for the given file path based on its extension.
Source code in src/docchex/_internal/parsing/base.py
22 23 24 25 26 27 28 29 30 31 32 33 | |
Finding
dataclass
¤
A single rule violation found in a document.
Attributes:
LLMClient
¤
Bases: Protocol
Protocol for LLM providers used by :class:~docchex.AICheckRule.
Any object that implements evaluate(doc, prompt) -> LLMResponse satisfies this
protocol — no subclassing required.
Built-in providers¤
- :class:
~docchex.AnthropicClient— Anthropic API (pip install docchex[anthropic]) - :class:
~docchex.OpenAIClient— OpenAI API or any OpenAI-compatible endpoint (pip install docchex[openai]) - :class:
~docchex.OllamaClient— local Ollama server via the OpenAI-compatible API (pip install docchex[ollama])
Custom providers¤
Implement a custom provider by defining a class with the evaluate method::
class MyClient:
def evaluate(self, doc: Document, prompt: str) -> LLMResponse:
...
return LLMResponse(passed=True, reason="All good")
loader = RuleLoader(llm=MyClient())
Future extensibility¤
The current design supports single-call, stateless checks: one prompt is sent to the
LLM and a pass/fail result is returned. If multi-step reasoning becomes necessary
(e.g. agent loops, tool calling, or structured-output retries), the natural approach
is to implement a richer LLMClient that encapsulates that logic internally —
the rest of the pipeline (AICheckRule, RuleLoader, run_qaqc) stays
unchanged. For that use case, litellm <https://docs.litellm.ai>_ (lightweight,
100+ providers) or LangChain <https://python.langchain.com>_ (full agent
orchestration) are good building blocks to wrap inside a custom LLMClient.
Methods:
-
evaluate–Evaluate the document against the given prompt and return a structured result.
evaluate
¤
evaluate(doc: Document, prompt: str) -> LLMResponse
Evaluate the document against the given prompt and return a structured result.
Source code in src/docchex/_internal/llm/base.py
60 61 62 | |
LLMResponse
dataclass
¤
Result returned by an LLM provider after evaluating a document.
Attributes:
OllamaClient
¤
LLM client that connects to a local Ollama server via its OpenAI-compatible API.
Requires pip install docchex[ollama] (installs the openai package).
Parameters:
-
(model¤str, default:_DEFAULT_MODEL) –Ollama model name (e.g.
"llama3.2","mistral"). -
(base_url¤str, default:_DEFAULT_BASE_URL) –Base URL of the Ollama server.
Methods:
-
evaluate–Send the document and prompt to Ollama and return a structured result.
Source code in src/docchex/_internal/llm/providers/ollama.py
21 22 23 24 25 26 27 28 29 30 | |
evaluate
¤
evaluate(doc: Document, prompt: str) -> LLMResponse
Send the document and prompt to Ollama and return a structured result.
Source code in src/docchex/_internal/llm/providers/ollama.py
32 33 34 | |
OpenAIClient
¤
OpenAIClient(
api_key: str | None = None,
model: str = _DEFAULT_MODEL,
base_url: str | None = None,
)
LLM client backed by the OpenAI API (or any OpenAI-compatible endpoint).
Requires pip install docchex[openai].
Parameters:
-
(api_key¤str | None, default:None) –OpenAI API key. Defaults to the
OPENAI_API_KEYenvironment variable. -
(model¤str, default:_DEFAULT_MODEL) –Model ID to use for evaluation.
-
(base_url¤str | None, default:None) –Override the API base URL (e.g. for a local Ollama server).
Methods:
-
evaluate–Send the document and prompt to OpenAI and return a structured result.
Source code in src/docchex/_internal/llm/providers/openai.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | |
evaluate
¤
evaluate(doc: Document, prompt: str) -> LLMResponse
Send the document and prompt to OpenAI and return a structured result.
Source code in src/docchex/_internal/llm/providers/openai.py
45 46 47 48 49 50 51 52 53 54 55 56 57 | |
PDFParser
¤
Bases: DocumentParser
Parse PDF files into Document objects using pdfplumber.
Methods:
-
for_path–Return the appropriate parser for the given file path based on its extension.
-
parse–Parse a PDF file into a Document, extracting text and metadata.
for_path
classmethod
¤
for_path(path: Path) -> DocumentParser
Return the appropriate parser for the given file path based on its extension.
Source code in src/docchex/_internal/parsing/base.py
22 23 24 25 26 27 28 29 30 31 32 33 | |
parse
¤
Parse a PDF file into a Document, extracting text and metadata.
Source code in src/docchex/_internal/parsing/pdf.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | |
Report
dataclass
¤
The result of running a set of rules against a document.
Methods:
-
to_dict–Serialise the report to a plain dictionary.
Attributes:
-
document_path(str) –Path to the evaluated document.
-
findings(list[Finding]) –All findings produced by the rule engine.
-
passed(bool) –Trueif no error-severity findings were produced. -
summary(dict[str, int]) –Finding counts grouped by severity.
to_dict
¤
Serialise the report to a plain dictionary.
Source code in src/docchex/_internal/models.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | |
RequiredSectionRule
¤
Bases: Rule
Checks that a required section heading is present in the document.
Methods:
-
check–Return a finding if the required section is absent from the document.
-
from_config–Instantiate from a rule configuration dictionary.
Attributes:
-
id(str) –Rule identifier.
-
match(str) –The section heading text to search for (case-insensitive).
-
severity(str) –Severity of the finding when the section is missing.
Source code in src/docchex/_internal/rules/builtin/required_section.py
21 22 23 24 | |
match
instance-attribute
¤
match: str = match
The section heading text to search for (case-insensitive).
severity
instance-attribute
¤
severity: str = severity
Severity of the finding when the section is missing.
check
¤
Return a finding if the required section is absent from the document.
Source code in src/docchex/_internal/rules/builtin/required_section.py
26 27 28 29 30 31 32 33 34 35 36 | |
from_config
classmethod
¤
from_config(config: dict[str, Any]) -> RequiredSectionRule
Instantiate from a rule configuration dictionary.
Source code in src/docchex/_internal/rules/builtin/required_section.py
38 39 40 41 42 43 44 45 | |
Rule
¤
Bases: ABC
Abstract base class for all docchex rules.
Methods:
-
check–Check the document and return any findings.
-
from_config–Instantiate the rule from a configuration dictionary.
Attributes:
-
id(str) –Unique identifier for this rule.
-
severity(str) –Default severity level for findings produced by this rule.
severity
class-attribute
instance-attribute
¤
Default severity level for findings produced by this rule.
check
abstractmethod
¤
Check the document and return any findings.
Source code in src/docchex/_internal/rules/base.py
31 32 33 34 | |
from_config
classmethod
¤
Instantiate the rule from a configuration dictionary.
Source code in src/docchex/_internal/rules/base.py
36 37 38 39 | |
RuleEngine
¤
Runs a list of rules against a document and collects findings into a report.
Methods:
-
run–Run all rules against the document and return a consolidated report.
Attributes:
Source code in src/docchex/_internal/evaluation/engine.py
19 20 | |
run
¤
Run all rules against the document and return a consolidated report.
Source code in src/docchex/_internal/evaluation/engine.py
22 23 24 25 26 27 | |
RuleLoader
¤
Parameters:
-
(llm¤LLMClient | None, default:None) –Optional LLM client injected into any
ai_checkrules that are loaded.
Methods:
-
load–Load rules from one or more sources.
Source code in src/docchex/_internal/rules/loader.py
28 29 30 31 32 33 34 | |
load
¤
Load rules from one or more sources.
A single source can be:
- A path string or Path to a .yaml/.yml/.toml file
- A preset shorthand string like "preset:tech_report"
- A list of rule dicts
Multiple sources can be combined by passing a list of any of the above. A list whose first element is not a dict is treated as a list of sources.
Source code in src/docchex/_internal/rules/loader.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
Severity
¤
Constants for rule severity levels.
Attributes:
TextParser
¤
Bases: DocumentParser
Parse plain-text (.txt) files into Document objects.
Pages are split on double newlines, mirroring PDFParser's convention. Used primarily for eval fixtures to keep CI dependency-free.
Methods:
-
for_path–Return the appropriate parser for the given file path based on its extension.
-
parse–Parse a plain-text file into a Document, splitting pages on double newlines.
for_path
classmethod
¤
for_path(path: Path) -> DocumentParser
Return the appropriate parser for the given file path based on its extension.
Source code in src/docchex/_internal/parsing/base.py
22 23 24 25 26 27 28 29 30 31 32 33 | |
parse
¤
Parse a plain-text file into a Document, splitting pages on double newlines.
Source code in src/docchex/_internal/parsing/text.py
21 22 23 24 25 26 27 | |
WordCountRule
¤
WordCountRule(
rule_id: str,
min_words: int | None = None,
max_words: int | None = None,
severity: str = WARNING,
)
Bases: Rule
Checks that the document word count falls within optional min/max bounds.
Methods:
-
check–Return findings if the document word count is outside the configured bounds.
-
from_config–Instantiate from a rule configuration dictionary.
Attributes:
-
id(str) –Rule identifier.
-
max_words(int | None) –Maximum word count allowed, or
Nonefor no upper bound. -
min_words(int | None) –Minimum word count required, or
Nonefor no lower bound. -
severity(str) –Severity of findings produced by this rule.
Source code in src/docchex/_internal/rules/builtin/word_count.py
23 24 25 26 27 28 29 30 31 32 33 | |
max_words
instance-attribute
¤
max_words: int | None = max_words
Maximum word count allowed, or None for no upper bound.
min_words
instance-attribute
¤
min_words: int | None = min_words
Minimum word count required, or None for no lower bound.
check
¤
Return findings if the document word count is outside the configured bounds.
Source code in src/docchex/_internal/rules/builtin/word_count.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
from_config
classmethod
¤
from_config(config: dict[str, Any]) -> WordCountRule
Instantiate from a rule configuration dictionary.
Source code in src/docchex/_internal/rules/builtin/word_count.py
57 58 59 60 61 62 63 64 65 | |
get_parser
¤
get_parser() -> ArgumentParser
Return the CLI argument parser.
Returns:
-
ArgumentParser–An argparse parser.
Source code in src/docchex/_internal/cli.py
30 31 32 33 34 35 36 37 38 39 | |
list_presets
¤
Return the names of all built-in rule presets.
Returns:
-
list[str]–A sorted list of preset names.
-
list[str]–Pass any name as
"preset:<name>"torun_qaqcorRuleLoader.load.
Example
docchex.list_presets()
# ['academic_paper', 'custom_template', 'letter_email', 'tech_report']
run_qaqc("report.pdf", "preset:tech_report")
run_qaqc("report.pdf", ["preset:tech_report", "my_extra_rules.yaml"])
Source code in src/docchex/__init__.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
main
¤
Run the main program.
This function is executed when you type docchex or python -m docchex.
Parameters:
Returns:
-
int–An exit code.
Source code in src/docchex/_internal/cli.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
run_qaqc
¤
Run QA/QC checks on a document against a set of rules.
Parameters:
-
(document¤str | Path) –Path to the document file (PDF or TXT supported).
-
(rules¤_RulesArg) –One or more rule sources. Can be: - A path string or
Pathto a.yaml/.tomlrules file - A preset name like"preset:tech_report"(seelist_presets()) - A list of rule dicts (includingtype: ai_checkentries) - A list combining any of the above -
(llm¤LLMClient | None, default:None) –Optional LLM client for
ai_checkrules (e.g.AnthropicClient()).
Returns:
Source code in src/docchex/__init__.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |