"""LightRAG chunking strategies. Two contracts coexist intentionally: - **Legacy contract** — :func:`chunking_by_token_size` keeps its historical 6-positional-arg signature ``(tokenizer, content, split_by_character, split_by_character_only, chunk_overlap_token_size, chunk_token_size)`` so externally-supplied :attr:`lightrag.LightRAG.chunking_func` implementations continue to work unchanged. The legacy contract is only invoked when ``process_options`` does NOT specify a chunking selector (i.e. ``chunking_explicit`` is False) — typically direct :meth:`LightRAG.ainsert` calls with raw text. - **File-chunker contract** — for documents whose ``process_options`` explicitly selects a chunking strategy, the file-based dispatcher in ``_PipelineMixin.process_single_document`` reads ``doc_process_opts.chunking`` and routes to a chunker following the standardized signature ``(tokenizer, content, chunk_token_size, *, )`` Currently shipped file chunkers: - :func:`chunking_by_fixed_token` — the ``"F"`` strategy. Same algorithm as :func:`chunking_by_token_size`, surfaced under the new contract. - :func:`chunking_by_recursive_character` — the ``"R"`` strategy. Wraps LangChain ``RecursiveCharacterTextSplitter``; recursively splits on a separator cascade with token-aware sizing. - :func:`chunking_by_semantic_vector` — the ``"V"`` strategy. Wraps LangChain ``SemanticChunker``; sentence-level embedding similarity finds breakpoints. Async; needs an :class:`~lightrag.utils.EmbeddingFunc`. - :func:`chunking_by_paragraph_semantic` — the ``"P"`` strategy. Heading-aware semantic chunker; consumes the docx-native ``.blocks.jsonl`` sidecar. Falls back to R when the sidecar is missing or unreadable. See ``docs/ParagraphSemanticChunking-zh.md`` for the algorithm behind the ``"P"`` strategy and ``docs/FileProcessingConfiguration-zh.md`` for how ``process_options`` and the new ``chunk_options`` snapshot drive chunker selection per document. """ from lightrag.chunker.paragraph_semantic import chunking_by_paragraph_semantic from lightrag.chunker.recursive_character import ( chunking_by_recursive_character, ) from lightrag.chunker.semantic_vector import chunking_by_semantic_vector from lightrag.chunker.token_size import ( chunking_by_fixed_token, chunking_by_token_size, ) __all__ = [ "chunking_by_fixed_token", "chunking_by_paragraph_semantic", "chunking_by_recursive_character", "chunking_by_semantic_vector", "chunking_by_token_size", ]