| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566 |
- """LightRAG chunking strategies.
- Two contracts coexist intentionally:
- - **Legacy contract** — :func:`chunking_by_token_size` keeps its
- historical 6-positional-arg signature
- ``(tokenizer, content, split_by_character,
- split_by_character_only, chunk_overlap_token_size,
- chunk_token_size)``
- so externally-supplied :attr:`lightrag.LightRAG.chunking_func`
- implementations continue to work unchanged. The legacy contract is
- only invoked when ``process_options`` does NOT specify a chunking
- selector (i.e. ``chunking_explicit`` is False) — typically direct
- :meth:`LightRAG.ainsert` calls with raw text.
- - **File-chunker contract** — for documents whose ``process_options``
- explicitly selects a chunking strategy, the file-based dispatcher in
- ``_PipelineMixin.process_single_document`` reads
- ``doc_process_opts.chunking`` and routes to a chunker following the
- standardized signature
- ``(tokenizer, content, chunk_token_size, *,
- <strategy-specific kwargs>)``
- Currently shipped file chunkers:
- - :func:`chunking_by_fixed_token` — the ``"F"`` strategy. Same
- algorithm as :func:`chunking_by_token_size`, surfaced under the
- new contract.
- - :func:`chunking_by_recursive_character` — the ``"R"`` strategy.
- Wraps LangChain ``RecursiveCharacterTextSplitter``; recursively
- splits on a separator cascade with token-aware sizing.
- - :func:`chunking_by_semantic_vector` — the ``"V"`` strategy.
- Wraps LangChain ``SemanticChunker``; sentence-level embedding
- similarity finds breakpoints. Async; needs an
- :class:`~lightrag.utils.EmbeddingFunc`.
- - :func:`chunking_by_paragraph_semantic` — the ``"P"`` strategy.
- Heading-aware semantic chunker; consumes the docx-native
- ``.blocks.jsonl`` sidecar. Falls back to R when the sidecar is
- missing or unreadable.
- See ``docs/ParagraphSemanticChunking-zh.md`` for the algorithm behind
- the ``"P"`` strategy and ``docs/FileProcessingConfiguration-zh.md`` for
- how ``process_options`` and the new ``chunk_options`` snapshot drive
- chunker selection per document.
- """
- from lightrag.chunker.paragraph_semantic import chunking_by_paragraph_semantic
- from lightrag.chunker.recursive_character import (
- chunking_by_recursive_character,
- )
- from lightrag.chunker.semantic_vector import chunking_by_semantic_vector
- from lightrag.chunker.token_size import (
- chunking_by_fixed_token,
- chunking_by_token_size,
- )
- __all__ = [
- "chunking_by_fixed_token",
- "chunking_by_paragraph_semantic",
- "chunking_by_recursive_character",
- "chunking_by_semantic_vector",
- "chunking_by_token_size",
- ]
|