__init__.py 2.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
  1. """LightRAG chunking strategies.
  2. Two contracts coexist intentionally:
  3. - **Legacy contract** — :func:`chunking_by_token_size` keeps its
  4. historical 6-positional-arg signature
  5. ``(tokenizer, content, split_by_character,
  6. split_by_character_only, chunk_overlap_token_size,
  7. chunk_token_size)``
  8. so externally-supplied :attr:`lightrag.LightRAG.chunking_func`
  9. implementations continue to work unchanged. The legacy contract is
  10. only invoked when ``process_options`` does NOT specify a chunking
  11. selector (i.e. ``chunking_explicit`` is False) — typically direct
  12. :meth:`LightRAG.ainsert` calls with raw text.
  13. - **File-chunker contract** — for documents whose ``process_options``
  14. explicitly selects a chunking strategy, the file-based dispatcher in
  15. ``_PipelineMixin.process_single_document`` reads
  16. ``doc_process_opts.chunking`` and routes to a chunker following the
  17. standardized signature
  18. ``(tokenizer, content, chunk_token_size, *,
  19. <strategy-specific kwargs>)``
  20. Currently shipped file chunkers:
  21. - :func:`chunking_by_fixed_token` — the ``"F"`` strategy. Same
  22. algorithm as :func:`chunking_by_token_size`, surfaced under the
  23. new contract.
  24. - :func:`chunking_by_recursive_character` — the ``"R"`` strategy.
  25. Wraps LangChain ``RecursiveCharacterTextSplitter``; recursively
  26. splits on a separator cascade with token-aware sizing.
  27. - :func:`chunking_by_semantic_vector` — the ``"V"`` strategy.
  28. Wraps LangChain ``SemanticChunker``; sentence-level embedding
  29. similarity finds breakpoints. Async; needs an
  30. :class:`~lightrag.utils.EmbeddingFunc`.
  31. - :func:`chunking_by_paragraph_semantic` — the ``"P"`` strategy.
  32. Heading-aware semantic chunker; consumes the docx-native
  33. ``.blocks.jsonl`` sidecar. Falls back to R when the sidecar is
  34. missing or unreadable.
  35. See ``docs/ParagraphSemanticChunking-zh.md`` for the algorithm behind
  36. the ``"P"`` strategy and ``docs/FileProcessingConfiguration-zh.md`` for
  37. how ``process_options`` and the new ``chunk_options`` snapshot drive
  38. chunker selection per document.
  39. """
  40. from lightrag.chunker.paragraph_semantic import chunking_by_paragraph_semantic
  41. from lightrag.chunker.recursive_character import (
  42. chunking_by_recursive_character,
  43. )
  44. from lightrag.chunker.semantic_vector import chunking_by_semantic_vector
  45. from lightrag.chunker.token_size import (
  46. chunking_by_fixed_token,
  47. chunking_by_token_size,
  48. )
  49. __all__ = [
  50. "chunking_by_fixed_token",
  51. "chunking_by_paragraph_semantic",
  52. "chunking_by_recursive_character",
  53. "chunking_by_semantic_vector",
  54. "chunking_by_token_size",
  55. ]