# Paragraph Semantic Chunking Strategy ## 1. Use Cases and Strategy Selection ### 1.1 What the P Strategy Solves Paragraph Semantic Chunking (hereafter the **P strategy**) targets documents with clear sectional structure such as DOCX. Its core goal is to **align chunk boundaries with the document's native semantic boundaries** (headings, paragraphs, table rows) as much as possible, rather than determining cut points solely from token-length counting. The P strategy is mainly designed to address the following four categories of problems: 1. **Table context fragmentation**: When a large table is split, its head and tail slices easily become detached from the preceding description, following explanation, or intermediate bridging text, making them impossible to understand independently during recall. 2. **Insufficient utilization of hierarchical information**: Methods that only look at neighboring paragraphs cannot leverage parent heading paths or relationships between sibling clauses. 3. **Imbalanced sizes of fine-grained sections**: Regulations, standards, contracts, etc., often contain many fine-grained clauses of 100–300 tokens. Without merging, chunks become too short and semantically thin; merging by adjacent length alone causes cross-topic pollution. 4. **Long-chunk re-splitting breaks structure**: When sections are excessively long, ordinary character splitting ignores table row boundaries and heading hierarchy. The P strategy is effective only for the `.blocks.jsonl` structured artifacts produced by the `native` extraction engine; for unstructured inputs, it automatically falls back to the R strategy (see §8). ### 1.2 Comparison of P / R / V Strategies | Dimension | R Strategy (Recursive) | V Strategy (SemanticVector) | P Strategy (ParagraphSemantic) | |---|---|---|---| | Splitting basis | Cascading character separators (paragraph → newline → Chinese punctuation → whitespace → character) + token budget | Sentence-level embedding distance thresholds (percentile / standard deviation / IQR / gradient) to locate semantic breaks | DOCX outline level with `parent_headings` + table row boundaries + anchors + hierarchy-aware merging | | Chunk size control | `chunk_token_size` hard cap | `chunk_token_size` is merely an advisory ceiling; when exceeded, secondary splitting via R | `target_max` hard cap + `target_ideal` soft target + table threshold + tail-absorption threshold working in concert | | Table handling | Table-unaware; may cut in the middle of a table | Table-unaware | Tables smaller than `table_max` are kept intact; large tables are sliced by JSON row array / HTML `` row boundaries and re-wrapped as valid `` | | Table context | Relies on incidental window coverage | Relies on embedding distance | First slice glues to preceding description, last slice glues to following explanation; bidirectional overlap of bridging text between consecutive large tables | | Inter-chunk overlap | Global `chunk_overlap_token_size` | No overlap | No overlap across section boundaries; within the same section, long body falls back to R with overlap by `CHUNK_P_OVERLAP_SIZE`; bridging text between consecutive large tables may enter both the preceding and following table chunks | | Heading metadata | Usually none | Usually none | Inherits or promotes heading; appends `[part n]` suffix after splitting; preserves `parent_headings` and `level` | | Embedding compute cost | None | High (must compute embedding per sentence) | None | | Input requirements | Any text | Any text + Embedding model | Must have a `.blocks.jsonl` sidecar (i.e., result of the `native` engine); otherwise falls back to R | ### 1.3 How to Choose | Scenario | Recommended | Rationale | |---|---|---| | DOCX with clear sectional hierarchy, large tables, fine-grained clauses | **P** | Fully leverages heading hierarchy and table row boundaries; chunk boundaries best match semantics; avoids cross-topic pollution | | Documents dominated by prose / commentary / long body without clear sectional structure | **V** | Splitting by semantic similarity forms natural boundaries at topic shifts, more stable than character splitting | | Inputs are plain text, Markdown, code, logs, or you want minimum compute overhead | **R** | No embedding overhead; cascading separators are stable enough for mixed Chinese-English text | | General configuration (uncertain about file types) | **R** | P automatically falls back to R when no sidecar is present; V also falls back to R when no Embedding model is available | | Documents with chaotic heading styles and many pseudo-headings in body | **R** or **V** | P depends on the native parser correctly identifying headings; messy headings cause basic chunk boundaries to shift | | Single-line giant tables or unparsable tables | Any | All three strategies eventually fall back to character-level splitting; P still retains the advantage of table context gluing | ### 1.4 Costs of the P Strategy - Must be paired with the `native` engine: explicitly declared in `LIGHTRAG_PARSER`, e.g., `docx:native-P`; otherwise, even if `P` is written, it falls back to R due to the missing `.blocks.jsonl`. - DOCX only: other formats have no `.blocks.jsonl` artifact. - Many algorithmic paths and thresholds: debugging requires first verifying the input sidecar, then inspecting the outputs of each stage. ## 2. Overview of How It Works The P strategy takes as input the `.blocks.jsonl` produced by the native parser in `fixlevel=0` mode. **Each `type == "content"` line is treated as one heading-level basic chunk**, then table slicing, long-chunk splitting, and hierarchical merging are performed on top: ```text DOCX ↓ native parser (fixlevel=0) .blocks.jsonl + sidecars (.tables.json / .equations.json / .drawings.json / .blocks.assets/) ↓ Stage B: slice oversized tables along row boundaries and assign first/middle/last roles ↓ Stage B.1: bidirectional overlap of bridging text between consecutive large tables ↓ Stage C: anchor-driven re-splitting of long text chunks ↓ Stage D: hierarchy-aware two-phase merging ↓ Stage E: [part n] line-level provenance numbering Final chunk list ``` **Key invariants of the P strategy**: 1. **No overlap across section boundaries**: Text between different `.blocks.jsonl` content lines is never copied into the other chunk, avoiding "misattribution". 2. **Long body within a section may overlap**: Multiple slices from within the same content line may keep R-style overlap controlled by `chunk_overlap_token_size`, reducing mid-sentence cuts in long bodies. 3. **Bridging text between tables may overlap bidirectionally**: The only cross-paragraph copying scenario, specifically serving context preservation for consecutive large tables. 4. **Table rows do not overlap each other**: Row-level slicing itself is non-overlapping, different from R's overlap concept. ## 3. Input and Output ### 3.1 Input `chunking_by_paragraph_semantic()` receives the following inputs: | Parameter | Source | Description | |---|---|---| | `content` | `full_docs[doc_id].content` | Concatenated merged text, used for fallback when sidecar is missing | | `blocks_path` | `full_docs[doc_id].lightrag_document_path` | Path to `.blocks.jsonl`, the primary input for the P strategy | | `chunk_token_size` | `chunk_options.chunk_token_size` / `CHUNK_P_SIZE` | Target hard cap N; defaults to `2000` | | `chunk_overlap_token_size` | `CHUNK_P_OVERLAP_SIZE` / `chunk_overlap_token_size` | Upper bound for long-body fallback overlap within the same content line and for the table bridging budget; defaults to `100` | | `tokenizer` | The tokenizer already parsed by LightRAG | Basis for all token counting and text overlap truncation | The P strategy **does not accept** `split_by_character` / `split_by_character_only`, because the normal path is driven by heading and paragraph structure. ### 3.2 `.blocks.jsonl` Convention The P strategy only processes `type == "content"` lines. Each content line typically contains: - `content`: The body text under the heading, possibly including ordinary paragraphs, `

` tags, `` formulas, `` graphics. - `heading`: The current heading. - `parent_headings`: The chain of parent headings. - `level`: Heading level (1–9, corresponding to the original outline levels 0–8). - `positions`: Original paragraph positioning (used for traceability). The native parser's `fixlevel=0` mode guarantees that "the body under a heading becomes one basic chunk" without performing token-threshold splitting during parsing. Tables are inserted into `content` while staying intact. ### 3.3 Output The final output is an ordered list of chunks, where each element is: ```python { "tokens": int, # Actual token count (re-measured after merging) "content": str, # Chunk text (may contain

tags) "chunk_order_index": int, # Chunk ordering index "heading": str, # Suffix [part n] appended after splitting "parent_headings": list[str], # Parent heading chain; no suffix appended "level": int, # Heading level } ``` Internally, the implementation also temporarily uses fields such as `paragraphs`, `table_chunk_role`, `uuid`, `uuid_end`, `type` to assist splitting and merging, but **these do not appear in the final output**. ### 3.4 `[part n]` Suffix Rules - When the same original `.blocks.jsonl` content line is split into multiple slices, the `heading` field of every slice gets `[part 1]`, `[part 2]` … appended. - Content lines that are not split keep the original heading unchanged. - `parent_headings` does not get any suffix. - Numbering is **reset independently within each original content line**. - The legacy `[表格片段N]` ("table fragment N") suffix is uniformly replaced by `[part n]`. ## 4. Key Thresholds P strategy thresholds are not fixed constants; they are dynamically derived from `chunk_token_size` (denoted N): | Name | Formula | Value when N = 2000 | Technical meaning | |---|---|---:|---| | `target_max` | N | 2000 | Hard upper bound for text chunks | | `target_ideal` | 0.75 × N | 1500 | Ideal target for text chunks; chunks at or above this value stop participating in ordinary peer merging | | `table_max` | 0.625 × N | 1250 | Threshold that triggers table slicing | | `table_ideal` | 0.375 × N | 750 | Ideal size for a table slice | | `table_min_last` | 0.32 × `table_max` | 400 | Last-slice swallow-back threshold (if the last slice is smaller and can be merged, swallow it back into the previous slice) | | `small_tail_threshold` | 0.125 × N | 250 | Threshold for tail fragment absorption | | `max_anchor_candidate_length` | Fixed | 100 chars | Upper bound on paragraph length for candidate anchors in long-chunk splitting | Proportional constraint relationships: `table_max < target_ideal < target_max`, `table_ideal < table_max`. These ratios originate from empirical values in the audit mode (`large chunk 8000, small table 5000, ideal table 3000, table tail 1600`) and are now proportionally scaled by `chunk_token_size`. ## 5. Stage A: Heading-Level Basic Chunks Heading recognition is performed by the native parser; **the P chunker itself does not scan the docx body nor judge heading styles**. In `fixlevel=0` mode, the native parser: 1. Reads `styles.xml`, builds a style inheritance chain via ``, and traces back the effective ``. 2. Iterates over the paragraphs of `document.xml`, resolving outline levels along the inheritance chain; original outline levels 0–8 are mapped to internal `level` 1–9. 3. Maintains `current_heading_stack`, clearing old headings no shallower than the current level when a new heading is encountered, and computing `parent_headings`. 4. Extracts tables, formulas, and drawings into single-line tags (`

...

` etc.) and writes them to the corresponding sidecars. 5. All recognizable headings trigger a basic chunk boundary; **no** token-threshold splitting is performed. The P chunker directly reads `.blocks.jsonl`, treating each content line as an independent unit of processing for subsequent Stages B/C. This implies that `[part n]` numbering is reset independently per original content line. ## 6. Stage B: Row-Boundary Slicing for Oversized Tables Stage B only processes tables whose token count exceeds `table_max`. Its goal is **not merely to split the table** but to preserve table boundary context based on row-boundary-priority splitting. ### 6.1 Row-Boundary-Priority Slicing - `format="json"`: Slice by the top-level JSON row array. - `format="html"`: Slice by `...` rows. - Tables not explicitly tagged but sniffable as JSON / HTML are handled by the same rules. Before slicing, the `

` wrapper token cost is pre-deducted so that each re-wrapped slice stays under `table_max` as much as possible. Each slice is re-wrapped as a valid `` tag for ease of downstream parsing. ### 6.2 Row-Level Recursive Re-Slicing If a row subset, after re-wrapping, still exceeds `table_max`, further subdivision is performed within that row subset. **Only when slicing has converged to a single row that itself exceeds the limit does it degrade to character-level splitting**. This mechanism keeps as much valid table structure as possible for table content expressible by row boundaries. ### 6.3 Last-Slice Swallow-Back If the token count of the last table slice falls below `table_min_last` and the result of merging with the previous slice does not exceed `table_max`, the last slice is swallowed back into the previous slice, reducing useless short table chunks. ### 6.4 Table Slice Roles and Physical Gluing Each table slice is assigned an internal field `table_chunk_role`, and gluing to surrounding paragraphs is decided by role: | Role | Meaning | Gluing strategy | |---|---|---| | `first` | First slice of the original table | Appended to the tail of the current accumulating chunk so that the table's **preceding description** enters the same chunk as the first slice | | `middle` | Middle slice of the original table | Output independently to avoid merging with unrelated body | | `last` | Last slice of the original table | Used as the starting point of a new accumulating chunk so that the **following explanation** is automatically appended after the last slice | | `none` | Non-table slice or untouched intact table | Treated as ordinary text chunks | `table_chunk_role` is an internal field that does not survive in the final output, **but in Stage D it continues to serve as a merging constraint** (see §9.1). ## 7. Stage B.1: Bidirectional Overlap of Bridging Text Between Consecutive Large Tables When the pattern "large table A, short bridging text, large table B" appears in the same original content line and both tables are split, the bridging text is distributed bidirectionally according to a context budget: 1. Encode the bridging text into tokens. 2. Compute the left budget `prev_budget = min(chunk_overlap_token_size, target_max - current token count of the left last slice)`. 3. Compute the right budget `next_budget = min(chunk_overlap_token_size, target_max - current token count of the right first slice)`. 4. **If the bridging text length does not exceed either budget**: Both the left and right table boundary chunks contain the **complete bridging text**. 5. **If the bridging text is longer**: The prefix enters the left last-slice chunk, the suffix enters the right first-slice chunk; the middle portion that exceeds both budgets becomes an independent ordinary text chunk. Each one-sided budget is additionally capped at `chunk_token_size / 2` to prevent the bridging text from dominating an entire chunk. The difference from ordinary adjacent chunk overlap: - Ordinary overlap copies characters or tokens by forward/backward order, regardless of boundary type. - The B.1 mechanism is triggered by table slice roles, treating bridging text as both the post-text context of the left table and the pre-text context of the right table, avoiding the bridging description being assigned to only one side or being split off and hard to recall. ## 8. Stage C: Anchor-Driven Re-Splitting of Long Text Chunks Stage C processes content chunks that still exceed `target_max` after Stage B. ### 8.1 Short-Paragraph Anchors Restore content into paragraphs, then select paragraphs that satisfy all of the following as candidate anchors: - The paragraph is not a table (does not start with `.docx.parsed/.blocks.jsonl ``` If the file is missing or empty, the P strategy falls back to R entirely and gains none of P's benefits. Common causes: - `LIGHTRAG_PARSER=docx:native-...` was not configured. - Parsing failed (see error entries in `pipeline_status`). - The document is not a DOCX (other formats do not support P). ### 12.2 Inspect the Contents of blocks.jsonl Each line is a JSON; filter `type == "content"` and inspect whether heading / level / parent_headings match expectations: ```bash jq -c 'select(.type=="content") | {level, heading, parent_headings}' \ INPUT/__parsed__/.docx.parsed/.blocks.jsonl | head ``` If most headings are empty or levels are abnormal, the native parser did not correctly recognize heading styles — in which case P's hierarchical merging and anchor promotion will both fail. ### 12.3 Inspect the Final Chunks View chunk metadata in the `text_chunks` storage: ```bash jq '.[] | {heading, level, tokens, parent_headings}' \ rag_storage/kv_store_text_chunks.json | head -30 ``` You should observe: - Headings of chunks around large tables typically correspond to `[part 1]` / `[part n]` (indicating Stage B splitting occurred). - Fine-grained clauses are merged into chunks close to `target_ideal` (indicating Stage D took effect). - `parent_headings` jumps at boundaries between different sections and stays stable within the same section. ### 12.4 Chunk Size Distribution Check Ideal distribution: most chunks fall in the range `[target_ideal, target_max]` (i.e., approximately 1500–2000 tokens when N=2000); chunks noticeably smaller are usually `middle` table slices (locked as independent) or tail chunks at section boundaries. If many tail chunks below `small_tail_threshold` appear, possible causes include: - The parent heading path consistency constraint is too strict (adjacent small chunks with different `parent_headings` cannot merge). - Many `middle` table slices pile up (the table itself is very large). ## 13. Troubleshooting ### 13.1 P Did Not Take Effect; Output Matches R Investigate in this order: 1. Does `full_docs[doc_id].process_options` contain `P`? 2. Is `full_docs[doc_id].parse_format` equal to `lightrag`? If `raw`, it is on the legacy path and P automatically falls back to R. 3. Does the `.blocks.jsonl` pointed to by `lightrag_document_path` exist and is it non-empty? 4. Are there `paragraph_semantic ... fallback to recursive_character` messages in the logs? ### 13.2 Tables Are Scattered; Preceding and Following Explanations Are Detached - Check whether the table is truly recognized as `

` or `

` (see `.blocks.jsonl`). Tables with unrecognized format can only undergo character splitting and cannot trigger Stage B's role mechanism. - Check whether the table's token count actually exceeds `table_max`. Tables below the threshold remain intact and never trigger first/middle/last slicing. - For consecutive large tables, confirm whether the bridging text between the two tables resides in the **same content line** — bridging across content lines does not participate in B.1 bidirectional overlap. ### 13.3 Fine-Grained Clauses Are Not Merged - Check whether the `parent_headings` of adjacent clauses are identical: the parent heading path consistency constraint prevents cross-topic merging. - Check whether `level` is the same: peer merging requires equal `level`; cross-level absorption only allows shallow absorbing deep. - Check whether a `middle` table slice is inserted in the middle: this blocks batched tail absorption. ### 13.4 A Single Chunk Exceeds `target_max` Normally, Stage D's actual token re-measurement rejects oversized merges, but oversized chunks may still occur in the following scenarios: - A single-row table itself exceeds `target_max` with no anchor to split on; eventually it goes through R character splitting but a single chunk still exceeds the limit. - `enforce_chunk_token_limit_before_embedding` performs a final hard cut before embedding; downstream will not actually embed an oversized chunk into the vector store. ### 13.5 Abnormal `[part n]` Suffixes - Multiple slices come from the same original content line, but only one `[part 1]` is seen: check whether they were merged in Stage D — after merging, the main chunk's part suffix is retained and multiple part tags are not concatenated. - Legacy `[表格片段N]` suffix appears: this indicates data output by an older chunker; the new version standardizes on `[part n]`, and re-chunking is required. ### 13.6 Log Keywords P-strategy-related log keywords (for `grep`-based troubleshooting): - `paragraph_semantic` — module entry - `fallback to recursive_character` — overall or single-paragraph degradation - `table_chunk_role` — table role-related - `bridge` — Stage B.1 bridging text handling - `anchor` — Stage C anchor selection