Repository Guidelines

Project Overview

LightRAG is a Retrieval-Augmented Generation (RAG) framework that uses graph-based knowledge representation for enhanced information retrieval. The system extracts entities and relationships from documents, builds a knowledge graph, and uses multiple retrieval modes (local, global, hybrid, mix, naive) for queries.

Project Structure

Top-level directories:

lightrag/: Core Python package — see Module Layout below.
lightrag_webui/: React 19 + TypeScript client (Bun + Vite + Tailwind). UI components in src/.
scripts/: test.sh (preferred test runner), setup/ interactive environment wizard (use make env-* rather than calling setup.sh directly — see Configuration > Setup Wizard Outputs), and release tooling.
tests/: Pytest coverage, organized into subdirectories that mirror lightrag/ (see Testing below for layout). Working datasets stay in inputs/, rag_storage/, and temp/; deployment collateral lives in docs/, k8s-deploy/, and compose files.

Module Layout (`lightrag/`)

lightrag.py: Main orchestrator class (LightRAG) — assembled from mixins (see LightRAG class composition). Hosts ainsert_custom_kg, _insert_done, _process_extract_entities, _refresh_addon_params_cache, and addon_params accessors. Critical: always call await rag.initialize_storages() after instantiation.
pipeline.py: _PipelineMixin — owns the document ingestion pipeline (apipeline_enqueue_documents, apipeline_process_enqueue_documents, apipeline_process_error_documents), the parse_native / parse_mineru / parse_docling parser dispatchers, multimodal analysis, validation, and the worker scaffolding.
utils_pipeline.py: Pure helpers shared by the pipeline mixin and other entry points: doc-status field access, document identity (source key, content hash), parsed-artifact path resolution, parser payload normalization, multimodal entity augmentation, and make_lightrag_doc_content.
llm_roles.py: RoleSpec / RoleLLMConfig / _RoleLLMState / ROLES registry plus _RoleLLMMixin — role normalization, builder registration, wrapper rebuild, runtime config update, queue cleanup, sanitized config export, queue status reporting. Route role-specific behavior here rather than into provider modules.
storage_migrations.py: _StorageMigrationMixin — check_and_migrate_data, _migrate_entity_relation_data, _migrate_chunk_tracking_storage.
addon_params.py: ObservableAddonParams plus default_addon_params / normalize_addon_params helpers.
operate.py: Core extraction and query operations including entity/relation extraction, chunking, and multi-mode retrieval logic.
base.py: Abstract base classes for storage backends (BaseKVStorage, BaseVectorStorage, BaseGraphStorage, BaseDocStatusStorage).
kg/: Storage implementations (JSON, NetworkX, Neo4j, PostgreSQL, MongoDB, Redis, Milvus, Qdrant, Faiss, Memgraph, OpenSearch, NanoVectorDB). The backend registry (STORAGE_IMPLEMENTATIONS / STORAGES) lives in kg/__init__.py; kg/factory.py::get_storage_class() resolves backend classes from configuration.
llm/: LLM and embedding provider bindings (OpenAI, Ollama, Azure, Gemini, Bedrock, Anthropic, etc.). All async with caching support.
parser/: Unified parsing layer. parser/routing.py resolves engine and filename hints for legacy, native, mineru, and docling flows; parser/debug.py provides an offline LightRAG stub for the parser/cli.py debug entry point (python -m lightrag.parser.cli). Native format parsers live as sibling sub-packages under parser/ (currently parser/docx/); external HTTP-based adapters live under parser/external/ (mineru, docling) with shared helpers in parser/external/_common.py, _manifest.py, _zip.py.
chunker/: Chunking strategies (token-size, recursive character, semantic vector, paragraph semantic).
api/: FastAPI service (lightrag_server.py) with REST endpoints and Ollama-compatible API; routers under routers/, static Swagger assets, packaged WebUI output, and Gunicorn launcher.

Core Architecture

LightRAG class composition

LightRAG is assembled from focused mixins (split out of the previously monolithic lightrag.py):

LightRAG → _RoleLLMMixin → _StorageMigrationMixin → _PipelineMixin → object

The @final decorator on LightRAG is preserved — the mixin layering is an internal implementation detail, not an external subclassing surface. The public API (ainsert, aquery, ainsert_custom_kg, initialize_storages, etc.) is unchanged. ainsert_custom_kg and its internal construction logic, _insert_done, _process_extract_entities, _refresh_addon_params_cache, and the addon_params property accessors stay on LightRAG itself because they cut across multiple flows or depend on prompt-profile state.

Storage Layer

LightRAG uses 4 storage types with pluggable backends:

KV_STORAGE: LLM response cache, text chunks, document info
VECTOR_STORAGE: Entity/relation/chunk embeddings
GRAPH_STORAGE: Entity-relation graph structure
DOC_STATUS_STORAGE: Document processing status tracking

Each LightRAG instance can pass a workspace parameter for data isolation. Implementation differs per storage type:

File-based: subdirectories under working_dir.
Collection-based: collection name prefixes.
Relational DB: workspace column filtering.
Qdrant: payload-based partitioning.

Pipeline concurrency contract

The document ingestion pipeline coordinates concurrent writers through pipeline_status (a per-workspace shared dict in lightrag.kg.shared_storage). These fields are mutated under get_namespace_lock("pipeline_status", workspace=...):

busy: any pipeline-busy state. Set by both the processing loop AND destructive jobs (clear / per-doc delete). On its own, busy=True does NOT block enqueue — see destructive_busy for the exclusive subset.
destructive_busy: the busy job is /documents/clear or /documents/{doc_id} (delete). These DROP storages and remove input files; a concurrent enqueue accepted in this window would write to storage being torn down and silently lose the document. Reservation and the enqueue last-line guard reject when this is True.
scanning: a /documents/scan task is running (whole lifecycle: classification + processing). Used by the /scan endpoint to refuse overlapping scans. Does NOT on its own block uploads/inserts.
scanning_exclusive: True only during the scan task's classification phase, when run_scanning_process is reading doc_status to classify files (PROCESSED → archive, FAILED-without-full_docs → retry-as-new, etc.) and possibly deleting stale stubs. Reservation and the enqueue last-line guard reject when this is set. Cleared before the scan transitions to its processing phase, allowing concurrent uploads to land while scan-driven processing finishes.
pending_enqueues: count of /upload, /text, /texts endpoints that have reserved a slot (via _reserve_enqueue_slot) but whose bg task has not yet completed. Only the scan endpoint reads this — to refuse starting while uploads are mid-flight.
request_pending: a nudge to the running processing loop. Set by either (a) apipeline_process_enqueue_documents when called while busy=True or (b) apipeline_enqueue_documents after writing to doc_status while busy=True. The loop checks it after each batch and re-queries doc_status if set.

Mutual-exclusion rules (all checked atomically inside the lock):

Operation	Refuses if	Writes
`_reserve_enqueue_slot`	`scanning_exclusive` or `destructive_busy`	`pending_enqueues++`
`apipeline_enqueue_documents` (last-line guard)	(`scanning_exclusive` and not `from_scan`) or `destructive_busy`	—
Scan endpoint reservation	`busy or scanning or pending_enqueues > 0`	`scanning = True`
`apipeline_process_enqueue_documents` entry	(already busy → set `request_pending`, return)	`busy = True` (NOT `destructive_busy`)
`clear_documents` / `delete_document` (synchronous reservation)	`busy or scanning or pending_enqueues > 0`	`busy = True`, `destructive_busy = True`

The contract permits concurrent enqueue + processing: a freshly-uploaded doc lands in doc_status while the loop is mid-batch, the loop sees request_pending after the current batch, re-queries doc_status, and picks up the new PENDING row.

For the rest — write ordering of full_docs vs doc_status, the workspace-scoped enqueue_serialize lock around dedup-and-upsert, and the from_scan=True bypass — see the docstrings on apipeline_enqueue_documents and apipeline_process_enqueue_documents in lightrag/pipeline.py.

Query Modes

local: Context-dependent retrieval focused on specific entities
global: Community/summary-based broad knowledge retrieval
hybrid: Combines local and global
naive: Direct vector search without graph
mix: Integrates KG and vector retrieval (recommended with reranker)

Development Commands

Setup

# Install with uv
uv sync
source .venv/bin/activate  # Or: .venv\Scripts\activate on Windows

# Install with API support
uv sync --extra api

# Install specific extras
uv sync --extra offline-storage  # Storage backends
uv sync --extra offline-llm      # LLM providers
uv sync --extra test             # Testing dependencies

API Server

# Copy and configure environment
cp env.example .env  # Edit with your LLM/embedding configs

# Build WebUI
cd lightrag_webui
bun install --frozen-lockfile
bun run build
cd ..

# Run server
lightrag-server                                           # Production
uvicorn lightrag.api.lightrag_server:app --reload        # Development
lightrag-gunicorn                                         # Multi-worker (gunicorn)

WebUI

cd lightrag_webui
bun install --frozen-lockfile      # Install dependencies
bun run dev                        # Dev server (Node + Vite)
bun run dev:bun                    # Dev server (Bun native)
bun run build                      # Production build
bun run preview                    # Preview production build
bun run lint                       # ESLint over *.ts/tsx/js/jsx

# Testing — Bun built-in runner (NOT Vitest/Jest)
bun test                           # All tests
bun test --watch                   # Watch mode
bun test --coverage                # With coverage report
bun test src/api/lightrag.test.ts  # Single test file

Testing

Use mock-based tests for external services (Redis, httpx, etc.) — do not depend on live services in unit tests.
Add regression tests for every bug fix.
Run the full test suite (or relevant subset) and report pass counts before declaring done.

Backend tests use pytest; frontend unit tests use Bun's built-in runner — see WebUI above.

# Preferred for fresh shells and automation; resolves PYTHON, venv, uv, .venv, venv, python, python3
./scripts/test.sh tests

# Run specific test file
./scripts/test.sh tests/kg/test_graph_storage.py

# Run with custom workers
./scripts/test.sh tests --test-workers 4

tests/: main test suite, mirrors feature folders. Place new tests under the subdirectory matching the module under test:
- tests/api/{auth,config,routes}/ for FastAPI server tests (auth/token, config loading, route handlers); top-level tests/api/ for app-wide concerns (path prefixes, Ollama-compatible endpoint).
- tests/chunker/, tests/evaluation/, tests/extraction/ for the like-named modules.
- tests/kg/<backend>_impl/ for backend-specific storage tests, mirroring the lightrag/kg/<backend>_impl.py file naming. The _impl suffix on every subdirectory keeps the layout uniform and avoids sys.path shadowing on names that overlap with top-level PyPI/stdlib packages (faiss, json, neo4j, networkx, redis) when a test is launched directly via python tests/kg/.... Current backends: faiss_impl/, json_impl/, memgraph_impl/, milvus_impl/, mongo_impl/, nano_impl/, neo4j_impl/, networkx_impl/, opensearch_impl/, postgres_impl/, qdrant_impl/, redis_impl/. tests/kg/ root holds cross-backend tests (test_graph_storage, test_batch_graph_operations, test_unified_lock_safety, test_file_atomic).
- tests/llm/<provider>_impl/ for provider-specific behavior, same _impl convention: bedrock_impl/, gemini_impl/, ollama_impl/, openai_impl/, voyageai_impl/, zhipu_impl/. tests/llm/ root holds cross-provider concerns (embedding, VLM, cache, role).
- tests/parser/, tests/parser/docx/, tests/parser/external/{mineru,docling}/ for parser implementations.
- tests/pipeline/ for ingestion pipeline and doc-status behavior (including test_pipeline_*, test_doc_status_*, test_multimodal_*, test_graph_keyed_locks).
- tests/sidecar/, tests/setup/, tests/workspace/ for the like-named cross-cutting concerns.
- When adding a new backend or LLM provider, create a new subdirectory plus an empty __init__.py rather than dropping the file in the parent directory root.
Markers (see tests/pytest.ini): offline, integration, requires_db, requires_api. Integration tests are skipped by default via -m "not integration".
Integration env vars: LIGHTRAG_RUN_INTEGRATION=true, LIGHTRAG_KEEP_ARTIFACTS=true, LIGHTRAG_TEST_WORKERS=4, plus storage-specific connection strings.

Linting

ruff check .

Key Implementation Patterns

LightRAG Initialization (Critical)

The most common error is forgetting to initialize storages (manifests as AttributeError: __aenter__ or KeyError: 'history_messages'):

import asyncio
from lightrag import LightRAG
from lightrag.llm.openai import gpt_4o_mini_complete, openai_embed

async def main():
    rag = LightRAG(
        working_dir="./rag_storage",
        llm_model_func=gpt_4o_mini_complete,
        embedding_func=openai_embed
    )

    # REQUIRED: Initialize storage backends
    await rag.initialize_storages()

    # Now safe to use
    await rag.ainsert("Your text here")
    result = await rag.aquery("Your question", param=QueryParam(mode="hybrid"))

    # Cleanup
    await rag.finalize_storages()

asyncio.run(main())

Custom Embedding Functions

Use @wrap_embedding_func_with_attrs decorator and call .func when wrapping (already-decorated functions cannot be wrapped again — access the underlying via .func):

from lightrag.utils import wrap_embedding_func_with_attrs

@wrap_embedding_func_with_attrs(embedding_dim=1536, max_token_size=8192)
async def custom_embed(texts: list[str]) -> np.ndarray:
    # Call underlying function, not wrapped version
    return await openai_embed.func(texts, model="text-embedding-3-large")

# Wrong: EmbeddingFunc(func=openai_embed)
# Right: EmbeddingFunc(func=openai_embed.func)

Pitfall — switching embedding models: when changing the embedding model you MUST clear the data directory (optionally keeping kv_store_llm_response_cache.json for LLM cache). Existing vectors will not match the new model's space.

Storage Configuration

Configure via environment variables or constructor params:

# Environment-based (recommended for production)
# See env.example for full list

# Constructor-based
rag = LightRAG(
    working_dir="./storage",
    workspace="project_name",  # For data isolation
    kv_storage="PGKVStorage",
    vector_storage="PGVectorStorage",
    graph_storage="Neo4JStorage",
    doc_status_storage="PGDocStatusStorage",
    vector_db_storage_cls_kwargs={
        "cosine_better_than_threshold": 0.2
    }
)

Document Insertion

# Single document
await rag.ainsert("Text content")

# Batch insertion
await rag.ainsert(["Text 1", "Text 2", ...])

# With custom IDs
await rag.ainsert("Text", ids=["doc-123"])

# With file paths (for citation)
await rag.ainsert(["Text 1", "Text 2"], file_paths=["doc1.pdf", "doc2.pdf"])

# Configure batch size
rag = LightRAG(..., max_parallel_insert=4)  # Default: 2, max recommended: 10

Query Configuration

from lightrag import QueryParam

result = await rag.aquery(
    "Your question",
    param=QueryParam(
        mode="mix",                    # Recommended with reranker
        top_k=60,                      # KG entities/relations to retrieve
        chunk_top_k=20,                # Text chunks to retrieve
        max_entity_tokens=6000,
        max_relation_tokens=8000,
        max_total_tokens=30000,
        enable_rerank=True,
        user_prompt="Additional instructions for LLM",
        stream=False
    )
)

Frontend Debugging via Playwright

For WebUI bugs whose symptoms only surface in the rendered DOM — layout/overflow/scrollbar issues, transient flashes, third-party libraries attaching helpers to <body> outside React's tree, or end-to-end verification of a fix — drive the running dev server (http://localhost:5173) with the document-skills:webapp-testing skill instead of reasoning from source alone. Seed state directly via localStorage (persist key settings-storage, schema in lightrag_webui/src/stores/settings.ts) to skip live LLM calls. Use wait_until="domcontentloaded" plus a selector wait — Vite dev's long-lived polling makes networkidle time out.

Configuration

.env Configuration

Primary configuration file for API server. Generate it with make env-base or copy env.example manually. Key sections:

Server settings (HOST, PORT, CORS)
Storage backends (connection strings via environment variables)
Query parameters (TOP_K, MAX_TOTAL_TOKENS, etc.)
Reranking configuration (RERANK_BINDING, RERANK_MODEL)
Authentication (AUTH_ACCOUNTS, LIGHTRAG_API_KEY)

See env.example for comprehensive template.

Setup Wizard Outputs

Keep .env host-usable. Container-only hostnames and staged SSL paths belong in the wizard-managed compose layer, not persisted back into .env.
Treat docker-compose.final.yml as generated output assembled from scripts/setup/templates/*.yml.
For setup workflow changes, prefer make env-* targets over direct scripts/setup/setup.sh calls.

Code Style

Language

Comments, backend code, and log messages in English. Frontend uses i18next for multi-language support.

Python

Follow PEP 8 with 4-space indentation
Use type annotations
Prefer dataclasses for state management
Use lightrag.utils.logger instead of print
Async/await patterns throughout

TypeScript / React (incl. WebUI ESLint)

Functional components with hooks; PascalCase for components
2-space indentation, single quotes (enforced by @stylistic rules)
Tailwind utility-first styling
ESLint stack: TypeScript-ESLint + React Hooks plugin + Prettier; @typescript-eslint/no-explicit-any is disabled (allowed)

Commit and Pull Request Guidance

If this repo is a fork of HKUDS/LightRAG. Target to HKUDS/LightRAG when creating PRs, not the fork's own repo.
PR descriptions should include: summary, motivation, linked issues if applyed, what's changed, what's broken and how it works.

AGENTS.md 19 KB Permalink Histórico Raw