LlamaIndex 的类视角下的 RAG 过程

Drawing-2024-12-21-09.57.12.excalidraw

使用 llamaIndex 构建 RAG 时,是通过 5 个步骤完成的

  1. Document->Documents:每个 Document 是一个文档,多个 Document 组成一个 Documents
  2. Documents->nodes:node 对于 Document 内的一段文本,对所有 Document 执行 chunk,构建由多个 chunk 组成的 nodes
  3. Nodes->index:通过 nodes 构建索引 VectorStoreIndex
  4. Index->Retriever:通过索引构建增强检索器 retriever,如向量检索器 VectorIndexRetriever
  5. Retriever->QueryEngine:将检索器与 llm 合并为查询引擎 RetrieverQueryEngine
  6. 评估检索器:基于 nodes 生成问题,得到(问题,nodes)数据集,然后使用 VectorIndexRetriever 去检索问题,比较输出和期待的 nodes, 评估检索器的性能
  7. 评估生成器:基于 nodes 生成问题,得到 (问题、上下文、标准答案) 数据集,然后 RetrieverQueryEngine 去生成结果,比较结果与上下文,得到 “上下文遵顼” 评估值,比较结果与标准答案,得到语义准确度

扩展说明

构建索引

1. 索引初始化参数

数据库:在构建索引时,目的是将 nodes 存储到数据库,所以还可以指定数据库,比如 Chroma、Pinecone

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import chromadb
import pinecone
from llama_index.core.storage import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.vector_stores.pinecone import PineconeVectorStore

# Chroma
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context1 = StorageContext.from_defaults(vector_store=vector_store)

# Pinecone
pinecone.init(api_key="<api_key>", environment="<environment>") pinecone.create_index( "quickstart", dimension=1536, metric="euclidean", pod_type="p1" ) storage_context2 = StorageContext.from_defaults( vector_store=PineconeVectorStore(pinecone.Index("quickstart")) )

index=VectorStoreIndex(nodes=nodes,storage_context=storage_context2)

转换:对 nodes 节点进一步加工,比如切换、生成文本标题、生成 Q/A 模式数据

1
2
3
4
5
6
7
from llama_index.core import VectorStoreIndex 
from llama_index.core.extractors import ( TitleExtractor, QuestionsAnsweredExtractor, ) from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter

transformations = [ TokenTextSplitter(chunk_size=512, chunk_overlap=128), TitleExtractor(nodes=5), QuestionsAnsweredExtractor(questions=3), ]

index = VectorStoreIndex(nodes, transformations=transformations )

2. Metadata
定义 meta 数据提取器,然后为每个 node 添加元数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
KeywordExtractor,
)
from llama_index.extractors.entity import EntityExtractor

transformations = [
SentenceSplitter(),
TitleExtractor(nodes=5),
QuestionsAnsweredExtractor(questions=3),
SummaryExtractor(summaries=["prev", "self"]),
KeywordExtractor(keywords=10),
EntityExtractor(prediction_threshold=0.5),
]

from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=transformations)
nodes = pipeline.run(documents=documents)
1
2
3
4
5
6
7
{'page_label': '2',
'file_name': '10 k-132. Pdf',
'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings',
'questions_this_excerpt_can_answer': '\n\n 1. How many countries does Uber Technologies, Inc. Operate in?\n 2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n 3. How much gross bookings did Uber Technologies, Inc. Generate in 2019?',
'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.",
'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.',
'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}

3. 持久化
索引构建耗时,可以将其持久化后保存下来

1
2
3
4
5
6
7
8
9
10
11
12
13
# 保存
storage_context.persist(persist_dir="<persist_dir>")

# 加载
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore.from_persist_dir(persist_dir="<persist_dir>"),
vector_store=SimpleVectorStore.from_persist_dir(
persist_dir="<persist_dir>"
),
index_store=SimpleIndexStore.from_persist_dir(persist_dir="<persist_dir>"),
)

index = load_index_from_storage(storage_context, index_id="<index_id>")

构建检索器

1. 初始化参数

similarity_top_k:检索器返回的 node 数量

vector_store_query_mode:检索模式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class VectorStoreQueryMode(str, Enum):
    """Vector store query mode."""
    DEFAULT = "default"
    SPARSE = "sparse"
    HYBRID = "hybrid"
    TEXT_SEARCH = "text_search"
    SEMANTIC_HYBRID = "semantic_hybrid"

    # fit learners
    SVM = "svm"
    LOGISTIC_REGRESSION = "logistic_regression"
    LINEAR_REGRESSION = "linear_regression"
   
    # maximum marginal relevance
    MMR = "mmr"
  • SPARSE:稀疏检索,比如 bm 25
  • HYBRID:混合检索
  • TEXT_SEARCH:文本搜索
  • SVM:可学习的 svm 检索

filters:metadata filters,通过 metadata 过滤检索到的 nodes

构建引擎

1. 初始化参数

response_synthesizer:响应合成器,意思是检索器一次性检索到的 nodes 内容与 prompt 的组合方式,比如如果 node 构建为一颗树,则从树顶检索下来

response_synthesizer解释Indexquery
Summary摘要索引只是将 Node 存储为顺序链LlamaIndex的类视角下的RAG过程-20241221135347LlamaIndex的类视角下的RAG过程-20241221135409
Vector Store将每个 Node 和相应的嵌入存储在 Vector StoreLlamaIndex的类视角下的RAG过程-20241221135546LlamaIndex的类视角下的RAG过程-20241221135610
Tree从一组 Node(成为此树中的叶节点)构建分层树LlamaIndex的类视角下的RAG过程-20241221135641LlamaIndex的类视角下的RAG过程-20241221135722
Keyword Tablekeyword 表索引从每个 Node 中提取关键字,并从 each 关键字添加到该关键字的相应 Node 中LlamaIndex的类视角下的RAG过程-20241221135809LlamaIndex的类视角下的RAG过程-20241221135827
Property Graph首先构建一个包含标记节点和关系的知识图谱使用多个子检索器并组合结果。默认情况下,使用 keyword + synoymn expanasion 以及向量检索

node_postprocessors:检索器拿到 node 后的后处理过程,

1
2
from llama_index.core import postprocessor
list(filter(lambda key:key[0].isupper(),dir(postprocessor)))

[‘AutoPrevNextNodePostprocessor’, ‘EmbeddingRecencyPostprocessor’, ‘FixedRecencyPostprocessor’, ‘KeywordNodePostprocessor’, ‘LLMRerank’, ‘LongContextReorder’, ‘MetadataReplacementPostProcessor’, ‘NERPIINodePostprocessor’, ‘PIINodePostprocessor’, ‘PrevNextNodePostprocessor’, ‘SentenceEmbeddingOptimizer’, ‘SentenceTransformerRerank’, ‘SimilarityPostprocessor’, ‘TimeWeightedPostprocessor’]

比如上面对 node 进行重排序的 LLMRerank,对 node 的 metadata 中的时间进行过滤的 TimeWeightedPostprocessor 等