RAG 评估 - 生成指标

评估生成的结果可能很困难,因为与传统机器学习不同,预测结果不是一个数字,并且很难为此问题定义定量指标

LlamaIndex 提供基于 LLM 的评估模块来衡量结果的质量。它使用 “黄金” LLM(例如 GPT-4)来确定预测答案是否以各种方式正确

请注意,许多当前评估模块 不需要 ground-truth 标签。可以通过 query、context、response 和 并将这些与 LLM 调用结合使用

评估指标:

  • Correctness: 生成的答案是否与给定查询的参考答案匹配(需要标签).
  • Semantic Similarity:预测答案是否在语义上与参考答案相似(需要标签)
  • Faithfulness(忠实度): 评估答案是否忠实于检索到的上下文(换句话说,是否存在幻觉)
  • Context Relevancy: 检索到的上下文是否与查询相关
  • Answer Relevancy: 生成的答案是否与查询相关
  • Guideline Adherence: 预测的答案是否符合特定指南。

LlamaIndex 的生成测试代码架构

classDiagram
    BaseEvaluator <|-- FaithfulnessEvaluator
	BaseEvaluator <|-- CorrectnessEvaluator
    class BaseEvaluator {
        + evaluate(query,response,contexts)    
        + aevaluate(query,response,contexts)
        + evaluate_response(query,response)
        + aevaluate_response(query,response)
    }
    class FaithfulnessEvaluator{
    	+llm
    	+eval_template
    	+refine_template
        +EvaluationResult aevaluate(query,response,contexts)
    }
    class CorrectnessEvaluator{
		+llm
		+eval_template
		+score_threshold
        +EvaluationResult aevaluate(query,response,contexts)
    }

LllamaIndex 每个评估器继承了 BaseEvaluator 基类,并且实现 aevaluate 方法,返回的是 EvaluationResult

EvaluationResult 是一个结构化输出的声明,包含以下信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class EvaluationResult(BaseModel):
"""Evaluation result.

Output of an BaseEvaluator.
"""

query: Optional[str] = Field(default=None, description="Query string")
contexts: Optional[Sequence[str]] = Field(
default=None, description="Context strings"
)
response: Optional[str] = Field(default=None, description="Response string")
passing: Optional[bool] = Field(
default=None, description="Binary evaluation result (passing or not)"
)
feedback: Optional[str] = Field(
default=None, description="Feedback or reasoning for the response"
)
score: Optional[float] = Field(default=None, description="Score for the response")
pairwise_source: Optional[str] = Field(
default=None,
description=(
"Used only for pairwise and specifies whether it is from original order of"
" presented answers or flipped order"
),
)
invalid_result: bool = Field(
default=False, description="Whether the evaluation result is an invalid one."
)
invalid_reason: Optional[str] = Field(
default=None, description="Reason for invalid evaluation."
)

使用评估器

Llamaindex 有以下评估器:

1
2
3
4
5
from llama_index.core import evaluation

evaluations=list(filter(lambda att:att.find('Evaluator')>0,dir(evaluation)))

print(evaluations)

[‘AnswerRelevancyEvaluator’, ‘BaseEvaluator’, ‘BaseRetrievalEvaluator’, ‘ContextRelevancyEvaluator’, ‘CorrectnessEvaluator’, ‘FaithfulnessEvaluator’, ‘GuidelineEvaluator’, ‘MultiModalRetrieverEvaluator’, ‘PairwiseComparisonEvaluator’, ‘QueryResponseEvaluator’, ‘RelevancyEvaluator’, ‘ResponseEvaluator’, ‘RetrieverEvaluator’, ‘SemanticSimilarityEvaluator’]

以 Faithfulness(忠实度)为例,使用过程如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import FaithfulnessEvaluator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build index
...

# define evaluator
evaluator = FaithfulnessEvaluator(llm=llm)

# query index
query_engine = vector_index.as_query_engine()
response = query_engine.query(
"What battles took place in New York City in the American Revolution?"
)
eval_result = evaluator.evaluate_response(response=response)
print(str(eval_result.passing))

生成测试集

在进行生成评估前,先生成测试数据,测试数据分为带标签和不带标签的,一般包含以下属性,生成数据的过程就是生成一批一下属性的数据

  • Query:问题
  • Contexts:上下文
  • Response:llm 回答
  • Answer:标准回答

LlamaIndex 提供 RagDatasetGenerator 类生成 rag 的测试数据,其属性及方法如下

classDiagram
    class RagDatasetGenerator {
    	+ nodes
    	+ num_questions_per_chunk
    	+ text_question_template
    	+ text_qa_template
    	+ question_gen_query
        + from_documents (documents,...)    
        + LabelledRagDataset _agenerate_dataset(nodes,labelled)
    }

RagDatasetGenerator 初始化涉及 nodes,为每个 nodes 生成的问题数量 num_questions_per_chunk,以及 3 个 prompt

生成步骤:

  • Step 1:为每个 node 构建查询引擎,然后利用 text_question_template、question_gen_query 生成问题,此时已有 query (生成的问题)、contexts

RAG评估-生成指标-20241225103739

  • Step 2:如果需要带标签的数据,使用 text_qa_template 生成答案
    RAG评估-生成指标-20241225105447
  • Step 3:整合问题、上下文及回答,返回 LabelledRagDataExample,它包含以下属性
classDiagram
	LabelledRagDataExample <|-- RagExamplePrediction
    class LabelledRagDataExample {
    	+ query
    	+ query_by
    	+ reference_contexts
    	+ reference_answer
    	+ reference_answer_by
    }
    
    class RagExamplePrediction {
		+ response
		+ contexts
	}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

# create llm
llm = OpenAI(model="gpt-4", temperature=0.0)

# build documents
documents = SimpleDirectoryReader("./data").load_data()

# define generator, generate questions
dataset_generator = RagDatasetGenerator.from_documents(
documents=documents,
llm=llm,
num_questions_per_chunk=10, # set the number of questions per nodes
)

rag_dataset = dataset_generator.generate_questions_from_nodes()
questions = [e.query for e in rag_dataset.examples]

然后运行批量评估,一次性评估批量样本的多个指标

1
2
3
4
5
6
7
8
9
10
from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
{"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
workers=8,
)

eval_results = await runner.aevaluate_queries(
vector_index.as_query_engine(), queries=questions
)