选择表格回答问题

发表于 2025-01-21 更新于 2025-02-02 分类于 2-深度学习， LLM开发工程师指南， RAG 阅读次数：本文字数： 3.9k 阅读时长 ≈ 4 分钟

在文章利用 langchian 搭建通用的表格数据分析工具 | 年轻人起来冲中，我们针对 RAG 回答统计、分析类问题的能力弱的问题，我们通过对问题分类，使用生成 pandas 代码的方式完成回答。但是这个方式存在一个问题，即无法处理多个表格，本文扩展使用场景，将其扩展到可以使用多表格的领域

问题背景

在没有扩展前，基于 RAG 的表格文档处理方式将用户查询转为 2 个方向：

向量查询：基于向量查询检索相关上下文，然后使用 llm 回答问题
生成代码：基于表格部分内容 + llm 生成代码，通过执行代码获取答案

flowchart LR
    A[用户提问] --> B{llm问题分类}
    C[最终回复]
    B --> D[向量查询] --> F{llm生成回复} -->C
    B --> E[llm生成代码] --> G{执行代码} -->C

因为在生成代码部分，需要利用表格内容构建 prompt，这就意味着，不同的表格由于表头不同，导致 prompt 不同，因此以上过程无法处理多个表格。

因此本文针对不同表格动态构建 prompt，可以同时针对不同表格实现 “精确查询”，其过程如下：

flowchart LR
    A[用户提问] --> B{llm问题分类}

    subgraph one[所有表格]
        direction LR
        E[llm生成代码] --> G{执行代码}
    end

    C[最终回复]
    B --> D[向量查询] --> F{llm生成回复} -->C
    B --> one -->I{llm选择答案} -->C

关键代码

1. 向量检索生成分支

该分支是常规的 rag 分支，构建索引，然后将检索到的上下文和问题一起提供给 llm，生成回答

vector_prompt=PromptTemplate.from_template("""
根据下面的上下文内容回答问题。如果你不知道答案，就回答不知道，不要试图编造答案
优化输出效果，采用无序列表输出，表头加粗处理
{context}
问题：{question}
""")

retriever=indexs.as_retriever()
from operator import itemgetter
vector_chain = (
    {"context": itemgetter("question")|retriever|format_docs, "question": RunnablePassthrough()}
    | vector_prompt
    | llm
    | StrOutputParser()
)

2. 生成 pandas 分支

这一分支包含 2 个过程，首先对每个文档生成 pandas 查询代码，执行得到结果后，通过 llm 评估每个文档回答问题的程度，取评分高的回答输出

# 1.生成pandas部分
system_string="你正在使用pandas处理DataFrame，请根据三个引号分隔的问题输出pandas命令，其中`print(df.head())`的结果如下:\n"
system_string+="{excle_head}\n"
system_string+= "请使用1行代码完成需求，目的是通过pandas检索出符合条件的行（包括所有列）\n"
system_string+= "只输出代码，不要输出其他任何信息，也不能写任何注释\n"
system_string+= "确保代码可运行\n"

pandas_prompt = ChatPromptTemplate.from_messages([("system", system_string), ("human", "'''{question}'''")])

pandas_chain = {'excle_path':RunnablePassthrough(), 'question':RunnablePassthrough()}|{'excle_head':read_excle, 'question':RunnablePassthrough()}|pandas_prompt | \
    llm | StrOutputParser() | _sanitize_output | \
    run_python|postprocess

referee_prompt=PromptTemplate.from_template("""
根据问题及回答，评估回答响应了问题的程度，如果打分是0-10分\n
问题：{question}\n
回答: {answer}\n
直接给出分数即可，不需要解释                                            
""")

referee_chain = (referee_prompt
    | llm
    | StrOutputParser()
)

try:
	# 对每个文档使用pandas生成，并通过llm评估生成，取评分最高的回答
    answers=[]
    for excle_path in excles_path:
        answer = pandas_chain.invoke({'excle_path':excle_path, 'question':question})
        score=referee_chain.invoke({"question":question,"answer":answer})
        answers.append((answer,int(score.replace('分',''))))
    
    if len(answers)>0:
        answers=sorted(answers,key=itemgetter(1),reverse=True)
        finnly_answer=answers[0]
        if finnly_answer[1]>0:
            return finnly_answer[0]
        else:
            return '无法回答该问题'
    else:
        return '无法回答该问题'
except Exception as e:
    print('出现错误:{e}')
    return '无法回答该问题'

3. 问题分类过程

这是对用户输入问题进行分类，然后选择向量查询还是生成 pandas 分支的过程

def route(info):
    print(info)
    if "精准查询" in info["class"]:
        return answer_by_pandas(question=info['question'],excles_path=excles_path)
    elif "普通查询" in info["class"]:
        return answer_by_vector(question=info['question'],excles_path=excles_path)
    else:
        return answer_by_vector(question=info['question'],excles_path=excles_path)


class_prompt=ChatPromptTemplate.from_template("""你是一名问题归类员，你的任务识别
以下由三个引号包含的问题，然后将其分类为：“普通查询”、“精准查询”，当提问涉及准确的时间、地点、人物、序号时，归类为精准查询，
否则归类为普通查询。
直接输出“普通查询”、“精准查询”之一，不要输出其他任何信息
问题：```{question}```  
""")

class_chain=class_prompt|llm|StrOutputParser()
full_chain={"class":class_chain,"question": RunnablePassthrough()}|RunnableLambda(route)

效果展示

使用时，第一步先勾选需要处理的表格，然后提问题

选择表格回答问题-20250121184553