langchian 进阶 01-StructedOutput

发表于 2024-10-07 更新于 2025-02-02 分类于 2-深度学习， LLM开发工程师指南， LangChain 阅读次数：本文字数： 7k 阅读时长 ≈ 6 分钟

本文介绍 langchian 如何标准化设置 LLM 输出格式化

LLMs 的直接输出是文本，没用结构性，导致 LLMs 后续的操作存在不确定性，无法构建后续应用。一个比较常见的例子是：从特定文本提取某些字段的数据保存到数据库中。以下我们问模型一个问题，比较没经过结构化、结构化后的输出

from langchain_ollama import ChatOllama

# 初始化Ollama LLM，注意需要后台开启ollama服务
model_name = "qwen2.5:latest"
llm  = ChatOllama(model=model_name)

response=llm.invoke("张三25岁，并且有168厘米")
print(response.content)

response=llm.invoke("李四28岁，并且有172厘米")
print(response.content)

好的，根据您提供的信息，张三是 25 岁的男性，身高为 168 厘米。如果您需要进一步帮助或有关于张三的其他问题，请随时告诉我！
好的，根据您提供的信息，李四是 28 岁的男性，身高为 172 厘米。请问您需要了解或讨论关于李四的哪些方面呢？或者您希望我基于这些信息进行某些特定的操作吗？

可以看出，直接问模型，没用定义结构化输出，模型输出内容不可控，下面对输出进行结构化处理

from typing import Optional
from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")

structured_llm = llm.with_structured_output(Person) # 可以绑定结构化输出
structured_llm.invoke("张三25岁，并且有168厘米")

Person (name=‘张三’, height_in_meters=1.68)

可以看出，模型输出形成有固定结构的内容！
实际上 langchain 提供 2 种方式去设置 LLMs 的 “结构化” 输出，以下内容探索 langchain 结构化输出的原理

通过 with_structured_output 函数


from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")

structured_llm = llm.with_structured_output(Person) # 可以绑定结构化输出
structured_llm.invoke("张三25岁，并且有168厘米")

Person (name=‘张三’, height_in_meters=1.68)

以上例子演示使用 Pydantic class 构建格式化的 prompt, 实际上 with_structured_output 接受以下输入：

an OpenAI function/tool schema,
a JSON Schema,
a TypedDict class (support added in 0.1.20),
or a Pydantic class.

自定义结构化输出

注意，并不是所有模型都实现了 with_structured_output 函数，因为并非所有模型都支持工具调用或 JSON 模式，此时有两种方法解决该问题：

使用 PydanticOutputParser：利用内置类来解析与给定 Pydantic 模式匹配的聊天模型的输出
使用 LCEL: 利用普通函数，自定义提示和解析器

使用 PydanticOutputParser


from typing import List
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(..., description="The name of the person")
    height_in_meters: float = Field(..., description="The height of the person expressed in meters.")

class People(BaseModel):
    """Identifying information about all people in a text."""
    people: List[Person]

# Set up a parser
parser = PydanticOutputParser(pydantic_object=People)

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Answer the user query. Wrap the output in `json` tags\n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())

query = "张三25岁，并且有168厘米"
print(prompt.invoke(query).to_string())

System: Answer the user query. Wrap the output in json tags
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {“properties”: {“foo”: {“title”: “Foo”, “description”: “a list of strings”, “type”: “array”, “items”: {“type”: “string”}}}, “required”: [“foo”]}
the object {“foo”: [“bar”, “baz”]} is a well-formatted instance of the schema. The object {“properties”: {“foo”: [“bar”, “baz”]}} is not well-formatted.
Here is the output schema:
{“ $defs": {"Person": {"description": "Information about a person.", "properties": {"name": {"description": "The name of the person", "title": "Name", "type": "string"}, "height_in_meters": {"description": "The height of the person expressed in meters.", "title": "Height In Meters", "type": "number"}}, "required": ["name", "height_in_meters"], "title": "Person", "type": "object"}}, "description": "Identifying information about all people in a text.", "properties": {"people": {"items": {"$ ref”: “#/$defs/Person”}, “title”: “People”, “type”: “array”}}, “required”: [“people”]}
Human: 张三 25 岁，并且有 168 厘米

可以看出这里 prompt 接受输入 query，输出包含 System 和 Human，其中 System 部分是按自定义结构输出生成的内容。

从 System 部分可以看出，为 People 类，生成了 “As an example, for the schema {“properties”: {“foo”: {“title”: “Foo”, “description”: “a list of strings”, “type”: “array”, “items”: {“type”: “string”}}}, “required”: [“foo”]} the object {“foo”: [“bar”, “baz”]} is a well-formatted instance of the schema. The object {“properties”: {“foo”: [“bar”, “baz”]}} is not well-formatted.”，要求生成的 json 时，最好定义为 {“foo”: [“bar”, “baz”]} 而不是 {“properties”: {“foo”: [“bar”, “baz”]}}。

然后定义输出格式 output schema，将这部分内容按 json 格式化后，显示如下，可以看出最外层要求输出 people，而其被定义为 Person 的 array，Person 被要求输出 name 和 height_in_meters。

{
    "$defs": {
        "Person": {
            "description": "Information about a person.",
            "properties": {
                "name": {
                    "description": "The name of the person",
                    "title": "Name",
                    "type": "string"
                },
                "height_in_meters": {
                    "description": "The height of the person expressed in meters.",
                    "title": "Height In Meters",
                    "type": "number"
                }
            },
            "required": [
                "name",
                "height_in_meters"
            ],
            "title": "Person",
            "type": "object"
        }
    },
    "description": "Identifying information about all people in a text.",
    "properties": {
        "people": {
            "items": {
                "$ref": "#/$defs/Person"
            },
            "title": "People",
            "type": "array"
        }
    },
    "required": [
        "people"
    ]
}

这里就是通过修改输入 LLMs 的 prompt，实现结构化内容输出，langchain 的作用是实现 “结构化表示 ->prompt”

1
2
3

chain = prompt | llm
query = "张三25岁，并且有168厘米"
chain.invoke({"query": query}).content

‘json\n{\n "people": [\n {\n "name": "张三",\n "height_in_meters": 1.68\n }\n ]\n}\n’

chain = prompt | llm | parser
query = "张三25岁，并且有168厘米"
chain.invoke({"query": query})

People (people=[Person (name=‘张三’, height_in_meters=1.68)])


chain = prompt | llm | parser

query = "张三25岁，并且有168厘米并且李四28岁，并且有172厘米"

chain.invoke({"query": query}) # 输入包含多个实例也可以格式化输出

People (people=[Person (name=‘张三’, height_in_meters=1.68), Person (name=‘李四’, height_in_meters=1.72)])

总结：结构化输出，是通过修改 prompt，让 LLMs 直接输出特定格式的内容。比如提问：“xxx，请以表格的形式显示”，而不是通过人工定义规则解析文本实现。关于如何实现 “输出格式 ->prompt”，lanchain 提供 2 种方式：

使用 with_structured_output：直接为操作直接绑定自定义结构化输出，自动生成 prompt
使用 PydanticOutputParser：为操作自定义结构化输出，自动生成 prompt

除此之外，还可以指定模型输出格式，并通过自定义函数解析替换以上的 parser 部分，实现结构化输出。综上可知，让模型输出输出结构化内容，langchain 包括两个步骤：

使用 langchain 定义结构化表示，并输出修改后的 prompt 到 LLMs，LLMs 输出结构化 json 的字符串格式内容
解析 LLMs 输出为结构化内容