本篇介绍基于 InternLM 和 LangChain 搭建私人知识库。
环境配置 1 2 3 4 5 !conda create --name internlm_langchain --clone=/root/share/conda_envs/internlm-base !/root/.conda/envs/internlm_langchain/bin /python -m pip install ipykernel ipywidgets !/root/.conda/envs/internlm_langchain/bin /python -m ipykernel install --user --name internlm_langchain --display-name internlm_langchain
1 2 3 4 %pip install -q --upgrade pip %pip install -q modelscope==1.9 .5 transformers==4.35 .2 streamlit==1.24 .0 sentencepiece==0.1 .99 accelerate==0.24 .1
1 %pip install -q langchain==0.0 .292 gradio==4.4 .0 chromadb==0.4 .15 sentence-transformers==2.2 .2 unstructured==0.10 .30 markdown==3.3 .7
1 %pip install -q -U huggingface_hub
模型和NLTK 相关资源下载 1 2 %mkdir -p /root/data/model/Shanghai_AI_Laboratory %cp -r /root/share/temp/model_repos/internlm-chat-7b /root/data/model/Shanghai_AI_Laboratory/internlm-chat-7b
1 2 3 4 5 os.environ['HF_ENDPOINT' ] = 'https://hf-mirror.com' os.system(f'{os.path.join(sys.exec_prefix, "bin/huggingface-cli" )} download --resume-download sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 --local-dir /root/data/model/sentence-transformer' )
1 2 %cd /root !git clone https://gitee.com/yzy0612/nltk_data.git --branch gh-pages
/root
Cloning into 'nltk_data'...
/root/.conda/envs/internlm_langchain/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.
self.shell.db['dhist'] = compress_dhist(dhist)[-100:]
remote: Enumerating objects: 1692, done.
remote: Counting objects: 100% (1692/1692), done.
remote: Compressing objects: 100% (775/775), done.
remote: Total 1692 (delta 909), reused 1692 (delta 909), pack-reused 0
Receiving objects: 100% (1692/1692), 952.80 MiB | 5.84 MiB/s, done.
Resolving deltas: 100% (909/909), done.
Updating files: 100% (244/244), done.
1 2 %cd nltk_data %mv packages/* ./
/root/nltk_data
1 2 %cd tokenizers !unzip punkt.zip
/root/nltk_data/tokenizers
Archive: punkt.zip
creating: punkt/
inflating: punkt/greek.pickle
inflating: punkt/estonian.pickle
inflating: punkt/turkish.pickle
inflating: punkt/polish.pickle
creating: punkt/PY3/
inflating: punkt/PY3/greek.pickle
inflating: punkt/PY3/estonian.pickle
inflating: punkt/PY3/turkish.pickle
inflating: punkt/PY3/polish.pickle
inflating: punkt/PY3/russian.pickle
inflating: punkt/PY3/czech.pickle
inflating: punkt/PY3/portuguese.pickle
inflating: punkt/PY3/README
inflating: punkt/PY3/dutch.pickle
inflating: punkt/PY3/norwegian.pickle
inflating: punkt/PY3/slovene.pickle
inflating: punkt/PY3/english.pickle
inflating: punkt/PY3/danish.pickle
inflating: punkt/PY3/finnish.pickle
inflating: punkt/PY3/swedish.pickle
inflating: punkt/PY3/spanish.pickle
inflating: punkt/PY3/german.pickle
inflating: punkt/PY3/italian.pickle
inflating: punkt/PY3/french.pickle
inflating: punkt/russian.pickle
inflating: punkt/czech.pickle
inflating: punkt/portuguese.pickle
inflating: punkt/README
inflating: punkt/dutch.pickle
inflating: punkt/norwegian.pickle
inflating: punkt/slovene.pickle
inflating: punkt/english.pickle
inflating: punkt/danish.pickle
inflating: punkt/finnish.pickle
inflating: punkt/swedish.pickle
inflating: punkt/spanish.pickle
inflating: punkt/german.pickle
inflating: punkt/italian.pickle
inflating: punkt/french.pickle
inflating: punkt/.DS_Store
inflating: punkt/PY3/malayalam.pickle
inflating: punkt/malayalam.pickle
1 2 %cd ../taggers !unzip averaged_perceptron_tagger.zip
/root/nltk_data/taggers
Archive: averaged_perceptron_tagger.zip
creating: averaged_perceptron_tagger/
inflating: averaged_perceptron_tagger/averaged_perceptron_tagger.pickle
项目代码下载 1 2 %cd /root/data !git clone https://github.com/InternLM/tutorial
/root/data
Cloning into 'tutorial'...
remote: Enumerating objects: 352, done.
remote: Counting objects: 100% (202/202), done.
remote: Compressing objects: 100% (127/127), done.
remote: Total 352 (delta 117), reused 136 (delta 74), pack-reused 150
Receiving objects: 100% (352/352), 11.28 MiB | 8.07 MiB/s, done.
Resolving deltas: 100% (142/142), done.
知识库搭建 数据收集 为语料处理方便,我们将选用上海人工智能实验室开源的一系列大模型工具开源仓库作为语料库来源,仓库中所有的 markdown、txt 文件作为示例语料库。注意,也可以选用其中的代码文件加入到知识库中,但需要针对代码文件格式进行额外处理(因为代码文件对逻辑联系要求较高,且规范性较强,在分割时最好基于代码模块进行分割再加入向量数据库)。
1 2 3 4 5 6 7 8 9 %cd /root/data !git clone https://gitee.com/open -compass/opencompass.git !git clone https://gitee.com/InternLM/lmdeploy.git !git clone https://gitee.com/InternLM/xtuner.git !git clone https://gitee.com/InternLM/InternLM-XComposer.git !git clone https://gitee.com/InternLM/lagent.git !git clone https://gitee.com/InternLM/InternLM.git
/root/data
Cloning into 'opencompass'...
remote: Enumerating objects: 4843, done.
remote: Total 4843 (delta 0), reused 0 (delta 0), pack-reused 4843
Receiving objects: 100% (4843/4843), 1.48 MiB | 1.39 MiB/s, done.
Resolving deltas: 100% (2941/2941), done.
Updating files: 100% (1154/1154), done.
Cloning into 'lmdeploy'...
remote: Enumerating objects: 4485, done.
remote: Counting objects: 100% (4485/4485), done.
remote: Compressing objects: 100% (1494/1494), done.
remote: Total 4485 (delta 2914), reused 4485 (delta 2914), pack-reused 0
Receiving objects: 100% (4485/4485), 2.23 MiB | 1.34 MiB/s, done.
Resolving deltas: 100% (2914/2914), done.
Updating files: 100% (455/455), done.
Cloning into 'xtuner'...
remote: Enumerating objects: 3735, done.
remote: Counting objects: 100% (1150/1150), done.
remote: Compressing objects: 100% (252/252), done.
remote: Total 3735 (delta 920), reused 1106 (delta 895), pack-reused 2585
Receiving objects: 100% (3735/3735), 742.80 KiB | 864.00 KiB/s, done.
Resolving deltas: 100% (2741/2741), done.
Updating files: 100% (450/450), done.
Cloning into 'InternLM-XComposer'...
remote: Enumerating objects: 680, done.
remote: Counting objects: 100% (680/680), done.
remote: Compressing objects: 100% (273/273), done.
remote: Total 680 (delta 361), reused 680 (delta 361), pack-reused 0
Receiving objects: 100% (680/680), 10.74 MiB | 2.61 MiB/s, done.
Resolving deltas: 100% (361/361), done.
Cloning into 'lagent'...
remote: Enumerating objects: 414, done.
remote: Counting objects: 100% (414/414), done.
remote: Compressing objects: 100% (188/188), done.
remote: Total 414 (delta 197), reused 414 (delta 197), pack-reused 0
Receiving objects: 100% (414/414), 214.97 KiB | 974.00 KiB/s, done.
Resolving deltas: 100% (197/197), done.
Cloning into 'InternLM'...
remote: Enumerating objects: 2604, done.
remote: Counting objects: 100% (592/592), done.
remote: Compressing objects: 100% (264/264), done.
remote: Total 2604 (delta 324), reused 581 (delta 318), pack-reused 2012
Receiving objects: 100% (2604/2604), 4.87 MiB | 1.69 MiB/s, done.
Resolving deltas: 100% (1608/1608), done.
首先将上述仓库中所有满足条件的文件路径找出来,我们定义一个函数,该函数将递归指定文件夹路径,返回其中所有满足条件(即后缀名为 .md 或者 .txt 的文件)的文件路径:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import os def get_files (dir_path ): file_list = [] for filepath, dirnames, filenames in os.walk(dir_path): for filename in filenames: if filename.endswith(".md" ): file_list.append(os.path.join(filepath, filename)) elif filename.endswith(".txt" ): file_list.append(os.path.join(filepath, filename)) return file_list
加载数据 得到所有目标文件路径之后,我们可以使用 LangChain 提供的 FileLoader 对象来加载目标文件,得到由目标文件解析出的纯文本内容。由于不同类型的文件需要对应不同的 FileLoader,我们判断目标文件类型,并针对性调用对应类型的 FileLoader,同时,调用 FileLoader 对象的 load 方法来得到加载之后的纯文本对象:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 from tqdm import tqdmfrom langchain.document_loaders import UnstructuredFileLoaderfrom langchain.document_loaders import UnstructuredMarkdownLoaderdef get_text (dir_path ): file_lst = get_files(dir_path) docs = [] for one_file in tqdm(file_lst): file_type = one_file.split('.' )[-1 ] if file_type == 'md' : loader = UnstructuredMarkdownLoader(one_file) elif file_type == 'txt' : loader = UnstructuredFileLoader(one_file) else : continue docs.extend(loader.load()) return docs
构建向量数据库 得到该列表之后,我们就可以将它引入到 LangChain 框架中构建向量数据库。由纯文本对象构建向量数据库,我们需要先对文本进行分块,接着对文本块进行向量化。
LangChain 提供了多种文本分块工具,此处我们使用字符串递归分割器,并选择分块大小为 500,块重叠长度为 150(由于篇幅限制,此处没有展示切割效果,学习者可以自行尝试一下,想要深入学习 LangChain 文本分块可以参考教程 《LangChain - Chat With Your Data》:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from langchain.text_splitter import RecursiveCharacterTextSplittertar_dir = [ "/root/data/InternLM" , "/root/data/InternLM-XComposer" , "/root/data/lagent" , "/root/data/lmdeploy" , "/root/data/opencompass" , "/root/data/xtuner" ] docs = [] for dir_path in tar_dir: docs.extend(get_text(dir_path)) text_splitter = RecursiveCharacterTextSplitter(chunk_size=500 , chunk_overlap=150 ) split_docs = text_splitter.split_documents(docs)
0%| | 0/25 [00:00<?, ?it/s]/root/.conda/envs/internlm_langchain/lib/python3.10/site-packages/unstructured/documents/html.py:498: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
rows = body.findall("tr") if body else []
40%|████ | 10/25 [00:28<00:23, 1.56s/it]/root/.conda/envs/internlm_langchain/lib/python3.10/site-packages/unstructured/documents/html.py:498: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
rows = body.findall("tr") if body else []
100%|██████████| 25/25 [00:28<00:00, 1.16s/it]
100%|██████████| 9/9 [00:00<00:00, 19.51it/s]
100%|██████████| 18/18 [00:00<00:00, 34.18it/s]
100%|██████████| 72/72 [00:02<00:00, 24.73it/s]
100%|██████████| 113/113 [00:05<00:00, 18.85it/s]
100%|██████████| 26/26 [00:01<00:00, 18.29it/s]
接着我们选用开源词向量模型 Sentence Transformer 来进行文本向量化。LangChain 提供了直接引入 HuggingFace 开源社区中的模型进行向量化的接口:
1 2 3 from langchain.embeddings.huggingface import HuggingFaceEmbeddingsembeddings = HuggingFaceEmbeddings(model_name="/root/data/model/sentence-transformer" )
同时,考虑到 Chroma 是目前最常用的入门数据库,我们选择 Chroma 作为向量数据库,基于上文分块后的文档以及加载的开源向量化模型,将语料加载到指定路径下的向量数据库:
1 2 3 4 5 6 7 8 9 10 11 12 from langchain.vectorstores import Chromapersist_directory = 'data_base/vector_db/chroma' vectordb = Chroma.from_documents( documents=split_docs, embedding=embeddings, persist_directory=persist_directory ) vectordb.persist()
可以在 /root/data 下新建一个 demo目录,将该脚本和后续脚本均放在该目录下运行。运行上述脚本,即可在本地构建已持久化的向量数据库,后续直接导入该数据库即可,无需重复构建。
InternLM 接入 LangChain 为便捷构建 LLM 应用,我们需要基于本地部署的 InternLM,继承 LangChain 的 LLM 类自定义一个 InternLM LLM 子类,从而实现将 InternLM 接入到 LangChain 框架中。完成 LangChain 的自定义 LLM 子类之后,可以以完全一致的方式调用 LangChain 的接口,而无需考虑底层模型调用的不一致。
基于本地部署的 InternLM 自定义 LLM 类并不复杂,我们只需从 LangChain.llms.base.LLM 类继承一个子类,并重写构造函数与 _call 函数即可:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 from langchain.llms.base import LLMfrom typing import Any , List , Optional from langchain.callbacks.manager import CallbackManagerForLLMRunfrom transformers import AutoTokenizer, AutoModelForCausalLMimport torchclass InternLM_LLM (LLM ): tokenizer : AutoTokenizer = None model: AutoModelForCausalLM = None def __init__ (self, model_path :str ): super ().__init__() print ("正在从本地加载模型..." ) self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True ) self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True ).to(torch.bfloat16).cuda() self.model = self.model.eval () print ("完成本地模型的加载" ) def _call (self, prompt : str , stop: Optional [List [str ]] = None , run_manager: Optional [CallbackManagerForLLMRun] = None , **kwargs: Any ): system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语). - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless. - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文. """ messages = [(system_prompt, '' )] response, history = self.model.chat(self.tokenizer, prompt , history=messages) return response @property def _llm_type (self ) -> str : return "InternLM"
在上述类定义中,我们分别重写了构造函数和 _call 函数:对于构造函数,我们在对象实例化的一开始加载本地部署的 InternLM 模型,从而避免每一次调用都需要重新加载模型带来的时间过长;_call 函数是 LLM 类的核心函数,LangChain 会调用该函数来调用 LLM,在该函数中,我们调用已实例化模型的 chat 方法,从而实现对模型的调用并返回调用结果。
构建检索问答链 LangChain 通过提供检索问答链对象来实现对于 RAG 全流程的封装。所谓检索问答链,即通过一个对象完成检索增强问答(即RAG)的全流程,针对 RAG 的更多概念,我们会在视频内容中讲解,也欢迎读者查阅该教程来进一步了解:《LLM Universe 》。我们可以调用一个 LangChain 提供的 RetrievalQA 对象,通过初始化时填入已构建的数据库和自定义 LLM 作为参数,来简便地完成检索增强问答的全流程,LangChain 会自动完成基于用户提问进行检索、获取相关文档、拼接为合适的 Prompt 并交给 LLM 问答的全部流程。
加载向量数据库 首先我们需要将上文构建的向量数据库导入进来,我们可以直接通过 Chroma 以及上文定义的词向量模型来加载已构建的数据库:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from langchain.vectorstores import Chromafrom langchain.embeddings.huggingface import HuggingFaceEmbeddingsimport osembeddings = HuggingFaceEmbeddings(model_name="/root/data/model/sentence-transformer" ) persist_directory = 'data_base/vector_db/chroma' vectordb = Chroma( persist_directory=persist_directory, embedding_function=embeddings )
上述代码得到的 vectordb 对象即为我们已构建的向量数据库对象,该对象可以针对用户的 query 进行语义向量检索,得到与用户提问相关的知识片段。
实例化自定义 LLM 与 Prompt Template 接着,我们实例化一个基于 InternLM 自定义的 LLM 对象:
1 2 llm = InternLM_LLM(model_path = "/root/data/model/Shanghai_AI_Laboratory/internlm-chat-7b" ) llm.predict("你是谁" )
正在从本地加载模型...
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]
完成本地模型的加载
'我是一个语言模型,我的名字是书生·浦语。我来自上海人工智能实验室。我可以回答各种问题,包括日常生活、历史、文化、科技、艺术、政治等各种话题。如果您有任何问题,欢迎随时问我。'
构建检索问答链,还需要构建一个 Prompt Template,该 Template 其实基于一个带变量的字符串,在检索之后,LangChain 会将检索到的相关文档片段填入到 Template 的变量中,从而实现带知识的 Prompt 构建。我们可以基于 LangChain 的 Template 基类来实例化这样一个 Template 对象:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from langchain.prompts import PromptTemplatetemplate = """使用以下上下文来回答用户的问题。如果你不知道答案,就说你不知道。总是使用中文回答。 问题: {question} 可参考的上下文: ··· {context} ··· 如果给定的上下文无法让你做出回答,请回答你不知道。 有用的回答:""" QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context" ,"question" ],template=template)
构建检索问答链 最后,可以调用 LangChain 提供的检索问答链构造函数,基于我们的自定义 LLM、Prompt Template 和向量知识库来构建一个基于 InternLM 的检索问答链:
1 2 3 from langchain.chains import RetrievalQAqa_chain = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),return_source_documents=True ,chain_type_kwargs={"prompt" :QA_CHAIN_PROMPT})
得到的 qa_chain 对象即可以实现我们的核心功能,即基于 InternLM 模型的专业知识库助手。我们可以对比该检索问答链和纯 LLM 的问答效果:
1 2 3 4 5 6 7 8 9 10 question = "什么是InternLM" result = qa_chain({"query" : question}) print ("检索问答链回答 question 的结果:" )print (result["result" ])result_2 = llm(question) print ("大模型回答 question 的结果:" )print (result_2)
检索问答链回答 question 的结果:
根据您提供的问题,InternLM是一个开源的轻量级训练框架,旨在支持大模型训练,而无需大量的依赖。它支持在拥有数千个GPU的大型集群上进行预训练,并在单个GPU上进行微调,同时实现了卓越的性能优化。在1024个GPU上训练时,InternLM可以实现近90%的加速效率。
InternLM团队已经发布了两个开源的预训练模型:InternLM-7B和InternLM-20B。更新包括InternLM-20B发布和InternLM-7B-Chat v1.1发布,后者增加了代码解释器和函数调用能力。InternLM模型的特点包括:
1. 支持训练高质量的对话模型,实现强大的知识库和推理功能;
2. 支持8k上下文窗口长,允许更长的输入序列和强大的推理能力;
3. 提供灵活的通用工具,使用户能够创建自己的工作流程;
4. 提供轻量级的学习框架,无需大量的依赖即可进行模型的前向学习和微调,并实现了卓越的性能优化。
总的来说,InternLM是一个非常实用的工具,可以帮助用户高效地训练大模型,并支持各种常见的预训练模型。
大模型回答 question 的结果:
书生·浦语
网页部署 在完成上述核心功能后,我们可以基于 Gradio 框架将其部署到 Web 网页,从而搭建一个小型 Demo,便于测试与使用。
我们首先将上文的代码内容封装为一个返回构建的检索问答链对象的函数,并在启动 Gradio 的第一时间调用该函数得到检索问答链对象,后续直接使用该对象进行问答对话,从而避免重复加载模型:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 from langchain.vectorstores import Chromafrom langchain.embeddings.huggingface import HuggingFaceEmbeddingsimport osfrom langchain.prompts import PromptTemplatefrom langchain.chains import RetrievalQAdef load_chain (): embeddings = HuggingFaceEmbeddings(model_name="/root/data/model/sentence-transformer" ) persist_directory = 'data_base/vector_db/chroma' vectordb = Chroma( persist_directory=persist_directory, embedding_function=embeddings ) llm = InternLM_LLM(model_path = "/root/data/model/Shanghai_AI_Laboratory/internlm-chat-7b" ) template = """使用以下上下文来回答最后的问题。如果你不知道答案,就说你不知道,不要试图编造答 案。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问!”。 {context} 问题: {question} 有用的回答:""" QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context" ,"question" ],template=template) qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectordb.as_retriever(),return_source_documents=True ,chain_type_kwargs={"prompt" :QA_CHAIN_PROMPT}) return qa_chain
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class Model_center (): """ 存储检索问答链的对象 """ def __init__ (self ): self.chain = load_chain() def qa_chain_self_answer (self, question: str , chat_history: list = [] ): """ 调用问答链进行回答 """ if question == None or len (question) < 1 : return "" , chat_history try : chat_history.append( (question, self.chain({"query" : question})["result" ])) return "" , chat_history except Exception as e: return e, chat_history
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import gradio as grmodel_center = Model_center() block = gr.Blocks() with block as demo: with gr.Row(equal_height=True ): with gr.Column(scale=15 ): gr.Markdown("""<h1><center>InternLM</center></h1> <center>书生浦语</center> """ ) with gr.Row(): with gr.Column(scale=4 ): chatbot = gr.Chatbot(height=450 , show_copy_button=True ) msg = gr.Textbox(label="Prompt/问题" ) with gr.Row(): db_wo_his_btn = gr.Button("Chat" ) with gr.Row(): clear = gr.ClearButton( components=[chatbot], value="Clear console" ) db_wo_his_btn.click(model_center.qa_chain_self_answer, inputs=[ msg, chatbot], outputs=[msg, chatbot]) gr.Markdown("""提醒:<br> 1. 初始化数据库时间可能较长,请耐心等待。 2. 使用中如果出现异常,将会在文本输入框进行展示,请不要惊慌。 <br> """ )gr.close_all() demo.launch()
正在从本地加载模型...
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]
完成本地模型的加载
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
参考文献
GitHub: 基于 InternLM 和 LangChain 搭建你的知识库
bilibili: 基于 InternLM 和 LangChain 搭建你的知识库