Using LangChain to teach an LLM to write like yourself

LangChain tutorial on making a document retriever and generator, which enables it to write just like me

Arslan Shahid
FireBird Technologies

--

Image by the Author

LangChain is one of the most frequently used RAG libraries in Python and JavaScript. Retrieval Augmented Generation is a technique to augment a large language model with more documents without going through the hassle of fine-tuning it to a specific task.

I came up with the concept of this post when I was trying to use LLMs to aid me in writing, I found that simple prompts often failed to generate reliable output — therefore the RAG application was the way to go.

Extracting Documents

LangChain has a proprietary object called ‘Documents’ that allows users to load the text for use with their package. These documents can easily be split using a text splitter and can be added into chains, output parsers, and LLMs.

I would be using all of my medium blog posts for training, I had already downloaded them as html files and placed them in the project directory.

from os import listdir
from os.path import isfile, join

mypath = 'project_directory'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

#Filters the files which are in the html format
onlyfiles = [x for x in onlyfiles if 'htm' in x]

The next step is to use langChain to parse these into documents that can be used in an RAG pipeline.

#there are plenty of HTMLLoaders to choose from
from langchain_community.document_loaders import UnstructuredHTMLLoader

# the data dictionary will contain the documents
data = {}

i = 0
for file in onlyfiles:
loader = UnstructuredHTMLLoader(file)

data[i] = loader.load()

i+=1

A view of the documents.

Image by the Author, every document contains page_content and other metadata like title, etc

These documents are larger than what can be digested by many LLM’s. ChatGPT 3.5 has only a 4096 token window, one article alone is approximately ~2000 tokens. To widen the context window, I would have to build a pipeline from where the LLM can retrieve these documents. Splitting the documents into chunks is also a good strategy as it will help the LLM in generating this content, I would also be deleting some of the unnecessary text in each document.

#importing the particular text splitter, in this case we would use TokenTextSplitter
from langchain.text_splitter import TokenTextSplitter


text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=25)
#A dictionary which contains all the post split docs.
texts = {}
for i in range(len(data)):
#the key of texts dictionary is the title of each file
texts[data[i][0].metadata['source']] = text_splitter.split_documents(data[i])

Documents have been extracted the next step is to create embedding. I would be using ChromaDB (local) and OpenAI embedding but other vector stores and embedding models can also be used.

Looking for someone to solve your problem? Click here:

https://form.jotform.com/240744327173051

Creating Embedding

In simple terms, embeddings are a function that takes a string input and outputs a vector of numbers. Embedding functions are an extensive topic in themselves and out of scope for this post.

If you’re interested in learning more about embeddings, please visit the OpenAI embedding documentation.

from langchain_community.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

for i in range(len(texts)):
vectordb = Chroma.from_documents(
#takes in a list of documents
texts[texts_keys[i]],
#Embedding function, we are using OpenAI default
embedding=OpenAIEmbeddings(api_key='your_open_ai_key'),
#Specify the directory where you want these
persist_directory='./LLM_train_embedding/Doc'
)
#pushes these into the directory
vectordb.persist()

Retrieval

For the simplest implementation of a retriever, I used a built-in method in ChromaDB, which allows you to search your documents. There are a few available search algorithms, which you can learn about from here.

#The built-in ChromaDb LangChain method for vector store-based retrieval
# need two arguments, the algorithm to use and search kwargs, which specify
# the parameters for each algorithm. k tells amount of documents to return etc

#Defines the retriever
retriever = vectordb.as_retriever(search_type='mmr', search_kwargs ={'k':1})

#Gets the document for the retriever
retriever.get_relevant_documents('What is a Sankey?')
Image by the Author, shows the output of the query ‘What is a Sankey?’.

Retrieval is a huge topic, one which I encourage you to learn more about. For the purposes of this post, I will also be using Parent Document Retriever.

Parent Document Retrieval: Retrieval method that breaks down large documents into smaller chunks, and allows the retriever to ‘learn’ more about a specific part of your document. Parent Document retrieval requires both a vector store & document store as per LangChain’s implementation.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

#Tells how to split the children
child_splitter = TokenTextSplitter(chunk_size=250, chunk_overlap=10)
#Tells how to split the parent
parent_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=50)

vectorstore = Chroma(
collection_name="full_documents", embedding_function=OpenAIEmbeddings(api_key='your_api_key')
)
#Creates a document store in memory
store = InMemoryStore()
#This retriever takes vector store and document store
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter = parent_splitter
)

# Adds the child documents and splits them.
for i in range(len(data)):
retriever.add_documents(data[i])

#Search over documents:
retriever.get_relevant_documents('What is a Sankey?')
Part of the output

Generation

Now I will instructing an LLM to generate using the retrieve documents. This will give the LLM additional context about my writing style.

We will be using the RetrievalQA chain

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

# This is the prompt I used

# It takes in the documents as {context} and user provide {topic}
template = """Mimic the writing style in the context:
{context} and produce a blog on the topic

Topic: {topic}


"""

prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(api_key =api_key)

# Using LangCHain LCEL to supply the prompt and generate output
chain = (
{
"context":itemgetter("topic") | retriever,
"topic": itemgetter("topic"),

}
| prompt
| model
| StrOutputParser()
)
#running the Chain
chain.invoke({"topic": "Pakistan "})
The answer to the Topic :’Pakistan”

Conclusion

This is how you can create a LLM bot that mimics to tries to recreate your own writing, I will experimenting with this and trying to get some evaluations to see how ‘good’ the LLM mimics my writing.

If you like help in the evaluation please fill this short form:

If you’re on mobile, please go to this link directly in your browser, the form doesn’t load properly on Medium mobile app:

https://form.jotform.com/240701049513043

If you’re interested in topics like these, please do follow me, share and comment on this post.

--

--

Arslan Shahid
FireBird Technologies

Life has the Markov property, the future is independent of the past, given the present