Chat with your PDFs using LangChain

‘Chatting’ with a PDF is becoming popular, this post explains exactly how you can build an LLM application to do so.

Arslan Shahid
FireBird Technologies

--

Image by the Author

PDF is a popular format for storing digital documents because it was designed to be printer-friendly. However, today printers are less common, and most people prefer to keep documents in editable formats. PDFs unfortunately are not editable and data stored in them is hard to work with.

Hence, it makes sense that many PDF2{insert_friendlier_format} converters exist. However, this conversion is rarely ‘loss-less’.

This is why a popular use case for large language model applications is to enable ‘chatting with PDFs’. Where a user just asks a natural language question and the LLM app retrieves the answer for the user.

This explainer will walk you through building your own ‘Chat with PDF’ application. There are four steps to this process:

  1. Loading PDFs using different PDF loaders in LangChain
  2. Data Cleaning
  3. Building a Retrieval
  4. Context-augmentation for the LLM.

Loading PDFs

LangChain has a few built-in PDF loaders which are taken from different PDF libraries like Unstructured & PyMuPDF. Most of these loaders only analyze the text inside the PDF and between conversions, there is often information loss. Depending on your documents the results may be different. There are some image-trace-based PDF loaders as well like Camelot, which work better for extracting tables & charts from PDFs.

The important thing to note is that you might have to play around with different loaders, and configurations to figure out what works best for your use-case. No converter has given perfect results, for more sophisticated pipelines I would recommend using an ensemble of different loaders to extract different objects inside the PDF. The cleaner the result is the better it will perform while chatting.

For this post, we will be using the Public Sector Development Program PDF document. The document is a holistic view of Pakistan’s Ministry of Planning & Development’s core program (PSDP).

It contains text data, tables, and other miscellaneous objects. We would load these PDFs as LangChain documents.

# We will be using these PDF loaders but you can check out other loaded documents
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader

# This is the name of the report which should be in the directory
# You can download the precise PDF I am using from here https://www.pc.gov.pk/uploads/archives/PSDP_2023-24.pdf
name = 'PSDP_2023-24.pdf'

# This loader uses PyMuPDF
loader_py = PyMuPDFLoader(name)

#This loader uses Unstructured
loader_un = UnstructuredFileLoader(name)

# Storing the loaded documents as langChain Document object
pages_py = loader_py.load()

pages_un = loader_un.load()
Image by the Author, shows partial output from PyMuPDFLoader
Image by the Author, same page of PDF for UnstructuredPDF Loader

The output seems similar, but if you look closely the whitespacing and splits for documents are different. This could mean different quality of results depending on your use case. I would highly suggest playing around with these PDF loaders while keeping your end goal in mind.

For this post, I decided to use the PyMuPDF Loader because it was easier to work with as it split the PDF page by page. However, you could potentially get the same split by playing around with UnstructuredPDFloader’s settings.

Cleaning the data

The first step to follow is to split the documents into smaller chunks. This would allow us to feed these documents into the LLM and also build a retriever.

# text splitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
# shows how to seperate
separator="\n",
# Shows the document token length
chunk_size=1000,
# How much overlap should exist between documents
chunk_overlap=150,
# How to measure length
length_function=len
)

# Applying the splitter
docs = text_splitter.split_documents(pages_py)

docs
Text splitter split documents into 330 from 100 pages

To get the best result, you would have to experiment with the best text-splitting strategy. The smaller the chunks the more specific information the retriever can fetch. However, if you make them too small then the retriever will not be able to retrieve any useful information, it will return small chunks of incoherent text.

You may also want to look into removing white space from documents. You can define a function that edits the ‘page_content’ for every document.

# a simple function that removes \n newline from the content
def remove_ws(d):
text = d.page_content.replace('\n','')
d.page_content = text
return d

# applied on the docs
docs = [remove_ws(d) for d in docs]

Building a Retriever

A Retriever is exactly as it sounds, something that brings back something for us. In this context, retrievers perform a semantic search on text data and bring back the text that matches the search query according to some underlying algorithm.

For this tutorial, I will be using FAISS, which is an open-source embedding model developed by Meta. You can see the documentation here and details for the LangChain implementation here.

# uses OpenAI embeddings to build a retriever
embeddings = OpenAIEmbeddings(api_key=api_key)
# Creates the document retriever using docs and embeddings
db = FAISS.from_documents(docs, embeddings)



# Asking the retriever to do similarity search based on Query
query = "Foreign Aid for Lowari Road Tunnel & Access Roads Project (2nd Revised )"
answer = db.similarity_search(query)

# Building the retriever
retriever = db.as_retriever(search_kwargs={'k': 3})
Image by the Author. Shows the original table in the PDF
Image by the Author. The retriever was able to search for the document.

More sophisticated retriever pipelines can aid in extracting more information and reformatting it according to how the user wants. However, in this post, our aim is to get started, using as few steps as possible.

Context Augmentation for the LLM (Basic)

The most basic RAG pipeline is one where we feed the retriever’s output to the LLM only once and ask it to do something with the information.

#Imports needed for the code to work.
#Using a simple output parser and chat prompt template
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


# This is the prompt used
template = """

You are a information retrieval AI. Format the retrieved information as a table or text


Use only the context for your answers, do not make up information

query: {query}

{context}
"""

# Converts the prompt into a prompt template
prompt = ChatPromptTemplate.from_template(template)
#Using OpenAI model, by default gpt 3.5 Turbo
model = ChatOpenAI(api_key =api_key)

# Construction of the chain
chain = (
# The initial dictionary uses the retriever and user supplied query
{"context":retriever,
"query":RunnablePassthrough()}
# Feeds that context and query into the prompt then model & lastly
# uses the ouput parser, do query for the data.
| prompt | model | StrOutputParser()

)

Now that we have our initial chain, lets begin chatting.

# asking for something inside the PDF image shown
chain.invoke("""Find the details Antimicrobial Resistance
(AMR)containment and Infection
prevention Control(IPC) program


Break down the Approved Cost both Total and Foreign Aid, Throwforward and Estimated Expenditure
""")
Answer by the LLM application
Information in the PDF.

There you have it the basic version for chatting with your PDF.

Asking for a summary of the Preface

I plan on doing an explainer on advanced RAG topics using PDFs. Please follow if you want to get notified when I publish. Also, if you’re looking to hire a AI expert to guide you through the mysteries of Large Language Models, you can reach out here:

https://form.jotform.com/240744327173051

Thank you, so much for reading!

--

--

Arslan Shahid
FireBird Technologies

Life has the Markov property, the future is independent of the past, given the present