Chat with your SQL database using GPT 4o via Vanna.ai

Creating a text-to-SQL pipeline using OpenAI GPT 4o

4 min readMay 15, 2024

OpenAI has just unveiled its latest flagship model, GPT 4o. This cutting-edge model is multimodal, capable of processing images, text, and video inputs to generate responses that surpass those of any previous model. This post highlights how you can build a text-to-SQL pipeline to chat with your database using GPT 4o and Vanna AI

Looking for someone to solve your problem related to Large Language Models & data science? Click here:

https://form.jotform.com/240744327173051

Image by the Author — Preview of Vanna Flask App

Getting Started

from vanna.openai import OpenAI_Chat
from vanna.vannadb import VannaDB_VectorStore
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
        VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
        OpenAI_Chat.__init__(self, config=config)
# Add your OpenAI api_key
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4o'})

Connecting to a Database

Vanna has built-in connectors for these 8 databases (you can connect to other database with a few additional lines of code):

Postgres SQL
Oracle
DuckDB
MySQL
SQLite
Big Query
Snowflake
Microsoft SQL

By looking at the documentation, you can figure out how to connect your specific database. For the purposes of this post, I would be connecting to DuckDB StackOverFlow Database. The database can found here!

#This is how you can connect to a DuckDB database
vn.connect_to_duckdb(url='motherduck:[<database_name>]?motherduck_token=<token>&saas_mode=true')

Training

Image by Author — StackOverFlow DB schema

Training on Plan (Information Schema)

# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")

# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan

# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)

Training on DDL

# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM Stackoverflow.users;")

Training on SQL statements

# here is an example of training on SQL statements
vn.train(
question="What are the top 10 users with highest amount of Badges?"
,sql="""SELECT UserId, COUNT(*) AS badge_count
FROM stackoverflow.main.badges
GROUP BY UserId
ORDER BY badge_count DESC
LIMIT 10
""")
# Another example
vn.train(
question="What is the difference in total answers for the user who answered the most answers and the user who answered the least questions?", 
,sql="SELECT MAX(answer_count) - MIN(answer_count) AS difference
FROM (
    SELECT OwnerUserId, COUNT(*) AS answer_count
    FROM stackoverflow.main.posts
    WHERE PostTypeId = 2
    GROUP BY OwnerUserId
) AS answer_counts;
")

Training on Documentation

# You can feed in contextual information using documentation
vn.train(documentation="We call the user with the highest answers in a year the Grand master")

You can view your training data using vn.get_training_data()

# vn.ask is runs these following functions in sequence, which can be run individually
# 1. vn.generate_ql
# 2. vn.run_sql
# 3. vn.generate_plotly_code
# 4. vn.get_plotly_figure

# this is how you can ask Vanna question's post training
vn.ask('Find the top 10 users with the highest amount of Badges?')

Image by Author — Visualization created by the SQL query

Using the Flask App

Vanna comes with a built-in UI Flask app. Which can be launched inside a jupyter notebook or Python script.

from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()

This is how you can use GPT 4o for text-to-SQL.

Benchmark

Below is the comparison between GPT 4o with other models in its class, as you can see it performs with the highest accuracy at 61% compared with 59% for ChatGPT 4 turbo and 56% for Claude Opus