Chat with your SQL database using GPT 4o via Vanna.ai
Creating a text-to-SQL pipeline using OpenAI GPT 4o
OpenAI has just unveiled its latest flagship model, GPT 4o. This cutting-edge model is multimodal, capable of processing images, text, and video inputs to generate responses that surpass those of any previous model. This post highlights how you can build a text-to-SQL pipeline to chat with your database using GPT 4o and Vanna AI
Looking for someone to solve your problem related to Large Language Models & data science? Click here:
https://form.jotform.com/240744327173051
Getting Started
from vanna.openai import OpenAI_Chat
from vanna.vannadb import VannaDB_VectorStore
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
def __init__(self, config=None):
MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
OpenAI_Chat.__init__(self, config=config)
# Add your OpenAI api_key
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4o'})
Connecting to a Database
Vanna has built-in connectors for these 8 databases (you can connect to other database with a few additional lines of code):
- Postgres SQL
- Oracle
- DuckDB
- MySQL
- SQLite
- Big Query
- Snowflake
- Microsoft SQL
By looking at the documentation, you can figure out how to connect your specific database. For the purposes of this post, I would be connecting to DuckDB StackOverFlow Database. The database can found here!
#This is how you can connect to a DuckDB database
vn.connect_to_duckdb(url='motherduck:[<database_name>]?motherduck_token=<token>&saas_mode=true')
Training
Training on Plan (Information Schema)
# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan
# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)
Training on DDL
# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM Stackoverflow.users;")
Training on SQL statements
# here is an example of training on SQL statements
vn.train(
question="What are the top 10 users with highest amount of Badges?"
,sql="""SELECT UserId, COUNT(*) AS badge_count
FROM stackoverflow.main.badges
GROUP BY UserId
ORDER BY badge_count DESC
LIMIT 10
""")
# Another example
vn.train(
question="What is the difference in total answers for the user who answered the most answers and the user who answered the least questions?",
,sql="SELECT MAX(answer_count) - MIN(answer_count) AS difference
FROM (
SELECT OwnerUserId, COUNT(*) AS answer_count
FROM stackoverflow.main.posts
WHERE PostTypeId = 2
GROUP BY OwnerUserId
) AS answer_counts;
")
Training on Documentation
# You can feed in contextual information using documentation
vn.train(documentation="We call the user with the highest answers in a year the Grand master")
You can view your training data using vn.get_training_data()
# vn.ask is runs these following functions in sequence, which can be run individually
# 1. vn.generate_ql
# 2. vn.run_sql
# 3. vn.generate_plotly_code
# 4. vn.get_plotly_figure
# this is how you can ask Vanna question's post training
vn.ask('Find the top 10 users with the highest amount of Badges?')
Using the Flask App
Vanna comes with a built-in UI Flask app. Which can be launched inside a jupyter notebook or Python script.
from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()
This is how you can use GPT 4o for text-to-SQL.
Benchmark
Below is the comparison between GPT 4o with other models in its class, as you can see it performs with the highest accuracy at 61% compared with 59% for ChatGPT 4 turbo and 56% for Claude Opus
Thank you for reading!