Text-to-SQL for DuckDB database using Vanna, in 25 lines of code

Start ‘talking’ to your database using DuckDB in under 25 lines of code in Python

Arslan Shahid
5 min readApr 12, 2024
Image by the Author

DuckDB is a popular open-source SQL database developed by MotherDuck. Launched in 2019, DuckDB is now used by companies, indie-hackers, and data professionals worldwide. It is renowned as a simple, fast, and feature-rich SQL database.

With the help of Vanna, you can now talk directly to your DuckDB database. This walkthrough will illustrate exactly how you can do so.

First a slight introduction to what exactly we will be building!

Image by the Author. Showcases the Vanna Flask App, being asked a question
Image by the Author. Shows the visualization made in Plotly for the question

Getting started with Vanna

Installation

%pip install 'vanna[chromadb,duckdb]'

You can use Vanna with any LLM and different Vector storage. Here is a snapshot of the configurations you can try:

Image by Author

Vanna allows users to use their hosted option or any local vector store and Large Language Model (LLM). This post will be showcasing how you can use both options.

Vanna Hosted

To use the Vanna hosted option you would need an API key and create a model on their site. Enter an email address and create a model here:

Next, fill in this next block of code:

api_key = # Your API key from https://vanna.ai/account/profile 

vanna_model_name = # Your model name from https://vanna.ai/account/profile
vn = VannaDefault(model=vanna_model_name, api_key=api_key)

Local using Ollama

You could alternatively use a local LLM using Ollama, the below code shows you exactly how to do so.

from vanna.ollama.ollama import Ollama
from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore

class MyVanna(ChromaDB_VectorStore, Ollama):
def __init__(self, config=None):
ChromaDB_VectorStore.__init__(self, config=config)
Ollama.__init__(self, config=config)

vn = MyVanna(config={'model': 'duckdb-nsql'})

Connecting to DuckDB

Next, you need to connect to DuckDB, you can feed in your DuckDB URL and other credentials like this. You can connect to your own database or you can use the StackOverFlow database found here.

Overview of the Dataset

The StackOverflow database contains information about the posts, tags, users, votes, etc on the popular question, answer website stackoverflow.com

Image by the Author, Snapshot of the schema for some tables

Connection

You can connect to DuckDB by filling in this pattern URL for your database.

motherduck:[<database_name>]?motherduck_token=<token>&saas_mode=true

You need to replace the database name for a database in your DB and the token with your token.

#replace the url with the URL of your dB
vn.connect_to_duckdb(url='motherduck:[<database_name>]?motherduck_token=<token>&saas_mode=true')

Training on SQL

You can use Vanna to train on Information_schema, DDL, SQL statements, and also documentation specific to your domain (It includes information not explicit in the database but you would give to a new data analytics trainee).

Training on Plan (Information Schema)


# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")

# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan

# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)

Training on DDL

# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM Stackoverflow.users;")

Training on SQL statements

# here is an example of training on SQL statements
vn.train(
question="What are the top 10 users with highest amount of Badges?"
,sql="""SELECT UserId, COUNT(*) AS badge_count
FROM stackoverflow.main.badges
GROUP BY UserId
ORDER BY badge_count DESC
LIMIT 10
""")
# Another example
vn.train(
question="What is the difference in total answers for the user who answered the most answers and the user who answered the least questions?",
,sql="SELECT MAX(answer_count) - MIN(answer_count) AS difference
FROM (
SELECT OwnerUserId, COUNT(*) AS answer_count
FROM stackoverflow.main.posts
WHERE PostTypeId = 2
GROUP BY OwnerUserId
) AS answer_counts;
")

Training on Documentation

# You can feed in contextual information using documentation
vn.train(documentation="We call the user with the highest answers in a year the Grand master")

You can view your training data using vn.get_training_data()

Asking Questions

# vn.ask is runs these following functions in sequence, which can be run individually
# 1. vn.generate_ql
# 2. vn.run_sql
# 3. vn.generate_plotly_code
# 4. vn.get_plotly_figure

# this is how you can ask Vanna question's post training
vn.ask(question='Find the user with the highest amount of downvotes?' , visualize= False)
Image by the Author. SQL and answer set generated by query

Lets ask a more challenging question, involving several tables.

# This question involves the badges and post table
vn.ask(vn.ask(question='Find the first post of the user with the highest amount of badges?', visualize=False)
Image by the Author, Shows the SQL generated

Now let us enable visualization, and see the results visualized in Plotly.

vn.ask('Find the top 10 users with the highest amount of Badges?')
Image by the Author, shows the SQL and result.
Image by the Author, the visualization in Plotly

Using the Flask App

Vanna comes with a built-in UI Flask app. Which can be launched inside a jupyter notebook or Python script.

from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()
Image by the Author

Inside the Flask app you can ask questions, see visualizations, see training data & edit your training set.

--

--

Arslan Shahid

Life has the Markov property, the future is independent of the past, given the present