Chat with 150K HuggingFace datasets using Vanna.AI with GPT 4o

Chat with Hugging Face Dataset repository via Vanna

3 min readJun 4, 2024

Imagine having the access to analyze over 150,000 datasets , all through a simple chat interface. With Vanna.AI and DuckDB, this is straightforward to build.

Vanna.AI provides a way to use any LLM to build a text to SQL pipeline, DuckDB allows you to connect to external data sources like Hugging Face.

Connecting to Vanna AI

In order to connect to Vanna you need to first get your free API from here! Vanna allows you to connect to any LLM but for the purpose of this post, GPT 4o was used. This is how you can connect, using your OpenAI api key.

from vanna.openai import OpenAI_Chat
from vanna.vannadb import VannaDB_VectorStore
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
        VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
        OpenAI_Chat.__init__(self, config=config)
# Add your OpenAI api_key
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4o'})

Connecting to DuckDB

DuckDB allows you to download datasets from HuggingFace directly, the first step in the process is to connect to DuckDB

#This is how you can connect to a DuckDB database
vn.connect_to_duckdb(url='motherduck:?motherduck_token=<token>')

After connecting, first explore the datasets available at Hugging Face!

Image by the Author, shows the Hugging Face Dataset repository

Downloading the dataset

You can choose any dataset you like for this post; this post shows the Fineweb dataset !

In order to load the dataset in your environment you need to pass in the dataset reference like this:

hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩

You can run this SQL to download the dataset.

#Running this will download the dataset
vn.run_sql("""
CREATE TABLE Fineweb AS
SELECT * FROM 'hf://datasets/HuggingFaceFW/fineweb/data/CC-MAIN-2013-20/000_00000.parquet';
SELECT * FROM Fineweb""")

Training

Vanna can aid in developing a RAG application which knows about the schema of your database.

Training on Plan

# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan
# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)

Training on DDL

# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM FineWeb")

Training on Question/SQL Pair

# here is an example of training on SQL statements
# In this data set we calculate the most common words in the text column
vn.train(sql ="""
SELECT word, COUNT(*) AS frequency
FROM (
    SELECT UNNEST(STRING_SPLIT(LOWER(text), ' ')) AS word
    FROM WIKIPEDIA
) AS words
GROUP BY word
ORDER BY frequency DESC
LIMIT 10;
""", question ="What are the most common words in text?")

Training on Documentation

# We can use documentation to give explicit context that you would give to a data analyst
vn.train(documentation="The number of worker's column corresponds to people laid off")

Chat

You can use the Vanna ask function to ask questions or launch the built in UI.

vn.ask("Show me the most common words in text", visualize=False)

Image by Author — Shows the Plotly chart for the question

You can use the following lines of code to launch the Flask App

from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()

Image by the Author — Showcasing Vanna Flask App

Thank you for reading!