ChatGPT vs Claude 3 — Which is better for text-to-SQL

6 min readApr 11, 2024

Since Anthropic released Claude 3, everyone has been comparing it with OpenAI’s ChatGPT models. This post will evaluate these two models on text-to-SQL tasks in keeping with the general trend.

Methodology

The evaluation method is straightforward: models receive questions to translate into SQL queries. Their output is compared with a pre-defined correct answer. If they match, the model performed accurately.
Results will be tabulated, and the model with the highest accuracy percentage will be deemed the best. We would be using this training set provided by DuckDB :

Limitations

This method has the following limitations

Query Optimization: It only considers the output of the query. There are usually many different ways to fetch the same data, some are better optimized (take less time/compute) depending on the database. This simple method does not take query optimization into account
Limited to DuckDB: Our evaluation set was tailored for DuckDB. A model scoring well in our test might not perform as well in other SQL databases.

Looking for an AI expert to solve your problem related to Large Language Models & data science? Click here:

https://form.jotform.com/240744327173051

Running an LLM

It takes a very few lines of code to run any Large Language Model from OpenAI or Anthropic.

OpenAI

from vanna.openai.openai_chat import OpenAI_Chat
from vanna.vannadb.vannadb_vector import VannaDB_VectorStore

#Add Vanna model and Vanna api_key
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
        VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
        OpenAI_Chat.__init__(self, config=config)

#Provide your api_key and which model you would like to use
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4-...'})

Anthropic

from vanna.anthropic.anthropic_chat import Anthropic_Chat


class MyVanna(VannaDB_VectorStore, Anthropic_Chat):
    def __init__(self, config=None):
        MY_VANNA_MODEL =  # Your model name from https://vanna.ai/account/profile
        VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key= #insert Vanna_API key, config=config)
        Anthropic_Chat.__init__(self, config=config)

api_key = # insert your API key
# You can change models to any Anthropic model
model = "claude-"
config = {'api_key':api_key,'model':model}

# Passes the configuration to MyVanna object
vn = MyVanna(config=config)

Evaluation Data Overview

For the evaluation we would be using a dataset with the following schema:

Image by the Author. Json file with the data

It has six columns, one is the db_id (an indentifier for the database). Question is the natural language question and query is the correct SQL we need to test against.

Training and Evaluation

Vanna allows user to train any LLM on the specifics of their database. It allows you to build a RAG by feeding ddl, question-sql pairs and documentation (metadata or business knowledge). For a detailed explaner on using Vanna for training you can read this post:

Chat with your SQL database using Claude 3

A blog post detailing how to use Claude for chatting with your database via Vanna.ai

arslanshahid-1997.medium.com

This is how training set was created:

#These are the models used
models = [
    {'model_id': 'gpt-3.5-turbo', 'provider': 'openai'},
    {'model_id': 'gpt-4-turbo-preview', 'provider': 'openai'},
    {'model_id': 'gpt-4', 'provider': 'openai'},

    {'model_id': 'claude-3-opus-20240229', 'provider': 'anthropic'},
    {'model_id': 'claude-3-sonnet-20240229', 'provider': 'anthropic'},
    {'model_id': 'claude-3-haiku-20240307', 'provider': 'anthropic'},
]

Next we would train for each of these model one Vanna object with just schema information and one with schema + question-SQL pairs

#iterates over all models
for model in models:
    model_id = model['model_id']
    
    output_data = []
    timestamp = pd.Timestamp.now().strftime('%Y%m%d%H%M%S')
    output_data_filename = f'output/output_data_{model_id}_{timestamp}.csv'
# Iterates over all databases of the evaluation data (shown above)
    for db_id in test_df['db_id'].unique():
        # This loades the data
        folder_schema_only = f'./chromadb/{db_id}/schema-only'
        folder_schema_and_reference_sql = f'./chromadb/{db_id}/schema-and-reference-sql'
# Passes in the configuration for openai        
        if model['provider'] == 'openai':
            vn_schema = Vanna_OpenAI(config={'api_key': os.environ['OPENAI_API_KEY'], 'model': model_id, 'path': folder_schema_only})
            vn_reference_sql = Vanna_OpenAI(config={'api_key': os.environ['OPENAI_API_KEY'], 'model': model_id, 'path': folder_schema_and_reference_sql})
# Passes in the configuration for anthropic
        elif model['provider'] == 'anthropic':
            vn_schema = Vanna_Anthropic(config={'api_key': os.environ['ANTHROPIC_API_KEY'], 'model': model_id, 'path': folder_schema_only})
            vn_reference_sql = Vanna_Anthropic(config={'api_key': os.environ['ANTHROPIC_API_KEY'], 'model': model_id, 'path': folder_schema_and_reference_sql})
#COnnects to the db
        vn_schema.connect_to_duckdb(f'md_text2sql_benchmark/databases/{db_id}/{db_id}.duckdb')
        vn_reference_sql.connect_to_duckdb(f'md_text2sql_benchmark/databases/{db_id}/{db_id}.duckdb')


        df_information_schema = vn_schema.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
        plan = vn_schema.get_training_plan_generic(df_information_schema)
        vn_schema.train(plan=plan)
        vn_reference_sql.train(plan=plan)

        num_training_data = vn_schema.get_training_data().shape[0]
# Get SQL prediction function is explained below
        for index, row in test_df.query(f'db_id == "{db_id}"').iterrows():
            get_sql_prediction(output_data, vn_schema, db_id=db_id, model_id=model_id, context='schema-only', question=row['question'], expected_sql=row['query'], num_training_data=num_training_data)

        num_training_data = vn_reference_sql.get_training_data().shape[0]

        for index, row in test_df.query(f'db_id == "{db_id}"').iterrows():
            get_sql_prediction(output_data, vn_reference_sql, db_id=db_id, model_id=model_id, context='schema-and-reference-sql', question=row['question'], expected_sql=row['query'], num_training_data=num_training_data)
# Saves the dataset into csvs to be loaded in the future
    pd.DataFrame(output_data).to_csv(output_data_filename, index=False)

SQL prediction

def get_sql_prediction(output_data, vn, db_id, model_id, context, question, expected_sql, num_training_data):
    #Generates the SQL based on the question
    sql = vn.generate_sql(question=question)

    try:
#Runs the sql
        expected_df = vn.run_sql(expected_sql)
        expected_df_md = expected_df.to_markdown()
    except Exception as e:
        expected_df = None
        expected_df_md = str(e)

    correct = 0

    try:
        df = vn.run_sql(sql)
        df_md = df.to_markdown()

        if expected_df is not None:
            # Output has the same rows as expected
            if expected_df.shape[0] == df.shape[0]:
                correct = 1

    except Exception as e:
        df_md = str(e)
#Outputs the data point that will be saved as a CSV
    output_data.append({
        'db_id': db_id,
        'model': model_id,
        'context': context,
        'num_training_data': num_training_data,
        'question': question,
        'expected_sql': expected_sql,
        'sql': sql,
        'expected_df': expected_df_md,
        'df': df_md,
        'correct': correct,
    })

The above code snippets do the following things:

Take each model then build two RAGs for each model. One only trained on the schema and the other trained on both the schema and sql-question pairs. After training it asks each RAG to answer the questions and compares the resulting dataframe with the correct dataframe. If it matches it assigns it as correct.

Results

After evaluation a accuracy metric is computed by dividing the number of correct answers with total questions asked.

Claude 3 Opus and Chat GPT 4 are the most accurate, with around 59% accuracy for SQL + schema and 56% for just schema. Interestingly, GPT-4-turbo is less accurate overall than GPT 4, which is unexpected. The other two Claude model’s beat ChatGPT 3.5.

Cost

The comparison is incomplete without looking at the per token cost, as users might tradeoff a bit of accuracy for lower costs.

Claude Opus is the most expensive provided that the output tokens are similar for ChatGPT and Claude Opus. Comparing the cost if you prefer the highest accuracy and cost is not a major issue, you should use ChatGPT 4. However, if you would like a reasonable output compared with lower cost the Claude 3 Haiku and Sonnet are the preferred models. With Claude 3 Haiku being the cheapest options to use.