ChatGPT vs Claude 3 — Which is better for text-to-SQL

Arslan Shahid
6 min readApr 11, 2024

--

Image by the Author

Since Anthropic released Claude 3, everyone has been comparing it with OpenAI’s ChatGPT models. This post will evaluate these two models on text-to-SQL tasks in keeping with the general trend.

Methodology

The evaluation method is straightforward: models receive questions to translate into SQL queries. Their output is compared with a pre-defined correct answer. If they match, the model performed accurately.
Results will be tabulated, and the model with the highest accuracy percentage will be deemed the best. We would be using this training set provided by DuckDB :

Limitations

This method has the following limitations

  1. Query Optimization: It only considers the output of the query. There are usually many different ways to fetch the same data, some are better optimized (take less time/compute) depending on the database. This simple method does not take query optimization into account
  2. Limited to DuckDB: Our evaluation set was tailored for DuckDB. A model scoring well in our test might not perform as well in other SQL databases.

Looking for an AI expert to solve your problem related to Large Language Models & data science? Click here:

https://form.jotform.com/240744327173051

Running an LLM

It takes a very few lines of code to run any Large Language Model from OpenAI or Anthropic.

OpenAI

from vanna.openai.openai_chat import OpenAI_Chat
from vanna.vannadb.vannadb_vector import VannaDB_VectorStore

#Add Vanna model and Vanna api_key
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
def __init__(self, config=None):
MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
OpenAI_Chat.__init__(self, config=config)

#Provide your api_key and which model you would like to use
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4-...'})

Anthropic

from vanna.anthropic.anthropic_chat import Anthropic_Chat


class MyVanna(VannaDB_VectorStore, Anthropic_Chat):
def __init__(self, config=None):
MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key= #insert Vanna_API key, config=config)
Anthropic_Chat.__init__(self, config=config)

api_key = # insert your API key
# You can change models to any Anthropic model
model = "claude-"
config = {'api_key':api_key,'model':model}

# Passes the configuration to MyVanna object
vn = MyVanna(config=config)

Evaluation Data Overview

For the evaluation we would be using a dataset with the following schema:

Image by the Author. Json file with the data
Image by the Author

It has six columns, one is the db_id (an indentifier for the database). Question is the natural language question and query is the correct SQL we need to test against.

Training and Evaluation

Vanna allows user to train any LLM on the specifics of their database. It allows you to build a RAG by feeding ddl, question-sql pairs and documentation (metadata or business knowledge). For a detailed explaner on using Vanna for training you can read this post:

This is how training set was created:

#These are the models used
models = [
{'model_id': 'gpt-3.5-turbo', 'provider': 'openai'},
{'model_id': 'gpt-4-turbo-preview', 'provider': 'openai'},
{'model_id': 'gpt-4', 'provider': 'openai'},

{'model_id': 'claude-3-opus-20240229', 'provider': 'anthropic'},
{'model_id': 'claude-3-sonnet-20240229', 'provider': 'anthropic'},
{'model_id': 'claude-3-haiku-20240307', 'provider': 'anthropic'},
]

Next we would train for each of these model one Vanna object with just schema information and one with schema + question-SQL pairs

#iterates over all models
for model in models:
model_id = model['model_id']

output_data = []
timestamp = pd.Timestamp.now().strftime('%Y%m%d%H%M%S')
output_data_filename = f'output/output_data_{model_id}_{timestamp}.csv'
# Iterates over all databases of the evaluation data (shown above)
for db_id in test_df['db_id'].unique():
# This loades the data
folder_schema_only = f'./chromadb/{db_id}/schema-only'
folder_schema_and_reference_sql = f'./chromadb/{db_id}/schema-and-reference-sql'
# Passes in the configuration for openai
if model['provider'] == 'openai':
vn_schema = Vanna_OpenAI(config={'api_key': os.environ['OPENAI_API_KEY'], 'model': model_id, 'path': folder_schema_only})
vn_reference_sql = Vanna_OpenAI(config={'api_key': os.environ['OPENAI_API_KEY'], 'model': model_id, 'path': folder_schema_and_reference_sql})
# Passes in the configuration for anthropic
elif model['provider'] == 'anthropic':
vn_schema = Vanna_Anthropic(config={'api_key': os.environ['ANTHROPIC_API_KEY'], 'model': model_id, 'path': folder_schema_only})
vn_reference_sql = Vanna_Anthropic(config={'api_key': os.environ['ANTHROPIC_API_KEY'], 'model': model_id, 'path': folder_schema_and_reference_sql})
#COnnects to the db
vn_schema.connect_to_duckdb(f'md_text2sql_benchmark/databases/{db_id}/{db_id}.duckdb')
vn_reference_sql.connect_to_duckdb(f'md_text2sql_benchmark/databases/{db_id}/{db_id}.duckdb')


df_information_schema = vn_schema.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
plan = vn_schema.get_training_plan_generic(df_information_schema)
vn_schema.train(plan=plan)
vn_reference_sql.train(plan=plan)

num_training_data = vn_schema.get_training_data().shape[0]
# Get SQL prediction function is explained below
for index, row in test_df.query(f'db_id == "{db_id}"').iterrows():
get_sql_prediction(output_data, vn_schema, db_id=db_id, model_id=model_id, context='schema-only', question=row['question'], expected_sql=row['query'], num_training_data=num_training_data)

num_training_data = vn_reference_sql.get_training_data().shape[0]

for index, row in test_df.query(f'db_id == "{db_id}"').iterrows():
get_sql_prediction(output_data, vn_reference_sql, db_id=db_id, model_id=model_id, context='schema-and-reference-sql', question=row['question'], expected_sql=row['query'], num_training_data=num_training_data)
# Saves the dataset into csvs to be loaded in the future
pd.DataFrame(output_data).to_csv(output_data_filename, index=False)

SQL prediction

def get_sql_prediction(output_data, vn, db_id, model_id, context, question, expected_sql, num_training_data):
#Generates the SQL based on the question
sql = vn.generate_sql(question=question)

try:
#Runs the sql
expected_df = vn.run_sql(expected_sql)
expected_df_md = expected_df.to_markdown()
except Exception as e:
expected_df = None
expected_df_md = str(e)

correct = 0

try:
df = vn.run_sql(sql)
df_md = df.to_markdown()

if expected_df is not None:
# Output has the same rows as expected
if expected_df.shape[0] == df.shape[0]:
correct = 1

except Exception as e:
df_md = str(e)
#Outputs the data point that will be saved as a CSV
output_data.append({
'db_id': db_id,
'model': model_id,
'context': context,
'num_training_data': num_training_data,
'question': question,
'expected_sql': expected_sql,
'sql': sql,
'expected_df': expected_df_md,
'df': df_md,
'correct': correct,
})

The above code snippets do the following things:

Take each model then build two RAGs for each model. One only trained on the schema and the other trained on both the schema and sql-question pairs. After training it asks each RAG to answer the questions and compares the resulting dataframe with the correct dataframe. If it matches it assigns it as correct.

Results

After evaluation a accuracy metric is computed by dividing the number of correct answers with total questions asked.

Same results visualized

Claude 3 Opus and Chat GPT 4 are the most accurate, with around 59% accuracy for SQL + schema and 56% for just schema. Interestingly, GPT-4-turbo is less accurate overall than GPT 4, which is unexpected. The other two Claude model’s beat ChatGPT 3.5.

Cost

The comparison is incomplete without looking at the per token cost, as users might tradeoff a bit of accuracy for lower costs.

Input Cost for models
Output Cost for the models
Per Call Cost
Total Cost

Claude Opus is the most expensive provided that the output tokens are similar for ChatGPT and Claude Opus. Comparing the cost if you prefer the highest accuracy and cost is not a major issue, you should use ChatGPT 4. However, if you would like a reasonable output compared with lower cost the Claude 3 Haiku and Sonnet are the preferred models. With Claude 3 Haiku being the cheapest options to use.

--

--

Responses (2)