Files

Ilan Bigio 59a180ddec Initial commit

Signed-off-by: Ilan Bigio <ilan@openai.com>

2025-04-16 12:56:08 -04:00

plots

Initial commit

2025-04-16 12:56:08 -04:00

plots_dbscan

Initial commit

2025-04-16 12:56:08 -04:00

analysis_dbscan.md

Initial commit

2025-04-16 12:56:08 -04:00

analysis.md

Initial commit

2025-04-16 12:56:08 -04:00

cluster_prompts.py

Initial commit

2025-04-16 12:56:08 -04:00

Clustering.ipynb

Initial commit

2025-04-16 12:56:08 -04:00

prompts.csv

Initial commit

2025-04-16 12:56:08 -04:00

README.md

Initial commit

2025-04-16 12:56:08 -04:00

README.md

Prompt‑Clustering Utility

This repository contains a small utility (cluster_prompts.py) that embeds a list of prompts with the OpenAI Embedding API, discovers natural groupings with unsupervised clustering, lets ChatGPT name & describe each cluster and finally produces a concise Markdown report plus a couple of diagnostic plots.

The default input file (prompts.csv) ships with the repo so you can try the script immediately, but you can of course point it at your own file.

1. Setup

1. Install the Python dependencies (preferably inside a virtual env):

pip install pandas numpy scikit-learn matplotlib openai

2. Export your OpenAI API key (required):

export OPENAI_API_KEY="sk‑..."

2. Basic usage

# Minimal command – runs on prompts.csv and writes analysis.md + plots/
python cluster_prompts.py

This will

create embeddings with the text-embedding-3-small model,
pick a suitable number k via silhouette score (K‑Means),
ask gpt‑4o‑mini to label & describe each cluster,
store the results in analysis.md,
and save two plots to plots/ (cluster_sizes.png and tsne.png).

The script prints a short success message once done.

3. Command‑line options

flag	default	description
`--csv`	`prompts.csv`	path to the input CSV (must contain a `prompt` column; an `act` column is used as context if present)
`--cache`	(none)	embedding cache path (JSON). Speeds up repeated runs – new texts are appended automatically.
`--cluster-method`	`kmeans`	`kmeans` (with automatic k) or `dbscan`
`--k-max`	`10`	upper bound for k when `kmeans` is selected
`--dbscan-min-samples`	`3`	min samples parameter for DBSCAN
`--embedding-model`	`text-embedding-3-small`	any OpenAI embedding model
`--chat-model`	`gpt-4o-mini`	chat model used to generate cluster names / descriptions
`--output-md`	`analysis.md`	where to write the Markdown report
`--plots-dir`	`plots`	directory for generated PNGs

Example with customised options:

python cluster_prompts.py \
  --csv my_prompts.csv \
  --cache .cache/embeddings.json \
  --cluster-method dbscan \
  --embedding-model text-embedding-3-large \
  --chat-model gpt-4o \
  --output-md my_analysis.md \
  --plots-dir my_plots

4. Interpreting the output

analysis.md

Overview table: cluster label, generated name, member count and description.
Detailed section for every cluster with five representative example prompts.
Separate lists for
- Noise / outliers (label ‑1 when DBSCAN is used) and
- Potentially ambiguous prompts (only with K‑Means) – these are items that lie almost equally close to two centroids and might belong to multiple groups.

plots/cluster_sizes.png

Quick bar‑chart visualisation of how many prompts ended up in each cluster.

5. Troubleshooting

Rate‑limits / quota errors – lower the number of prompts per run or switch to a larger quota account.
Authentication errors – make sure OPENAI_API_KEY is exported in the shell where you run the script.
Inadequate clusters – try the other clustering method, adjust --k-max or tune DBSCAN parameters (eps range is inferred, min_samples exposed via CLI).

README.md Unescape Escape

Prompt‑Clustering Utility

1. Setup

2. Basic usage

3. Command‑line options

4. Interpreting the output

analysis.md

plots/cluster_sizes.png

5. Troubleshooting

README.md