Prompt‑Clustering Utility
This repository contains a small utility (cluster_prompts.py) that embeds a
list of prompts with the OpenAI Embedding API, discovers natural groupings with
unsupervised clustering, lets ChatGPT name & describe each cluster and finally
produces a concise Markdown report plus a couple of diagnostic plots.
The default input file (prompts.csv) ships with the repo so you can try the
script immediately, but you can of course point it at your own file.
1. Setup
1. Install the Python dependencies (preferably inside a virtual env):
pip install pandas numpy scikit-learn matplotlib openai
2. Export your OpenAI API key (required):
export OPENAI_API_KEY="sk‑..."
2. Basic usage
# Minimal command – runs on prompts.csv and writes analysis.md + plots/
python cluster_prompts.py
This will
- create embeddings with the
text-embedding-3-smallmodel, - pick a suitable number k via silhouette score (K‑Means),
- ask
gpt‑4o‑minito label & describe each cluster, - store the results in
analysis.md, - and save two plots to
plots/(cluster_sizes.pngandtsne.png).
The script prints a short success message once done.
3. Command‑line options
| flag | default | description |
|---|---|---|
--csv |
prompts.csv |
path to the input CSV (must contain a prompt column; an act column is used as context if present) |
--cache |
(none) | embedding cache path (JSON). Speeds up repeated runs – new texts are appended automatically. |
--cluster-method |
kmeans |
kmeans (with automatic k) or dbscan |
--k-max |
10 |
upper bound for k when kmeans is selected |
--dbscan-min-samples |
3 |
min samples parameter for DBSCAN |
--embedding-model |
text-embedding-3-small |
any OpenAI embedding model |
--chat-model |
gpt-4o-mini |
chat model used to generate cluster names / descriptions |
--output-md |
analysis.md |
where to write the Markdown report |
--plots-dir |
plots |
directory for generated PNGs |
Example with customised options:
python cluster_prompts.py \
--csv my_prompts.csv \
--cache .cache/embeddings.json \
--cluster-method dbscan \
--embedding-model text-embedding-3-large \
--chat-model gpt-4o \
--output-md my_analysis.md \
--plots-dir my_plots
4. Interpreting the output
analysis.md
- Overview table: cluster label, generated name, member count and description.
- Detailed section for every cluster with five representative example prompts.
- Separate lists for
- Noise / outliers (label
‑1when DBSCAN is used) and - Potentially ambiguous prompts (only with K‑Means) – these are items that lie almost equally close to two centroids and might belong to multiple groups.
- Noise / outliers (label
plots/cluster_sizes.png
Quick bar‑chart visualisation of how many prompts ended up in each cluster.
5. Troubleshooting
- Rate‑limits / quota errors – lower the number of prompts per run or switch to a larger quota account.
- Authentication errors – make sure
OPENAI_API_KEYis exported in the shell where you run the script. - Inadequate clusters – try the other clustering method, adjust
--k-maxor tune DBSCAN parameters (epsrange is inferred,min_samplesexposed via CLI).