# Prompt‑Clustering Utility This repository contains a small utility (`cluster_prompts.py`) that embeds a list of prompts with the OpenAI Embedding API, discovers natural groupings with unsupervised clustering, lets ChatGPT name & describe each cluster and finally produces a concise Markdown report plus a couple of diagnostic plots. The default input file (`prompts.csv`) ships with the repo so you can try the script immediately, but you can of course point it at your own file. --- ## 1. Setup 1. Install the Python dependencies (preferably inside a virtual env): ```bash pip install pandas numpy scikit-learn matplotlib openai ``` 2. Export your OpenAI API key (**required**): ```bash export OPENAI_API_KEY="sk‑..." ``` --- ## 2. Basic usage ```bash # Minimal command – runs on prompts.csv and writes analysis.md + plots/ python cluster_prompts.py ``` This will * create embeddings with the `text-embedding-3-small` model,  * pick a suitable number *k* via silhouette score (K‑Means), * ask `gpt‑4o‑mini` to label & describe each cluster, * store the results in `analysis.md`, * and save two plots to `plots/` (`cluster_sizes.png` and `tsne.png`). The script prints a short success message once done. --- ## 3. Command‑line options | flag | default | description | |------|---------|-------------| | `--csv` | `prompts.csv` | path to the input CSV (must contain a `prompt` column; an `act` column is used as context if present) | | `--cache` | _(none)_ | embed­ding cache path (JSON). Speeds up repeated runs – new texts are appended automatically. | | `--cluster-method` | `kmeans` | `kmeans` (with automatic *k*) or `dbscan` | | `--k-max` | `10` | upper bound for *k* when `kmeans` is selected | | `--dbscan-min-samples` | `3` | min samples parameter for DBSCAN | | `--embedding-model` | `text-embedding-3-small` | any OpenAI embedding model | | `--chat-model` | `gpt-4o-mini` | chat model used to generate cluster names / descriptions | | `--output-md` | `analysis.md` | where to write the Markdown report | | `--plots-dir` | `plots` | directory for generated PNGs | Example with customised options: ```bash python cluster_prompts.py \ --csv my_prompts.csv \ --cache .cache/embeddings.json \ --cluster-method dbscan \ --embedding-model text-embedding-3-large \ --chat-model gpt-4o \ --output-md my_analysis.md \ --plots-dir my_plots ``` --- ## 4. Interpreting the output ### analysis.md * Overview table: cluster label, generated name, member count and description. * Detailed section for every cluster with five representative example prompts. * Separate lists for * **Noise / outliers** (label `‑1` when DBSCAN is used) and * **Potentially ambiguous prompts** (only with K‑Means) – these are items that lie almost equally close to two centroids and might belong to multiple groups. ### plots/cluster_sizes.png Quick bar‑chart visualisation of how many prompts ended up in each cluster. --- ## 5. Troubleshooting * **Rate‑limits / quota errors** – lower the number of prompts per run or switch to a larger quota account. * **Authentication errors** – make sure `OPENAI_API_KEY` is exported in the shell where you run the script. * **Inadequate clusters** – try the other clustering method, adjust `--k-max` or tune DBSCAN parameters (`eps` range is inferred, `min_samples` exposed via CLI).