104 lines
3.3 KiB
Markdown
104 lines
3.3 KiB
Markdown
|
|
# Prompt‑Clustering Utility
|
|||
|
|
|
|||
|
|
This repository contains a small utility (`cluster_prompts.py`) that embeds a
|
|||
|
|
list of prompts with the OpenAI Embedding API, discovers natural groupings with
|
|||
|
|
unsupervised clustering, lets ChatGPT name & describe each cluster and finally
|
|||
|
|
produces a concise Markdown report plus a couple of diagnostic plots.
|
|||
|
|
|
|||
|
|
The default input file (`prompts.csv`) ships with the repo so you can try the
|
|||
|
|
script immediately, but you can of course point it at your own file.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Setup
|
|||
|
|
|
|||
|
|
1. Install the Python dependencies (preferably inside a virtual env):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
pip install pandas numpy scikit-learn matplotlib openai
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. Export your OpenAI API key (**required**):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
export OPENAI_API_KEY="sk‑..."
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Basic usage
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Minimal command – runs on prompts.csv and writes analysis.md + plots/
|
|||
|
|
python cluster_prompts.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This will
|
|||
|
|
|
|||
|
|
* create embeddings with the `text-embedding-3-small` model,
|
|||
|
|
* pick a suitable number *k* via silhouette score (K‑Means),
|
|||
|
|
* ask `gpt‑4o‑mini` to label & describe each cluster,
|
|||
|
|
* store the results in `analysis.md`,
|
|||
|
|
* and save two plots to `plots/` (`cluster_sizes.png` and `tsne.png`).
|
|||
|
|
|
|||
|
|
The script prints a short success message once done.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Command‑line options
|
|||
|
|
|
|||
|
|
| flag | default | description |
|
|||
|
|
|------|---------|-------------|
|
|||
|
|
| `--csv` | `prompts.csv` | path to the input CSV (must contain a `prompt` column; an `act` column is used as context if present) |
|
|||
|
|
| `--cache` | _(none)_ | embedding cache path (JSON). Speeds up repeated runs – new texts are appended automatically. |
|
|||
|
|
| `--cluster-method` | `kmeans` | `kmeans` (with automatic *k*) or `dbscan` |
|
|||
|
|
| `--k-max` | `10` | upper bound for *k* when `kmeans` is selected |
|
|||
|
|
| `--dbscan-min-samples` | `3` | min samples parameter for DBSCAN |
|
|||
|
|
| `--embedding-model` | `text-embedding-3-small` | any OpenAI embedding model |
|
|||
|
|
| `--chat-model` | `gpt-4o-mini` | chat model used to generate cluster names / descriptions |
|
|||
|
|
| `--output-md` | `analysis.md` | where to write the Markdown report |
|
|||
|
|
| `--plots-dir` | `plots` | directory for generated PNGs |
|
|||
|
|
|
|||
|
|
Example with customised options:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python cluster_prompts.py \
|
|||
|
|
--csv my_prompts.csv \
|
|||
|
|
--cache .cache/embeddings.json \
|
|||
|
|
--cluster-method dbscan \
|
|||
|
|
--embedding-model text-embedding-3-large \
|
|||
|
|
--chat-model gpt-4o \
|
|||
|
|
--output-md my_analysis.md \
|
|||
|
|
--plots-dir my_plots
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Interpreting the output
|
|||
|
|
|
|||
|
|
### analysis.md
|
|||
|
|
|
|||
|
|
* Overview table: cluster label, generated name, member count and description.
|
|||
|
|
* Detailed section for every cluster with five representative example prompts.
|
|||
|
|
* Separate lists for
|
|||
|
|
* **Noise / outliers** (label `‑1` when DBSCAN is used) and
|
|||
|
|
* **Potentially ambiguous prompts** (only with K‑Means) – these are items that
|
|||
|
|
lie almost equally close to two centroids and might belong to multiple
|
|||
|
|
groups.
|
|||
|
|
|
|||
|
|
### plots/cluster_sizes.png
|
|||
|
|
|
|||
|
|
Quick bar‑chart visualisation of how many prompts ended up in each cluster.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Troubleshooting
|
|||
|
|
|
|||
|
|
* **Rate‑limits / quota errors** – lower the number of prompts per run or switch
|
|||
|
|
to a larger quota account.
|
|||
|
|
* **Authentication errors** – make sure `OPENAI_API_KEY` is exported in the
|
|||
|
|
shell where you run the script.
|
|||
|
|
* **Inadequate clusters** – try the other clustering method, adjust `--k-max`
|
|||
|
|
or tune DBSCAN parameters (`eps` range is inferred, `min_samples` exposed via
|
|||
|
|
CLI).
|