Files
llmx/codex-cli/examples/prompt-analyzer/template/README.md

104 lines
3.3 KiB
Markdown
Raw Normal View History

# PromptClustering Utility
This repository contains a small utility (`cluster_prompts.py`) that embeds a
list of prompts with the OpenAI Embedding API, discovers natural groupings with
unsupervised clustering, lets ChatGPT name & describe each cluster and finally
produces a concise Markdown report plus a couple of diagnostic plots.
The default input file (`prompts.csv`) ships with the repo so you can try the
script immediately, but you can of course point it at your own file.
---
## 1. Setup
1. Install the Python dependencies (preferably inside a virtual env):
```bash
pip install pandas numpy scikit-learn matplotlib openai
```
2. Export your OpenAI API key (**required**):
```bash
export OPENAI_API_KEY="sk..."
```
---
## 2. Basic usage
```bash
# Minimal command runs on prompts.csv and writes analysis.md + plots/
python cluster_prompts.py
```
This will
* create embeddings with the `text-embedding-3-small` model, 
* pick a suitable number *k* via silhouette score (KMeans),
* ask `gpt4omini` to label & describe each cluster,
* store the results in `analysis.md`,
* and save two plots to `plots/` (`cluster_sizes.png` and `tsne.png`).
The script prints a short success message once done.
---
## 3. Commandline options
| flag | default | description |
|------|---------|-------------|
| `--csv` | `prompts.csv` | path to the input CSV (must contain a `prompt` column; an `act` column is used as context if present) |
| `--cache` | _(none)_ | embed­ding cache path (JSON). Speeds up repeated runs  new texts are appended automatically. |
| `--cluster-method` | `kmeans` | `kmeans` (with automatic *k*) or `dbscan` |
| `--k-max` | `10` | upper bound for *k* when `kmeans` is selected |
| `--dbscan-min-samples` | `3` | min samples parameter for DBSCAN |
| `--embedding-model` | `text-embedding-3-small` | any OpenAI embedding model |
| `--chat-model` | `gpt-4o-mini` | chat model used to generate cluster names / descriptions |
| `--output-md` | `analysis.md` | where to write the Markdown report |
| `--plots-dir` | `plots` | directory for generated PNGs |
Example with customised options:
```bash
python cluster_prompts.py \
--csv my_prompts.csv \
--cache .cache/embeddings.json \
--cluster-method dbscan \
--embedding-model text-embedding-3-large \
--chat-model gpt-4o \
--output-md my_analysis.md \
--plots-dir my_plots
```
---
## 4. Interpreting the output
### analysis.md
* Overview table: cluster label, generated name, member count and description.
* Detailed section for every cluster with five representative example prompts.
* Separate lists for
* **Noise / outliers** (label `1` when DBSCAN is used) and
* **Potentially ambiguous prompts** (only with KMeans) these are items that
lie almost equally close to two centroids and might belong to multiple
groups.
### plots/cluster_sizes.png
Quick barchart visualisation of how many prompts ended up in each cluster.
---
## 5. Troubleshooting
* **Ratelimits / quota errors** lower the number of prompts per run or switch
to a larger quota account.
* **Authentication errors** make sure `OPENAI_API_KEY` is exported in the
shell where you run the script.
* **Inadequate clusters** try the other clustering method, adjust `--k-max`
or tune DBSCAN parameters (`eps` range is inferred, `min_samples` exposed via
CLI).