Files
llmx/codex-cli/examples/prompt-analyzer/template/README.md
Ilan Bigio 59a180ddec Initial commit
Signed-off-by: Ilan Bigio <ilan@openai.com>
2025-04-16 12:56:08 -04:00

104 lines
3.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PromptClustering Utility
This repository contains a small utility (`cluster_prompts.py`) that embeds a
list of prompts with the OpenAI Embedding API, discovers natural groupings with
unsupervised clustering, lets ChatGPT name & describe each cluster and finally
produces a concise Markdown report plus a couple of diagnostic plots.
The default input file (`prompts.csv`) ships with the repo so you can try the
script immediately, but you can of course point it at your own file.
---
## 1. Setup
1. Install the Python dependencies (preferably inside a virtual env):
```bash
pip install pandas numpy scikit-learn matplotlib openai
```
2. Export your OpenAI API key (**required**):
```bash
export OPENAI_API_KEY="sk..."
```
---
## 2. Basic usage
```bash
# Minimal command runs on prompts.csv and writes analysis.md + plots/
python cluster_prompts.py
```
This will
* create embeddings with the `text-embedding-3-small` model, 
* pick a suitable number *k* via silhouette score (KMeans),
* ask `gpt4omini` to label & describe each cluster,
* store the results in `analysis.md`,
* and save two plots to `plots/` (`cluster_sizes.png` and `tsne.png`).
The script prints a short success message once done.
---
## 3. Commandline options
| flag | default | description |
|------|---------|-------------|
| `--csv` | `prompts.csv` | path to the input CSV (must contain a `prompt` column; an `act` column is used as context if present) |
| `--cache` | _(none)_ | embed­ding cache path (JSON). Speeds up repeated runs  new texts are appended automatically. |
| `--cluster-method` | `kmeans` | `kmeans` (with automatic *k*) or `dbscan` |
| `--k-max` | `10` | upper bound for *k* when `kmeans` is selected |
| `--dbscan-min-samples` | `3` | min samples parameter for DBSCAN |
| `--embedding-model` | `text-embedding-3-small` | any OpenAI embedding model |
| `--chat-model` | `gpt-4o-mini` | chat model used to generate cluster names / descriptions |
| `--output-md` | `analysis.md` | where to write the Markdown report |
| `--plots-dir` | `plots` | directory for generated PNGs |
Example with customised options:
```bash
python cluster_prompts.py \
--csv my_prompts.csv \
--cache .cache/embeddings.json \
--cluster-method dbscan \
--embedding-model text-embedding-3-large \
--chat-model gpt-4o \
--output-md my_analysis.md \
--plots-dir my_plots
```
---
## 4. Interpreting the output
### analysis.md
* Overview table: cluster label, generated name, member count and description.
* Detailed section for every cluster with five representative example prompts.
* Separate lists for
* **Noise / outliers** (label `1` when DBSCAN is used) and
* **Potentially ambiguous prompts** (only with KMeans) these are items that
lie almost equally close to two centroids and might belong to multiple
groups.
### plots/cluster_sizes.png
Quick barchart visualisation of how many prompts ended up in each cluster.
---
## 5. Troubleshooting
* **Ratelimits / quota errors** lower the number of prompts per run or switch
to a larger quota account.
* **Authentication errors** make sure `OPENAI_API_KEY` is exported in the
shell where you run the script.
* **Inadequate clusters** try the other clustering method, adjust `--k-max`
or tune DBSCAN parameters (`eps` range is inferred, `min_samples` exposed via
CLI).