Initial commit

Signed-off-by: Ilan Bigio <ilan@openai.com>
2025-04-16 12:56:08 -04:00
commit 59a180ddec
163 changed files with 30587 additions and 0 deletions
--- a/codex-cli/examples/prompt-analyzer/template/README.md
+++ b/codex-cli/examples/prompt-analyzer/template/README.md
@@ -0,0 +1,103 @@
+# Prompt‑Clustering Utility
+
+This repository contains a small utility (`cluster_prompts.py`) that embeds a
+list of prompts with the OpenAI Embedding API, discovers natural groupings with
+unsupervised clustering, lets ChatGPT name & describe each cluster and finally
+produces a concise Markdown report plus a couple of diagnostic plots.
+
+The default input file (`prompts.csv`) ships with the repo so you can try the
+script immediately, but you can of course point it at your own file.
+
+---
+
+## 1. Setup
+
+1. Install the Python dependencies (preferably inside a virtual env):
+
+```bash
+pip install pandas numpy scikit-learn matplotlib openai
+```
+
+2. Export your OpenAI API key (**required**):
+
+```bash
+export OPENAI_API_KEY="sk‑..."
+```
+
+---
+
+## 2. Basic usage
+
+```bash
+# Minimal command – runs on prompts.csv and writes analysis.md + plots/
+python cluster_prompts.py
+```
+
+This will
+
+* create embeddings with the `text-embedding-3-small` model, 
+* pick a suitable number *k* via silhouette score (K‑Means),
+* ask `gpt‑4o‑mini` to label & describe each cluster,
+* store the results in `analysis.md`,
+* and save two plots to `plots/` (`cluster_sizes.png` and `tsne.png`).
+
+The script prints a short success message once done.
+
+---
+
+## 3. Command‑line options
+
+| flag | default | description |
+|------|---------|-------------|
+| `--csv` | `prompts.csv` | path to the input CSV (must contain a `prompt` column; an `act` column is used as context if present) |
+| `--cache` | _(none)_ | embedding cache path (JSON). Speeds up repeated runs – new texts are appended automatically. |
+| `--cluster-method` | `kmeans` | `kmeans` (with automatic *k*) or `dbscan` |
+| `--k-max` | `10` | upper bound for *k* when `kmeans` is selected |
+| `--dbscan-min-samples` | `3` | min samples parameter for DBSCAN |
+| `--embedding-model` | `text-embedding-3-small` | any OpenAI embedding model |
+| `--chat-model` | `gpt-4o-mini` | chat model used to generate cluster names / descriptions |
+| `--output-md` | `analysis.md` | where to write the Markdown report |
+| `--plots-dir` | `plots` | directory for generated PNGs |
+
+Example with customised options:
+
+```bash
+python cluster_prompts.py \
+  --csv my_prompts.csv \
+  --cache .cache/embeddings.json \
+  --cluster-method dbscan \
+  --embedding-model text-embedding-3-large \
+  --chat-model gpt-4o \
+  --output-md my_analysis.md \
+  --plots-dir my_plots
+```
+
+---
+
+## 4. Interpreting the output
+
+### analysis.md
+
+* Overview table: cluster label, generated name, member count and description.
+* Detailed section for every cluster with five representative example prompts.
+* Separate lists for
+  * **Noise / outliers** (label `‑1` when DBSCAN is used) and
+  * **Potentially ambiguous prompts** (only with K‑Means) – these are items that
+    lie almost equally close to two centroids and might belong to multiple
+    groups.
+
+### plots/cluster_sizes.png
+
+Quick bar‑chart visualisation of how many prompts ended up in each cluster.
+
+---
+
+## 5. Troubleshooting
+
+* **Rate‑limits / quota errors** – lower the number of prompts per run or switch
+  to a larger quota account.
+* **Authentication errors** – make sure `OPENAI_API_KEY` is exported in the
+  shell where you run the script.
+* **Inadequate clusters** – try the other clustering method, adjust `--k-max`
+  or tune DBSCAN parameters (`eps` range is inferred, `min_samples` exposed via
+  CLI).