Files
llmx/codex-cli/examples/prompt-analyzer/template
Ilan Bigio 59a180ddec Initial commit
Signed-off-by: Ilan Bigio <ilan@openai.com>
2025-04-16 12:56:08 -04:00
..
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00
2025-04-16 12:56:08 -04:00

PromptClustering Utility

This repository contains a small utility (cluster_prompts.py) that embeds a list of prompts with the OpenAI Embedding API, discovers natural groupings with unsupervised clustering, lets ChatGPT name & describe each cluster and finally produces a concise Markdown report plus a couple of diagnostic plots.

The default input file (prompts.csv) ships with the repo so you can try the script immediately, but you can of course point it at your own file.


1. Setup

1. Install the Python dependencies (preferably inside a virtual env):

pip install pandas numpy scikit-learn matplotlib openai

2. Export your OpenAI API key (required):

export OPENAI_API_KEY="sk..."

2. Basic usage

# Minimal command  runs on prompts.csv and writes analysis.md + plots/
python cluster_prompts.py

This will

  • create embeddings with the text-embedding-3-small model, 
  • pick a suitable number k via silhouette score (KMeans),
  • ask gpt4omini to label & describe each cluster,
  • store the results in analysis.md,
  • and save two plots to plots/ (cluster_sizes.png and tsne.png).

The script prints a short success message once done.


3. Commandline options

flag default description
--csv prompts.csv path to the input CSV (must contain a prompt column; an act column is used as context if present)
--cache (none) embed­ding cache path (JSON). Speeds up repeated runs  new texts are appended automatically.
--cluster-method kmeans kmeans (with automatic k) or dbscan
--k-max  10 upper bound for k when kmeans is selected
--dbscan-min-samples 3 min samples parameter for DBSCAN
--embedding-model text-embedding-3-small any OpenAI embedding model
--chat-model gpt-4o-mini chat model used to generate cluster names / descriptions
--output-md analysis.md where to write the Markdown report
--plots-dir plots directory for generated PNGs

Example with customised options:

python cluster_prompts.py \
  --csv my_prompts.csv \
  --cache .cache/embeddings.json \
  --cluster-method dbscan \
  --embedding-model text-embedding-3-large \
  --chat-model gpt-4o \
  --output-md my_analysis.md \
  --plots-dir my_plots

4. Interpreting the output

analysis.md

  • Overview table: cluster label, generated name, member count and description.
  • Detailed section for every cluster with five representative example prompts.
  • Separate lists for
    • Noise / outliers (label 1 when DBSCAN is used) and
    • Potentially ambiguous prompts (only with KMeans) these are items that lie almost equally close to two centroids and might belong to multiple groups.

plots/cluster_sizes.png

Quick barchart visualisation of how many prompts ended up in each cluster.


5. Troubleshooting

  • Ratelimits / quota errors lower the number of prompts per run or switch to a larger quota account.
  • Authentication errors make sure OPENAI_API_KEY is exported in the shell where you run the script.
  • Inadequate clusters try the other clustering method, adjust --k-max or tune DBSCAN parameters (eps range is inferred, min_samples exposed via CLI).