codex-cli/examples/prompt-analyzer/template/Clustering.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## K-means Clustering in Python using OpenAI\n",
    "\n",
    "We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1000, 1536)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# imports\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from ast import literal_eval\n",
    "\n",
    "# load data\n",
    "datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",
    "\n",
    "df = pd.read_csv(datafile_path)\n",
    "df[\"embedding\"] = df.embedding.apply(literal_eval).apply(np.array)  # convert string to numpy array\n",
    "matrix = np.vstack(df.embedding.values)\n",
    "matrix.shape\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Find the clusters using K-means"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/homebrew/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Cluster\n",
       "0    4.105691\n",
       "1    4.191176\n",
       "2    4.215613\n",
       "3    4.306590\n",
       "Name: Score, dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "n_clusters = 4\n",
    "\n",
    "kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",
    "kmeans.fit(matrix)\n",
    "labels = kmeans.labels_\n",
    "df[\"Cluster\"] = labels\n",
    "\n",
    "df.groupby(\"Cluster\").Score.mean().sort_values()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "tsne = TSNE(n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200)\n",
    "vis_dims2 = tsne.fit_transform(matrix)\n",
    "\n",
    "x = [x for x, y in vis_dims2]\n",
    "y = [y for x, y in vis_dims2]\n",
    "\n",
    "for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n",
    "    xs = np.array(x)[df.Cluster == category]\n",
    "    ys = np.array(y)[df.Cluster == category]\n",
    "    plt.scatter(xs, ys, color=color, alpha=0.3)\n",
    "\n",
    "    avg_x = xs.mean()\n",
    "    avg_y = ys.mean()\n",
    "\n",
    "    plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",
    "plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Text samples in the clusters & naming the clusters\n",
    "\n",
    "Let's show random samples from each cluster. We'll use gpt-4 to name the clusters, based on a random sample of 5 reviews from that cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "import os\n",
    "\n",
    "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n",
    "\n",
    "# Reading a review which belong to each group.\n",
    "rev_per_cluster = 5\n",
    "\n",
    "for i in range(n_clusters):\n",
    "    print(f\"Cluster {i} Theme:\", end=\" \")\n",
    "\n",
    "    reviews = \"\\n\".join(\n",
    "        df[df.Cluster == i]\n",
    "        .combined.str.replace(\"Title: \", \"\")\n",
    "        .str.replace(\"\\n\\nContent: \", \":  \")\n",
    "        .sample(rev_per_cluster, random_state=42)\n",
    "        .values\n",
    "    )\n",
    "\n",
    "    messages = [\n",
    "        {\"role\": \"user\", \"content\": f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:'}\n",
    "    ]\n",
    "\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-4\",\n",
    "        messages=messages,\n",
    "        temperature=0,\n",
    "        max_tokens=64,\n",
    "        top_p=1,\n",
    "        frequency_penalty=0,\n",
    "        presence_penalty=0)\n",
    "    print(response.choices[0].message.content.replace(\"\\n\", \"\"))\n",
    "\n",
    "    sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n",
    "    for j in range(rev_per_cluster):\n",
    "        print(sample_cluster_rows.Score.values[j], end=\", \")\n",
    "        print(sample_cluster_rows.Summary.values[j], end=\":   \")\n",
    "        print(sample_cluster_rows.Text.str[:70].values[j])\n",
    "\n",
    "    print(\"-\" * 100)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "openai",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Initial commit Signed-off-by: Ilan Bigio <ilan@openai.com> 2025-04-16 12:56:08 -04:00			`{`
			`"cells": [`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## K-means Clustering in Python using OpenAI\n",`
			`"\n",`
			`"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"(1000, 1536)"`
			`]`
			`},`
			`"execution_count": 2,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# imports\n",`
			`"import numpy as np\n",`
			`"import pandas as pd\n",`
			`"from ast import literal_eval\n",`
			`"\n",`
			`"# load data\n",`
			`"datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n",`
			`"\n",`
			`"df = pd.read_csv(datafile_path)\n",`
			`"df[\"embedding\"] = df.embedding.apply(literal_eval).apply(np.array) # convert string to numpy array\n",`
			`"matrix = np.vstack(df.embedding.values)\n",`
			`"matrix.shape\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### 1. Find the clusters using K-means"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			"/opt/homebrew/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
			`" warnings.warn(\n"`
			`]`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Cluster\n",`
			`"0 4.105691\n",`
			`"1 4.191176\n",`
			`"2 4.215613\n",`
			`"3 4.306590\n",`
			`"Name: Score, dtype: float64"`
			`]`
			`},`
			`"execution_count": 3,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from sklearn.cluster import KMeans\n",`
			`"\n",`
			`"n_clusters = 4\n",`
			`"\n",`
			`"kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n",`
			`"kmeans.fit(matrix)\n",`
			`"labels = kmeans.labels_\n",`
			`"df[\"Cluster\"] = labels\n",`
			`"\n",`
			`"df.groupby(\"Cluster\").Score.mean().sort_values()\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from sklearn.manifold import TSNE\n",`
			`"import matplotlib\n",`
			`"import matplotlib.pyplot as plt\n",`
			`"\n",`
			`"tsne = TSNE(n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200)\n",`
			`"vis_dims2 = tsne.fit_transform(matrix)\n",`
			`"\n",`
			`"x = [x for x, y in vis_dims2]\n",`
			`"y = [y for x, y in vis_dims2]\n",`
			`"\n",`
			`"for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n",`
			`" xs = np.array(x)[df.Cluster == category]\n",`
			`" ys = np.array(y)[df.Cluster == category]\n",`
			`" plt.scatter(xs, ys, color=color, alpha=0.3)\n",`
			`"\n",`
			`" avg_x = xs.mean()\n",`
			`" avg_y = ys.mean()\n",`
			`"\n",`
			`" plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n",`
			`"plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster."`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### 2. Text samples in the clusters & naming the clusters\n",`
			`"\n",`
			`"Let's show random samples from each cluster. We'll use gpt-4 to name the clusters, based on a random sample of 5 reviews from that cluster."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from openai import OpenAI\n",`
			`"import os\n",`
			`"\n",`
			`"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n",`
			`"\n",`
			`"# Reading a review which belong to each group.\n",`
			`"rev_per_cluster = 5\n",`
			`"\n",`
			`"for i in range(n_clusters):\n",`
			`" print(f\"Cluster {i} Theme:\", end=\" \")\n",`
			`"\n",`
			`" reviews = \"\\n\".join(\n",`
			`" df[df.Cluster == i]\n",`
			`" .combined.str.replace(\"Title: \", \"\")\n",`
			`" .str.replace(\"\\n\\nContent: \", \": \")\n",`
			`" .sample(rev_per_cluster, random_state=42)\n",`
			`" .values\n",`
			`" )\n",`
			`"\n",`
			`" messages = [\n",`
			`" {\"role\": \"user\", \"content\": f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:'}\n",`
			`" ]\n",`
			`"\n",`
			`" response = client.chat.completions.create(\n",`
			`" model=\"gpt-4\",\n",`
			`" messages=messages,\n",`
			`" temperature=0,\n",`
			`" max_tokens=64,\n",`
			`" top_p=1,\n",`
			`" frequency_penalty=0,\n",`
			`" presence_penalty=0)\n",`
			`" print(response.choices[0].message.content.replace(\"\\n\", \"\"))\n",`
			`"\n",`
			`" sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n",`
			`" for j in range(rev_per_cluster):\n",`
			`" print(sample_cluster_rows.Score.values[j], end=\", \")\n",`
			`" print(sample_cluster_rows.Summary.values[j], end=\": \")\n",`
			`" print(sample_cluster_rows.Text.str[:70].values[j])\n",`
			`"\n",`
			`" print(\"-\" * 100)\n"`
			`]`
			`},`
			`{`
			`"attachments": {},`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "openai",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.11.3"`
			`},`
			`"vscode": {`
			`"interpreter": {`
			`"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"`
			`}`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 2`
			`}`