{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## K-means Clustering in Python using OpenAI\n", "\n", "We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 1536)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# imports\n", "import numpy as np\n", "import pandas as pd\n", "from ast import literal_eval\n", "\n", "# load data\n", "datafile_path = \"./data/fine_food_reviews_with_embeddings_1k.csv\"\n", "\n", "df = pd.read_csv(datafile_path)\n", "df[\"embedding\"] = df.embedding.apply(literal_eval).apply(np.array) # convert string to numpy array\n", "matrix = np.vstack(df.embedding.values)\n", "matrix.shape\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Find the clusters using K-means" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We show the simplest use of K-means. You can pick the number of clusters that fits your use case best." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/homebrew/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "Cluster\n", "0 4.105691\n", "1 4.191176\n", "2 4.215613\n", "3 4.306590\n", "Name: Score, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cluster import KMeans\n", "\n", "n_clusters = 4\n", "\n", "kmeans = KMeans(n_clusters=n_clusters, init=\"k-means++\", random_state=42)\n", "kmeans.fit(matrix)\n", "labels = kmeans.labels_\n", "df[\"Cluster\"] = labels\n", "\n", "df.groupby(\"Cluster\").Score.mean().sort_values()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "\n", "tsne = TSNE(n_components=2, perplexity=15, random_state=42, init=\"random\", learning_rate=200)\n", "vis_dims2 = tsne.fit_transform(matrix)\n", "\n", "x = [x for x, y in vis_dims2]\n", "y = [y for x, y in vis_dims2]\n", "\n", "for category, color in enumerate([\"purple\", \"green\", \"red\", \"blue\"]):\n", " xs = np.array(x)[df.Cluster == category]\n", " ys = np.array(y)[df.Cluster == category]\n", " plt.scatter(xs, ys, color=color, alpha=0.3)\n", "\n", " avg_x = xs.mean()\n", " avg_y = ys.mean()\n", "\n", " plt.scatter(avg_x, avg_y, marker=\"x\", color=color, s=100)\n", "plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Text samples in the clusters & naming the clusters\n", "\n", "Let's show random samples from each cluster. We'll use gpt-4 to name the clusters, based on a random sample of 5 reviews from that cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import OpenAI\n", "import os\n", "\n", "client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"\"))\n", "\n", "# Reading a review which belong to each group.\n", "rev_per_cluster = 5\n", "\n", "for i in range(n_clusters):\n", " print(f\"Cluster {i} Theme:\", end=\" \")\n", "\n", " reviews = \"\\n\".join(\n", " df[df.Cluster == i]\n", " .combined.str.replace(\"Title: \", \"\")\n", " .str.replace(\"\\n\\nContent: \", \": \")\n", " .sample(rev_per_cluster, random_state=42)\n", " .values\n", " )\n", "\n", " messages = [\n", " {\"role\": \"user\", \"content\": f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:'}\n", " ]\n", "\n", " response = client.chat.completions.create(\n", " model=\"gpt-4\",\n", " messages=messages,\n", " temperature=0,\n", " max_tokens=64,\n", " top_p=1,\n", " frequency_penalty=0,\n", " presence_penalty=0)\n", " print(response.choices[0].message.content.replace(\"\\n\", \"\"))\n", "\n", " sample_cluster_rows = df[df.Cluster == i].sample(rev_per_cluster, random_state=42)\n", " for j in range(rev_per_cluster):\n", " print(sample_cluster_rows.Score.values[j], end=\", \")\n", " print(sample_cluster_rows.Summary.values[j], end=\": \")\n", " print(sample_cluster_rows.Text.str[:70].values[j])\n", "\n", " print(\"-\" * 100)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data." ] } ], "metadata": { "kernelspec": { "display_name": "openai", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" }, "vscode": { "interpreter": { "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97" } } }, "nbformat": 4, "nbformat_minor": 2 }