"We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb)."
"df[\"embedding\"] = df.embedding.apply(literal_eval).apply(np.array) # convert string to numpy array\n",
"matrix = np.vstack(df.embedding.values)\n",
"matrix.shape\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Find the clusters using K-means"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We show the simplest use of K-means. You can pick the number of clusters that fits your use case best."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/homebrew/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
"plt.title(\"Clusters identified visualized in language 2d using t-SNE\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Text samples in the clusters & naming the clusters\n",
"\n",
"Let's show random samples from each cluster. We'll use gpt-4 to name the clusters, based on a random sample of 5 reviews from that cluster."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import os\n",
"\n",
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))\n",
"\n",
"# Reading a review which belong to each group.\n",
"rev_per_cluster = 5\n",
"\n",
"for i in range(n_clusters):\n",
" print(f\"Cluster {i} Theme:\", end=\" \")\n",
"\n",
" reviews = \"\\n\".join(\n",
" df[df.Cluster == i]\n",
" .combined.str.replace(\"Title: \", \"\")\n",
" .str.replace(\"\\n\\nContent: \", \": \")\n",
" .sample(rev_per_cluster, random_state=42)\n",
" .values\n",
" )\n",
"\n",
" messages = [\n",
" {\"role\": \"user\", \"content\": f'What do the following customer reviews have in common?\\n\\nCustomer reviews:\\n\"\"\"\\n{reviews}\\n\"\"\"\\n\\nTheme:'}\n",
"It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepancies in the data."