Building Intelligent Search with BigQuery Machine Learning: A Step-by-Step Guide (Part 1 – K-Means clustering)

This article guides you through creating an intelligent search system using BigQuery Machine Learning (BigQuery ML). This system leverages Large Language Models (LLMs) and clustering to categorize user search queries.

What you’ll learn:

  • Setting up external connections for BigQuery ML services.
  • Creating LLM and embedding models.
  • Calculating embeddings for historical search data.
  • Building a K-Means clustering model to group similar searches.
  • Assigning user-friendly names to clusters using LLMs.
  • Classifying new search queries based on the created clusters.

Prerequisites:

  • A Google Cloud Platform (GCP) project with BigQuery enabled.
  • Basic familiarity with BigQuery concepts like tables and queries.

1. External Connection Creation (Initial Setup)

BigQuery ML interacts with external services like LLMs. To enable this, we need to create an external connection:

CREATE OR REPLACE CONNECTION `us.llm-conn`
OPTIONS (
  type = 'CLOUD_AI_PLATFORM',
  project_id = 'your-project-id',
  location = 'us-central1'
);

Replace your-project-id with your actual GCP project ID.

2. LLM Model Creation (Initial Setup)

This model helps us assign human-readable names to clusters later:

CREATE OR REPLACE MODEL test_temp_dataset.llm_model
 REMOTE WITH CONNECTION `us.llm-conn`
 OPTIONS (remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1');

3. Embedding Model Creation (Initial Setup)

This model calculates numerical representations (“embeddings”) of search queries:

CREATE OR REPLACE MODEL `project.dataset.queries-embedding-model`
REMOTE WITH CONNECTION `us.llm-connection`
OPTIONS (REMOTE_SERVICE_TYPE = 'CLOUD_AI_TEXT_EMBEDDING_MODEL_V1');

4. Calculate Embeddings for Historical Data

This step uses the embedding model to convert past search queries into numerical representations. Replace dataset with your actual dataset ID and adjust the query to match your data structure:

CREATE OR REPLACE TABLE `project.dataset.data_embeddings` AS (
SELECT *
 FROM ML.GENERATE_TEXT_EMBEDDING(
  MODEL `project.test_temp_dataset.embedding_model`,
  -- a query statement whose result contains a STRING column named content.
  (SELECT searchQueryColumn as content FROM `project.dataset.data_*` GROUP BY searchQueryColumn),
  STRUCT(TRUE AS flatten_json_output)
 )
)

5. Create a Clustering Model

This step groups similar search queries based on their embeddings:

CREATE OR REPLACE MODEL `project.dataset.data_clusters`
OPTIONS (
 model_type = 'KMEANS',
 KMEANS_INIT_METHOD = 'KMEANS++',
 num_clusters = HPARAM_RANGE(5, 20),
 HPARAM_TUNING_ALGORITHM = 'VIZIER_DEFAULT',
 NUM_TRIALS = 10,
 HPARAM_TUNING_OBJECTIVES = ['DAVIES_BOULDIN_INDEX']) AS (
  SELECT
   text_embedding
  FROM   `project.dataset.data_embeddings`
);

6. Save Clusters and Assign Friendly Names

Here, we leverage the LLM model to generate descriptive names based on the most common queries within each cluster. However, the provided SQL snippet highlights some challenges:

  • LLM Model Results: The query might return uninformative names or struggle to differentiate clusters.
  • Disruption of Historical Data: Assigning names based on current data might not reflect historical trends.

Here’s the SQL ignoring these limitations for illustrative purposes:

WITH clusters AS (
  SELECT
    centroid_id,
    STRING_AGG(content, "\n") AS cluster_content
  FROM
    ML.PREDICT(
      MODEL `project.dataset.data_clusters`,
      (
        SELECT
          text_embedding,
          content
        FROM
          `project.dataset.data_embeddings`
      )
    )
  GROUP BY
    centroid_id
  ORDER BY
    centroid_id
)

SELECT
  ml_generate_text_result['predictions'][0]['content'] AS cluster_name,
  centroid_id,
  cluster_content
FROM
  ML.GENERATE_TEXT(
    MODEL `test_temp_dataset.llm_model`,
    (
      SELECT
        CONCAT('Find a short and descriptive title that summarizes the provided group of search queries. The title should contain at least 1 word of the provided search query values and should not be more than 6 words or less than 2 words. If you cannot find a suitable title just return the word appearing most often, excluding special characters and numbers. \n\nSearch queries\n: ', cluster_content) AS prompt,
        *
      FROM
        clusters
    ),
    STRUCT(
      0.5 AS temperature,
      300 AS max_output_tokens
    )
  );

7. Assign a Cluster to Each Search Query

This step assigns each new search query to the most relevant cluster. It also incorporates a distance threshold to avoid assigning irrelevant queries:

SELECT
  content AS search_query,
  NEAREST_CENTROIDS_DISTANCE[offset(0)].CENTROID_ID AS best_matching_cluster_id,
  ROUND(NEAREST_CENTROIDS_DISTANCE[offset(0)].DISTANCE, 3) AS min_cluster_distance
FROM
  ML.PREDICT(
    MODEL `project.dataset.data_clusters`,
    (
      SELECT *
      FROM ML.GENERATE_TEXT_EMBEDDING(
        MODEL `project.test_temp_dataset.embedding_model`,
        -- a query statement whose result contains a STRING column named content.
        (SELECT content FROM your_search_query_table),
        STRUCT(TRUE AS flatten_json_output)
      )
    )
  );

Replace your_search_query_table with the actual table containing your new search queries.

Note:

  • The distance threshold to differentiate relevant and irrelevant queries needs to be determined based on your specific data and requirements.

Summary

This BigQuery Machine Learning solution is a configurable and scalable option for building an intelligent search system with OK results.

  • Configurable: You can adjust the number of clusters (KMeans parameters) and distance thresholds based on your data and desired level of granularity.
  • Scalable: The solution scales to handle increasing amounts of search queries by adding more data to the historical data table and potentially retraining the clustering model.

However, it’s important to note that the results might be imprecise due to limitations with the LLM model:

  • LLM Model Shortcomings: The LLM model might struggle to generate descriptive cluster names or differentiate between similar clusters.
  • Historical Data Dependence: Assigning cluster names based on current data may not reflect historical trends.

Therefore, the solution offers “OK” results, and you might need to implement additional techniques or human oversight to improve the accuracy of cluster names and overall search experience.

This concludes the basic steps for building an intelligent search system with BigQuery Machine Learning. By combining LLMs and clustering, you can categorize user search queries and improve your search experience. Remember to address the limitations mentioned in step 6 for a more robust system.

Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.