Building Intelligent Search with BigQuery Machine Learning: A Step-by-Step Guide (Part 1 – K-Means clustering)
This article guides you through creating an intelligent search system using BigQuery Machine Learning (BigQuery ML). This system leverages Large Language Models (LLMs) and clustering to categorize user search queries.
What you’ll learn:
- Setting up external connections for BigQuery ML services.
- Creating LLM and embedding models.
- Calculating embeddings for historical search data.
- Building a K-Means clustering model to group similar searches.
- Assigning user-friendly names to clusters using LLMs.
- Classifying new search queries based on the created clusters.
Prerequisites:
- A Google Cloud Platform (GCP) project with BigQuery enabled.
- Basic familiarity with BigQuery concepts like tables and queries.
1. External Connection Creation (Initial Setup)
BigQuery ML interacts with external services like LLMs. To enable this, we need to create an external connection:
CREATE OR REPLACE CONNECTION `us.llm-conn`
OPTIONS (
type = 'CLOUD_AI_PLATFORM',
project_id = 'your-project-id',
location = 'us-central1'
);
Replace your-project-id
with your actual GCP project ID.
2. LLM Model Creation (Initial Setup)
This model helps us assign human-readable names to clusters later:
CREATE OR REPLACE MODEL test_temp_dataset.llm_model
REMOTE WITH CONNECTION `us.llm-conn`
OPTIONS (remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1');
3. Embedding Model Creation (Initial Setup)
This model calculates numerical representations (“embeddings”) of search queries:
CREATE OR REPLACE MODEL `project.dataset.queries-embedding-model`
REMOTE WITH CONNECTION `us.llm-connection`
OPTIONS (REMOTE_SERVICE_TYPE = 'CLOUD_AI_TEXT_EMBEDDING_MODEL_V1');
4. Calculate Embeddings for Historical Data
This step uses the embedding model to convert past search queries into numerical representations. Replace dataset
with your actual dataset ID and adjust the query to match your data structure:
CREATE OR REPLACE TABLE `project.dataset.data_embeddings` AS (
SELECT *
FROM ML.GENERATE_TEXT_EMBEDDING(
MODEL `project.test_temp_dataset.embedding_model`,
-- a query statement whose result contains a STRING column named content.
(SELECT searchQueryColumn as content FROM `project.dataset.data_*` GROUP BY searchQueryColumn),
STRUCT(TRUE AS flatten_json_output)
)
)
5. Create a Clustering Model
This step groups similar search queries based on their embeddings:
CREATE OR REPLACE MODEL `project.dataset.data_clusters`
OPTIONS (
model_type = 'KMEANS',
KMEANS_INIT_METHOD = 'KMEANS++',
num_clusters = HPARAM_RANGE(5, 20),
HPARAM_TUNING_ALGORITHM = 'VIZIER_DEFAULT',
NUM_TRIALS = 10,
HPARAM_TUNING_OBJECTIVES = ['DAVIES_BOULDIN_INDEX']) AS (
SELECT
text_embedding
FROM `project.dataset.data_embeddings`
);
6. Save Clusters and Assign Friendly Names
Here, we leverage the LLM model to generate descriptive names based on the most common queries within each cluster. However, the provided SQL snippet highlights some challenges:
- LLM Model Results: The query might return uninformative names or struggle to differentiate clusters.
- Disruption of Historical Data: Assigning names based on current data might not reflect historical trends.
Here’s the SQL ignoring these limitations for illustrative purposes:
WITH clusters AS (
SELECT
centroid_id,
STRING_AGG(content, "\n") AS cluster_content
FROM
ML.PREDICT(
MODEL `project.dataset.data_clusters`,
(
SELECT
text_embedding,
content
FROM
`project.dataset.data_embeddings`
)
)
GROUP BY
centroid_id
ORDER BY
centroid_id
)
SELECT
ml_generate_text_result['predictions'][0]['content'] AS cluster_name,
centroid_id,
cluster_content
FROM
ML.GENERATE_TEXT(
MODEL `test_temp_dataset.llm_model`,
(
SELECT
CONCAT('Find a short and descriptive title that summarizes the provided group of search queries. The title should contain at least 1 word of the provided search query values and should not be more than 6 words or less than 2 words. If you cannot find a suitable title just return the word appearing most often, excluding special characters and numbers. \n\nSearch queries\n: ', cluster_content) AS prompt,
*
FROM
clusters
),
STRUCT(
0.5 AS temperature,
300 AS max_output_tokens
)
);
7. Assign a Cluster to Each Search Query
This step assigns each new search query to the most relevant cluster. It also incorporates a distance threshold to avoid assigning irrelevant queries:
SELECT
content AS search_query,
NEAREST_CENTROIDS_DISTANCE[offset(0)].CENTROID_ID AS best_matching_cluster_id,
ROUND(NEAREST_CENTROIDS_DISTANCE[offset(0)].DISTANCE, 3) AS min_cluster_distance
FROM
ML.PREDICT(
MODEL `project.dataset.data_clusters`,
(
SELECT *
FROM ML.GENERATE_TEXT_EMBEDDING(
MODEL `project.test_temp_dataset.embedding_model`,
-- a query statement whose result contains a STRING column named content.
(SELECT content FROM your_search_query_table),
STRUCT(TRUE AS flatten_json_output)
)
)
);
Replace your_search_query_table
with the actual table containing your new search queries.
Note:
- The distance threshold to differentiate relevant and irrelevant queries needs to be determined based on your specific data and requirements.
Summary
This BigQuery Machine Learning solution is a configurable and scalable option for building an intelligent search system with OK results.
- Configurable: You can adjust the number of clusters (KMeans parameters) and distance thresholds based on your data and desired level of granularity.
- Scalable: The solution scales to handle increasing amounts of search queries by adding more data to the historical data table and potentially retraining the clustering model.
However, it’s important to note that the results might be imprecise due to limitations with the LLM model:
- LLM Model Shortcomings: The LLM model might struggle to generate descriptive cluster names or differentiate between similar clusters.
- Historical Data Dependence: Assigning cluster names based on current data may not reflect historical trends.
Therefore, the solution offers “OK” results, and you might need to implement additional techniques or human oversight to improve the accuracy of cluster names and overall search experience.
This concludes the basic steps for building an intelligent search system with BigQuery Machine Learning. By combining LLMs and clustering, you can categorize user search queries and improve your search experience. Remember to address the limitations mentioned in step 6 for a more robust system.