Building Intelligent Search with BigQuery Machine Learning: A Step-by-Step Guide (Part 2 – LLM clustering)

This article explores building a clustering solution for search queries using BigQuery Machine Learning (BigQuery ML) and Large Language Models (LLMs). While this approach offers configuration options, it’s important to acknowledge its trade-offs, including slower processing times and potentially less-than-perfect results. It’s an alternative approach, using LLMs to the previous post that uses K-means clustering.

What you’ll learn:

  • How to leverage LLMs for text-based clustering within BigQuery.
  • The benefits of configurability in this solution.
  • Understanding the limitations of LLM-based clustering.

Benefits:

  • Configurability: You can adjust the LLM prompt and other parameters to tailor the clustering to your specific needs.

Trade-offs:

  • Processing Speed: This solution relies on LLM calls within BigQuery queries, which can be slower than traditional clustering methods.
  • Result Accuracy: LLM-generated cluster names might not always be perfectly accurate or descriptive.

The Approach:

This solution utilizes SQL queries to achieve text-based clustering. Here’s a breakdown of the key steps with included code samples:

1. Preprocessing Search Queries:

The provided SQL snippet demonstrates preprocessing steps for the search queries. It combines individual queries from a specific date range (data_YYYYMMDD) into a single string (concatenated_queries) and extracts individual queries (individual_query).

WITH all_queriesYYYYMMDD AS (
  SELECT
    (
      SELECT
        STRING_AGG(LOWER(REGEXP_REPLACE(searchQueryColumn, r'"', '')), '\n')
      FROM
        (SELECT * FROM `project.dataset.data_YYYYMMDD`)
    ) AS concatenated_queries,
    searchQueryColumn AS individual_query
  FROM
    `project.dataset.data_YYYYMMDD` LIMIT 10
)

2. LLM-based Clustering:

This step utilizes the ML.GENERATE_TEXT function to leverage an LLM model (test_temp_dataset.llm_model). The prompt provided to the LLM essentially asks: “Given a list of search queries, which cluster would each individual query belong to?” The LLM then assigns a cluster name or returns the original query if it doesn’t fit well within any existing cluster.

SELECT
  REGEXP_REPLACE(TRIM(TO_JSON_STRING(ml_generate_text_result['predictions'][0]['content'])), r'["\n\x5C]+', '') AS cluster_name_clean2,
  REGEXP_REPLACE(individual_query, r'"', '') as search_query,
  -- *
  -- ml_generate_text_result['predictions'][0]['safetyAttributes'] -- AS safety_attributes,
  -- ml_generate_text_result['predictions'] AS predictions,
  -- * EXCEPT (ml_generate_text_result)
FROM
  ML.GENERATE_TEXT( MODEL `test_temp_dataset.llm_model`,
    (
    SELECT
      CONCAT('I have the following list of search queries (Separated by line breaks). If I split them into clusters which cluster would the search query "', individual_query, '" be assigned to? Return only the name of the cluster. If the search query is not similar enough to other queries return the search query as the cluster name. Do not return cluster names inside doublequotes. \n\nList of queries:\n',concatenated_queries) AS prompt,
      *
    FROM all_queriesYYYYMMDD),
    STRUCT( 0.2 AS temperature,
        100 AS max_output_tokens));

3. Cleaning and Result Formatting:

The final step cleans up the LLM generated cluster names by removing quotation marks and extra characters using regular expressions. It then combines the cleaned cluster name with the original search query for further analysis.

In essence, this solution leverages the power of LLMs to perform a text-based clustering of search queries. While offering configuration options, it’s crucial to remember the potential for slower processing speeds and less than perfectly accurate results.

Additional Considerations:

  • This example focuses on a specific date range (data_YYYYMMDD). You might need to adapt the queries to handle a broader range of historical data. Here’s an adjustment to process all data within a table named data_*:
WITH all_queries AS (
  SELECT
    (
      SELECT
        STRING_AGG(LOWER(REGEXP_REPLACE(searchQueryColumn, r'"', '')), '\n')
      FROM
        (SELECT * FROM `project.dataset.data_*`)
    ) AS concatenated_queries,
    searchQueryColumn AS individual_query
  FROM
    `project.dataset.data_*`
)
  • Implementing a distance threshold alongside LLM-generated names could help ensure a minimum level of similarity within clusters. Here’s how you could modify the ML.GENERATE_TEXT call to incorporate a distance threshold of 0.7:
ML.GENERATE_TEXT( MODEL `test_temp_dataset.llm_model`,
  (
  SELECT
    CONCAT('...', individual_query, '...'), -- Replace ... with appropriate context for the query
    ...  -- other prompt elements from previous query
  ),
  STRUCT(
    0.2 AS temperature,
    100 AS max_output_tokens,
    0.7 AS distance_threshold -- New argument to specify minimum similarity
  ))

Remember: The effectiveness of this distance threshold depends on the specific use case and the characteristics of your search queries.

By understanding the trade-offs and implementing additional techniques like distance thresholds, this LLM-based clustering solution can be a valuable tool for organizing and analyzing your search queries within BigQuery.

Summary

This LLM-based clustering solution offers configurability but has limitations resulting in “OK results” with slower processing times.

Pros:

  • Configurable: You can adjust the LLM prompt and other parameters to tailor the clustering to your specific needs.

Cons:

  • Slow Processing: LLM calls within BigQuery queries can be slow compared to traditional clustering methods.
  • Limited Accuracy: LLM-generated cluster names might not always be perfectly accurate or descriptive.

Overall:

This solution can be a useful tool for organizing search queries, but be aware of the trade-offs. It may be suitable for smaller datasets or exploratory analyses, but for larger datasets or mission-critical applications, alternative clustering approaches might be more efficient and produce more reliable results.

Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.