Added search function to the blog

A simple search module based on Meilisearch that supports keyword and semantic search

The new blog's search feature was launched after initially missing from the plan. meilisearch was chosen for its multi-language capabilities, AI integration, and content recommendation. The solution involves using Edge Functions for database operations, allowing for effective keyword and semantic searches.

#search functionality
#PGroonga
#vector search
#meilisearch
#multilingual
#indexing
#semantic search

A white arrow points to the right, pointing towards a blue background with six gears in various colors, including red, yellow, green, and blue.

The new blog did not have a search feature when it was launched, but it was always part of the plan. This week, I spent two days completing it, which was much faster than I expected.

PGroonga

Initially, I planned to use PostgreSQL plugin PGroonga for the search implementation. PGroonga can index and retrieve non-Latin languages. This method had many case studies, detailed documentation, and was easy to implement.

My blog has three categories of content, and I wanted to search all three categories at once.

So, I created a Materialized View that included the searchable fields of articles, photography, and thoughts into one view. I set a trigger to refresh the view whenever the data table is updated.

Finally, I created an index for this combined view to enable search functionality.

Implementing it on the front end was also not difficult.

Initial tests showed that Chinese and Japanese could be searched, but only for exact matches.

For example, if there is content with "People's Republic of China," searching for "China" would yield no results. You must search for "People" or "Republic" for it to work. This is obviously not a user-friendly search.

PGroonga can search for multiple terms, so splitting the search terms could meet this need.

However, a bigger problem arose: how to split the terms?

While many libraries can perform term splitting, do I need to use different libraries for Chinese and Japanese? Adding a language detection program is too cumbersome.

Although I had completed the functionality, I firmly abandoned this approach.

Vector Search

Vector search converts text into vectors, a process called embedding. Then, by comparing their distances in multidimensional space, we can determine the semantic similarity between two pieces of text.

For example, if you tag a text with scores like "Life: 0.1, Technology: 0.5, Travel: 0.1", and another text scores "Life: 0.1, Technology: 0.7, Travel: 0.2". These texts are likely close in content and both related to technology.

The actual number of vectors can be thousands; this is just a simplified explanation.

The advantage of vector search is that it is not limited to specific keyword matching but performs searches based on semantics. For instance, in a recipe database, when you search for "Dapanji," you might also want to see other Xinjiang recipes or other chicken-based dishes. Vector search can achieve this.

Although the embedding process requires AI (this site uses OpenAI's text-embedding-ada-002 model), which entails costs, the same vector data can also be used for content recommendations, a feature I plan to add in the future.

With Supabase Edge Function, it is easy to automatically generate and store vector data.

While vector search can search the meaning of the content, its accuracy is sometimes inferior to keyword searches. For example, I may specifically want "Dapanji" information, but vector search might return a bunch of unrelated chicken recipes.

The best solution is to use keyword search as the primary method, with semantic search as a supplement. This is when meilisearch came back into my focus.

meilisearch

I actually considered meilisearch from the beginning but temporarily set it aside to keep things simple. After some exploration, I found that meilisearch is the optimal solution:

1. It comes with multi-language splitting and indexing, so you only need to add data without worrying about the implementation details;

2. It can use OpenAI’s API to generate vector data for semantic search;

3. It can search for similar content for content recommendation.

After setting the OpenAI key, the embedding process is automatic.

Current Solution

I wrote an Edge Function that triggers on INSERT, UPDATE, or DELETE operations on the corresponding table. The first two operations send the new data to the meilisearch server, while the last one deletes the corresponding data.

typescript
import { serve } from "https://deno.land/[email protected]/http/server.ts";

const MEILI_URL = Deno.env.get('MEILI_URL');
const MEILI_KEY = Deno.env.get('MEILI_KEY');

interface Record {
  id: string;
  lang?: string;
  slug?: string;
  title?: string;
  subtitle?: string;
  abstract?: string;
  content_text?: string;
  topic?: string;
  is_draft?: boolean;
}

interface Payload {
  type: 'INSERT' | 'UPDATE' | 'DELETE';
  table: string;
  schema: string;
  record?: Record;
  old_record?: Record;
}

async function handleMeilisearch(payload: Payload) {
  const { type, table, record, old_record } = payload;

  let url = `${MEILI_URL}/indexes/${table}/documents`;
  let method = 'POST';
  let body;

  if (type === 'DELETE' || (type === 'UPDATE' && !old_record.is_draft && record.is_draft)) {
    url = `${url}/${old_record.id}`;
    method = 'DELETE';
  } else if (type === 'INSERT' || type === 'UPDATE') {
    if (record.is_draft) {
      console.log(`跳过索引操作：${table} 是草稿状态`);
      return { skipped: true, reason: 'Draft' };
    }

    const fields = {
      article: ['id', 'lang', 'slug', 'title', 'subtitle', 'abstract', 'content_text', 'topic'],
      photo: ['id', 'slug', 'lang', 'title', 'abstract', 'content_text', 'topic'],
      thought: ['id', 'slug', 'content_text', 'topic']
    };

    body = JSON.stringify([
      fields[table as keyof typeof fields].reduce((obj, field) => {
        if (record[field as keyof Record] !== undefined) {
          obj[field] = record[field as keyof Record];
        }
        return obj;
      }, {} as Record)
    ]);

    method = type === 'UPDATE' ? 'PUT' : 'POST';
  }

  const response = await fetch(url, {
    method,
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${MEILI_KEY}`
    },
    body
  });

  if (!response.ok) {
    throw new Error(`Meilisearch操作失败: ${response.statusText}`);
  }

  return response.json();
}

serve(async (req) => {
  try {
    const payload: Payload = await req.json();
    const result = await handleMeilisearch(payload);
    return new Response(JSON.stringify(result), {
      headers: { 'Content-Type': 'application/json' }
    });
  } catch (error) {
    console.error('处理请求时发生错误:', error);
    return new Response(JSON.stringify({ error: error.message }), {
      status: 500,
      headers: { 'Content-Type': 'application/json' }
    });
  }
});

Once all the data is transferred to meilisearch and stored as documents, and indexing is complete, you can perform searches.

You can set how the search works, configure highlighted fields, determine the weight of vector search, etc. Send the following code as the body to the meilisearch /multi-search endpoint. Refer to documentation for specific search settings:

typescript
queries: [
  {
    indexUid: "article",
    q: query,
    limit: 10,
    attributesToCrop: ["abstract", "content_text"],
    cropLength: 24,
    cropMarker: "...",
    attributesToHighlight: ["title", "abstract", "content_text", "topic"],
    highlightPreTag: "<span class=\"text-violet-600\">",
    highlightPostTag: "</span>",
    showRankingScore: true,
    hybrid: {
      embedder: "default",
      semanticRatio: 0.4
    }
  },
  {
    indexUid: "photo",
    q: query,
    limit: 15,
    attributesToCrop: ["abstract", "content_text"],
    cropLength: 24,
    cropMarker: "...",
    attributesToHighlight: ["title", "abstract", "content_text", "topic"],
    highlightPreTag: "<span class=\"text-violet-600\">",
    highlightPostTag: "</span>",
    showRankingScore: true,
    hybrid: {
      embedder: "default",
      semanticRatio: 0.5
    }
  },
  {
    indexUid: "thought",
    q: query,
    limit: 5,
    attributesToCrop: ["content_text"],
    cropLength: 24,
    cropMarker: "...",
    attributesToHighlight: ["content_text", "topic"],
    highlightPreTag: "<span class=\"text-violet-600\">",
    highlightPostTag: "</span>",
    showRankingScore: true,
    hybrid: {
      embedder: "default",
      semanticRatio: 0.5
    }
  }
]

Finally, you can filter through the results on the front end and sort them by ranking score. This way, you have a simple search engine that supports keyword and fuzzy search and can search multiple types of content.

You can click the search icon in the navigation bar to experience the actual effect.

Development

Added search function to the blog

A simple search module based on Meilisearch that supports keyword and semantic search

PGroonga

Vector Search

meilisearch

Current Solution

Previous

We are witnessing the insularization of Chinese civilization

Next

Finally got a Japanese driver's license