Embedding Searching Articles


Posted by ar851060 on 2024-06-29

Overview

Since the search function in Instagram is terrible, one of my friends wanted a function which can search his articles so that his customers can find the most related articles by their question.

Basically, I built this searching function according to embedding searching. Moreover, I want to set Thompson Sampling experiment in this function for finding the most suitable embedding models from OpenAI.

Example for Searching Articles

The whole structure of this function is belowed, there are three parts in this project: Search, Record, and Update. This article will focus on Search and Update functions, if you just want to build your own embedding searching articles, then read this article is enough. However, if you want to know how I build a experiment on this search articles or you are interested in Record, then please read my next article (progressing...).

Structure of Searching Article Feature

Search

Below is the structure of searching function. I will start to explain what I do in each steps.

Structure of Searching function

Assign a Embedding Model from OpenAI

2 Model decide which is the best

When users type in some questions to this function, the first thing is to assign a embedding model to it. The way to assign embedding model is using beta model. Each embedding model has its own beta model, when new question comes, they output probabilities and assign the model with biggest probability.

Since I just want to mention how I built the embedding search articles, the testing part will be left in the next article.

The embedding models in my options are all from OpenAI, they are:

  • text-embedding-3-small
  • ada v2
  • text-embedding-3-large

Since test-embedding-3-large cannot handle traditional Chinese as well as other two models, I decide to delete it before the test.

When embedding model is ready, next step is to transform the question into embedding. After this, we are ready to search article.

Please beware that the process of transform text into embedding is not reversible, so in fact, Semantic Search output the indexes, not contents.

Semantic Search with Faiss

FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.

In this project, I decide to use faiss as searching library since it is the fastest free embedding searching library. Simply, I put all the embedding of article in the Faiss, query the top 3 most nearest embeddings from Faiss, and get the 3 corresponded indexes of articles.

Get Selected Article & Update

After getting the indexes, we can get the content of article from those indexes, and send those information to users. Also, I update the user information for Record function and update testing parameters.

Upload

Below is the structure of upload function. The purpose of this function is to help author to update his article without doing any coding and some complex processes.

Structure of Upload Function

Trigger Google Apps Script

Actually, there are two parts of function in this section, one is in Google Apps Script and the other is in Google Cloud Function.

The function in Google Apps Script to identify whether there is a new article in Google Sheet. The Google Sheet looks like this

Title URL Article Update
Title Old old_url .com old article Done
Title New new_url .com new article

When author fills in "Title", "URL", "Article", the function will check whether these three columns are completed. Once they are completed, then function will update article into Google Cloud Function, and fill "Done" into "Update" column automatically.

Google Cloud Function

When new article into Google Cloud Function, it update the Article Database right away. At the same time, new article needs to transform into embedding and update Embedding Database. The embedding models are the same as Search function.

Conclusion

Actually, this system is very simple, but it will be more complex once I want to add experiment into it. Therefore, please expected the next article, I will talk about the Thompson Sampling, a Bayesian Testing, on this feature.

Also, I do not have time to push my code into Github, so please wait.


#embedding #searching









Related Posts

BTC original source code(C++) initial 架構分析  (1)

BTC original source code(C++) initial 架構分析 (1)

Day05 從 Hash Anchor 看原生 History API (上)

Day05 從 Hash Anchor 看原生 History API (上)

Reactive Programming 簡介與教學(以 RxJS 為例)

Reactive Programming 簡介與教學(以 RxJS 為例)


Comments