Overview
From the last article Embedding Searching Articles and Corresponding Thompson Sampling, I explain how to build a embedding vector seraching engine with OpenAI embedding model. However, in OpenAI, there three types of embedding models: text-embedding-3-small, text-embedding-3-large, and ada v2. Although I can cancel out text-embedding-3-large for its bad performance on Mandarin, but I cannot tell the difference between ada v2 and text-embedding-3-small in a short time. Therefore, I decide to perform a Bayesian AB Testing, more precisely, Thompson Sampling to decide which embedding model is more suitable in this project.
Thompson Sampling
Since here we only record clicking, in Thompson Sampling, we use bernoulli-beta model. The model looks like:
where p means the probability, x means click or not, alpha and beta are the parameters of beta models.
The steps to do experiment by Thompson Sampling:
- set Beta(1,1) for each embedding model: Beta{ada} and Beta{small}
- when users ask question, each beta model comes out a probability p{ada} and p{small}, and assign the model whose probability is higher.
- after sending the searching results back to users, the beta parameter of the beta model which is corresponding to assigned embedding model plus one. That is, beta{assigned} = beta{assigned} + 1
- Once clicking searching results, Record function will do two things: beta-1 and alpha+1
The following is the procedure of experiment during the searching results and clicking results.
Record
The Record structure is only for recording the results of the experiment.
Clicking
When users click the urls of searching results, the urls send http information to Google Cloud Function.
User Database
In GCF, the function checks whether this clicking is right after searching or not. Since we only want to know the results which embedding models made we sent to customers is more suitable, we only record the clicking right after searching.
Update Parameters and Redirect
Once checking the click, then we update the testing parameter. After recording, we redirect that users to the real result article.
Experiment Results
I ended up the experiment on July 3, 2024. The results of parameters from two beta models are
Model/Parameters | alpha | beta | success | Number of Trials |
---|---|---|---|---|
ada v2 | 63 | 35 | 62 | 96 |
text-embedding-3-small | 160 | 80 | 159 | 238 |
The following is the mean probability and the corresponding credible interval for each model. The x-axis is the click, and the y-axis is probability.
You can see most of time that Small are larger than Ada and they became stable after around 300 clicks. You can see the change of beta distributions in the next gif.
According to the end period of experiment time and the stable of both models, I decided that text-embedding-3-small is the embedding model for searching article engine.
Conclusion
This time, I use a different method, Thompson Sampling, to do the AB testing. In this method, the adventage is that I do not need to worry about UX during experiment period, since the algorithm automatically increase the probability of the better model. However, in this method, it is hard to do the analysis after experiment. I can not claim that a straghtforward statement with some quantified results, for instance, "A is better than B since the click rate is significantly larger than B by xx%", so if the results of experiment needs to report to others, it might not be a best choice.