BERT v.s. Gzip+kNN v.s. LGBM

在2023 D4SG 成果發表會心得中，有一組為了迴避掉cleaning data的麻煩，所以使用BERT來解決classification的問題。這個idea讓我覺得很有趣，所以我也嘗試做了實驗。原本該組別的code在這裡。

Since I find some people use language model to do the tabular classification mission in order to avoid the difficulty of cleaning data. This idea I think is intreseting, so I did a small experiment. The code inspired me is here.

我在這邊要比較的模型為BERT和由Microsoft開發的LGBM模型，並且再增加一個對手，為2023ACL的落選論文「“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors」，介紹文章在這裡 Gzip + kNN: A Good Text Classifier?。為了配合不要讓BERT和LGBM學到的東西有太大的差異，並且為了讓Gzip+kNN能快一點。在資料的限制上，我必須要找有很多文字，但個別文字的變數並不是文章或句子這樣有結構的東西，並且資料集大小10MB以下。這邊我所使用的是kaggle上面的"Latest Data Science Salaries"這個資料集。這個資料集如下:

Here I want to compare between three models: pre-trained BERT, LGBM, and Gzip+kNN. Gzip + kNN is the latest paper and the introduction of it is here: Gzip + kNN: A Good Text Classifier?. Since I want to make no difference between what BERT learned and what LGBM learned, and also I want Gzip+kNN faster. The dataset has to be smaller than 10MB ,and the feature of text is not a sentences or context. Here, I use Latest Data Science Salaries in Kaggle, and it looks like this:

Job Title	Employment Type	Experience Level	Expertise Level	Company Location	Salary in USD	Employee Residence	Company Size	Year
Machine Learning Manager	Full-Time	Senior	Expert	United States	129562	Germany	Large	2023
BI Data Analyst	Full-Time	Entry	Junior	Kenya	50000	Kenya	Small	2023
AI Engineer	Full-Time	Senior	Expert	United States	227850	United States	Medium	2023
AI Engineer	Full-Time	Senior	Expert	United States	180500	United States	Medium	2023
Data Analyst	Full-Time	Mid	Intermediate	United States	90000	United States	Medium	2023

Year是指這筆資料蒐集的年份，而目標就是用其他變數來預測 Salary in USD。這種資料給的文字資訊很多，但沒有只能給BERT吃的變數。其中類別的資料狀況為:

Year means the year collected this sample, and the target is to predict Salary in USD with other variables. The discrete variables are:

	Job Title	Employment Type	Experience Level	Expertise Level	Company Location	Employee Residence	Company Size
count	3470	3470	3470	3470	3470	3470	3470
unique	115	4	4	4	71	83	3
top	Data Engineer	Full-Time	Senior	Expert	United States	United States	Medium
freq	732	3431	2187	2187	2616	2571	2859

可以看到 Job Title, Company Location, 和Employee Residence類別特別多。

There are lots of unique features in Job Title, Company Location, Employee Residence.

我的code如下:

Here is my code:

BERT部分
 LGBM/Gzip+kNN部分

其中BERT我用pre-train model，並做20個epoch去fine-tuning。LGBM直接不調參數。

I use pre-trained BERT, and fine-tune with 20 epochs. LGBM don't tune any parameters since I am lazy.

MSE結果如下:

The result of MSE is here:

BERT	Gzip+kNN	LGBM
0.02042	0.01397	0.01521

BERT的結果最差，Gzip+kNN最好。其中BERT都預測偏高。但是目前Gzip+kNN這個沒有package支援，kNN本身也不快。所以效率的考量下，我應該還是會用LGBM。下面是feature importance。

The worst model is BERT, and the best is Gzip+kNN. BERT predicts higher than it should be. Since Gzip+kNN does not support by any package and kNN is slow in prediction, so I will use LGBM in future for efficiency. Below is feature importance from LGBM:

Features	Importance
Job_Title	1344
Employee_Residence	415
Company_Location	395
Experience_Level	341
Year	301
Company_Size	204
Employment_Type	0
Expertise_Level	0

薪水看起來和專業程度跟任職類型無關，但也可能是因為被Experience Level說明完了。

Salary are not related to employment type and expertise level, but it could be explained by experience level. If you want to know how LGBM calculate feature importance, check this link: Is this Important? - Unveiling the Secrets of Feature Importance

全部的code我都放在這裡了Github
All my code is here: Github