BERT v.s. Gzip+kNN v.s. LGBM


Posted by ar851060 on 2023-07-29

2023 D4SG 成果發表會心得中,有一組為了迴避掉cleaning data的麻煩,所以使用BERT來解決classification的問題。這個idea讓我覺得很有趣,所以我也嘗試做了實驗。原本該組別的code在這裡

Since I find some people use language model to do the tabular classification mission in order to avoid the difficulty of cleaning data. This idea I think is intreseting, so I did a small experiment. The code inspired me is here.

我在這邊要比較的模型為BERT和由Microsoft開發的LGBM模型,並且再增加一個對手,為2023ACL的落選論文「“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors」,介紹文章在這裡 Gzip + kNN: A Good Text Classifier?。為了配合不要讓BERT和LGBM學到的東西有太大的差異,並且為了讓Gzip+kNN能快一點。在資料的限制上,我必須要找有很多文字,但個別文字的變數並不是文章或句子這樣有結構的東西,並且資料集大小10MB以下。這邊我所使用的是kaggle上面的"Latest Data Science Salaries"這個資料集。這個資料集如下:

Here I want to compare between three models: pre-trained BERT, LGBM, and Gzip+kNN. Gzip + kNN is the latest paper and the introduction of it is here: Gzip + kNN: A Good Text Classifier?. Since I want to make no difference between what BERT learned and what LGBM learned, and also I want Gzip+kNN faster. The dataset has to be smaller than 10MB ,and the feature of text is not a sentences or context. Here, I use Latest Data Science Salaries in Kaggle, and it looks like this:

Job Title Employment Type Experience Level Expertise Level Company Location Salary in USD Employee Residence Company Size Year
Machine Learning Manager Full-Time Senior Expert United States 129562 Germany Large 2023
BI Data Analyst Full-Time Entry Junior Kenya 50000 Kenya Small 2023
AI Engineer Full-Time Senior Expert United States 227850 United States Medium 2023
AI Engineer Full-Time Senior Expert United States 180500 United States Medium 2023
Data Analyst Full-Time Mid Intermediate United States 90000 United States Medium 2023

Year是指這筆資料蒐集的年份,而目標就是用其他變數來預測 Salary in USD。這種資料給的文字資訊很多,但沒有只能給BERT吃的變數。其中類別的資料狀況為:

Year means the year collected this sample, and the target is to predict Salary in USD with other variables. The discrete variables are:

Job Title Employment Type Experience Level Expertise Level Company Location Employee Residence Company Size
count 3470 3470 3470 3470 3470 3470 3470
unique 115 4 4 4 71 83 3
top Data Engineer Full-Time Senior Expert United States United States Medium
freq 732 3431 2187 2187 2616 2571 2859

可以看到 Job Title, Company Location, 和Employee Residence類別特別多。

There are lots of unique features in Job Title, Company Location, Employee Residence.

我的code如下:

Here is my code:

BERT部分
LGBM/Gzip+kNN部分

其中BERT我用pre-train model,並做20個epoch去fine-tuning。LGBM直接不調參數。

I use pre-trained BERT, and fine-tune with 20 epochs. LGBM don't tune any parameters since I am lazy.

MSE結果如下:

The result of MSE is here:

BERT Gzip+kNN LGBM
0.02042 0.01397 0.01521

BERT的結果最差,Gzip+kNN最好。其中BERT都預測偏高。但是目前Gzip+kNN這個沒有package支援,kNN本身也不快。所以效率的考量下,我應該還是會用LGBM。下面是feature importance。

The worst model is BERT, and the best is Gzip+kNN. BERT predicts higher than it should be. Since Gzip+kNN does not support by any package and kNN is slow in prediction, so I will use LGBM in future for efficiency. Below is feature importance from LGBM:

Features Importance
Job_Title 1344
Employee_Residence 415
Company_Location 395
Experience_Level 341
Year 301
Company_Size 204
Employment_Type 0
Expertise_Level 0

薪水看起來和專業程度跟任職類型無關,但也可能是因為被Experience Level說明完了。

Salary are not related to employment type and expertise level, but it could be explained by experience level. If you want to know how LGBM calculate feature importance, check this link: Is this Important? - Unveiling the Secrets of Feature Importance

全部的code我都放在這裡了Github
All my code is here: Github


#project









Related Posts

專案起手式

專案起手式

MTR04_0921

MTR04_0921

Terminal 快捷鍵

Terminal 快捷鍵


Comments