在2023 D4SG 成果發表會心得中,有一組為了迴避掉cleaning data的麻煩,所以使用BERT來解決classification的問題。這個idea讓我覺得很有趣,所以我也嘗試做了實驗。原本該組別的code在這裡。
Since I find some people use language model to do the tabular classification mission in order to avoid the difficulty of cleaning data. This idea I think is intreseting, so I did a small experiment. The code inspired me is here.
我在這邊要比較的模型為BERT和由Microsoft開發的LGBM模型,並且再增加一個對手,為2023ACL的落選論文「“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors」,介紹文章在這裡 Gzip + kNN: A Good Text Classifier?。為了配合不要讓BERT和LGBM學到的東西有太大的差異,並且為了讓Gzip+kNN能快一點。在資料的限制上,我必須要找有很多文字,但個別文字的變數並不是文章或句子這樣有結構的東西,並且資料集大小10MB以下。這邊我所使用的是kaggle上面的"Latest Data Science Salaries"這個資料集。這個資料集如下:
Here I want to compare between three models: pre-trained BERT, LGBM, and Gzip+kNN. Gzip + kNN is the latest paper and the introduction of it is here: Gzip + kNN: A Good Text Classifier?. Since I want to make no difference between what BERT learned and what LGBM learned, and also I want Gzip+kNN faster. The dataset has to be smaller than 10MB ,and the feature of text is not a sentences or context. Here, I use Latest Data Science Salaries in Kaggle, and it looks like this:
Job Title | Employment Type | Experience Level | Expertise Level | Company Location | Salary in USD | Employee Residence | Company Size | Year |
---|---|---|---|---|---|---|---|---|
Machine Learning Manager | Full-Time | Senior | Expert | United States | 129562 | Germany | Large | 2023 |
BI Data Analyst | Full-Time | Entry | Junior | Kenya | 50000 | Kenya | Small | 2023 |
AI Engineer | Full-Time | Senior | Expert | United States | 227850 | United States | Medium | 2023 |
AI Engineer | Full-Time | Senior | Expert | United States | 180500 | United States | Medium | 2023 |
Data Analyst | Full-Time | Mid | Intermediate | United States | 90000 | United States | Medium | 2023 |
Year是指這筆資料蒐集的年份,而目標就是用其他變數來預測 Salary in USD。這種資料給的文字資訊很多,但沒有只能給BERT吃的變數。其中類別的資料狀況為:
Year means the year collected this sample, and the target is to predict Salary in USD with other variables. The discrete variables are:
Job Title | Employment Type | Experience Level | Expertise Level | Company Location | Employee Residence | Company Size | |
---|---|---|---|---|---|---|---|
count | 3470 | 3470 | 3470 | 3470 | 3470 | 3470 | 3470 |
unique | 115 | 4 | 4 | 4 | 71 | 83 | 3 |
top | Data Engineer | Full-Time | Senior | Expert | United States | United States | Medium |
freq | 732 | 3431 | 2187 | 2187 | 2616 | 2571 | 2859 |
可以看到 Job Title, Company Location, 和Employee Residence類別特別多。
There are lots of unique features in Job Title, Company Location, Employee Residence.
我的code如下:
Here is my code:
其中BERT我用pre-train model,並做20個epoch去fine-tuning。LGBM直接不調參數。
I use pre-trained BERT, and fine-tune with 20 epochs. LGBM don't tune any parameters since I am lazy.
MSE結果如下:
The result of MSE is here:
BERT | Gzip+kNN | LGBM |
---|---|---|
0.02042 | 0.01397 | 0.01521 |
BERT的結果最差,Gzip+kNN最好。其中BERT都預測偏高。但是目前Gzip+kNN這個沒有package支援,kNN本身也不快。所以效率的考量下,我應該還是會用LGBM。下面是feature importance。
The worst model is BERT, and the best is Gzip+kNN. BERT predicts higher than it should be. Since Gzip+kNN does not support by any package and kNN is slow in prediction, so I will use LGBM in future for efficiency. Below is feature importance from LGBM:
Features | Importance |
---|---|
Job_Title | 1344 |
Employee_Residence | 415 |
Company_Location | 395 |
Experience_Level | 341 |
Year | 301 |
Company_Size | 204 |
Employment_Type | 0 |
Expertise_Level | 0 |
薪水看起來和專業程度跟任職類型無關,但也可能是因為被Experience Level說明完了。
Salary are not related to employment type and expertise level, but it could be explained by experience level. If you want to know how LGBM calculate feature importance, check this link: Is this Important? - Unveiling the Secrets of Feature Importance