Introduction
BERT is a popular pre-trained language model that has shown great success in natural language processing (NLP) tasks such as sentiment analysis, text classification, and question answering. However, BERT only takes in text as input and cannot incorporate other types of data such as numerical and categorical data.
This limitation can be addressed by incorporating tabular data into BERT models. In this way, we can use BERT's powerful language processing capabilities along with tabular data to create more robust and accurate models.
This article will show how to incorporate tabular data into BERT models and train them using the Hugging Face Trainer.
Incorporate Tabular Data with BERT Model
Here is a step-by-step PyTorch code to create a custom BERT model that can incorporate tabular data (numerical and categorical values) and train it using the Hugging Face Trainer.
Step 1: Prepare the Data
The first step is to prepare the data. We can use pandas to load the data from a CSV file and split it into training and validation sets. We'll use a simple example CSV file that contains both numerical and categorical data:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv')
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
Once we have the training and validation data frames, we need to convert them into PyTorch datasets using the Dataset class. We'll create a custom dataset that inherits from the Dataset
class and implements the __len__
and __getitem__
methods:
import torch
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data, tokenizer, max_length):
self.data = data
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.data)
def __getitem__(self, index):
row = self.data.iloc[index]
text = row['text']
label = row['label']
numerical_data = torch.tensor(row[['num1', 'num2']].values, dtype=torch.float)
categorical_data = torch.tensor(row['cat'].values, dtype=torch.long)
encoding = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_length, return_tensors='pt')
return {
'input_ids': encoding['input_ids'][0],
'attention_mask': encoding['attention_mask'][0],
'token_type_ids': encoding['token_type_ids'][0],
'numerical_data': numerical_data,
'categorical_data': categorical_data,
'label': torch.tensor(label, dtype=torch.long)
}
In this example, we load each row from the data frame and extract the text, label, numerical data (num1 and num2), and categorical data (cat) columns. We then use the tokenizer to encode the text and return a dictionary that contains the input IDs, attention mask, token type IDs, numerical data, categorical data, and label.
Step 2: Define the Model
The next step is to define the model. We'll use the Hugging Face BertForSequenceClassification
model as the base and add two additional input layers for the numerical and categorical data:
from transformers import BertForSequenceClassification, BertConfig
import torch.nn as nn
class CustomModel(nn.Module):
def __init__(self, num_numerical_features, num_categorical_features, num_labels):
super().__init__()
config = BertConfig.from_pretrained('bert-base-uncased')
self.bert = BertForSequenceClassification(config)
self.numerical_layer = nn.Linear(num_numerical_features, 64)
self.categorical_layer = nn.Embedding(num_categorical_features, 16)
self.classifier = nn.Linear(768+64+16, num_labels)
def forward(self, input_ids, attention_mask, token_type_ids, numerical_data, categorical_data):
bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids).last_hidden_state[:, 0]
numerical_output = self.numerical_layer(numerical_data)
categorical_output = self.categorical_layer(categorical_data).mean(dim=1)
x = torch.cat((bert_output, numerical_output, categorical_output), dim=1)
logits = self.classifier(x)
return logits
In this example, we define a custom model that inherits from the PyTorch nn.Module
class. The model has three layers: the BERT layer, a numerical layer that takes in the numerical data and outputs a 64-dimensional vector, and a categorical layer that takes in the categorical data and outputs a 16-dimensional vector. We then concatenate the outputs from these layers with the BERT output and pass them through a linear layer to get the logits.
Step 3: Train the Model
The final step is to train the model using the Hugging Face Trainer. We'll use the AdamW optimizer and the cross-entropy loss function. We'll also define a custom metric that calculates the accuracy:
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score
from transformers import AdamW
from transformers import get_scheduler
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_dataset = CustomDataset(train_df, tokenizer, max_length=128)
val_dataset = CustomDataset(val_df, tokenizer, max_length=128)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = CustomModel(num_numerical_features=2, num_categorical_features=4, num_labels=2).to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
scheduler = get_scheduler("linear", optimizer, num_warmup_steps=0, num_training_steps=len(train_dataset)*5)
def compute_accuracy(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
return accuracy_score(labels, preds)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=5,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
evaluation_strategy='epoch',
save_strategy='epoch',
learning_rate=5e-5,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_accuracy,
optimizer=optimizer,
scheduler=scheduler,
)
trainer.train()
In this example, we first create the tokenizers and datasets, and move the model to the GPU if it is available. We then create the optimizer, scheduler, and custom metric. Finally, we create the Trainer
object and call the train
method.
References