2023-03-05

How to Incorporate Tabular Data with BERT

Introduction

BERT is a popular pre-trained language model that has shown great success in natural language processing (NLP) tasks such as sentiment analysis, text classification, and question answering. However, BERT only takes in text as input and cannot incorporate other types of data such as numerical and categorical data.

This limitation can be addressed by incorporating tabular data into BERT models. In this way, we can use BERT's powerful language processing capabilities along with tabular data to create more robust and accurate models.

This article will show how to incorporate tabular data into BERT models and train them using the Hugging Face Trainer.

Incorporate Tabular Data with BERT Model

Here is a step-by-step PyTorch code to create a custom BERT model that can incorporate tabular data (numerical and categorical values) and train it using the Hugging Face Trainer.

Step 1: Prepare the Data

The first step is to prepare the data. We can use pandas to load the data from a CSV file and split it into training and validation sets. We'll use a simple example CSV file that contains both numerical and categorical data:

python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv')
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

Once we have the training and validation data frames, we need to convert them into PyTorch datasets using the Dataset class. We'll create a custom dataset that inherits from the Dataset class and implements the __len__ and __getitem__ methods:

python
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        row = self.data.iloc[index]

        text = row['text']
        label = row['label']
        numerical_data = torch.tensor(row[['num1', 'num2']].values, dtype=torch.float)
        categorical_data = torch.tensor(row['cat'].values, dtype=torch.long)

        encoding = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_length, return_tensors='pt')

        return {
            'input_ids': encoding['input_ids'][0],
            'attention_mask': encoding['attention_mask'][0],
            'token_type_ids': encoding['token_type_ids'][0],
            'numerical_data': numerical_data,
            'categorical_data': categorical_data,
            'label': torch.tensor(label, dtype=torch.long)
        }

In this example, we load each row from the data frame and extract the text, label, numerical data (num1 and num2), and categorical data (cat) columns. We then use the tokenizer to encode the text and return a dictionary that contains the input IDs, attention mask, token type IDs, numerical data, categorical data, and label.

Step 2: Define the Model

The next step is to define the model. We'll use the Hugging Face BertForSequenceClassification model as the base and add two additional input layers for the numerical and categorical data:

python
from transformers import BertForSequenceClassification, BertConfig
import torch.nn as nn

class CustomModel(nn.Module):
    def __init__(self, num_numerical_features, num_categorical_features, num_labels):
        super().__init__()

        config = BertConfig.from_pretrained('bert-base-uncased')
        self.bert = BertForSequenceClassification(config)

        self.numerical_layer = nn.Linear(num_numerical_features, 64)
        self.categorical_layer = nn.Embedding(num_categorical_features, 16)

        self.classifier = nn.Linear(768+64+16, num_labels)

    def forward(self, input_ids, attention_mask, token_type_ids, numerical_data, categorical_data):
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids).last_hidden_state[:, 0]

        numerical_output = self.numerical_layer(numerical_data)
        categorical_output = self.categorical_layer(categorical_data).mean(dim=1)

        x = torch.cat((bert_output, numerical_output, categorical_output), dim=1)
        logits = self.classifier(x)

        return logits

In this example, we define a custom model that inherits from the PyTorch nn.Module class. The model has three layers: the BERT layer, a numerical layer that takes in the numerical data and outputs a 64-dimensional vector, and a categorical layer that takes in the categorical data and outputs a 16-dimensional vector. We then concatenate the outputs from these layers with the BERT output and pass them through a linear layer to get the logits.

Step 3: Train the Model

The final step is to train the model using the Hugging Face Trainer. We'll use the AdamW optimizer and the cross-entropy loss function. We'll also define a custom metric that calculates the accuracy:

python
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score
from transformers import AdamW
from transformers import get_scheduler

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_dataset = CustomDataset(train_df, tokenizer, max_length=128)
val_dataset = CustomDataset(val_df, tokenizer, max_length=128)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = CustomModel(num_numerical_features=2, num_categorical_features=4, num_labels=2).to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)
scheduler = get_scheduler("linear", optimizer, num_warmup_steps=0, num_training_steps=len(train_dataset)*5)

def compute_accuracy(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    return accuracy_score(labels, preds)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=5e-5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_accuracy,
    optimizer=optimizer,
    scheduler=scheduler,
)

trainer.train()

In this example, we first create the tokenizers and datasets, and move the model to the GPU if it is available. We then create the optimizer, scheduler, and custom metric. Finally, we create the Trainer object and call the train method.

References

https://medium.com/georgian-impact-blog/how-to-incorporate-tabular-data-with-huggingface-transformers-b70ac45fcfb4
https://towardsdatascience.com/how-to-combine-textual-and-numerical-features-for-machine-learning-in-python-dc1526ca94d9

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!