Harnessing Advanced Machine Learning for Text Classification at Razi Title: A Technical Exploration

Mar 15, 2024 6:55:02 AM | Tech Tip Harnessing Advanced Machine Learning for Text Classification at Razi Title: A Technical Exploration

At Razi Title, our commitment to innovation drives us to explore and implement cutting-edge technologies. In this blog, I'll take you through our journey in developing a sophisticated text classification model using TensorFlow, delving into the advanced machine learning techniques we've employed.

At Razi Title, our commitment to innovation drives us to explore and implement cutting-edge technologies. In this blog, I'll take you through our journey in developing a sophisticated text classification model using TensorFlow, delving into the advanced machine learning techniques we've employed.


import os

from layers.attention_layers import ScaledDotProductAttention

# Set environment variable to make TensorFlow use CPU
os.environ["CUDA_VISIBLE_DEVICES"] = ""

Our journey begins with preparing our machine learning environment. Choosing TensorFlow, a robust and versatile framework, allows us to build complex models with relative ease. The decision to run computations on the CPU is opportunistic, as the PC we used has a modern i9 processor which out performed the Nvidia 12GB Vram video card. Extra bonus, ensuring our model remains deployable in diverse environments without depending on specific hardware capabilities helps.



import os
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import Dense, Dropout, Embedding, Input, Flatten
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf

from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical


def load_classified_data(input_directory):
data = []
labels = []

for label_dir in os.listdir(input_directory):
full_dir_path = os.path.join(input_directory, label_dir)
if os.path.isdir(full_dir_path):
for file_name in os.listdir(full_dir_path):
file_path = os.path.join(full_dir_path, file_name)
with open(file_path, "r") as file:
text = file.read()
data.append(text)
labels.append(label_dir)

return data, labels

Our load_classified_data function is more than just a data loader; it's a gateway to structured machine learning. By categorizing text from various directories, we lay a solid foundation for supervised learning. This method ensures that our model can differentiate between distinct categories, which is vital in automating document classification in real estate transactions. Essentially, put text files into folders. Those folders become labels.


if __name__ == "__main__":
input_directory = "/home/rzwink/working/cinnabar/ld/20231228094239"
data, labels = load_classified_data(input_directory)

NUM_WORDS = 1000
EMBEDDING_DIM = 16
MAX_LENGTH = 500
# Train the model
NUM_EPOCHS = 20
BATCH_SIZE = 32 # You can adjust this value as needed

vectorizer = TextVectorization(
max_tokens=NUM_WORDS, output_sequence_length=MAX_LENGTH, output_mode="int"
)
vectorizer.adapt(data)

# Save the vectorizer vocabulary
vocab = vectorizer.get_vocabulary()
vocab_dict = {word: index for index, word in enumerate(vocab)}

# Encoding the labels
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Save the label encoder
np.save(f"models/page_classifier/label_encoder.npy", label_encoder.classes_)

X_train, X_test, y_train, y_test = train_test_split(
np.array([[s] for s in data]),
to_categorical(encoded_labels),
test_size=0.2,
random_state=42,
)

# Define the model with ScaledDotProductAttention
text_input = Input(shape=(1,), dtype=tf.string) # Adjusted for a batch of strings
x = vectorizer(text_input)
x = Embedding(NUM_WORDS, EMBEDDING_DIM, input_length=MAX_LENGTH)(x)

# Create query, key, value for attention
query = Dense(EMBEDDING_DIM)(x)
key = Dense(EMBEDDING_DIM)(x)
value = Dense(EMBEDDING_DIM)(x)

In this section, we dive into the nuances of text vectorization and label encoding. Text vectorization converts raw text into numerical tokens, a format that neural networks can understand and process. This step is crucial in extracting meaningful patterns from text data, enabling the model to learn from linguistic structures. Our use of a limited vocabulary size (NUM_WORDS) and sequence length (MAX_LENGTH) balances model complexity with computational efficiency.

  
# Apply Scaled Dot Product Attention
attention_output, _ = ScaledDotProductAttention()(query, key, value)

x = Flatten()(attention_output)

x = Dense(24, activation="relu")(x)
x = Dropout(0.5)(x)
outputs = Dense(len(label_encoder.classes_), activation="softmax")(x)

model = Model(text_input, outputs)

# Compile and train the model
model.compile(
loss="categorical_crossentropy",
optimizer=Adam(learning_rate=0.001),
metrics=["accuracy"],
)
# Define the early stopping callback
early_stopping = EarlyStopping(
monitor="val_loss", # Monitor the validation loss
patience=3, # Number of epochs with no improvement after which training will be stopped
restore_best_weights=True, # Restore model weights from the epoch with the best value of the monitored quantity
)

# Train the model with early stopping
model.fit(
X_train,
y_train,
epochs=NUM_EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_test, y_test),
callbacks=[early_stopping], # Add the callback here
)
# Save the model
model.save("models/page_classifier", save_format="tf")

Training a deep learning model is a delicate balance between learning enough and not overfitting. Our use of the EarlyStopping callback is a testament to our focus on creating a robust model. This approach ensures that our model retains its ability to generalize to new, unseen data, which is crucial in the dynamic field of real estate where new document types and terminologies emerge regularly.


# Evaluate the model
y_pred = model.predict(X_test)
y_pred_labels = label_encoder.inverse_transform([np.argmax(y) for y in y_pred])
y_test_labels = label_encoder.inverse_transform([np.argmax(y) for y in y_test])

# Print classification report
print(classification_report(y_test_labels, y_pred_labels, zero_division=1))

The evaluation phase is where theory meets practice. By closely examining the classification report, we gain insights into the model's performance across various document types. This helps us fine-tune the model for even better accuracy in real-world applications, such as automating title searches and document verification processes at Razi Title. Through this deep dive into our machine learning model, we at Razi Title are not just embracing technological advancement; we're actively shaping it to fit our unique industry needs. This model represents a significant step in our ongoing journey to revolutionize the real estate sector with AI-driven solutions.

Written By: Robert Zwink