Language Selection

Get healthy now with MedBeds!
Click here to book your session

Protect your whole family with Orgo-Life® Quantum MedBed Energy Technology® devices.

Advertising by Adpathway

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings

5 months ago 46

PROTECT YOUR DNA WITH QUANTUM TECHNOLOGY

Orgo-Life the new way to the future

Advertising by Adpathway

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
Image by Editor

Introduction

Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models—such as those used in scikit-learn—to improve downstream performance.

This article presents seven advanced Python examples of feature engineering tricks that add extra value to text data by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine learning models that rely on text, in applications such as sentiment analysis, topic classification, document clustering, and semantic similarity detection.

Common setup for all examples

Unless stated otherwise, the seven example tricks below make use of this common setup. We rely on Sentence Transformers for embeddings and scikit-learn for modeling utilities.

!pip install sentence-transformers scikit-learn -q

from sentence_transformers import SentenceTransformer

import numpy as np

# Load a lightweight LLM embedding model; builds 384-dimensional embeddings

model = SentenceTransformer("all-MiniLM-L6-v2")

1. Combining TF-IDF and Embedding Features

The first example shows how to jointly extract—given a source text dataset like fetch_20newsgroups—both TF-IDF and LLM-generated sentence-embedding features. We then combine these feature types to train a logistic regression model that classifies news texts based on the combined features, often boosting accuracy by capturing both lexical and semantic information.

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

# Loading data

data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.autos'])

texts, y = data.data[:500], data.target[:500]

# Extracting features of two broad types

tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()

emb = model.encode(texts, show_progress_bar=False)

# Combining features and training ML model

X = np.hstack([tfidf, StandardScaler().fit_transform(emb)])

clf = LogisticRegression(max_iter=1000).fit(X, y)

print("Accuracy:", clf.score(X, y))

2. Topic-Aware Embedding Clusters

This trick takes a few sample text sequences, generates embeddings using the preloaded language model, applies K-Means clustering on these embeddings to assign topics, and then combines the embeddings with a one-hot encoding of each example’s cluster identifier (its “topic class”) to build a new feature representation. It is a useful strategy for creating compact topic meta-features.

from sklearn.cluster import KMeans

from sklearn.preprocessing import OneHotEncoder

texts = ["Tokyo Tower is a popular landmark.", "Sushi is a traditional Japanese dish.",

"Mount Fuji is a famous volcano in Japan.", "Cherry blossoms bloom in the spring in Japan."]

emb = model.encode(texts)

topics = KMeans(n_clusters=2, n_init='auto', random_state=42).fit_predict(emb)

topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1))

X = np.hstack([emb, topic_ohe])

print(X.shape)

3. Semantic Anchor Similarity Features

This simple strategy computes similarity to a small set of fixed “anchor” (or reference) sentences used as compact semantic descriptors—essentially, semantic landmarks. Each column in the similarity-feature matrix contains the similarity of the text to one anchor. The main value lies in allowing the model to learn relationships between the text’s similarity to key concepts and a target variable—useful for text classification models.

from sklearn.metrics.pairwise import cosine_similarity

anchors = ["space mission", "car performance", "politics"]

anchor_emb = model.encode(anchors)

texts = ["The rocket launch was successful.", "The car handled well on the track."]

emb = model.encode(texts)

sim_features = cosine_similarity(emb, anchor_emb)

print(sim_features)

4. Meta-Feature Stacking via Auxiliary Sentiment Classifier

For text associated with labels such as sentiments, the following feature-engineering technique adds extra value. A meta-feature is built as the prediction probability returned by an auxiliary classifier trained on the embeddings. This meta-feature is stacked with the original embeddings, resulting in an augmented feature set that can improve downstream performance by exposing potentially more discriminative information than raw embeddings alone.

A slight additional setup is needed for this example:

!pip install sentence-transformers scikit-learn -q

from sentence_transformers import SentenceTransformer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler # Import StandardScaler

import numpy as np

embedder = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim

# Small dataset containing texts and sentiment labels

texts = ["I love this!", "This is terrible.", "Amazing quality.", "Not good at all."]

y = np.array([1, 0, 1, 0])

# Obtain embeddings from the embedder LLM

emb = embedder.encode(texts, show_progress_bar=False)

# Train an auxiliary classifier on embeddings

X_train, X_test, y_train, y_test = train_test_split(

emb, y, test_size=0.5, random_state=42, stratify=y

)

meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)

# Leverage the auxiliary model's predicted probability as a meta-feature

meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of positive class

# Augment original embeddings with the meta-feature

# Do not forget to scale again for consistency

scaler = StandardScaler()

emb_scaled = scaler.fit_transform(emb)

X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together

print("emb shape:", emb.shape)

print("meta_feature shape:", meta_feature.shape)

print("augmented shape:", X_aug.shape)

print("meta clf accuracy on test slice:", meta_clf.score(X_test, y_test))

5. Embedding Compression and Nonlinear Expansion

This strategy applies PCA dimensionality reduction to compress the raw embeddings built by the LLM and then polynomially expands these compressed embeddings. It may sound odd at first, but this can be an effective approach to capture nonlinear structure while maintaining efficiency.

!pip install sentence-transformers scikit-learn -q

from sentence_transformers import SentenceTransformer

from sklearn.decomposition import PCA

from sklearn.preprocessing import PolynomialFeatures

import numpy as np

# Loading a lightweight embedding language model

embedder = SentenceTransformer("all-MiniLM-L6-v2")

texts = ["The satellite was launched into orbit.",

"Cars require regular maintenance.",

"The telescope observed distant galaxies."]

# Obtaining embeddings

emb = embedder.encode(texts, show_progress_bar=False)

# Compressing with PCA and enriching with polynomial features

pca = PCA(n_components=2).fit_transform(emb) # Reduced n_components to a valid value

poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca)

print("Original shape:", emb.shape)

print("After PCA:", pca.shape)

print("After polynomial expansion:", poly.shape)

6. Relational Learning with Pairwise Contrastive Features

The goal here is to build pairwise relational features from text embeddings. Interrelated features—constructed in a contrastive fashion—can highlight aspects of similarity and dissimilarity. This is particularly effective for predictive processes that inherently entail comparisons among texts.

!pip install sentence-transformers -q

from sentence_transformers import SentenceTransformer

import numpy as np

# Loading embedder

embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Example text pairs

pairs = [

("The car is fast.", "The vehicle moves quickly."),

("The sky is blue.", "Bananas are yellow.")

]

# Generating embeddings for both sides

emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)

emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)

# Building contrastive features: absolute difference and element-wise product

X_pairs = np.hstack([np.abs(emb1 - emb2), emb1 * emb2])

print("Pairwise feature shape:", X_pairs.shape)

7. Cross-Modal Fusion

The last trick combines LLM embeddings with simple linguistic or numeric features—such as punctuation ratio or other domain-specific engineered features. It contributes to more holistic text-derived features by uniting semantic signals with handcrafted linguistic aspects. Here is an example that measures punctuation in the text.

!pip install sentence-transformers -q

from sentence_transformers import SentenceTransformer

import numpy as np, re

# Loading embedder

embedder = SentenceTransformer("all-MiniLM-L6-v2")

texts = ["Mars mission 2024!", "New electric car model launched."]

# Computing embeddings

emb = embedder.encode(texts, show_progress_bar=False)

# Adding simple numeric text features

lengths = np.array([len(t.split()) for t in texts]).reshape(-1, 1)

punct_ratio = np.array([len(re.findall(r"[^\w\s]", t)) / len(t) for t in texts]).reshape(-1, 1)

# Combining all features

X = np.hstack([emb, lengths, punct_ratio])

print("Final feature matrix shape:", X.shape)

Wrapping Up

We explored seven advanced feature-engineering tricks that help extract more information from raw text, going beyond LLM-generated embeddings alone. These practical strategies can boost downstream machine learning models that take text as input by capturing complementary lexical, semantic, relational, and handcrafted signals.