Secbert to detect anomalous log entries
Hmm… If you are not specifically constrained to SecBERT:
I would treat SecBERT as one candidate encoder, not as the entire anomaly-detection solution.
Your current approach is roughly:
HTTP request / log entry
-> SecBERT tokenizer
-> SecBERT encoder
-> classification head
-> normal / anomalous
That is a valid experiment, but it is only one formulation. SecBERT itself is not a ready-made SIEM anomaly detector. It is a cybersecurity-domain pretrained language model. The actual detector is the full pipeline around it: preprocessing, tokenization, label mapping, classification head, checkpoint loading, validation metrics, and thresholding.
For context:
- SecBERT is described as a BERT model trained on cybersecurity text.
- The original SecBERT GitHub repo frames it as a language model that can improve downstream tasks such as NER, text classification, semantic understanding, and Q&A in the cybersecurity domain.
- CSIC 2010 is generated HTTP request data, with normal/anomalous labels and attacks such as SQL injection, buffer overflow, information gathering, file disclosure, CRLF injection, XSS, server-side include, and parameter tampering.
So I would separate two questions:
- Why is the current SecBERT setup predicting all diagnostic examples as anomalous?
- Is SecBERT the best formulation for CSIC / SIEM-style anomaly detection?
Those are related, but not identical.
1. First debug the current “everything is anomalous” behavior
Before changing models, I would verify that the current pipeline is not simply misconfigured.
A model predicting only one class is a common failure mode in fine-tuning workflows. There are similar Hugging Face Forum threads where fine-tuned BERT/RoBERTa models always predicted the same class, and the underlying causes discussed included checkpoint loading, class imbalance, metrics, and training/evaluation setup:
- Fine-tuned model always predicts same output class for new data
- BERT and RoBERTa giving same outputs
- Fine-tuned RoBERTa only predicting one category
I would check at least these items:
print(model.config.id2label)
print(model.config.label2id)
for name, split in {
"train": train_df,
"val": val_df,
"test": test_df,
}.items():
print("\n", name)
print(split["label"].value_counts())
print(split["label"].value_counts(normalize=True))
Things I would specifically verify:
normalandanomalousare mapped consistently inlabel2idandid2label.- The inference code is loading the fine-tuned checkpoint, not the base SecBERT model.
- The tokenizer is loaded from the same checkpoint or compatible base model.
- The 2,000-row CSIC sample is not badly skewed.
- Train/validation/test splits are stratified.
- Validation metrics include per-class precision, recall, F1, and confusion matrix.
- The HTTP requests are not truncated before the useful payload appears.
- The final decision threshold is calibrated on validation data, rather than relying only on
argmaxor a fixed 0.5 threshold.
For text classification with Transformers, the usual pattern is to attach a sequence-classification head, train it, and evaluate it with task-specific metrics. The Hugging Face guide for text classification is a useful reference for that setup.
I would also inspect logits/probabilities for a small batch:
import torch
import pandas as pd
def inspect_predictions(model, tokenizer, texts, max_length=256):
model.eval()
rows = []
for text in texts:
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=max_length,
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits[0].detach().cpu()
probs = torch.softmax(logits, dim=-1)
pred_id = int(torch.argmax(probs))
pred_label = model.config.id2label.get(pred_id, str(pred_id))
rows.append({
"text": text[:160],
"num_tokens": int(inputs["attention_mask"].sum()),
"logit_0": float(logits[0]),
"logit_1": float(logits[1]),
"prob_0": float(probs[0]),
"prob_1": float(probs[1]),
"pred_id": pred_id,
"pred_label": pred_label,
})
return pd.DataFrame(rows)
If both normal and anomalous examples get almost the same probability distribution, the model is probably not separating the classes yet.
2. For CSIC HTTP requests, add simple string/protocol baselines
For CSIC, I would not start by assuming that a cybersecurity language model is the strongest representation.
CSIC is HTTP request data, not ordinary natural-language security prose. Many important signals are string-level or protocol-structure-level:
- URL encoding
- SQLi tokens
- XSS markers
- CRLF markers
- path traversal markers
- unusual parameter values
- long payloads
- high special-character density
- odd HTTP method/path/body combinations
A simple character n-gram model may be surprisingly strong here.
I would build at least this baseline:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
clf = Pipeline([
("tfidf", TfidfVectorizer(
analyzer="char_wb",
ngram_range=(3, 5),
min_df=2,
)),
("lr", LogisticRegression(
max_iter=1000,
class_weight="balanced",
)),
])
clf.fit(X_train, y_train)
pred = clf.predict(X_val)
print(confusion_matrix(y_val, pred))
print(classification_report(y_val, pred, target_names=["normal", "anomalous"]))
TfidfVectorizer supports character n-gram features via analyzer="char" or analyzer="char_wb"; see the scikit-learn docs for TfidfVectorizer.
Interpretation:
| Result | Likely interpretation |
|---|---|
| TF-IDF baseline works, SecBERT fails | The data may be separable, but the Transformer setup/preprocessing/training may be wrong. |
| Both TF-IDF and SecBERT fail | Check labels, split, preprocessing, sampling, and evaluation. |
| TF-IDF beats SecBERT | This task may be more string-pattern-heavy than language-semantics-heavy. |
| SecBERT beats TF-IDF | Good; then SecBERT may be adding useful cyber-domain representation. |
I would also add HTTP-specific handcrafted features:
import re
from urllib.parse import unquote
def extract_http_features(text):
decoded = unquote(text.lower())
return {
"length": len(decoded),
"num_digits": sum(c.isdigit() for c in decoded),
"num_special": sum(c in "'\"<>;(){}[]" for c in decoded),
"num_params": decoded.count("&") + decoded.count("?"),
"has_sql_keyword": int(bool(re.search(
r"\b(select|union|drop|insert|update|delete|or 1=1)\b",
decoded,
))),
"has_xss_marker": int(bool(re.search(
r"(<script|javascript:|onerror=|onload=)",
decoded,
))),
"has_path_traversal": int("../" in decoded or "..%2f" in decoded),
"has_crlf": int("%0d" in text.lower() or "%0a" in text.lower()),
}
Those features can be fed into Logistic Regression, Random Forest, LightGBM, XGBoost, or combined with text features.
3. Position SecBERT correctly
I would frame SecBERT like this:
SecBERT is not the detector by itself.
SecBERT is the encoder.
The detector is:
preprocessing
+ tokenizer
+ encoder
+ classification head or anomaly scorer
+ labels or normal-only training data
+ validation metrics
+ thresholding
This matters because changing only the encoder will not fix problems such as:
- reversed labels,
- wrong checkpoint loading,
- skewed sampling,
- bad validation split,
- truncation,
- uncalibrated threshold,
- or evaluating only four hand-picked examples.
If you want to keep the supervised Transformer route, compare cyber-domain encoders under the same evaluation setup:
| Encoder candidate | Notes |
|---|---|
| SecBERT | Cybersecurity-text pretrained BERT. Reasonable current baseline. |
| SecRoBERTa | RoBERTa variant from the same SecBERT family. |
| SecureBERT | RoBERTa-based cybersecurity-domain language model. |
| SecureBERT 2.0 | Newer Cisco cybersecurity/threat-intelligence encoder based on ModernBERT. |
| General BERT/RoBERTa | Useful sanity check: domain pretraining may or may not help this specific HTTP-request task. |
But I would only do this after the evaluation setup is clean. Otherwise, you may just be changing the model while keeping the same bug.
4. For SIEM-style unknown anomaly detection, consider normal-only scoring
If the real goal is SIEM-style unknown anomaly detection, supervised normal vs anomalous classification on CSIC may be only a proxy task.
A SIEM-like goal is often closer to:
learn what normal looks like
-> assign anomaly/risk scores
-> alert only above a tuned threshold
That suggests another family of methods:
HTTP request / log text
-> BERT/SecBERT/SecureBERT embedding
-> IsolationForest / One-Class SVM / kNN distance / Autoencoder
-> anomaly score
-> threshold tuned on validation data
Options worth testing:
| Method | Fit |
|---|---|
| SecBERT/SecureBERT embeddings + IsolationForest | Lightweight normal-only anomaly scoring. |
| embeddings + One-Class SVM | Classic novelty detection baseline. |
| embeddings + kNN distance | Easy to interpret: “far from nearest normal examples.” |
| embeddings + Autoencoder | Reconstruction-error-based anomaly score. |
| handcrafted HTTP features + IsolationForest | Often strong and cheap for request-level anomaly scoring. |
References:
- scikit-learn IsolationForest
- scikit-learn novelty and outlier detection
- Hugging Face Forum discussion on anomaly / out-of-domain detection with BERT
- HTTP request embedding work such as HTTP2vec
- A related BERT + autoencoder direction for HTTP anomaly detection: Contextual embeddings and autoencoders for HTTP traffic anomaly detection
For SIEM-like alerting, I would not rely only on argmax. I would tune thresholds on validation data and inspect the precision/recall trade-off. scikit-learn’s precision_recall_curve and precision-recall curve example are useful references.
Example threshold selection:
import numpy as np
from sklearn.metrics import precision_recall_curve, classification_report
# y_val: 0=normal, 1=anomalous
# anom_scores: probability or anomaly score where larger means "more anomalous"
precision, recall, thresholds = precision_recall_curve(y_val, anom_scores)
f1 = 2 * precision[:-1] * recall[:-1] / np.maximum(
precision[:-1] + recall[:-1],
1e-12,
)
best_idx = np.nanargmax(f1)
best_threshold = thresholds[best_idx]
print("threshold:", best_threshold)
print("precision:", precision[best_idx])
print("recall:", recall[best_idx])
print("f1:", f1[best_idx])
y_pred = (anom_scores >= best_threshold).astype(int)
print(classification_report(y_val, y_pred, target_names=["normal", "anomalous"]))
In practice, you may not want the threshold with best F1. For SIEM, you may instead choose a threshold based on acceptable false positives or alert volume.
5. If moving from CSIC to real SIEM/system logs, consider sequence modeling
If your final target is real SIEM, system, or application logs, I would be careful about classifying each raw log line independently.
Many log anomalies are contextual:
One failed login is normal.
Fifty failed logins from the same source in five minutes is suspicious.
One 404 is normal.
A burst of unusual paths from the same user-agent may be suspicious.
One admin endpoint access may be normal.
Admin access from a new user-agent, source, or geography may be suspicious.
For real logs, a common pipeline is:
raw logs
-> parse into templates / structured events
-> group by host, user, session, source IP, process, or time window
-> build sequences
-> model normal sequence patterns
-> detect deviations
Useful references:
- LogPAI logparser: extracts event templates from unstructured logs and converts raw log messages into structured event sequences.
- Drain: representative online log parser using a fixed-depth parse tree.
- LogBERT: log anomaly detection via BERT; the pipeline includes raw data, parsing, structured logs, sequence construction, and modeling.
- DeepLog: classic LSTM-based system log anomaly detection; see also the DeepLog paper.
- deep-loglizer: toolkit for deep learning-based log analysis and anomaly detection.
- log-anomaly-detection reading list: useful survey-style collection of log anomaly detection papers and tools.
This is a different formulation from CSIC single-request binary classification. CSIC is useful for web request experiments, but it may not capture the full context of SIEM log anomaly detection.
6. Suggested experiment order
If I were doing this, I would use this order:
Debug the current SecBERT setup
- label mapping,
- checkpoint loading,
- class balance,
- stratified splits,
- truncation,
- confusion matrix,
- score distribution,
- threshold calibration.
Build simple CSIC baselines
- char n-gram TF-IDF + Logistic Regression,
- char n-gram TF-IDF + Linear SVM,
- HTTP handcrafted features + classical classifier.
Compare cyber-domain encoders
- SecBERT,
- SecRoBERTa,
- SecureBERT,
- SecureBERT 2.0,
- general BERT/RoBERTa.
Try normal-only anomaly scoring
- embeddings + IsolationForest,
- embeddings + One-Class SVM,
- embeddings + kNN distance,
- embeddings + Autoencoder,
- HTTP features + IsolationForest.
For real SIEM logs, move to sequence/context modeling
- parse logs into templates,
- build sequences by host/session/source/time window,
- test LogBERT / DeepLog-style approaches.
Bottom line
I would not abandon SecBERT immediately, but I would not make it the center of the whole solution either.
For CSIC HTTP requests, simple string/protocol baselines may be very strong. For unknown anomaly detection, normal-only anomaly scoring may be a better formulation. For real SIEM logs, sequence/context modeling may be more appropriate than classifying each raw entry independently.
So my recommendation would be:
Keep SecBERT as one candidate.
Debug the current same-class prediction issue.
Add simple HTTP/string baselines.
Compare other cyber encoders only after the evaluation setup is clean.
If the goal is unknown anomaly detection, test embedding + anomaly scoring.
If the goal is real SIEM logs, consider log-sequence modeling.
Discussion in the ATmosphere