External Publication

Secbert to detect anomalous log entries

Hugging Face Forums [Unofficial] May 27, 2026

Hmm… If you are not specifically constrained to SecBERT:

I would treat SecBERT as one candidate encoder, not as the entire anomaly-detection solution.

Your current approach is roughly:

HTTP request / log entry
  -> SecBERT tokenizer
  -> SecBERT encoder
  -> classification head
  -> normal / anomalous

That is a valid experiment, but it is only one formulation. SecBERT itself is not a ready-made SIEM anomaly detector. It is a cybersecurity-domain pretrained language model. The actual detector is the full pipeline around it: preprocessing, tokenization, label mapping, classification head, checkpoint loading, validation metrics, and thresholding.

For context:

SecBERT is described as a BERT model trained on cybersecurity text.
The original SecBERT GitHub repo frames it as a language model that can improve downstream tasks such as NER, text classification, semantic understanding, and Q&A in the cybersecurity domain.
CSIC 2010 is generated HTTP request data, with normal/anomalous labels and attacks such as SQL injection, buffer overflow, information gathering, file disclosure, CRLF injection, XSS, server-side include, and parameter tampering.

So I would separate two questions:

Why is the current SecBERT setup predicting all diagnostic examples as anomalous?
Is SecBERT the best formulation for CSIC / SIEM-style anomaly detection?

Those are related, but not identical.

1. First debug the current “everything is anomalous” behavior

Before changing models, I would verify that the current pipeline is not simply misconfigured.

A model predicting only one class is a common failure mode in fine-tuning workflows. There are similar Hugging Face Forum threads where fine-tuned BERT/RoBERTa models always predicted the same class, and the underlying causes discussed included checkpoint loading, class imbalance, metrics, and training/evaluation setup:

Fine-tuned model always predicts same output class for new data
BERT and RoBERTa giving same outputs
Fine-tuned RoBERTa only predicting one category

I would check at least these items:

print(model.config.id2label)
print(model.config.label2id)

for name, split in {
    "train": train_df,
    "val": val_df,
    "test": test_df,
}.items():
    print("\n", name)
    print(split["label"].value_counts())
    print(split["label"].value_counts(normalize=True))

Things I would specifically verify:

normal and anomalous are mapped consistently in label2id and id2label.
The inference code is loading the fine-tuned checkpoint, not the base SecBERT model.
The tokenizer is loaded from the same checkpoint or compatible base model.
The 2,000-row CSIC sample is not badly skewed.
Train/validation/test splits are stratified.
Validation metrics include per-class precision, recall, F1, and confusion matrix.
The HTTP requests are not truncated before the useful payload appears.
The final decision threshold is calibrated on validation data, rather than relying only on argmax or a fixed 0.5 threshold.

For text classification with Transformers, the usual pattern is to attach a sequence-classification head, train it, and evaluate it with task-specific metrics. The Hugging Face guide for text classification is a useful reference for that setup.

I would also inspect logits/probabilities for a small batch:

import torch
import pandas as pd

def inspect_predictions(model, tokenizer, texts, max_length=256):
    model.eval()
    rows = []

    for text in texts:
        inputs = tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=max_length,
        )

        with torch.no_grad():
            outputs = model(**inputs)

        logits = outputs.logits[0].detach().cpu()
        probs = torch.softmax(logits, dim=-1)

        pred_id = int(torch.argmax(probs))
        pred_label = model.config.id2label.get(pred_id, str(pred_id))

        rows.append({
            "text": text[:160],
            "num_tokens": int(inputs["attention_mask"].sum()),
            "logit_0": float(logits[0]),
            "logit_1": float(logits[1]),
            "prob_0": float(probs[0]),
            "prob_1": float(probs[1]),
            "pred_id": pred_id,
            "pred_label": pred_label,
        })

    return pd.DataFrame(rows)

If both normal and anomalous examples get almost the same probability distribution, the model is probably not separating the classes yet.

2. For CSIC HTTP requests, add simple string/protocol baselines

For CSIC, I would not start by assuming that a cybersecurity language model is the strongest representation.

CSIC is HTTP request data, not ordinary natural-language security prose. Many important signals are string-level or protocol-structure-level:

URL encoding
SQLi tokens
XSS markers
CRLF markers
path traversal markers
unusual parameter values
long payloads
high special-character density
odd HTTP method/path/body combinations

A simple character n-gram model may be surprisingly strong here.

I would build at least this baseline:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

clf = Pipeline([
    ("tfidf", TfidfVectorizer(
        analyzer="char_wb",
        ngram_range=(3, 5),
        min_df=2,
    )),
    ("lr", LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
    )),
])

clf.fit(X_train, y_train)
pred = clf.predict(X_val)

print(confusion_matrix(y_val, pred))
print(classification_report(y_val, pred, target_names=["normal", "anomalous"]))

TfidfVectorizer supports character n-gram features via analyzer="char" or analyzer="char_wb"; see the scikit-learn docs for TfidfVectorizer.

Interpretation:

Result	Likely interpretation
TF-IDF baseline works, SecBERT fails	The data may be separable, but the Transformer setup/preprocessing/training may be wrong.
Both TF-IDF and SecBERT fail	Check labels, split, preprocessing, sampling, and evaluation.
TF-IDF beats SecBERT	This task may be more string-pattern-heavy than language-semantics-heavy.
SecBERT beats TF-IDF	Good; then SecBERT may be adding useful cyber-domain representation.

I would also add HTTP-specific handcrafted features:

import re
from urllib.parse import unquote

def extract_http_features(text):
    decoded = unquote(text.lower())

    return {
        "length": len(decoded),
        "num_digits": sum(c.isdigit() for c in decoded),
        "num_special": sum(c in "'\"<>;(){}[]" for c in decoded),
        "num_params": decoded.count("&") + decoded.count("?"),
        "has_sql_keyword": int(bool(re.search(
            r"\b(select|union|drop|insert|update|delete|or 1=1)\b",
            decoded,
        ))),
        "has_xss_marker": int(bool(re.search(
            r"(<script|javascript:|onerror=|onload=)",
            decoded,
        ))),
        "has_path_traversal": int("../" in decoded or "..%2f" in decoded),
        "has_crlf": int("%0d" in text.lower() or "%0a" in text.lower()),
    }

Those features can be fed into Logistic Regression, Random Forest, LightGBM, XGBoost, or combined with text features.

3. Position SecBERT correctly

I would frame SecBERT like this:

SecBERT is not the detector by itself.
SecBERT is the encoder.

The detector is:
preprocessing
  + tokenizer
  + encoder
  + classification head or anomaly scorer
  + labels or normal-only training data
  + validation metrics
  + thresholding

This matters because changing only the encoder will not fix problems such as:

reversed labels,
wrong checkpoint loading,
skewed sampling,
bad validation split,
truncation,
uncalibrated threshold,
or evaluating only four hand-picked examples.

If you want to keep the supervised Transformer route, compare cyber-domain encoders under the same evaluation setup:

Encoder candidate	Notes
SecBERT	Cybersecurity-text pretrained BERT. Reasonable current baseline.
SecRoBERTa	RoBERTa variant from the same SecBERT family.
SecureBERT	RoBERTa-based cybersecurity-domain language model.
SecureBERT 2.0	Newer Cisco cybersecurity/threat-intelligence encoder based on ModernBERT.
General BERT/RoBERTa	Useful sanity check: domain pretraining may or may not help this specific HTTP-request task.

But I would only do this after the evaluation setup is clean. Otherwise, you may just be changing the model while keeping the same bug.

4. For SIEM-style unknown anomaly detection, consider normal-only scoring

If the real goal is SIEM-style unknown anomaly detection, supervised normal vs anomalous classification on CSIC may be only a proxy task.

A SIEM-like goal is often closer to:

learn what normal looks like
  -> assign anomaly/risk scores
  -> alert only above a tuned threshold

That suggests another family of methods:

HTTP request / log text
  -> BERT/SecBERT/SecureBERT embedding
  -> IsolationForest / One-Class SVM / kNN distance / Autoencoder
  -> anomaly score
  -> threshold tuned on validation data

Options worth testing:

Method	Fit
SecBERT/SecureBERT embeddings + IsolationForest	Lightweight normal-only anomaly scoring.
embeddings + One-Class SVM	Classic novelty detection baseline.
embeddings + kNN distance	Easy to interpret: “far from nearest normal examples.”
embeddings + Autoencoder	Reconstruction-error-based anomaly score.
handcrafted HTTP features + IsolationForest	Often strong and cheap for request-level anomaly scoring.

References:

scikit-learn IsolationForest
scikit-learn novelty and outlier detection
Hugging Face Forum discussion on anomaly / out-of-domain detection with BERT
HTTP request embedding work such as HTTP2vec
A related BERT + autoencoder direction for HTTP anomaly detection: Contextual embeddings and autoencoders for HTTP traffic anomaly detection

For SIEM-like alerting, I would not rely only on argmax. I would tune thresholds on validation data and inspect the precision/recall trade-off. scikit-learn’s precision_recall_curve and precision-recall curve example are useful references.

Example threshold selection:

import numpy as np
from sklearn.metrics import precision_recall_curve, classification_report

# y_val: 0=normal, 1=anomalous
# anom_scores: probability or anomaly score where larger means "more anomalous"

precision, recall, thresholds = precision_recall_curve(y_val, anom_scores)

f1 = 2 * precision[:-1] * recall[:-1] / np.maximum(
    precision[:-1] + recall[:-1],
    1e-12,
)

best_idx = np.nanargmax(f1)
best_threshold = thresholds[best_idx]

print("threshold:", best_threshold)
print("precision:", precision[best_idx])
print("recall:", recall[best_idx])
print("f1:", f1[best_idx])

y_pred = (anom_scores >= best_threshold).astype(int)
print(classification_report(y_val, y_pred, target_names=["normal", "anomalous"]))

In practice, you may not want the threshold with best F1. For SIEM, you may instead choose a threshold based on acceptable false positives or alert volume.

5. If moving from CSIC to real SIEM/system logs, consider sequence modeling

If your final target is real SIEM, system, or application logs, I would be careful about classifying each raw log line independently.

Many log anomalies are contextual:

One failed login is normal.
Fifty failed logins from the same source in five minutes is suspicious.

One 404 is normal.
A burst of unusual paths from the same user-agent may be suspicious.

One admin endpoint access may be normal.
Admin access from a new user-agent, source, or geography may be suspicious.

For real logs, a common pipeline is:

raw logs
  -> parse into templates / structured events
  -> group by host, user, session, source IP, process, or time window
  -> build sequences
  -> model normal sequence patterns
  -> detect deviations

Useful references:

LogPAI logparser: extracts event templates from unstructured logs and converts raw log messages into structured event sequences.
Drain: representative online log parser using a fixed-depth parse tree.
LogBERT: log anomaly detection via BERT; the pipeline includes raw data, parsing, structured logs, sequence construction, and modeling.
DeepLog: classic LSTM-based system log anomaly detection; see also the DeepLog paper.
deep-loglizer: toolkit for deep learning-based log analysis and anomaly detection.
log-anomaly-detection reading list: useful survey-style collection of log anomaly detection papers and tools.

This is a different formulation from CSIC single-request binary classification. CSIC is useful for web request experiments, but it may not capture the full context of SIEM log anomaly detection.

6. Suggested experiment order

If I were doing this, I would use this order:

Debug the current SecBERT setup
- label mapping,
- checkpoint loading,
- class balance,
- stratified splits,
- truncation,
- confusion matrix,
- score distribution,
- threshold calibration.
Build simple CSIC baselines
- char n-gram TF-IDF + Logistic Regression,
- char n-gram TF-IDF + Linear SVM,
- HTTP handcrafted features + classical classifier.
Compare cyber-domain encoders
- SecBERT,
- SecRoBERTa,
- SecureBERT,
- SecureBERT 2.0,
- general BERT/RoBERTa.
Try normal-only anomaly scoring
- embeddings + IsolationForest,
- embeddings + One-Class SVM,
- embeddings + kNN distance,
- embeddings + Autoencoder,
- HTTP features + IsolationForest.
For real SIEM logs, move to sequence/context modeling
- parse logs into templates,
- build sequences by host/session/source/time window,
- test LogBERT / DeepLog-style approaches.

Bottom line

I would not abandon SecBERT immediately, but I would not make it the center of the whole solution either.

For CSIC HTTP requests, simple string/protocol baselines may be very strong. For unknown anomaly detection, normal-only anomaly scoring may be a better formulation. For real SIEM logs, sequence/context modeling may be more appropriate than classifying each raw entry independently.

So my recommendation would be:

Keep SecBERT as one candidate.
Debug the current same-class prediction issue.
Add simple HTTP/string baselines.
Compare other cyber encoders only after the evaluation setup is clean.
If the goal is unknown anomaly detection, test embedding + anomaly scoring.
If the goal is real SIEM logs, consider log-sequence modeling.