Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigxroedgd32rbniastg4gxkfpbf7qfvgtheifcqdqcs2w5ilgfzh4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmsmhyvlxdj2"
  },
  "path": "/t/secbert-to-detect-anomalous-log-entries/176237#post_2",
  "publishedAt": "2026-05-27T02:41:00.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "SecBERT",
    "SecBERT GitHub repo",
    "CSIC 2010",
    "Fine-tuned model always predicts same output class for new data",
    "BERT and RoBERTa giving same outputs",
    "Fine-tuned RoBERTa only predicting one category",
    "text classification",
    "TfidfVectorizer",
    "SecRoBERTa",
    "SecureBERT",
    "SecureBERT 2.0",
    "IsolationForest",
    "novelty and outlier detection",
    "anomaly / out-of-domain detection with BERT",
    "HTTP2vec",
    "Contextual embeddings and autoencoders for HTTP traffic anomaly detection",
    "precision_recall_curve",
    "precision-recall curve example",
    "LogPAI logparser",
    "Drain",
    "LogBERT",
    "DeepLog",
    "DeepLog paper",
    "deep-loglizer",
    "log-anomaly-detection reading list"
  ],
  "textContent": "Hmm… If you are not specifically constrained to SecBERT:\n\n* * *\n\nI would treat SecBERT as one candidate encoder, not as the entire anomaly-detection solution.\n\nYour current approach is roughly:\n\n\n    HTTP request / log entry\n      -> SecBERT tokenizer\n      -> SecBERT encoder\n      -> classification head\n      -> normal / anomalous\n\n\nThat is a valid experiment, but it is only one formulation. SecBERT itself is not a ready-made SIEM anomaly detector. It is a cybersecurity-domain pretrained language model. The actual detector is the full pipeline around it: preprocessing, tokenization, label mapping, classification head, checkpoint loading, validation metrics, and thresholding.\n\nFor context:\n\n  * SecBERT is described as a BERT model trained on cybersecurity text.\n  * The original SecBERT GitHub repo frames it as a language model that can improve downstream tasks such as NER, text classification, semantic understanding, and Q&A in the cybersecurity domain.\n  * CSIC 2010 is generated HTTP request data, with normal/anomalous labels and attacks such as SQL injection, buffer overflow, information gathering, file disclosure, CRLF injection, XSS, server-side include, and parameter tampering.\n\n\n\nSo I would separate two questions:\n\n  1. Why is the current SecBERT setup predicting all diagnostic examples as anomalous?\n  2. Is SecBERT the best formulation for CSIC / SIEM-style anomaly detection?\n\n\n\nThose are related, but not identical.\n\n## 1. First debug the current “everything is anomalous” behavior\n\nBefore changing models, I would verify that the current pipeline is not simply misconfigured.\n\nA model predicting only one class is a common failure mode in fine-tuning workflows. There are similar Hugging Face Forum threads where fine-tuned BERT/RoBERTa models always predicted the same class, and the underlying causes discussed included checkpoint loading, class imbalance, metrics, and training/evaluation setup:\n\n  * Fine-tuned model always predicts same output class for new data\n  * BERT and RoBERTa giving same outputs\n  * Fine-tuned RoBERTa only predicting one category\n\n\n\nI would check at least these items:\n\n\n    print(model.config.id2label)\n    print(model.config.label2id)\n\n    for name, split in {\n        \"train\": train_df,\n        \"val\": val_df,\n        \"test\": test_df,\n    }.items():\n        print(\"\\n\", name)\n        print(split[\"label\"].value_counts())\n        print(split[\"label\"].value_counts(normalize=True))\n\n\nThings I would specifically verify:\n\n  * `normal` and `anomalous` are mapped consistently in `label2id` and `id2label`.\n  * The inference code is loading the fine-tuned checkpoint, not the base SecBERT model.\n  * The tokenizer is loaded from the same checkpoint or compatible base model.\n  * The 2,000-row CSIC sample is not badly skewed.\n  * Train/validation/test splits are stratified.\n  * Validation metrics include per-class precision, recall, F1, and confusion matrix.\n  * The HTTP requests are not truncated before the useful payload appears.\n  * The final decision threshold is calibrated on validation data, rather than relying only on `argmax` or a fixed 0.5 threshold.\n\n\n\nFor text classification with Transformers, the usual pattern is to attach a sequence-classification head, train it, and evaluate it with task-specific metrics. The Hugging Face guide for text classification is a useful reference for that setup.\n\nI would also inspect logits/probabilities for a small batch:\n\n\n    import torch\n    import pandas as pd\n\n    def inspect_predictions(model, tokenizer, texts, max_length=256):\n        model.eval()\n        rows = []\n\n        for text in texts:\n            inputs = tokenizer(\n                text,\n                return_tensors=\"pt\",\n                truncation=True,\n                padding=True,\n                max_length=max_length,\n            )\n\n            with torch.no_grad():\n                outputs = model(**inputs)\n\n            logits = outputs.logits[0].detach().cpu()\n            probs = torch.softmax(logits, dim=-1)\n\n            pred_id = int(torch.argmax(probs))\n            pred_label = model.config.id2label.get(pred_id, str(pred_id))\n\n            rows.append({\n                \"text\": text[:160],\n                \"num_tokens\": int(inputs[\"attention_mask\"].sum()),\n                \"logit_0\": float(logits[0]),\n                \"logit_1\": float(logits[1]),\n                \"prob_0\": float(probs[0]),\n                \"prob_1\": float(probs[1]),\n                \"pred_id\": pred_id,\n                \"pred_label\": pred_label,\n            })\n\n        return pd.DataFrame(rows)\n\n\nIf both normal and anomalous examples get almost the same probability distribution, the model is probably not separating the classes yet.\n\n## 2. For CSIC HTTP requests, add simple string/protocol baselines\n\nFor CSIC, I would not start by assuming that a cybersecurity language model is the strongest representation.\n\nCSIC is HTTP request data, not ordinary natural-language security prose. Many important signals are string-level or protocol-structure-level:\n\n  * URL encoding\n  * SQLi tokens\n  * XSS markers\n  * CRLF markers\n  * path traversal markers\n  * unusual parameter values\n  * long payloads\n  * high special-character density\n  * odd HTTP method/path/body combinations\n\n\n\nA simple character n-gram model may be surprisingly strong here.\n\nI would build at least this baseline:\n\n\n    from sklearn.pipeline import Pipeline\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    from sklearn.linear_model import LogisticRegression\n    from sklearn.metrics import classification_report, confusion_matrix\n\n    clf = Pipeline([\n        (\"tfidf\", TfidfVectorizer(\n            analyzer=\"char_wb\",\n            ngram_range=(3, 5),\n            min_df=2,\n        )),\n        (\"lr\", LogisticRegression(\n            max_iter=1000,\n            class_weight=\"balanced\",\n        )),\n    ])\n\n    clf.fit(X_train, y_train)\n    pred = clf.predict(X_val)\n\n    print(confusion_matrix(y_val, pred))\n    print(classification_report(y_val, pred, target_names=[\"normal\", \"anomalous\"]))\n\n\n`TfidfVectorizer` supports character n-gram features via `analyzer=\"char\"` or `analyzer=\"char_wb\"`; see the scikit-learn docs for TfidfVectorizer.\n\nInterpretation:\n\nResult | Likely interpretation\n---|---\nTF-IDF baseline works, SecBERT fails | The data may be separable, but the Transformer setup/preprocessing/training may be wrong.\nBoth TF-IDF and SecBERT fail | Check labels, split, preprocessing, sampling, and evaluation.\nTF-IDF beats SecBERT | This task may be more string-pattern-heavy than language-semantics-heavy.\nSecBERT beats TF-IDF | Good; then SecBERT may be adding useful cyber-domain representation.\n\nI would also add HTTP-specific handcrafted features:\n\n\n    import re\n    from urllib.parse import unquote\n\n    def extract_http_features(text):\n        decoded = unquote(text.lower())\n\n        return {\n            \"length\": len(decoded),\n            \"num_digits\": sum(c.isdigit() for c in decoded),\n            \"num_special\": sum(c in \"'\\\"<>;(){}[]\" for c in decoded),\n            \"num_params\": decoded.count(\"&\") + decoded.count(\"?\"),\n            \"has_sql_keyword\": int(bool(re.search(\n                r\"\\b(select|union|drop|insert|update|delete|or 1=1)\\b\",\n                decoded,\n            ))),\n            \"has_xss_marker\": int(bool(re.search(\n                r\"(<script|javascript:|onerror=|onload=)\",\n                decoded,\n            ))),\n            \"has_path_traversal\": int(\"../\" in decoded or \"..%2f\" in decoded),\n            \"has_crlf\": int(\"%0d\" in text.lower() or \"%0a\" in text.lower()),\n        }\n\n\nThose features can be fed into Logistic Regression, Random Forest, LightGBM, XGBoost, or combined with text features.\n\n## 3. Position SecBERT correctly\n\nI would frame SecBERT like this:\n\n\n    SecBERT is not the detector by itself.\n    SecBERT is the encoder.\n\n    The detector is:\n    preprocessing\n      + tokenizer\n      + encoder\n      + classification head or anomaly scorer\n      + labels or normal-only training data\n      + validation metrics\n      + thresholding\n\n\nThis matters because changing only the encoder will not fix problems such as:\n\n  * reversed labels,\n  * wrong checkpoint loading,\n  * skewed sampling,\n  * bad validation split,\n  * truncation,\n  * uncalibrated threshold,\n  * or evaluating only four hand-picked examples.\n\n\n\nIf you want to keep the supervised Transformer route, compare cyber-domain encoders under the same evaluation setup:\n\nEncoder candidate | Notes\n---|---\nSecBERT | Cybersecurity-text pretrained BERT. Reasonable current baseline.\nSecRoBERTa | RoBERTa variant from the same SecBERT family.\nSecureBERT | RoBERTa-based cybersecurity-domain language model.\nSecureBERT 2.0 | Newer Cisco cybersecurity/threat-intelligence encoder based on ModernBERT.\nGeneral BERT/RoBERTa | Useful sanity check: domain pretraining may or may not help this specific HTTP-request task.\n\nBut I would only do this after the evaluation setup is clean. Otherwise, you may just be changing the model while keeping the same bug.\n\n## 4. For SIEM-style unknown anomaly detection, consider normal-only scoring\n\nIf the real goal is SIEM-style unknown anomaly detection, supervised `normal` vs `anomalous` classification on CSIC may be only a proxy task.\n\nA SIEM-like goal is often closer to:\n\n\n    learn what normal looks like\n      -> assign anomaly/risk scores\n      -> alert only above a tuned threshold\n\n\nThat suggests another family of methods:\n\n\n    HTTP request / log text\n      -> BERT/SecBERT/SecureBERT embedding\n      -> IsolationForest / One-Class SVM / kNN distance / Autoencoder\n      -> anomaly score\n      -> threshold tuned on validation data\n\n\nOptions worth testing:\n\nMethod | Fit\n---|---\nSecBERT/SecureBERT embeddings + IsolationForest | Lightweight normal-only anomaly scoring.\nembeddings + One-Class SVM | Classic novelty detection baseline.\nembeddings + kNN distance | Easy to interpret: “far from nearest normal examples.”\nembeddings + Autoencoder | Reconstruction-error-based anomaly score.\nhandcrafted HTTP features + IsolationForest | Often strong and cheap for request-level anomaly scoring.\n\nReferences:\n\n  * scikit-learn IsolationForest\n  * scikit-learn novelty and outlier detection\n  * Hugging Face Forum discussion on anomaly / out-of-domain detection with BERT\n  * HTTP request embedding work such as HTTP2vec\n  * A related BERT + autoencoder direction for HTTP anomaly detection: Contextual embeddings and autoencoders for HTTP traffic anomaly detection\n\n\n\nFor SIEM-like alerting, I would not rely only on `argmax`. I would tune thresholds on validation data and inspect the precision/recall trade-off. scikit-learn’s precision_recall_curve and precision-recall curve example are useful references.\n\nExample threshold selection:\n\n\n    import numpy as np\n    from sklearn.metrics import precision_recall_curve, classification_report\n\n    # y_val: 0=normal, 1=anomalous\n    # anom_scores: probability or anomaly score where larger means \"more anomalous\"\n\n    precision, recall, thresholds = precision_recall_curve(y_val, anom_scores)\n\n    f1 = 2 * precision[:-1] * recall[:-1] / np.maximum(\n        precision[:-1] + recall[:-1],\n        1e-12,\n    )\n\n    best_idx = np.nanargmax(f1)\n    best_threshold = thresholds[best_idx]\n\n    print(\"threshold:\", best_threshold)\n    print(\"precision:\", precision[best_idx])\n    print(\"recall:\", recall[best_idx])\n    print(\"f1:\", f1[best_idx])\n\n    y_pred = (anom_scores >= best_threshold).astype(int)\n    print(classification_report(y_val, y_pred, target_names=[\"normal\", \"anomalous\"]))\n\n\nIn practice, you may not want the threshold with best F1. For SIEM, you may instead choose a threshold based on acceptable false positives or alert volume.\n\n## 5. If moving from CSIC to real SIEM/system logs, consider sequence modeling\n\nIf your final target is real SIEM, system, or application logs, I would be careful about classifying each raw log line independently.\n\nMany log anomalies are contextual:\n\n\n    One failed login is normal.\n    Fifty failed logins from the same source in five minutes is suspicious.\n\n    One 404 is normal.\n    A burst of unusual paths from the same user-agent may be suspicious.\n\n    One admin endpoint access may be normal.\n    Admin access from a new user-agent, source, or geography may be suspicious.\n\n\nFor real logs, a common pipeline is:\n\n\n    raw logs\n      -> parse into templates / structured events\n      -> group by host, user, session, source IP, process, or time window\n      -> build sequences\n      -> model normal sequence patterns\n      -> detect deviations\n\n\nUseful references:\n\n  * LogPAI logparser: extracts event templates from unstructured logs and converts raw log messages into structured event sequences.\n  * Drain: representative online log parser using a fixed-depth parse tree.\n  * LogBERT: log anomaly detection via BERT; the pipeline includes raw data, parsing, structured logs, sequence construction, and modeling.\n  * DeepLog: classic LSTM-based system log anomaly detection; see also the DeepLog paper.\n  * deep-loglizer: toolkit for deep learning-based log analysis and anomaly detection.\n  * log-anomaly-detection reading list: useful survey-style collection of log anomaly detection papers and tools.\n\n\n\nThis is a different formulation from CSIC single-request binary classification. CSIC is useful for web request experiments, but it may not capture the full context of SIEM log anomaly detection.\n\n## 6. Suggested experiment order\n\nIf I were doing this, I would use this order:\n\n  1. **Debug the current SecBERT setup**\n\n     * label mapping,\n     * checkpoint loading,\n     * class balance,\n     * stratified splits,\n     * truncation,\n     * confusion matrix,\n     * score distribution,\n     * threshold calibration.\n  2. **Build simple CSIC baselines**\n\n     * char n-gram TF-IDF + Logistic Regression,\n     * char n-gram TF-IDF + Linear SVM,\n     * HTTP handcrafted features + classical classifier.\n  3. **Compare cyber-domain encoders**\n\n     * SecBERT,\n     * SecRoBERTa,\n     * SecureBERT,\n     * SecureBERT 2.0,\n     * general BERT/RoBERTa.\n  4. **Try normal-only anomaly scoring**\n\n     * embeddings + IsolationForest,\n     * embeddings + One-Class SVM,\n     * embeddings + kNN distance,\n     * embeddings + Autoencoder,\n     * HTTP features + IsolationForest.\n  5. **For real SIEM logs, move to sequence/context modeling**\n\n     * parse logs into templates,\n     * build sequences by host/session/source/time window,\n     * test LogBERT / DeepLog-style approaches.\n\n\n\n## Bottom line\n\nI would not abandon SecBERT immediately, but I would not make it the center of the whole solution either.\n\nFor CSIC HTTP requests, simple string/protocol baselines may be very strong. For unknown anomaly detection, normal-only anomaly scoring may be a better formulation. For real SIEM logs, sequence/context modeling may be more appropriate than classifying each raw entry independently.\n\nSo my recommendation would be:\n\n\n    Keep SecBERT as one candidate.\n    Debug the current same-class prediction issue.\n    Add simple HTTP/string baselines.\n    Compare other cyber encoders only after the evaluation setup is clean.\n    If the goal is unknown anomaly detection, test embedding + anomaly scoring.\n    If the goal is real SIEM logs, consider log-sequence modeling.\n",
  "title": "Secbert to detect anomalous log entries"
}