{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreigxroedgd32rbniastg4gxkfpbf7qfvgtheifcqdqcs2w5ilgfzh4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmsmhyvlxdj2"
},
"path": "/t/secbert-to-detect-anomalous-log-entries/176237#post_2",
"publishedAt": "2026-05-27T02:41:00.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"SecBERT",
"SecBERT GitHub repo",
"CSIC 2010",
"Fine-tuned model always predicts same output class for new data",
"BERT and RoBERTa giving same outputs",
"Fine-tuned RoBERTa only predicting one category",
"text classification",
"TfidfVectorizer",
"SecRoBERTa",
"SecureBERT",
"SecureBERT 2.0",
"IsolationForest",
"novelty and outlier detection",
"anomaly / out-of-domain detection with BERT",
"HTTP2vec",
"Contextual embeddings and autoencoders for HTTP traffic anomaly detection",
"precision_recall_curve",
"precision-recall curve example",
"LogPAI logparser",
"Drain",
"LogBERT",
"DeepLog",
"DeepLog paper",
"deep-loglizer",
"log-anomaly-detection reading list"
],
"textContent": "Hmm… If you are not specifically constrained to SecBERT:\n\n* * *\n\nI would treat SecBERT as one candidate encoder, not as the entire anomaly-detection solution.\n\nYour current approach is roughly:\n\n\n HTTP request / log entry\n -> SecBERT tokenizer\n -> SecBERT encoder\n -> classification head\n -> normal / anomalous\n\n\nThat is a valid experiment, but it is only one formulation. SecBERT itself is not a ready-made SIEM anomaly detector. It is a cybersecurity-domain pretrained language model. The actual detector is the full pipeline around it: preprocessing, tokenization, label mapping, classification head, checkpoint loading, validation metrics, and thresholding.\n\nFor context:\n\n * SecBERT is described as a BERT model trained on cybersecurity text.\n * The original SecBERT GitHub repo frames it as a language model that can improve downstream tasks such as NER, text classification, semantic understanding, and Q&A in the cybersecurity domain.\n * CSIC 2010 is generated HTTP request data, with normal/anomalous labels and attacks such as SQL injection, buffer overflow, information gathering, file disclosure, CRLF injection, XSS, server-side include, and parameter tampering.\n\n\n\nSo I would separate two questions:\n\n 1. Why is the current SecBERT setup predicting all diagnostic examples as anomalous?\n 2. Is SecBERT the best formulation for CSIC / SIEM-style anomaly detection?\n\n\n\nThose are related, but not identical.\n\n## 1. First debug the current “everything is anomalous” behavior\n\nBefore changing models, I would verify that the current pipeline is not simply misconfigured.\n\nA model predicting only one class is a common failure mode in fine-tuning workflows. There are similar Hugging Face Forum threads where fine-tuned BERT/RoBERTa models always predicted the same class, and the underlying causes discussed included checkpoint loading, class imbalance, metrics, and training/evaluation setup:\n\n * Fine-tuned model always predicts same output class for new data\n * BERT and RoBERTa giving same outputs\n * Fine-tuned RoBERTa only predicting one category\n\n\n\nI would check at least these items:\n\n\n print(model.config.id2label)\n print(model.config.label2id)\n\n for name, split in {\n \"train\": train_df,\n \"val\": val_df,\n \"test\": test_df,\n }.items():\n print(\"\\n\", name)\n print(split[\"label\"].value_counts())\n print(split[\"label\"].value_counts(normalize=True))\n\n\nThings I would specifically verify:\n\n * `normal` and `anomalous` are mapped consistently in `label2id` and `id2label`.\n * The inference code is loading the fine-tuned checkpoint, not the base SecBERT model.\n * The tokenizer is loaded from the same checkpoint or compatible base model.\n * The 2,000-row CSIC sample is not badly skewed.\n * Train/validation/test splits are stratified.\n * Validation metrics include per-class precision, recall, F1, and confusion matrix.\n * The HTTP requests are not truncated before the useful payload appears.\n * The final decision threshold is calibrated on validation data, rather than relying only on `argmax` or a fixed 0.5 threshold.\n\n\n\nFor text classification with Transformers, the usual pattern is to attach a sequence-classification head, train it, and evaluate it with task-specific metrics. The Hugging Face guide for text classification is a useful reference for that setup.\n\nI would also inspect logits/probabilities for a small batch:\n\n\n import torch\n import pandas as pd\n\n def inspect_predictions(model, tokenizer, texts, max_length=256):\n model.eval()\n rows = []\n\n for text in texts:\n inputs = tokenizer(\n text,\n return_tensors=\"pt\",\n truncation=True,\n padding=True,\n max_length=max_length,\n )\n\n with torch.no_grad():\n outputs = model(**inputs)\n\n logits = outputs.logits[0].detach().cpu()\n probs = torch.softmax(logits, dim=-1)\n\n pred_id = int(torch.argmax(probs))\n pred_label = model.config.id2label.get(pred_id, str(pred_id))\n\n rows.append({\n \"text\": text[:160],\n \"num_tokens\": int(inputs[\"attention_mask\"].sum()),\n \"logit_0\": float(logits[0]),\n \"logit_1\": float(logits[1]),\n \"prob_0\": float(probs[0]),\n \"prob_1\": float(probs[1]),\n \"pred_id\": pred_id,\n \"pred_label\": pred_label,\n })\n\n return pd.DataFrame(rows)\n\n\nIf both normal and anomalous examples get almost the same probability distribution, the model is probably not separating the classes yet.\n\n## 2. For CSIC HTTP requests, add simple string/protocol baselines\n\nFor CSIC, I would not start by assuming that a cybersecurity language model is the strongest representation.\n\nCSIC is HTTP request data, not ordinary natural-language security prose. Many important signals are string-level or protocol-structure-level:\n\n * URL encoding\n * SQLi tokens\n * XSS markers\n * CRLF markers\n * path traversal markers\n * unusual parameter values\n * long payloads\n * high special-character density\n * odd HTTP method/path/body combinations\n\n\n\nA simple character n-gram model may be surprisingly strong here.\n\nI would build at least this baseline:\n\n\n from sklearn.pipeline import Pipeline\n from sklearn.feature_extraction.text import TfidfVectorizer\n from sklearn.linear_model import LogisticRegression\n from sklearn.metrics import classification_report, confusion_matrix\n\n clf = Pipeline([\n (\"tfidf\", TfidfVectorizer(\n analyzer=\"char_wb\",\n ngram_range=(3, 5),\n min_df=2,\n )),\n (\"lr\", LogisticRegression(\n max_iter=1000,\n class_weight=\"balanced\",\n )),\n ])\n\n clf.fit(X_train, y_train)\n pred = clf.predict(X_val)\n\n print(confusion_matrix(y_val, pred))\n print(classification_report(y_val, pred, target_names=[\"normal\", \"anomalous\"]))\n\n\n`TfidfVectorizer` supports character n-gram features via `analyzer=\"char\"` or `analyzer=\"char_wb\"`; see the scikit-learn docs for TfidfVectorizer.\n\nInterpretation:\n\nResult | Likely interpretation\n---|---\nTF-IDF baseline works, SecBERT fails | The data may be separable, but the Transformer setup/preprocessing/training may be wrong.\nBoth TF-IDF and SecBERT fail | Check labels, split, preprocessing, sampling, and evaluation.\nTF-IDF beats SecBERT | This task may be more string-pattern-heavy than language-semantics-heavy.\nSecBERT beats TF-IDF | Good; then SecBERT may be adding useful cyber-domain representation.\n\nI would also add HTTP-specific handcrafted features:\n\n\n import re\n from urllib.parse import unquote\n\n def extract_http_features(text):\n decoded = unquote(text.lower())\n\n return {\n \"length\": len(decoded),\n \"num_digits\": sum(c.isdigit() for c in decoded),\n \"num_special\": sum(c in \"'\\\"<>;(){}[]\" for c in decoded),\n \"num_params\": decoded.count(\"&\") + decoded.count(\"?\"),\n \"has_sql_keyword\": int(bool(re.search(\n r\"\\b(select|union|drop|insert|update|delete|or 1=1)\\b\",\n decoded,\n ))),\n \"has_xss_marker\": int(bool(re.search(\n r\"(<script|javascript:|onerror=|onload=)\",\n decoded,\n ))),\n \"has_path_traversal\": int(\"../\" in decoded or \"..%2f\" in decoded),\n \"has_crlf\": int(\"%0d\" in text.lower() or \"%0a\" in text.lower()),\n }\n\n\nThose features can be fed into Logistic Regression, Random Forest, LightGBM, XGBoost, or combined with text features.\n\n## 3. Position SecBERT correctly\n\nI would frame SecBERT like this:\n\n\n SecBERT is not the detector by itself.\n SecBERT is the encoder.\n\n The detector is:\n preprocessing\n + tokenizer\n + encoder\n + classification head or anomaly scorer\n + labels or normal-only training data\n + validation metrics\n + thresholding\n\n\nThis matters because changing only the encoder will not fix problems such as:\n\n * reversed labels,\n * wrong checkpoint loading,\n * skewed sampling,\n * bad validation split,\n * truncation,\n * uncalibrated threshold,\n * or evaluating only four hand-picked examples.\n\n\n\nIf you want to keep the supervised Transformer route, compare cyber-domain encoders under the same evaluation setup:\n\nEncoder candidate | Notes\n---|---\nSecBERT | Cybersecurity-text pretrained BERT. Reasonable current baseline.\nSecRoBERTa | RoBERTa variant from the same SecBERT family.\nSecureBERT | RoBERTa-based cybersecurity-domain language model.\nSecureBERT 2.0 | Newer Cisco cybersecurity/threat-intelligence encoder based on ModernBERT.\nGeneral BERT/RoBERTa | Useful sanity check: domain pretraining may or may not help this specific HTTP-request task.\n\nBut I would only do this after the evaluation setup is clean. Otherwise, you may just be changing the model while keeping the same bug.\n\n## 4. For SIEM-style unknown anomaly detection, consider normal-only scoring\n\nIf the real goal is SIEM-style unknown anomaly detection, supervised `normal` vs `anomalous` classification on CSIC may be only a proxy task.\n\nA SIEM-like goal is often closer to:\n\n\n learn what normal looks like\n -> assign anomaly/risk scores\n -> alert only above a tuned threshold\n\n\nThat suggests another family of methods:\n\n\n HTTP request / log text\n -> BERT/SecBERT/SecureBERT embedding\n -> IsolationForest / One-Class SVM / kNN distance / Autoencoder\n -> anomaly score\n -> threshold tuned on validation data\n\n\nOptions worth testing:\n\nMethod | Fit\n---|---\nSecBERT/SecureBERT embeddings + IsolationForest | Lightweight normal-only anomaly scoring.\nembeddings + One-Class SVM | Classic novelty detection baseline.\nembeddings + kNN distance | Easy to interpret: “far from nearest normal examples.”\nembeddings + Autoencoder | Reconstruction-error-based anomaly score.\nhandcrafted HTTP features + IsolationForest | Often strong and cheap for request-level anomaly scoring.\n\nReferences:\n\n * scikit-learn IsolationForest\n * scikit-learn novelty and outlier detection\n * Hugging Face Forum discussion on anomaly / out-of-domain detection with BERT\n * HTTP request embedding work such as HTTP2vec\n * A related BERT + autoencoder direction for HTTP anomaly detection: Contextual embeddings and autoencoders for HTTP traffic anomaly detection\n\n\n\nFor SIEM-like alerting, I would not rely only on `argmax`. I would tune thresholds on validation data and inspect the precision/recall trade-off. scikit-learn’s precision_recall_curve and precision-recall curve example are useful references.\n\nExample threshold selection:\n\n\n import numpy as np\n from sklearn.metrics import precision_recall_curve, classification_report\n\n # y_val: 0=normal, 1=anomalous\n # anom_scores: probability or anomaly score where larger means \"more anomalous\"\n\n precision, recall, thresholds = precision_recall_curve(y_val, anom_scores)\n\n f1 = 2 * precision[:-1] * recall[:-1] / np.maximum(\n precision[:-1] + recall[:-1],\n 1e-12,\n )\n\n best_idx = np.nanargmax(f1)\n best_threshold = thresholds[best_idx]\n\n print(\"threshold:\", best_threshold)\n print(\"precision:\", precision[best_idx])\n print(\"recall:\", recall[best_idx])\n print(\"f1:\", f1[best_idx])\n\n y_pred = (anom_scores >= best_threshold).astype(int)\n print(classification_report(y_val, y_pred, target_names=[\"normal\", \"anomalous\"]))\n\n\nIn practice, you may not want the threshold with best F1. For SIEM, you may instead choose a threshold based on acceptable false positives or alert volume.\n\n## 5. If moving from CSIC to real SIEM/system logs, consider sequence modeling\n\nIf your final target is real SIEM, system, or application logs, I would be careful about classifying each raw log line independently.\n\nMany log anomalies are contextual:\n\n\n One failed login is normal.\n Fifty failed logins from the same source in five minutes is suspicious.\n\n One 404 is normal.\n A burst of unusual paths from the same user-agent may be suspicious.\n\n One admin endpoint access may be normal.\n Admin access from a new user-agent, source, or geography may be suspicious.\n\n\nFor real logs, a common pipeline is:\n\n\n raw logs\n -> parse into templates / structured events\n -> group by host, user, session, source IP, process, or time window\n -> build sequences\n -> model normal sequence patterns\n -> detect deviations\n\n\nUseful references:\n\n * LogPAI logparser: extracts event templates from unstructured logs and converts raw log messages into structured event sequences.\n * Drain: representative online log parser using a fixed-depth parse tree.\n * LogBERT: log anomaly detection via BERT; the pipeline includes raw data, parsing, structured logs, sequence construction, and modeling.\n * DeepLog: classic LSTM-based system log anomaly detection; see also the DeepLog paper.\n * deep-loglizer: toolkit for deep learning-based log analysis and anomaly detection.\n * log-anomaly-detection reading list: useful survey-style collection of log anomaly detection papers and tools.\n\n\n\nThis is a different formulation from CSIC single-request binary classification. CSIC is useful for web request experiments, but it may not capture the full context of SIEM log anomaly detection.\n\n## 6. Suggested experiment order\n\nIf I were doing this, I would use this order:\n\n 1. **Debug the current SecBERT setup**\n\n * label mapping,\n * checkpoint loading,\n * class balance,\n * stratified splits,\n * truncation,\n * confusion matrix,\n * score distribution,\n * threshold calibration.\n 2. **Build simple CSIC baselines**\n\n * char n-gram TF-IDF + Logistic Regression,\n * char n-gram TF-IDF + Linear SVM,\n * HTTP handcrafted features + classical classifier.\n 3. **Compare cyber-domain encoders**\n\n * SecBERT,\n * SecRoBERTa,\n * SecureBERT,\n * SecureBERT 2.0,\n * general BERT/RoBERTa.\n 4. **Try normal-only anomaly scoring**\n\n * embeddings + IsolationForest,\n * embeddings + One-Class SVM,\n * embeddings + kNN distance,\n * embeddings + Autoencoder,\n * HTTP features + IsolationForest.\n 5. **For real SIEM logs, move to sequence/context modeling**\n\n * parse logs into templates,\n * build sequences by host/session/source/time window,\n * test LogBERT / DeepLog-style approaches.\n\n\n\n## Bottom line\n\nI would not abandon SecBERT immediately, but I would not make it the center of the whole solution either.\n\nFor CSIC HTTP requests, simple string/protocol baselines may be very strong. For unknown anomaly detection, normal-only anomaly scoring may be a better formulation. For real SIEM logs, sequence/context modeling may be more appropriate than classifying each raw entry independently.\n\nSo my recommendation would be:\n\n\n Keep SecBERT as one candidate.\n Debug the current same-class prediction issue.\n Add simple HTTP/string baselines.\n Compare other cyber encoders only after the evaluation setup is clean.\n If the goal is unknown anomaly detection, test embedding + anomaly scoring.\n If the goal is real SIEM logs, consider log-sequence modeling.\n",
"title": "Secbert to detect anomalous log entries"
}