Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifj5u4gmy32qfqzy2sd63na66x7r43r655gld4geanh2e7amfrug4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhcqqajx5ae2"
  },
  "path": "/t/research-orientation-on-aiops-for-university-students/174280#post_2",
  "publishedAt": "2026-03-17T11:44:07.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "arXiv",
    "Heng Li",
    "OpenReview",
    "OpenTelemetry",
    "USENIX",
    "Frontiers",
    "ScienceDirect"
  ],
  "textContent": "I’m not very familiar with AIOps, so for now, I’ll just use GPT to summarize the current trends in research:\n\n* * *\n\nThere are still good directions today in 2026.\n\nThe important shift is this: **AIOps is no longer most interesting at the level of “predict one metric better.”** The stronger questions now are about **diagnosis, decision support, realistic evaluation, and operating newer systems such as LLM and GPU infrastructure**. Recent benchmarks and surveys show that the field has moved from narrow detection toward multimodal reasoning over logs, metrics, traces, incident knowledge, and operator workflows. (arXiv)\n\n## Why you feel stuck\n\nYou started from a common student entry point: traffic prediction and error forecasting. That is logical. It is measurable, easy to prototype, and easy to find papers on. The problem is that those tasks are often **supporting tasks** , not the main operational objective. The cloud AIOps survey frames the real goals around detection, failure prediction, root-cause analysis, and actions that reduce MTTD and MTTR, while OpenTelemetry frames observability around answering “Why is this happening?” using traces, metrics, and logs together. That makes isolated prediction feel narrower than it first appears. (arXiv)\n\nThere is a second reason. Some parts of the literature are simply crowded. A survey of AIOps projects found that monitoring data such as logs and performance metrics are the most common inputs and that anomaly detection is the most common goal. That means a lot of visible work is concentrated in the same few problem formulations. If you keep searching inside that corridor, it will look saturated. (Heng Li)\n\nThere is a third reason. Benchmarks used to make some problems look easier than they are. RCAEval already had to introduce a benchmark with **735 failure cases** and **15 reproducible baselines** for microservice RCA, and newer benchmarks such as OpenRCA and Cloud-OpsBench exist because the community still thinks evaluation is incomplete and too far from realistic incident work. Fields do not keep building benchmarks this quickly when the problems are finished. (arXiv)\n\n## What changed by 2026\n\nBy 2026, two trends are very clear.\n\nFirst, the field is becoming **multimodal and agentic**. Cloud-OpsBench argues that modern RCA should be evaluated as **active reasoning** rather than passive classification, and it introduces **452 fault cases across 40 root-cause types** over the Kubernetes stack. OpenRCA does something similar from the LLM side, with **335 failures** and **over 68 GB of logs, metrics, and traces**. The message is that the research frontier is now closer to “investigate like an operator” than “classify one snapshot.” (arXiv)\n\nSecond, **LLM-based AIOps is real enough to study seriously, but still weak enough to leave major room for research**. OpenRCA reports that even with a specially designed RCA-agent, the best-performing model solved only **11.34%** of failure cases. A 2026 failure-analysis paper then ran **1,675 agent runs across five models** and found recurring pitfalls such as hallucinated data interpretation and incomplete exploration. That is not a solved area. It is an immature area with visible failure modes. (OpenReview)\n\nAt the same time, the base layer has not changed: traces, metrics, and logs still matter, and **context propagation** still determines whether those signals can be tied together correctly in a distributed system. That means students who understand observability well are still positioned better than students who only understand modeling. (OpenTelemetry)\n\n## The best way to think about specialization\n\nDo not ask:\n\n> What else can I predict?\n\nAsk:\n\n> What decision in operations is still badly supported?\n\nThat one change in framing usually reveals better research topics.\n\nA good specialization in AIOps today usually sits at one of these boundaries:\n\n  * between **observability** and **diagnosis**\n  * between **diagnosis** and **operator action**\n  * between **benchmarks** and **real incidents**\n  * between **classical cloud systems** and **new AI infrastructure** (OpenTelemetry)\n\n\n\n## Where the strongest opportunities are today\n\n## 1. Multimodal root-cause analysis under incomplete telemetry\n\nThis is the best direction for most students.\n\nThe 2023 survey on AIOps for cloud platforms explicitly says there are **very limited efforts** on **trace and multimodal failure prediction**. The 2024 failure-diagnosis survey then frames microservice diagnosis as a multimodal problem involving logs, metrics, traces, events, and topology. OpenTelemetry’s official primer reinforces why this matters: each signal answers a different part of the debugging problem. (arXiv)\n\nThis creates a strong research question:\n\n**What can still be diagnosed when telemetry is missing, delayed, downsampled, or corrupted?**\n\nThat question is strong because it is realistic. In real systems, traces are sampled, logs are noisy, metrics are delayed, and context propagation breaks. A method that works only under perfect observability is not very useful. (OpenTelemetry)\n\nA thesis-sized version would be:\n\n**Robust RCA for cloud-native systems under partial observability**\n\nYou would:\n\n  * instrument a small microservice system\n  * collect metrics, logs, and traces\n  * inject several fault types\n  * compare performance when one signal is missing or degraded\n\n\n\nThis is technically solid, experimentally manageable, and still underbuilt enough to matter. (arXiv)\n\n## 2. Benchmarking and evaluation, not just new models\n\nThis area is less glamorous than model-building, but often more valuable.\n\nRCAEval, OpenRCA, and Cloud-OpsBench together show that the field still needs better evaluation infrastructure: RCAEval standardized reproducible RCA benchmarking for microservices, OpenRCA exposed how hard long-context, multimodal RCA is for LLMs, and Cloud-OpsBench moved the evaluation target toward active tool use and deterministic reproducibility. That combination says the benchmark layer is still under construction. (arXiv)\n\nA good student contribution here does **not** need to be “invent a novel architecture.” It can be:\n\n  * a better failure-injection setup\n  * a clearer evaluation protocol\n  * a partial-observability benchmark variant\n  * a benchmark for drift or telemetry misalignment\n  * a comparison of diagnosis quality versus data-collection cost\n\n\n\nThat kind of work is publishable because the community is still trying to measure the right things. (arXiv)\n\n## 3. Incident reports, postmortems, and historical incident knowledge\n\nThis is one of the most underused directions for students who like both systems and language.\n\nAutoARTS studied **over 2,000 incidents from more than 450 Azure services** to build a better root-cause labeling system, which shows how important and messy incident knowledge is in practice. The LLM-era AIOps survey also shows that newer systems increasingly incorporate human-generated artifacts such as Q&A, software information, and incident reports rather than relying only on raw telemetry. (USENIX)\n\nThat suggests several strong questions:\n\n  * how to retrieve similar past incidents during triage\n  * how to use postmortems to improve ranking of root-cause candidates\n  * how to clean noisy incident labels\n  * how to generate grounded summaries that point to evidence, not just fluent text\n\n\n\nA strong project here would be:\n\n**Telemetry plus postmortem retrieval for incident triage**\n\nThat is better than a generic “LLM for AIOps” project because it has grounding, evaluation, and immediate practical value. (USENIX)\n\n## 4. Decision-aware forecasting instead of plain forecasting\n\nIf you still like forecasting, keep it, but reframe it.\n\nForecasting is mature enough that “slightly better prediction accuracy” is often not a compelling research story by itself. More interesting is whether prediction improves **autoscaling, SLO management, capacity planning, or bottleneck ranking**. A recent 2026 review of distributed tracing and proactive SLO management emphasizes evaluation protocols for SLO violation prediction and actionable outputs such as bottleneck candidate ranking and what-if estimation. Meanwhile, the AIOps model-update study shows that once deployed, models must be actively maintained because operational data evolve over time. (Frontiers)\n\nSo instead of:\n\n  * “predict traffic better”\n\n\n\nmove to:\n\n  * “predict enough, early enough, and robustly enough to support SLO-safe control”\n\n\n\nA good topic would be:\n\n**Drift-aware workload prediction for SLO-constrained autoscaling**\n\nThat keeps your current interests but upgrades the operational meaning of the work. (arXiv)\n\n## 5. AIOps for LLM and GPU systems\n\nThis is the freshest niche in 2026.\n\nA 2026 study on GPU-driven LLM workloads tested **24 RCA methods** and found that existing RCA tools **do not generalize** to these systems; multi-source approaches did best, metric-based methods depended heavily on the fault type, and trace-based methods largely failed. That is a strong sign that classical web-service AIOps is not enough for modern AI-serving stacks. (arXiv)\n\nThis is a very good area if you have access to:\n\n  * a lab running inference or training workloads\n  * GPU cluster telemetry\n  * a simulator or controlled deployment setup\n\n\n\nA strong topic would be:\n\n**Failure diagnosis for LLM inference services using multi-source observability**\n\nThis is genuinely current. It also has a clear argument for why old methods are insufficient. (arXiv)\n\n## 6. Model maintenance, drift, and lifecycle reliability\n\nThis is not fashionable, but it is very real.\n\nThe 2023 model-update paper says directly that **when and how to update AIOps models remain an under-investigated topic** and shows that active update strategies can outperform a stationary model in both performance and stability. If you want a topic that looks modest but teaches excellent research habits, this is one of the best. (arXiv)\n\nThis is especially relevant because operations data are not static. New deployments, new users, new software versions, new logging conventions, and new workloads all change the distribution. AIOps models that are good on day one and stale on day sixty are not good AIOps models. (arXiv)\n\nA solid project here would be:\n\n**When should an RCA or forecasting model be retrained under workload drift?**\n\nThat is practical, measurable, and deployment-relevant. (arXiv)\n\n* * *\n\n## What looks crowded today\n\nThese areas are not useless. They are just harder to make important unless you bring a strong twist.\n\n### Plain traffic prediction on standard traces\n\nToo many papers stop at forecast accuracy. The more meaningful work now ties prediction to SLOs, control, or cost. (Frontiers)\n\n### Log-only anomaly detection or log parsing\n\nThere is still active work here, including LLM-based methods, but this is one of the busiest lanes in the literature. It is easier to write incremental papers here than durable ones. (ScienceDirect)\n\n### Toy RCA on simplified microservice setups\n\nRCAEval, OpenRCA, and Cloud-OpsBench exist precisely because older evaluation setups were not enough. Purely toy results are less convincing now. (arXiv)\n\n### Generic “LLM for AIOps” demos\n\nThe benchmark evidence is still sobering. OpenRCA’s best RCA-agent result is 11.34%, and the 2026 failure-analysis work shows consistent agent failure patterns. The bar is now much higher than “I prompted an LLM on logs.” (OpenReview)\n\n* * *\n\n## What I would recommend for you specifically\n\nFor a second-year university student, I would optimize for three things at once:\n\n  1. **high learning value**\n  2. **feasible experiments**\n  3. **a topic with room for a real contribution**\n\n\n\nThat points me to this ranking.\n\n### Best overall choice\n\n**Robust multimodal RCA under incomplete telemetry**\nWhy: strong gap, realistic, good systems training, publishable without giant resources. (arXiv)\n\n### Best if you like NLP or LLMs\n\n**Incident retrieval and grounded triage from postmortems plus telemetry**\nWhy: strong link to practice, less crowded than generic log parsing, and more grounded than vague LLM demos. (USENIX)\n\n### Best if you want to keep forecasting\n\n**Decision-aware forecasting for autoscaling or SLO support, with drift handling**\nWhy: preserves your current skills but makes the problem more meaningful. (Frontiers)\n\n### Best if you want a fresh niche\n\n**AIOps for LLM/GPU systems**\nWhy: very current and not yet well served by existing methods. (arXiv)\n\n* * *\n\n## What I would do in your position over the next year\n\n### Step 1. Build observability literacy first\n\nBefore choosing a thesis, get very comfortable with:\n\n  * what traces are\n  * what metrics are\n  * what logs are\n  * how context propagation connects them\n\n\n\nThat is the substrate of almost every serious AIOps problem today. Without it, many papers look more magical than they are. (OpenTelemetry)\n\n### Step 2. Reproduce one benchmark\n\nPick one:\n\n  * **RCAEval** if you want classical RCA on microservices\n  * **OpenRCA** if you want LLM-based or multimodal RCA\n  * **Cloud-OpsBench** if you want agentic RCA and tool use (arXiv)\n\n\n\nYour goal here is not to beat the benchmark immediately. It is to understand what the data, labels, and failure cases really look like.\n\n### Step 3. Add one realism constraint\n\nChoose one:\n\n  * missing traces\n  * noisy logs\n  * delayed metrics\n  * drift\n  * weak labels\n  * retrieval from past incidents\n  * abstention or confidence calibration\n\n\n\nOne realism constraint is enough for a strong undergraduate project. (arXiv)\n\n### Step 4. Evaluate the right outcome\n\nDo not evaluate only MAE, F1, or Top-1.\n\nAlso ask:\n\n  * did triage get faster?\n  * did the top-k root-cause ranking improve?\n  * how robust was the method when observability degraded?\n  * how costly is the method to run?\n  * does it know when not to answer? (arXiv)\n\n\n\n### Step 5. Write the paper around the bottleneck, not the architecture\n\nIn AIOps, a valuable contribution can be:\n\n  * a benchmark setup\n  * a robustness study\n  * a cleaner failure-injection protocol\n  * a grounded triage pipeline\n  * a realistic comparison under missing data\n\n\n\nIt does not need to be “a bigger model.” (arXiv)\n\n* * *\n\n## Good topic statements you could actually use\n\nThese are all better than “AI for cloud traffic prediction.”\n\n### Safe and strong\n\n**Observability-driven root-cause analysis under partial telemetry in cloud-native systems** (OpenTelemetry)\n\n### Strong if you like documents and LLMs\n\n**Grounded incident triage using telemetry and historical postmortems** (USENIX)\n\n### Strong if you want systems plus forecasting\n\n**Drift-aware workload prediction for SLO-oriented autoscaling** (arXiv)\n\n### Bold and current\n\n**Failure diagnosis for GPU-backed LLM inference services** (arXiv)\n\n* * *\n\n## Bottom line\n\nYour problem is not that AIOps has no gaps left.\n\nYour problem is that you started in one of the most crowded entry corridors.\n\nToday, the field is more promising if you move from **prediction-centric AIOps** to **decision-centric AIOps**. The best opportunities are where observability is incomplete, labels are weak, evaluation is immature, or the systems themselves have changed faster than the diagnostic tools. That is exactly what the current surveys, benchmarks, and observability standards are telling us. (arXiv)\n\nIf I had to give you one direct recommendation, it would be this:\n\n**Specialize in observability-driven failure diagnosis for cloud systems under realistic constraints, especially incomplete telemetry.**\n\nThat direction is current, technically serious, and still open enough for a second-year student to make a meaningful contribution. (arXiv)",
  "title": "Research Orientation on AIOps for University Students"
}