{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiccblhkjf5pgbvcwr23pa5iyrujjyl6gzy5g5at3qt7m4js2zog3y",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3midbs3dfpfu2"
  },
  "path": "/t/transformer-for-asynchronous-multi-stream-image-time-series-with-online-prediction/174804#post_2",
  "publishedAt": "2026-03-30T23:30:42.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "ACLarXiv",
    "arXiv",
    "OpenReview",
    "CVF Open Access",
    "ScienceDirect"
  ],
  "textContent": "Seems a known gap?\n\n* * *\n\nI did not find a **widely recognized, exact-match architecture** for your full setup: **two raw image streams** , **asynchronous real-valued timestamps** , **no forced alignment** , **spatial structure preserved inside each image** , and **causal online classification after every arrival**. What I found is a set of strong neighboring lines of work that each cover part of it. The intersection still looks like a real gap rather than a settled standard. MulT covers **unaligned multimodal attention**. StreaMulT covers **streaming unaligned multimodal inference with memory**. TSViT covers **image time series with time-aware visual tokenization**. ContiFormer and Transformer Hawkes Process cover **continuous-time irregular sequences**. AnytimeFormer covers **irregular asynchronous two-image-modality fusion** , but for reconstruction rather than online classification. RAMNet and SODFormer cover **asynchronous visual streams with online updates** , but in event-plus-frame settings and different tasks. Time-IMM then reinforces the broader point that realistic irregular asynchronous multimodal settings are still under-served in current benchmarks and methods. (ACLarXiv)\n\n## The background\n\nMost transformer work on time series grew out of one of three easier settings:\n\n  * **regularly sampled sequences**\n  * **feature-level multimodal streams**\n  * **video-like synchronized visual inputs**\n\n\n\nYour problem sits outside all three. You have **visual observations** , not just vectors. They arrive at **irregular real times**. The two streams are **not aligned**. And you need **online causal updates** , not one final prediction after the sequence ends. That combination is exactly why no single canonical paper shows up. The literature is rich on each axis separately, but sparse at the full intersection. (ACLarXiv)\n\n## What already exists, and how close it is\n\n### 1. Unaligned multimodal transformers\n\n**MulT** is the classic reference for unaligned multimodal sequences. Its key idea is directional crossmodal attention that lets one modality attend to another across distinct time steps **without explicit alignment**. That is very relevant to your two unsynchronized streams. But MulT was developed for low-level modality feature sequences, not raw image patch streams, and it is not an online streaming vision model. (ACLarXiv)\n\n**StreaMulT** is closer in deployment spirit. It explicitly defines a setting where the goal is prediction across time from **heterogeneous multimodal sequential data in a streaming fashion** , and it uses crossmodal attention plus a **memory bank** to handle **unaligned input streams** and **arbitrarily long inputs**. That is the closest existing transformer framing to your online requirement. The mismatch is that it is still not a raw-image-first architecture. (arXiv)\n\n### 2. Irregular continuous-time sequence models\n\n**Transformer Hawkes Process** is important conceptually because it treats the input as a **continuous-time event sequence** and explicitly says vanilla transformer machinery is not directly ready-made for continuous-time event data. It adapts self-attention to that setting and argues for attention-based modeling of short- and long-range event dependencies.\n\n**ContiFormer** pushes the same idea further. It states that ordinary recurrent and transformer models are limited by their **discrete characteristic** on irregular continuous-time data, and extends transformer relation modeling into the **continuous-time domain**. That makes it one of the strongest references for your timestamp problem. (arXiv)\n\n**Time2Vec** is not a full architecture, but it is still one of the cleanest timestamp components. It is explicitly proposed as a **model-agnostic vector representation of time** for synchronous and asynchronous events. That makes it a natural candidate for event-time embeddings in your setup. (OpenReview)\n\n### 3. Visual time-series models that preserve spatial structure\n\n**TSViT** is probably the most relevant visual paper if you care about keeping spatial image structure instead of collapsing each image to one scalar or one vector too early. It builds a **factorized temporo-spatial encoder** for satellite image time series and introduces **acquisition-time-specific temporal positional encodings**. This is strong evidence that image time series benefit from explicit timestamp-aware modeling while still treating the input as images, not just tabular points. (CVF Open Access)\n\n**S-ViT** is relevant for a different reason. It uses a **memory-enabled temporally aware spatial encoder** to produce frame-level features, then sends those features to a temporal decoder. That separation is useful for your case because it points away from one giant flat spatiotemporal token stream and toward a more scalable “encode image first, fuse over time second” design. (CVF Open Access)\n\n### 4. Asynchronous visual streaming systems\n\n**RAMNet** is one of the strongest near-matches for the online semantics you want. It is not transformer-based, but it is explicitly built for **asynchronous and irregular data from multiple sensors** , keeps a hidden state that is **updated asynchronously** , and can be **queried at any time** for a prediction. The mismatch is that it works on events and frames for monocular depth, not two ordinary image streams for classification.\n\n**SODFormer** is another very relevant near-match. It fuses **asynchronous events and frames** , uses a **spatiotemporal transformer** , and says it can **continuously detect objects in an asynchronous manner**. Its fusion module can be **queried at any time** , specifically to avoid the bottleneck of synchronized frame-based fusion. Again, the mismatch is task and modality type rather than the core streaming idea. (arXiv)\n\n### 5. Two image modalities with irregular timestamps\n\n**AnytimeFormer** is the closest paper I found to your raw input shape. It takes **Sentinel-2 optical** and **Sentinel-1 SAR** observations together with their timestamps, uses a **time-align attention module** to adaptively align **temporally asynchronous multi-modal time series** , and avoids extra alignment preprocessing. That is very close to “two image channels with irregular timestamps.” The mismatch is that the task is **reconstruction at arbitrary times** , not online sequence classification after each arrival. (ScienceDirect)\n\n### 6. A useful warning paper\n\n**MICA** is important because it points out a failure mode that matters a lot in your case. Its argument is that asynchronous multimodal fusion is not just a timing problem. It is also a **distribution discrepancy** problem. If the two modalities live in different feature distributions, plain cross-attention can become unreliable, so it performs attention in a more modality-invariant space. If your two streams come from genuinely different sensors, this paper is very relevant to architecture design. (CVF Open Access)\n\n## So is this a known gap?\n\nYes. That is the most accurate summary.\n\nThe field clearly knows about:\n\n  * **unaligned multimodal streams** (ACLarXiv)\n  * **continuous-time irregular sequences** (arXiv)\n  * **timestamp-aware visual time series** (CVF Open Access)\n  * **asynchronous online visual fusion**\n\n\n\nBut those pieces are usually studied in different communities. Time-IMM makes the broader point explicitly: real-world time series are often **irregular** , **multimodal** , **asynchronous** , and **messy** , while many benchmarks and methods still assume cleaner, more regular settings. That supports the claim that your problem is not a solved standard benchmark case. (arXiv)\n\n## What I think about your proposed design\n\nYour instinct is good. The main refinement is architectural.\n\n### What is right in your idea\n\nThese parts are solid:\n\n  * **image encoder first**\n  * **explicit real-valued time embeddings**\n  * **channel or modality ID embeddings**\n  * **causal prediction after every new observation**\n\n\n\nThose choices line up well with the existing literature. MulT and StreaMulT support the unaligned multimodal part. Time2Vec, THP, and ContiFormer support treating time explicitly. TSViT supports the idea that timestamps belong in visual time-series modeling. (ACLarXiv)\n\n### What I would change\n\nI would **not** literally replace all positional indices with time embeddings.\n\nInside each image, you still need **2D spatial positional information**. Across images, you need **event-time information**. Those are different roles. TSViT is a strong precedent for keeping image-space modeling explicit while adding time-aware encodings. So I would use:\n\n  * **2D spatial positions inside the per-image encoder**\n  * **continuous-time embeddings at the event level**\n  * **modality embeddings at the event level** (CVF Open Access)\n\n\n\nI would also **not** start with a model that sends every patch token from every image into one ever-growing causal transformer. That is elegant, but it is also the most likely place to hit compute and memory problems. StreaMulT’s use of memory banks and S-ViT’s separation of spatial and temporal stages both point toward a more scalable design. (arXiv)\n\n## The design I would actually recommend\n\nI would treat each arrival as an **event** made of:\n\n  * the image\n  * its real timestamp\n  * its stream ID\n\n\n\nThen I would use this pipeline:\n\n### 1. Per-image visual encoder\n\nEncode each image with a ViT-like or CNN-plus-transformer backbone that preserves spatial structure. Keep normal 2D patch positions here. If the two sensor modalities are very different, use separate stems or adapters, because MICA shows that crossmodal attention can become unreliable when modality distributions differ too much. (CVF Open Access)\n\n### 2. Per-image token compression\n\nDo not export all patches into the temporal model. Export either:\n\n  * one global token, or\n  * a small set of learned latent summary tokens\n\n\n\nThis keeps more spatial information than one scalar summary, but avoids a temporal patch-history explosion. TSViT and S-ViT both support this kind of factorized thinking. (CVF Open Access)\n\n### 3. Event-time encoding\n\nAdd:\n\n  * absolute timestamp embedding\n  * time since previous event\n  * time since previous event from the same stream\n  * time since previous event from the other stream\n  * stream ID embedding\n\n\n\nThat exact combination is my recommendation, not a named paper module, but it follows naturally from the continuous-time event perspective in Time2Vec, THP, and ContiFormer. (OpenReview)\n\n### 4. Streaming fusion with memory\n\nInstead of one universal stream only, keep:\n\n  * memory for stream A\n  * memory for stream B\n  * fused memory for prediction\n\n\n\nWhen a new A image arrives, update A memory, let it attend into recent B memory, then update fused state and emit a prediction. That is much closer to the logic of MulT and StreaMulT than to a monolithic merged token list. (ACLarXiv)\n\n### 5. Prefix-level supervision\n\nBecause you need online classification, train the model to be correct not only at the end, but after each arrival. RAMNet and StreaMulT are both useful here as operational references: they are built around state updates and predictions across time, not only one terminal decision. (arXiv)\n\n## My shortest answer\n\nThere is **no standard named architecture** that already solves your exact setting end to end. The nearest pieces are:\n\n  * **MulT** for unaligned multimodal attention (ACLarXiv)\n  * **StreaMulT** for streaming unaligned multimodal inference with memory (arXiv)\n  * **TSViT** for timestamp-aware image time-series transformers (CVF Open Access)\n  * **ContiFormer** and **Transformer Hawkes Process** for continuous-time irregular event modeling (arXiv)\n  * **AnytimeFormer** for asynchronous two-image-modality fusion with timestamps (ScienceDirect)\n  * **RAMNet** and **SODFormer** for asynchronous online visual prediction semantics\n\n\n\nSo the best classification of your idea is:\n\n**not solved as a standard recipe, but strongly supported by existing components.** (ACLarXiv)\n\n## What I would call it\n\nIf you need a label for the architecture family, I would call it one of these:\n\n  * **streaming multimodal continuous-time vision transformer**\n  * **event-driven multimodal vision transformer**\n  * **marked-event transformer for asynchronous visual streams**\n\n\n\nThose are descriptive names, not established paper titles.\n\n## Final judgment\n\nYour proposed direction is not only reasonable. It is probably **the right research direction** for this problem. The only strong caveat is this:\n\n**keep space inside the image encoder, and keep time at the event level.**\n\nDo that, and your design becomes a clean synthesis of the strongest ideas already in the literature, while still addressing a real gap.",
  "title": "Transformer for asynchronous multi-stream image time-series with online prediction?"
}