{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibodnwbjxbkqgl5cdmjn4z4jadkj6il7qi7q3rnqhtqmcyc4osac4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mica6rjazmz2"
  },
  "path": "/t/transformer-for-asynchronous-multi-stream-image-time-series-with-online-prediction/174804#post_1",
  "publishedAt": "2026-03-30T12:39:43.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I have **two streams of images** , each stream corresponding to a different “channel” (e.g. different sensor modality). The streams are **not synchronized** — at any given moment, a new image arrives from one stream or the other, each with a real-valued timestamp. I want to **classify the sequence online** , i.e. produce an updated prediction after every new incoming image.\n\nKey constraints:\n\n  * Spatial features within each image matter (not just a scalar summary)\n\n  * Timestamps are irregular and not aligned across streams\n\n  * Prediction must improve causally as more observations arrive\n\n\n\n\nThe natural design seems to be: **ViT encoder per image → causal transformer over the merged token stream** , with real-valued timestamp embeddings (e.g. Time2Vec) replacing positional indices, and band/channel ID as an additional embedding.\n\nIs there an existing architecture or paper that handles this exact setup? Or is this a known gap?",
  "title": "Transformer for asynchronous multi-stream image time-series with online prediction?"
}