{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibodnwbjxbkqgl5cdmjn4z4jadkj6il7qi7q3rnqhtqmcyc4osac4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mibzhsbwims2"
},
"path": "/t/transformer-for-asynchronous-multi-stream-image-time-series-with-online-prediction/174804#post_1",
"publishedAt": "2026-03-30T12:39:43.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I have **two streams of images** , each stream corresponding to a different “channel” (e.g. different sensor modality). The streams are **not synchronized** — at any given moment, a new image arrives from one stream or the other, each with a real-valued timestamp. I want to **classify the sequence online** , i.e. produce an updated prediction after every new incoming image.\n\nKey constraints:\n\n * Spatial features within each image matter (not just a scalar summary)\n\n * Timestamps are irregular and not aligned across streams\n\n * Prediction must improve causally as more observations arrive\n\n\n\n\nThe natural design seems to be: **ViT encoder per image → causal transformer over the merged token stream** , with real-valued timestamp embeddings (e.g. Time2Vec) replacing positional indices, and band/channel ID as an additional embedding.\n\nIs there an existing architecture or paper that handles this exact setup? Or is this a known gap?",
"title": "Transformer for asynchronous multi-stream image time-series with online prediction?"
}