Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibwuzuu7ht72evm255iaaccae6qoe4nyec4lvdo6zdpp2t4cvrsmm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mouazrkwqnz2"
  },
  "path": "/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_3",
  "publishedAt": "2026-06-22T05:00:20.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Thank you very very much for the in-depth explanation on what to do,\n\nI have another question where in we are talking about structural extraction right for “Compare this”, “Which is FY did I earn more” and other such queries, we need to extract text into a typed model right?\n\nso we first send the “text” of the chunk to the LLM and then ask it to extract such information into a typed model and then store everything into embeddings?\n\nor should we use a parser to extract info into a typed model\n\nand during retrieval we query based on the embeddings in pgvector and we will fetch\nthe column and also the typed models then send it for answering?\n\none more question how would .xlsv and .csv work?\nyou mentioned, storing rows and cols with cell_id etc\nbut in excel such comparison queries will occur more frequently right?\n\nso is it recommend to make generalized typed model for them as well? or to parse them into a dataframe or something similar?\n\nwill locally run models be good enough for such extraction into typed models?\n\ndo we have to worry about chunk overlapping in case of .xlsv? as sending long sheets to the model will raise token errors right?\n\nSorry for continuing to bug you regarding this",
  "title": "Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)"
}