Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihzd3elnai5ypxoqw5zph4fqcinrh27bpaw5fpfwkmtvpp7tcj5xy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmadmwyztgx2"
  },
  "path": "/t/seeking-arxiv-cs-ai-endorsement-fine-tuning-llama-3-2-for-u-s-immigration-law-q-a-using-aws-sagemaker/176110#post_1",
  "publishedAt": "2026-05-19T21:03:17.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://arxiv.org/auth/endorse?x=IYUKX3",
    "GitHub - natebell510/usa-immigration: Fine-tuned Llama 3.2 3B on 17,058 Q&A pairs from official U.S. immigration sources (USCIS, 8 CFR, BIA). Outperforms Llama 3 8B zero-shot (+27% mean score, 4× fully-correct). End-to-end pipeline: crawl → parse → QA generation (Bedrock) → LoRA fine-tuning (SageMaker). · GitHub"
  ],
  "textContent": "Hi all,\n\nI’m looking for an arXiv cs.AI endorsement to submit my first paper. I’m QA Automation Engineer and currently studying Data Engineering and Machine Learning.\n\n**What endorsement involves:** It’s a quick administrative step, you click the arXiv link below to confirm the paper fits the cs.AI category. It doesn’t require you to read the paper or more formally endorse it. You just need to have published 3+ papers in a CS category on arXiv in the past 5 years. **Endorsement link:** https://arxiv.org/auth/endorse?x=IYUKX3 and Endorsement Code: IYUKX3\n\nAbstract:\nThis paper describes an end-to-end pipeline for constructing a large-scale, source-grounded question-answering dataset covering U.S. immigration law, and for fine-tuning a small language model on that dataset. Starting from official U.S. government sources – including the USCIS Policy Manual, federal regulations (8 CFR), BIA precedent decisions, and immigration statistics – I crawl, parse, normalize, and chunk 10,056 canonical documents. Using Amazon Bedrock (Claude Sonnet 4.6), I generate 17,058 structured Q&A pairs across 13 immigration subdomains, each annotated with source provenance, authority level, answer type, and immigration subtopic. I then fine-tune Meta’s Llama 3.2 3B Instruct model via AWS SageMaker JumpStart using parameter-efficient LoRA, merge the adapters into the base weights, and publish both the dataset and model to Hugging Face. The complete pipeline – from first crawl to published model – runs end-to-end on commodity cloud infrastructure for approximately $25 in total compute cost. All artifacts are publicly available at GitHub - natebell510/usa-immigration: Fine-tuned Llama 3.2 3B on 17,058 Q&A pairs from official U.S. immigration sources (USCIS, 8 CFR, BIA). Outperforms Llama 3 8B zero-shot (+27% mean score, 4× fully-correct). End-to-end pipeline: crawl → parse → QA generation (Bedrock) → LoRA fine-tuning (SageMaker). · GitHub .",
  "title": "Seeking arXiv cs.AI endorsement — Fine-Tuning Llama 3.2 for U.S. Immigration Law Q&A Using AWS SageMaker"
}