{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreienaioqbhn7x2wqpl5iqyiq3rfkggqkzumxtmmvslvlgbvuqw7dw4",
"uri": "at://did:plc:lk3jfj3zq4k4wxnk474axylu/app.bsky.feed.post/3mgyj2nmxbnu2"
},
"path": "/t/need-serious-beta-testers-for-trl5-on-prem-dataset-cleaning-pipeline-using-openai-api/1376682#post_1",
"publishedAt": "2026-03-14T02:19:06.000Z",
"site": "https://community.openai.com",
"tags": [
"https://github.com/mentoratechnologies/PurifyFactory-Beta"
],
"textContent": "Hi everyone,\n\nI’m looking for a small number of **serious technical beta testers** for **Purify Factory** , a **model-agnostic on-prem pipeline** for cleaning noisy text datasets at scale before downstream AI use.\n\nThis is **not** an open-source release and not a casual “try it if you feel like it” beta.\nThis phase is meant to support **TRL5 validation** , so I’m specifically looking for testers who can run **real, structured tests** and provide **useful, reproducible feedback**.\n\nWhat Purify Factory does:\n\n * cleans noisy multilingual text datasets\n\n * processes JSONL datasets with `sentence` or `text` fields\n\n * produces auditable output with original text, cleaned text, token usage, cost, and provider metadata\n\n * supports multiple providers, including **OpenAI** , plus Anthropic, Gemini, and local backends\n\n\n\n\nWhat I need from testers:\n\n * **Linux x86_64** environment\n\n * willingness to test on a **real dataset** , not toy samples\n\n * ideally **1,000+ records minimum** ; **5,000+ preferred** for stronger validation\n\n * ability to report installation issues, runtime behavior, failure modes, output quality, and edge cases in a structured way\n\n * willingness to share feedback that is actually usable for certification and release hardening\n\n\n\n\nImportant constraints:\n\n * the repo is the **beta access point** , not a source-code release\n\n * access requires a **personal free beta license**\n\n * the license is **machine-specific**\n\n * testers must use their **own API key / credits**\n\n * this is best suited for developers, data engineers, NLP practitioners, or technical teams already working with text preprocessing / LLM data pipelines\n\n\n\n\nIf you are interested **and you can run a serious validation pass** , the beta repo is here:\n\nhttps://github.com/mentoratechnologies/PurifyFactory-Beta\n\nI’m especially interested in feedback from people already using the **OpenAI API** in production or evaluation pipelines, because I want to understand how Purify Factory behaves in real data-preparation workflows, not just in isolated demos.\n\nIf you match this profile and want to participate, reply in the thread or reach out through the repo instructions.",
"title": "Need serious beta testers for TRL5: on-prem dataset cleaning pipeline using OpenAI API"
}