I would like to get an opinion from knowledgeable people (since I don't understand anything about it myself)
As it stands, that repository isn’t really a dataset in the sense of Hugging Face or PyTorch, but I definitely think it functions as a prompt library.
If you plan to significantly increase the volume of data, converting it into a chat-like format similar to a standard dataset would likely make it usable for training LLMs.
Alternatively, you could keep it in a style similar to what it is now and simply enhance it as a prompt library by standardizing the formatting.
The reason is that when creating files to attach to RAG (think of it as a GUI for ChatGPT or Claude if you’re not familiar) to modify its behavior, having information in formats like JSON, YAML, or well-documented Markdown makes it easier to achieve precise changes in behavior. (While this depends on the AI model, structured data generally tends to be interpreted more accurately.)
The following is an evaluation by GPT:
Yes. It is worth continuing.
But I would not treat it as a finished “dataset” yet. Right now it looks more like a creative prompt library / emotion framework that could later become a better dataset.
My simple opinion
Your idea is interesting and original.
The current form is not very strong technically.
That is good news, because technical problems are fixable. A weak idea is much harder to fix.
What is good about it
The best part is that it has a clear idea.
You are not just listing feelings. You are trying to translate feelings into system language : memory, signals, loops, corruption, shutdown, touch as a process, and so on.
That gives the project a real identity.
Many small projects fail because they are vague or random. Yours is not random. It has a style and a point of view.
What is weak about it
The weak part is the structure.
Right now, it is hard to see it as a normal dataset that other people can easily:
- inspect
- load
- compare
- train on
- evaluate
So when technical people look at it, they may think:
“Interesting concept, but not ready to use.”
That does not mean it is bad. It means it is still in an early form.
What it really is right now
At the moment, I think it is closer to:
- a prompt library
- a metaphor system for emotions
- a seed collection for future synthetic data
- maybe the start of an emotion ontology
That is more accurate than calling it a strong dataset already.
Could it be useful to anyone
Yes, but probably to a niche group for now.
Most likely users:
- prompt engineers
- people experimenting with emotion-aware assistants
- small-model tinkerers
- people interested in emotion representation
- HCI / digital humanities / speculative design people
Less likely users right now:
- benchmark researchers
- people who want clean fine-tuning data immediately
- teams who need standard structure and easy reuse
The biggest risk
The biggest risk is that it could make an AI sound more emotional without making it more understanding.
That is an important difference.
A model can sound deep, caring, or poetic without actually helping better.
So if you keep developing this, the long-term question should be:
Does it improve real understanding and response quality, or only style?
What I would do next
I would do four things.
1. Change the framing
Describe it as a metaphor-based emotional prompt library or seed framework.
That is clearer and more believable.
2. Separate art from data
Keep the rich original writing.
But also make a clean structured version with fields like:
- concept
- metaphor type
- intended use
- source prompt
- risk notes
3. Make the format cleaner
Use a consistent format and naming scheme so other people can actually work with it.
4. Pick one goal
For example:
- better emotional acknowledgment
- better interpretation of metaphorical feelings
- better safe responses
Without one goal, it stays interesting but hard to evaluate.
Bottom line
My short answer is:
Yes, continue.
The idea is good.
The current packaging is the weak part.
Right now it is more valuable as a distinctive framework or prompt library than as a mature dataset.
So I would not abandon it.
I would reframe it, clean it up, and build version 2.
Here is the simplest plan I would use.
What makes the repo look unconvincing now
Two things are visible on the page itself:
- the dataset viewer is unavailable because Hugging Face could not detect supported data files
- the card has YAML metadata warnings because some task fields are not in the official lists (Hugging Face)
So the fix is not “write more feelings first.” The fix is make the project easy to recognize, load, and understand.
A simple v2 plan
1. Pick one identity
Choose one main label for the repo:
- prompt library
- seed dataset for fine-tuning
- emotion ontology
My recommendation: call it a metaphor-based emotional prompt library and seed dataset.
That is clear and believable.
2. Split the repo into two layers
Keep the original creative files. But do not make them the main data format.
Use this structure:
super-duper-fibber/
├── README.md
├── data/
│ ├── train.jsonl
│ ├── validation.jsonl
│ └── test.jsonl
├── source_texts/
│ ├── pain.yaml
│ ├── loneliness.yaml
│ ├── touch.yaml
│ └── ...
└── examples/
└── load_dataset.py
Why this helps:
- Hugging Face recommends supported repo structure and supported file formats so the dataset can load automatically and get a viewer. Supported formats include
.jsonl,.csv,.parquet, and others. TheREADME.mdis also the dataset card. (Hugging Face)
3. Make one clean row format
Each row in train.jsonl should be one usable item.
For example:
{
"id": "pain_001",
"concept": "pain",
"metaphor_domain": "system failure",
"language": "en",
"source_prompt": "Full original metaphor-rich text here...",
"intended_use": "system_prompt_seed",
"risk_notes": "Not for mental health crisis use"
}
If you want it to be more training-ready, use a standard format that TRL already supports, such as:
{
"messages": [
{"role": "system", "content": "You interpret pain through system-failure metaphors..."},
{"role": "user", "content": "I feel like something inside me keeps breaking."},
{"role": "assistant", "content": "That sounds like a state of repeated internal failure, not a small glitch..."}
]
}
TRL’s SFT docs say SFTTrainer supports standard and conversational formats, including rows like {"text": ...} and {"messages": ...]}. ([Hugging Face)
4. Fix the README metadata first
At the top of README.md, use only official metadata fields and official values.
A safer version would look more like this:
---
language:
- en
- ru
license: cc0-1.0
pretty_name: Super Duper Fibber
tags:
- text
- emotions
- prompts
- empathy
task_categories:
- text-generation
configs:
- config_name: default
data_files:
- split: train
path: data/train.jsonl
- split: validation
path: data/validation.jsonl
- split: test
path: data/test.jsonl
---
Why this matters:
- Hugging Face uses the README YAML block for metadata and data file configuration
- you can define splits there with
configs - correct metadata improves discoverability and removes warning noise (Hugging Face)
5. Rewrite the dataset card so people understand it in 30 seconds
Your README should answer these questions immediately:
What is this? A metaphor-based emotional prompt library plus normalized dataset rows.
What is one example? One concept mapped to one metaphor family, with source text and optional structured fields.
What is it for? Prompt design, synthetic-data seeding, emotion-aware assistant experiments.
What is it not for? Not therapy. Not psychological ground truth. Not crisis support.
What are the limits? Single-author style. Subjective mappings. Not clinically validated.
Hugging Face’s dataset card docs explicitly say the card should help users understand the contents, context, intended use, and potential biases. (Hugging Face)
6. Add one tiny usage example
Create examples/load_dataset.py:
from datasets import load_dataset
ds = load_dataset("closerh/super-duper-fibber")
print(ds["train"][0])
This is small, but it makes the repo feel real.
7. Add a minimal schema section
Put this in the README:
| Field | Meaning |
|---|---|
id |
unique item id |
concept |
emotion or state |
metaphor_domain |
system metaphor used |
source_prompt |
original authored text |
intended_use |
prompt seed, ontology seed, training seed |
risk_notes |
limits and safety notes |
This makes the repo look designed rather than improvised.
8. Only after that, add more content
Right now, structure is the bottleneck.
So the order should be:
- fix metadata
- create normalized JSONL files
- keep original files in a separate folder
- rewrite README
- add example loader
- then expand content
If you want the shortest possible upgrade path
Do just these 4 things first:
- Create
data/train.jsonlwith 20 to 50 clean rows. - Add
configs:toREADME.mdso Hugging Face knows where the data files are. - Replace unofficial task fields with official ones.
- Rewrite the README as a proper dataset card with intended use and limits. (Hugging Face)
That alone would make the repo look much more credible.
My blunt recommendation
Keep the poetic files. But make the main repo face look like data, not just ideas.
That is the fastest way to make people think:
“This is unusual, but serious.”
Discussion in the ATmosphere