LLM "curving" via prompting
Hmm… this is difficult, but based on what I could find so far, maybe it looks something like this:
Short answer
I would not treat this as a yes/no question yet.
At the output level , the safest first interpretation is that the prompt is creating a strong style / ontology / behavioral regime shift. The prompt gives the model a very explicit vocabulary: “field,” “gravity well,” “self-organization,” “manifold,” “density,” “hidden dynamics,” etc. So it is not surprising that the model starts speaking and visualizing in that ontology.
But I would not yet call it evidence that the model’s internal manifold has literally been “curved,” unless the visualization is based on something like token scores/logprobs, hidden states, or causal interventions , rather than generated text or model self-description.
My tentative answer to the three questions would be:
Manifold perturbation or stylistic variation? Maybe more than ordinary style variation, but “manifold perturbation” is too strong until the measurement layer is specified. A safer current label might be prompt-induced regime shift or prompt steerability.
Has anyone tried similar “field condition” prompting? Yes, there are nearby prompt-only experiments on this forum using “field,” “attractor,” “semantic tension,” or “global regulator” language. I would treat those as related observations, not evidence of the same internal mechanism.
How to distinguish rhetorical self-organization from something stronger? I would not look for one magic “self-organization score.” I would use an evidence profile : output behavior → behavioral stability → token distribution → hidden-state trajectory → interdependence/order-parameter tests → causal patching/ablation/transfer.
For me, the most important missing detail is:
What exactly are the maps computed from?
Generated text embeddings? Token scores? Hidden states? Attentions? A custom metric? A 2-D projection of some other vector space? The interpretation changes a lot depending on that answer.
The main distinction I would make
I would separate “the prompt curves the model” into several increasingly strong claims.
| Level | What is being claimed | Example evidence |
|---|---|---|
| Output style | The output sounds different | More field/manifold/self-organization language |
| Behavioral regime | The prompt reliably shifts behavior | Replication across seeds/tasks; survives style-only controls |
| Token distribution | The next-token landscape changes | Scores, logprobs, entropy, KL divergence |
| Representation | Hidden-state trajectories change | Layerwise hidden-state distances, CKA/RSA, trajectory clustering |
| Self-organization-like profile | There is stable, structured organization, not just vocabulary | Order parameters, perturbation recovery, interdependence/synergy |
| Causal mechanism | The effect can be moved or removed | Activation patching, ablation, activation-difference transfer |
The first two levels are still interesting. They just do not yet justify the stronger internal-mechanism language.
A useful existing term for the second level is prompt steerability : how much prompting alone can shift a model’s behavioral distribution away from baseline. See, for example:
- Evaluating the Prompt Steerability of Large Language Models
- ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
- PromptRobust
- PromptBench
So I would initially describe the current observation as:
This prompt may induce a distinctive behavioral regime. Whether that regime corresponds to a deeper representation-level or causal mechanism is testable, but not yet shown by the wording of the output alone.
What I would ask first
Before interpreting the maps, I would ask:
What exactly is being plotted?
| If the map is computed from… | Safer interpretation | Unsafe interpretation |
|---|---|---|
| Generated text embeddings | Output-space shift | Internal manifold perturbation |
| Prompt embeddings | Input-space / prompt-space difference | Model dynamics changed |
| Token scores / logprobs | Next-token distribution changed | Causal mechanism found |
| Hidden states | Representation trajectory changed | The difference caused the behavior |
| Attention patterns | Attention allocation changed | The whole model entered a new field state |
| Custom metrics | Difference under that metric | Internal geometry proved |
| UMAP/t-SNE/PCA projection | Exploratory visualization | Literal gravity wells / basin sizes |
For open-weight Hugging Face models, some of these layers are accessible. Relevant docs:
- Transformers generation utilities
- Text generation API
- Model outputs / hidden states
There is also a practical gotcha: hidden states from model.generate() can be confusing because prompt-token states and generated-token states are not always obtained in the same way. This HF Forum thread is relevant:
- The hidden states when I use model.generate()
So I would ask:
Are the maps based on generated text, scores/logprobs, hidden states from a specific layer/token position, attentions, or a custom post-processing metric?
That question is not a criticism. It determines what kind of claim the map can support.
Related HF Forum context (click for more details)
Rhetorical self-organization vs stronger evidence
I do not know a single accepted metric for “genuine self-organization simulation” in LLM outputs.
But adjacent fields do have recurring measurement families. I would treat them as an evidence profile , not one score.
| Measurement family | What it asks | Prompt-experiment version |
|---|---|---|
| Order parameters | Is there a low-dimensional collective variable? | What variable summarizes the induced regime? |
| Stability / recovery | Does the system return to a regime after perturbation? | Seeds, paraphrases, distractors, multi-turn recovery |
| Entropy / complexity / homeostasis | Is there a balance of order and variability? | Token entropy, output entropy, complexity profile |
| Fingerprint metrics | Is there randomness, pattern, and interdependence? | Output variance, repeated structure, mutual information |
| Interdependence / synergy | Are parts becoming structurally coupled? | MI/PID among output features or hidden components |
| Information dynamics | Is information stored/transferred over time? | Prompt influence across generation steps |
| Multiscale profile | At what scale does organization appear? | Token, sentence, response, layer, seed ensemble |
| Topological structure | Are there persistent geometric features? | TDA on hidden-state or output-embedding clouds |
| Causal intervention | Can the effect be moved or removed? | Patching, ablation, activation-difference transfer |
Translated to this experiment, I would draw the boundary like this:
| Category | What it would look like |
|---|---|
| Rhetorical self-organization | The output uses self-organization / field / attractor vocabulary |
| Behavioral self-organization-like effect | The regime replicates across seeds/tasks and survives style-only controls |
| Distribution-level evidence | Scores, entropy, or logprob profiles shift relative to controls |
| Representation-level evidence | Hidden-state trajectories shift relative to controls |
| Order-parameter-like evidence | A low-dimensional variable predicts many output/hidden-state features |
| Interdependence evidence | Output or hidden-state components become more mutually dependent/synergistic |
| Causal evidence | Patching, ablation, or activation-difference transfer moves or removes the effect |
So for your third question, I would not ask:
Is it genuine self-organization, yes or no?
I would ask:
How far up this ladder does the effect survive?
Background links for self-organization measurement families (click for more details)
Controls that would make the result easier to interpret
I would add controls before interpreting the effect as internal geometry.
| Control | Why it matters |
|---|---|
| Baseline prompt | What does the model do normally? |
| Curving / field prompt | The proposed effect |
| Physics-metaphor-only prompt | Tests whether the effect is just physics vocabulary |
| Self-monitoring-only prompt | Tests whether the effect is generic reflective prompting |
| Same-length abstract prompt | Controls for length and abstraction |
| Shuffled-label prompt | Tests whether labels like “self-assembly” and “self-organization” matter |
| No-visual-language prompt | Tests whether the visual map is induced by visualization instructions |
| Multiple seeds | Tests whether the regime is stable |
| Greedy + sampling | Separates deterministic prompt effect from sampling noise |
Example interpretation:
| Result | Safer interpretation |
|---|---|
| Curving prompt and physics-only prompt look similar | Style / ontology priming is likely |
| Curving prompt differs from all style controls across seeds | Behavioral regime shift becomes more plausible |
| Token score / entropy profiles differ from controls | Distribution-level claim becomes stronger |
| Hidden-state trajectories differ from controls | Representation-level claim becomes stronger |
| Patching transfers or removes the effect | Causal claim becomes more plausible |
Controls are not just for disproving the idea. They make the interesting part more precise.
Small practical next step
If I were trying to make this easy for others to evaluate, I would create a minimal reproducibility package:
- Pick one open-weight model.
- Save exact model revision and tokenizer revision.
- Save the serialized chat-template output.
- Use baseline / curving / physics-only / same-length abstract controls.
- Fix generation settings.
- Run multiple seeds.
- Save generated outputs.
- Save token scores / entropy profiles if available.
- Save hidden-state extraction method if used.
- Report which layer/token positions are plotted.
- Save projection method and parameters.
- If possible, add one small activation-patching or activation-transfer experiment.
The claim can then be stated in tiers:
| Evidence obtained | Claim strength |
|---|---|
| Output changes only | Rhetorical/style regime |
| Replicates across seeds and controls | Behavioral regime shift |
| Scores/logprobs differ | Token-distribution shift |
| Hidden states differ | Representation-level shift |
| Order parameter / interdependence appears | Self-organization-like profile |
| Patching/ablation/transfer works | Causal evidence candidate |
Implementation details that can easily confound the result (click for more details) A staged open-model test (click for more details) Relation to activation steering (click for more details) Visualization caveat (click for more details) Model self-description is output data, not mechanism evidence (click for more details) Terms I would use / avoid (click for more details)
My current position
This looks like an interesting prompt-induced regime shift. It may become a stronger representation-level or self-organization-like claim if the maps are based on well-specified token/hidden-state data, survive controls, and ideally show some causal transfer or ablation behavior.
Until then, I would keep “manifold curving” as a metaphor or hypothesis, not as the measured conclusion.
Discussion in the ATmosphere