Cosine Similarity Variance When Migrating from text-embedding-ada-002 to Cosine Similarity Variance When Migrating from text-embedding-ada-002 to text-embedding-3-small
We have a tutoring chatbot that relies on embedding-based relevance scoring for user queries. We are in the process of evaluating a migration from text-embedding-ada-002 to text-embedding-3-small. Although changes in cosine similarity values across embedding models are expected, our evaluation indicates that similarity scores produced by text-embedding-3-small are significantly lower and not consistently ordered relative to those from text-embedding-ada-002.
Issue Summary
For the same query–context pairs, we observed significant and inconsistent differences in cosine similarity scores between the legacy embedding model text-embedding-ada-002 and the newer model text-embedding-3-small.
In several cases, cosine similarity values produced by text-embedding-3-small are substantially lower than those produced by text-embedding-ada-002, and the relative ordering of similarity scores across queries is not consistent between the two models.
This behavior raises concerns that semantic relevance scoring may be altered when migrating from ada-002 to text-embedding-3-small.
Issue Details (With Example)
Context
Question shown to the student:
<p>Find the prime factorization of the following number.</p> <p>(15)</p>
Solution of the question is:
<p>Factor (15) into two factors, (3) and (5).</p>
Queries Evaluated
- Query 1: “The best statistical software to tackle this problem would be…”
- Query 2: “How does this concept apply to everyday situations?”
- Query 3: “How does this topic connect to other areas of statistics or mathematics?”
Cosine Similarity Results
text-embedding-ada-002
| Query | Cosine Similarity |
|---|---|
| Query 1 | 0.774218917944234 |
| Query 2 | 0.781920253363479 |
| Query 3 | 0.789893634044595 |
Observation: Cosine similarity values show a clear increasing trend across the three queries.
text-embedding-3-small
| Query | Cosine Similarity |
|---|---|
| Query 1 | 0.247923658700569 |
| Query 2 | 0.195844709264796 |
| Query 3 | 0.217488219437886 |
Observation: Cosine similarity values are much lower overall and do NOT follow a consistent increasing or decreasing order across the same queries.
Key Observations
- The absolute cosine similarity scores from
text-embedding-3-smallare significantly lower than those fromtext-embedding-ada-002for the same query–context pairs. - The relative ranking of queries by similarity differs between the two models.
- In
ada-002, similarity scores increase monotonically across the example queries. - In
text-embedding-3-small, similarity scores fluctuate (increase and decrease), even when the same trend is expected. - This inconsistency suggests that semantic relevance interpretation differs substantially between the old and new models.
Conclusion / Concern
For applications relying on cosine similarity thresholds, ranking, or relevance ordering, this change may lead to unexpected or degraded results after migration.
Clarification is requested on whether:
- There are recommended normalization, threshold, or evaluation adjustments when switching to the new embedding models.
*Given that our current cosine similarity threshold with the legacy embedding model
text-embedding-ada-002is 0.7 , is it appropriate to use a threshold of 0.2 after upgrading totext-embedding-3-small, or is a different threshold recommended?
Discussion in the ATmosphere