{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig7tg3snwjnvgmi7plitn2goaqrmiswl3kqu64rrhigjiqpfbkb6m",
"uri": "at://did:plc:haakkg7y3xdghcdmprxeexso/app.bsky.feed.post/3mnb4vzsox5f2"
},
"path": "/t/prieco-quality-benchmarking/38289#post_1",
"publishedAt": "2026-06-01T22:10:12.000Z",
"site": "https://discuss.privacyguides.net",
"tags": [
"PriEco",
"WorldWideWebSize",
"8B",
"SEJournal",
"IndexMachine",
"Source",
"replied to me",
"2025",
"LIVE STATS",
"LIVE",
"NDCG"
],
"textContent": "# Intro\n\nHi! Some of you may know PriEco and you may know the results aren’t the best yet. I’m committed to improve it as much as I can. I’ve decided to document it here. Hopefully I am allowed to, will provide a coherent source of information, show some transparency on my side and we could have discussions.\n\n**Why here?** Because of community and I like Privacy Guides. I could do it on X/Mastodon/Bsky or on own blog (that would be first post) but I don’t have trust that people would ready it there.\nPlease tell me if it’s appreciated here and if not I’ll stop\n\nI still believe the main reason why PriEco lacks behind is **index size**. It’s just too small compared to more established web search engines\n\n## Index size\n\n**Google** (40-50B results, 400B known URLs)\ndoesn’t publicly report it, but we can estimate from sources:\nWorldWideWebSize estimates Google at roughly 40-50B pages (Bing at 1-3B, I find it hard to believe as both Mojeek and Brave search report much higher numbers)\nI’d say that general agreed upon numbers online are ~50B results and 400B known URLs\nBut there are claims as 8B (Maybe they mean domains)\n\n**Bing** (8-14B)\nSEJournal\nIndexMachine\nAgain, it’s just estimation\n\n**Brave search** (8B+)\n\n\nSource\nHere we can estimate Brave search is roughly Bing size as it was 8B over 5y ago. I believe we can all agree Brave search got a lot better compared to how it was 5y ago.\n\n**Mojeek** (9B)\nMojeek is pretty transparent about this. They replied to me and in 2025 they reached 9B (it’s in their timeline)\n\nThat said, there is a meaningful distinction about how many URLs (web pages) a web search engine knows about **crawling** and actually stores and serves as results **indexing**.\nI personally have 0 care for now about how many URLs PriEco knows about but! for the sake of this post I ran a script: 2.1B. I care only about how many results it can deliver to you: LIVE STATS _(need to improve that page design)_\n\nWhile writing this I stumbled across\n\nJust so you know. That was likely version 1 of my crawler. Now it’s at version 3 and the reason was that the before versions produced unusable results. PriEco crawls the web only for a few months and only recently with a reasonable speed\n\n**PriEco** (300M results, 2.1B known URLs)\nI already mentioned the information but for people scanning through this\nAgain the known URLs is irrelevant information for me, the results count is LIVE\n\n## Ranking\n\nThis information is even more hidden. PriEco does:\n\n 1. Concurrent full-text search (keyword matching) and IVF vector search (semantic understanding)\n 2. RRF merge & deduplication: merges both indexes to 1 list of results\n 3. Hand ranking: A set of hand-picked rules that boost or hurt result score\n 1. Examples are: SSL, loading time, if the page is in user set lang/loc, bad url patterns, measured confidence and effort of the page, if the page is homepage…\n 2. I made up each signal weight. I am now looking to do a Google & Brave search query log optimization to improve it\n 4. Reranker model + PageRank\n 5. Cap SERP (search engine result page) to max 3 results from a single domain\n\n\n\nOnline is a lot more information about how to do ranking and I’m looking to ways to include in my ranking pipeline to improve results quality.\nRight now it’s about putting some logic behind hand ranking weights\n\n## Tests\n\nProduct becomes what you optimize it for. We need a proper “Gold standard” or a measurable metric we optimize PriEco against so that we can reliably measure if it’s getting better. We could measure against Google like so many before me did.\n\n**NDCG** (AI helped, don’t yet have a proper test) I took 50 Google, Brave search, Mojeek and PriEco SERPs of the same query\nSample of the queries:\n\n * simple: _youtube, netflix sign in, wikipedia english main page_\n * products: _best video editing software 2026,_ best wireless earbuds for working out\n * questions: _difference between ssd and hdd sequential read write speeds_ or _why does rust borrow checker reject mutable references in loops_\n\n\n\n_The questions and test code was AI made for now_\nThese are results:\nGoogle (Measured against) 1.0000\nBrave Search Score: 0.5599\nMojeek Score: 0.2932\nPriEco Score: 0.0856\n\nWe can clearly see PriEco scored the worst. BUT! It’s workable, considering even index size compared to Mojeek is 30 times smaller and PriEco ranking isn’t yet very smart\n\n**That said, likely the test wasn’t entire optimal.** It contained a lot of long queries. But it’s good to compare how PriEco scores on it compared to Brave search and Mojeek, which in my view makes the score pretty reasonable\n\n## Final words\n\nFirst of all it isn’t yet done. I just wrote this to communicate current state of PriEco. I will keep this post updated as I improve the ranking, run more tests and grow the index.\n\nExcited for any of your replies to this topic",
"title": "PriEco Quality Benchmarking"
}