Project
„New Search Engine“
My sets versus PageRank and Panda
1. Errors of search engines
Search engines make two errors: irrelevance and not-differentiating of spam.
Irrelevance is ordering of irrelevant WWW pages in front of relevant WWW pages.
Relevant WWW pages are quantitatively (reasonably) big and qualitatively good
pages, which correspond to the searched word.
Not-differentiating of spam means, that up in the order are „Black SEO“ WWW
pages (formally correct - but without content, wallpapered, pretending
something other than containing, copied, link farms and the like).
It can be estimated, that the ratio of errors of search engines, caused by
irrelevance and by not-differentiating of spam, is 50:50.
My sets are directed above all towards removing of irrelevances, but they have
meaning also for antispam.
2. Single WWW pages versus sets
Till now, search engines evaluated single WWW pages.
Search engines proceed so, that for the searched word and for each WWW page, on
which the searched word is found, compute the ordering value, then sorts these
ordering values in descending order and according to this construct the order
of the found WWW pages.
The number of found pages, where the searched word occurs, is often very large,
even millions till hundreds of millions. To differentiate one million of found
pages, the ordering values must differ in the sixth order (after decimal
point). This leads to randomness and incorrectness, as very small differences
decide about the order (change of two words).
I have found three years ago, that it is better to evaluate the sets of
Internet components (WWW pages, documents, images, audios, videos…). For the
evaluation of the sets I use the same criteria, as for the evaluation of single
WWW pages, but with different weights of criteria.
The advantages of the sets:
- the sets are much bigger than single WWW pages and differ much more from each
other, so that it is easier for the algorithm to evaluate them and to construct
their order.
- ordering values of the sets are at least 10 times larger than ordering values
of WWW pages. This removes some randomness and incorrectness. Mathematically it
can be said, that using the sets at least removes the equality of ordering
values of WWW pages at the boundaries of the six-decimal decision intervals, in
other words, it can be roughly estimated, that the ordering by using sets will
be 10 percent better (more exact), than the ordering by using single WWW pages.
3. PageRank versus sets
Google uses PageRank (of WWW pages) as one of the criteria. It considers
links between WWW pages.
The evaluation of the sets of Internet components differs substantially from
the evaluation of PageRank – only the WWW pages,
which are part of the set, are evaluated. These sets cannot be
constructed by link exchange or by buying links. The weight of the criterion
Rank for single WWW pages can be lowered using the sets. The weight of the
criterion SetRank (average or sum of the Ranks of the WWW pages of the set) is
still lower, than the weight of the Rank for single WWW pages.
The advantages of the sets:
- using PageRank can be relatively easily betrayed by link exchange or by
buying links (especially by buying links from the same branch); using sets can
eliminate such betraying to some extent (only the pages are considered, which
belong to the given set).
4. Panda versus sets
Google came with Panda. Panda evaluates sites (Webs, WWW servers) in order to
reveal spam (Black SEO). When the site is evaluated as spam, for all the pages
within this site their ordering value is decreased by some number, given by the
stage of the spam.
For Panda, the sets are whole sites, alternatively (big) parts of these sites.
It proceeds from the whole, from up to down, from „molecules“ to „atoms“. It is
directed only to antispam, for the solution of irrelevances does not make
sense. Besides, Panda probably creates the sets only from WWW pages.
I construct my sets around every WWW page (alternatively, I evaluate, that the
given WWW page does not have any set). I proceed per partes, from down to up,
from „atoms“ to „molecules“. Into the sets, I put not only WWW pages, but also
other components of Net (documents, images, audios, videos). As for the
theoretical development, I have about two year’s time advantage.
The advantages of the sets:
- the sets of Panda (consisting just from WWW pages) are less expressible than
my sets (consisting from WWW pages plus other Net components).
- Panda cannot be used for solving of the
irrelevances; my sets can be used for solving of the irrelevances.
- if Panda considers the whole site to be spam, it penalizes all the WWW pages
of this site - this has negative consequences in the case, that by error the
„clean“ site is considered to be spam; my sets are smaller, so that the wrong
consideration has smaller negative impact (only the set is penalized, not the
whole site) – in other words: the application of the
rules of Panda to my sets will make Panda more precise.
- as for Panda, the spammers already know, that it is concentrated on the whole
sites, so that they can defend by optimizing their sites; as for my sets, the
spammers will not know (at least for some time), what the sets are, so that
they will not be able to defend by optimizing these sets.
5. Consequences
The order of found links using the sets is better (more
exact), than the order constructed by single WWW pages or by Panda.
My procedure of constructing the sets can be patented.
Implementation of my algorithm is simple.
Practically no changing of existing programs is necessary, just adding some
files and programs.
6. Remark
Google does not use my sets for sure.
The proof is its order of links, e.g. when searching the word „Lednice” (on
google.cz), the link to the isolated WWW page http://www.zamek-lednice.cz is the
fourth..
7. Links
Rank
http://en.wikipedia.org/wiki/PageRank
Panda
http://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html
http://en.wikipedia.org/wiki/Google_Panda
My sets
http://www.milionovastranka.net/en/beads_and_balls.htm
http://www.milionovastranka.net/en/graphical_explanation.htm