Frequently Asked Questions about the spam detector tool.
What does "search engine spam" mean?
Search engines want to value web pages analyzing (also) the text that the pages show to users. Some webmasters trick search engines into believing that a page shows to users more/different text than the one really displayed.
This trick is one of many methods considered "spamming" by search engines. For a more in-depth information about search engine spam, you can read this search engine spam definition and this good white paper about SE spam.
What does this "search engine spam detector" do?
This tool analyzes a web page searching for features that show the presence of some tricks considered "spam" by search engines.
What spam methods this tool is able to detect?
It mainly detects many tricks belonging to three classes of methods: keyword stuffing, doorway farms and text hiding (based on same foreground/background color).
What tricks it cannot detect?
Does this tool obey to robots.txt files?
No, because it isn't a crawler and it acts exactly as a normal web browser. So it simply downloads all the files required to "render" the web page.
What's that <whitespace> text sometimes shown in the invisible text report?
Each <whitespace> is a simple HTML entity. The tool reports them if they are inserted in a "invisible zone" of the HTML code, that is a zone where a text would appear invisible.
Why have you created such a tool?
First, because I like to train my brain, searching for nifty solutions to interesting tasks. Second, to show to webmasters and SEO-wannabe how it's relatively easy to develop a tool like this.
If the tool isn't capable of detecting the hidden text method I'm using, does it mean that my method can fool search engines too?
No. It just means that this tool wasn't able of finding the hidden text. Search engines could be smarter.
Don't you think that your work could harm SEOs or help search engines?
No. Good SEOs don't need spam tricks. Beside, believe me, the most important search engines already detect spam with better algorithms than mine.
Who are you?
I'm an independent italian software developer. I'm not connected with any search engine company. For questions about this tool, you can reach me at the e-mail address: spamdetector at motoricerca.info
How do you detect keyword stuffing?
In most cases, keyword stuffing detection is very easy to accomplish.
A real phrase, with a language grammar and syntax, is visually different from an aseptic list of keywords. I have just observed the visual differences between real phrases and keyword stuffed text and I have developed an algorithm that calculates how much a paragraph of text seems "natural".
Further, many webmasters often write paragraphs of keywords with evident signs of keyword stuffing. It's almost as they put a giant "Hey, Google! Keyword stuffing here!" sign on their pages. This makes keywords detection even easier.
The algorithm doesn't use a dictionary of terms nor has any knowledge of grammar rules. It isn't language-dependant and it works very well with many different languages.
As keyword stuffing detection can't be 100% perfect and since it could generate some "false positives", a good search engine shouldn't penalize a web page for using keyword stuffing but could just calculate the importance/weight of a word taking in account how much natural is the text around the word. This would minimize the negative effects of wrong text interpretation.
So, when a spammer gains a good position with a keyword stuffed page, he/she tends toward to think that the keyword stuffing worked well, while it could be possibile for the page to reach even better positions with clear and genuine text phrases instead of paragraphs of keywords.
How do you detect hidden text?
For now, this tool detects only two type of hidden text: text with same/similar foreground/background color and text hidden with CSS "display" or "visibility" properties.
I have coded a little HTML+CSS interpreter. Basicly it does the same thing that a web browser does: it parses CSS and HTML code extracting the values of foreground and background colors and storing them in some data structures. Subsequently, when the routine parses the HTML searching for text, it knows in wich colors the text and the background would be drawn by a web browser. If the two colors are identical, the text is considered invisible.
The algorithm also supports similar colors. If the contrast between foreground and background color is too low, the text is considered hardly perceptible by human eye and a warning is reported.
Just after the CSS parsing phase and before the HTML parsing, the tool downloads also all images used by the analyzed web page as backgrounds. Since hidden text methods can use monochromatic backgrounds, it is necessary to preprocess all the backgrounds in order to understand if they are monochromatic or multicolor images and to remember the color values for the HTML parsing phase.
The image processing phase is quite fast because it's not strictly necessary to analyze all the pixels of an image to understand if it's monochromatic.
How do you detect doorway farms?
First, I have to clarify what definition this tool gives to the expression "doorway farm": a (usually big) list of keyword-rich text links that point to pages generated mainly to increase keyword relevancy for search engines.
The first step is to parse the page code, extracting all the links, then grouping them in blocks. The second step is to make a statistical analysis on the text of the links, to determine if the distribution of words is unnatural. This step permits me to exclude all the links used for legal reason, like navigational menus.
If the text of the links actually show an unnatural distribution pattern, then the algorithm downloads two of the pages the links point to and it makes two quick analysis on their HTML code and on their text, trying to understand if the text inside the linked pages is unnatural too.
There is also a sub-analysis that tells me if the pages were probably automatically generated by a software. Its algorithm is quite complex.
If the two (random) downloaded pages appears to be "unnatural" then the algorithm assumes that all the other pages pointed by the links of the same block are probably unnatural too, and a warning is reported.
How this tool will improve?
Well, I want to add support for CSS positioning, since many invisible text tricks uses positioning properties to hide the contents. I think it's very complex to write a full compliant CSS box model, so probably I will develope just a very simple version of it. It will correctly do its job in most cases: my goal is to find hidden text, not to create a complete web browser display engine.
There are also many small improvements I could make to the tool and I will slowly introduce them:
What documentation/software did you use to program the tool?
PHP language and W3C HTML/CSS specification. All the code was written from scratch.
"Background color is FFFFFF, the text color is F6F6F6. Those are DIFFERENT. So googlebot can't tell they are CLOSE to the same." [read on a SEO forum]
Ah ah ah! ROTFL! :-D
What next? "It's impossible to build a flying machine"? :-D