As part of the process of a search engine taking a look at your site, it performs a transformation of the text from how it appears on a web page to something referred to as search engine ‘representational text’. The process of a search engine actually transforming your written content to something to aid understanding of what a site is about is actually rather complex. The whole process is done in three main stages named ‘tokenization’, ‘filtration’ and ‘stemming’, with each having its own important process attached that can actually have quite a profound affect on how you choose to set up your site and indeed create the text for it.
So what do these terms actually mean? Well, let’s start at the beginning and take a look at tokenization. To put it simply, this process completely takes all of the written document text on a page and crams it together. For this process, search engines completely ignore any stylization you have and basically lumber all the characters in one group. During this time, the robots strip any formatting such as full stops, capital letters and other types of text manipulation that don’t aid its computation.
Once tokenization has taken place, the filtration step of the conversion is next in line. This is probably the most important step that influences how a content writer creates text. Basically, filtration is the process of removing common words that don’t help the search engine in any way. By common words, I’m talking about ‘the’, ‘and’, ‘then’, ‘do’, etc. As you can probably guess, these are completely irrelevant to a search engine as it’s only looking for words that are good indicators of a site’s written content. Naturally, what you’re left with is a list of words that are, in 99% of cases, completely related to the block of text you’ve created.
The last step a search engine takes toward converting your content is to stem it. To put it simply, stemming is the process of removing suffixes (examples: -es, -ness, -ing, etc.) and other types of word manipulations to give clear cut keywords. So to recap, let’s take a look at a practical example to better give you an idea of what goes on:
Original text: “There are many different types of boxes in the world. Some people like to put a box over their heads when they are boxing with fellow boxers.”
Tokenization: “there are many different types of boxes in the world some people like to put a box over their heads when they are boxing with fellow boxers”
Filtration: “types boxes world people box heads boxing fellow boxers”
Stemming: “type box world people box head box fellow box”
As you can see, search engines strip quite a large amount of what you write in order to see exactly what your content is about. More often than not, people will completely ignore this system of search engine text transformation and make the mistake of stuffing too many like keywords together that will, inventively, make your text appear quite spammy and thus, threaten the creditability and legitimacy of your content.