Google recently published a research paper discussing the Siamese Multi-depth Transformer-based Hierarchical Encoder, in short SMITH. The SMITH algorithm is currently not live but might be rolled out shortly. The newly-developed algorithm is expected to outperform the current BERT algorithm by better understanding long content and text. Thus, instead of understanding only words and sentences, as does BERT, the SMITH algorithm is expected to also predict passages. As a result, this will allow for an overall improved understanding of the content as a whole, as well as improvements in recommendations and document clustering.
What is the SMITH algorithm?
The difference between the two algorithms is that BERT is trained to predict words, whereas the SMITH algorithm will predict the following blocks of sentences by matching long-form texts.
Here is how Google describes the BERT algorithm limitations:
“In recent years, self-attention-based models like Transformers… and BERT …have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length.
By stating that BERT is limited to text-matching short texts, it is clear that the latter will be outperformed by the SMITH algorithm from this point of view.
Furthermore, in their recently published research paper “Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching”, the researchers talk about the advantages and added benefits of using the SMITH algorithm:
“…semantic matching between long texts is a more challenging task due to a few reasons:
1) When both texts are long, matching them requires a more thorough understanding of semantic relations including matching pattern between text fragments with long distance;
2) Long documents contain internal structure like sections, passages and sentences. For human readers, document structure usually plays a key role for content understanding. Similarly, a model also needs to take document structure information into account for better document matching performance;
3) The processing of long texts is more likely to trigger practical issues like out of TPU/GPU memories without careful model design.”
It is clear that the SMITH algorithm will outcompete BERT by being able to understand semantic relations and document structure. However, it is important to consider that one algorithm doesn’t replace the other, they simply combine to offer the best experience
What does this mean for your business?
This is how the Google researchers define the SMITH algorithm’s power:
Comparing to BERT based baselines, our model can increase the maximum input text length from 512 to 2048
Whereas BERT is limited to sentences and words, the SMITH algorithm is more powerful. To optimize your content, it will become more important than ever to write engaging, relevant, and useful text. You will need to ensure that your copywriting team is well-prepared, creates high-quality content, stays on the topic, and focuses on the text’s main goal to make sure the search engine fully understands the content’s message and doesn’t penalize your website. With the new update, the search engines become more and more answer engines, which don’t simply provide a list of relevant results but have the goal of answering users’ questions.