Google: Static Sign-Off Methodology & Results

Google 2022 Update: Static Sign-off for Cloud TPU ML HW

DAC 2019 panel presentation by Hamid Shojaei of Google (edited transcript)

Case Study Overview

Hamid Shojaei of Google presents a case study on Google’s static sign-off methodology. Hamid covers best practices & results for RTL Linting, Single mode & Multimode clock domain crossing & Reset domain crossing. (Real Intent tools deployed)

Download Google Case Study PDF

Google & Machine Learning, TPUs

Google aspires to create technologies that solve important problems — and we are relying on machine learning to help us reach our goal.

We believe these technologies can promote innovation and further our mission to organize the world’s information and make it universally accessible and useful.

To implement these algorithms and these advanced technologies, for many years we have been relying on Moore’s law to give us the computing power that we need. But as you know, Moore’s Law is in decline and we cannot hope to get larger speedups in terms of the board anymore.

That’s why, at Google, several years ago we started to work on a project. We made custom hardware for machine learning called Tensor Processing Unit or TPUs.

We have been using TPUs in our data center for several years and found them to give us an order of magnitude better optimized performance per watt for machine learning.

This is roughly equivalent to fast-forwarding technology about 7 years into the future — or 3 generations of Moore’s Law.

Google TPU Design Challenges

Let’s look at the challenges we are facing while making TPUs. The first is that our product cycle is short, and time-to-market is very important for us.

Our first TPU was for AlphaGo which was used in the Go contest in 2016 and was able to beat the top master Lee Sedol.

Since then we have made several generations of TPUs, and we have more generations to come. As you can see, we execute very fast and we need EDA tools that can cope with our schedule.

Another fact is that a TPU is a complex design; it has many blocks running with different clock frequencies having different power and performance requirements.

We also use advanced technologies to make our chip; for that we need a high level of collaboration and co-optimization between designers, the process, and the tool developers.

Clocking is Primary ASIC Design Challenge

Let’s look at this chart from Wilson research. According to this research, the main type of flaws contributing to respins are functional bugs and clocking issues — and for clocking, CDC verification is very important.

For both categories, static sign-off / static verification is very important and helps a lot. However, the issue is that for the ASIC design community, static sign-off is usually the last stage and there is no time to ensure quality and robustness.

How do we address this issue?

Static Sign-Off Best Practices

#1: Run Static Checks First

As engineers, sometimes we need to go to therapy — and if you don’t need to you should be proud of yourself. But if you do there is a common sentence that they always give you:

“If it’s your job to eat a frog, it’s best to eat it first thing in the morning. And if it’s your job to eat two frogs, it’s best to eat the biggest one first.”

At Google we try to listen to this advice. We know static sign-off is very challenging and time consuming, so we start as early as possible — from day 1 of RTL design.

#2: Run Broad Range of Static Checks

This slide shows the big picture of the different static verification checks that we run on our code. Before we submit our code, we have “pre-submit”, which run some basic checks on our code.

Then we have linting, in which we have standard rules, as well as some custom rules that we have developed internally.  Elaboration is next to make sure that RTL does not have any elab issues.

We use a combination of static and dynamic approaches for CDC verification. And we also run RDC (reset domain crossing).