Cloud-Based Static Sign-Off Methodology for TPU Machine Learning Hardware

Presented by Hamid Shojaei
Google

DAC 2021 presentation by Hamid Shojaei of Google (edited transcript)

Case Study Overview

Hamid Shojaei of Google presents a case study on Google’s cloud-based static sign-off methodology that includes pre-submit with RTL Linting, Single mode & Multimode clock domain crossing & Reset domain crossing. (Real Intent tools deployed)

Cloud TPU: Empowering EDA with Google Cloud AI

TPUs are Google’s custom-developed ASICs used to accelerate machine learning applications. Our TPUs are designed from the ground up with the benefit of Google’s deep experience and leadership in machine learning.

A few years ago, we also developed and enabled Cloud TPUs. The Cloud TPU enables you to run your machine learning workloads on TPU hardware using TensorFlow. It is designed for maximum performance and flexibility to help researchers, developers, and business owners to build their ML applications on TPUs.

We also provided High-level TensorFlow APIs that help you to get models running on Cloud TPU. In our experience, models that previously took weeks to train on other hardware platforms can converge in a matter of a few hours on TPUs.

MLPerf benchmark results have been recently released, and as you can see here, TPU demonstrated record-breaking performance at scale and can significantly outperform GPUs.

Google Cloud: Fast Design of Large, Complex Chips

Currently, all of our EDA workflows run on the cloud. We started with using cloud for batch jobs in 2020. Then we moved our entire verification flow and finally switched to cloud for everything in 2021.

That was a big win for us, because we have a perfect job management system on the cloud. We also have vast compute resources, there is a fast networking system, and we have lots of cloud storage that we can use for our EDA applications.

Another benefit we get from moving to the cloud is elasticity — we are able to get more quota in a few hours versus months for on-premise expansion. Yet another benefit is scalability — the solutions are designed with horizontal scaling. Here you can see how our total daily job submissions to the cloud have increased.

We also improve our reliability using the cloud, because we eliminate single failure zones with cloud capacity allocated across the globe. This was a super positive experience for us. We are collaborating with all the EDA vendors to enable them on the cloud.

Google Static Sign-Off Methodology

Below are four of the key elements of Google’s static sign-off methodology

#1: Continuous Static Checks throughout RTL design

Let’s look at how our static sign-off methodology works on the cloud. As soon as we have RTL changes, we run our pre-submit checks. (I will provide more details for this step in the next slide.) When the RTL code is in, we kick off several tasks that run in parallel.

We use Real Intent tools for RTL Linting, autoformal, clock domain crossing, and reset domain crossing sign-off.

We also elaborate RTL and report and fix issues regarding elaboration. The key advantage of this approach is that this process is early and continuous. Any change in RTL will trigger the process again.

This approach helps a lot in static sign-off and makes it much easier to close. Imagine if you waited for the entire RTL to be ready and verified before you did CDC verification — the CDC reports would be huge and the noise too high — it would make it much harder for you to close.

#2: RTL Linting + Autoformal + Functional Tests = Pre-Submit requirement

Some more insights on our pre-submit step. This step is coming from the software world and has a huge impact on the quality of the code we submit. In addition to code review, we have a feature to block committing a pending code change.

Our pre-submit checks include very simple customizable built-in checks on the content or metadata of a change, such as naming conventions and readability standards.

Thanks to the fast linting tool, we were also able to enable lint checks in pre-submit. As for our wish list, we’d like to run some light CDC checks in our pre-submit. For example, to quickly check that there is no missing synchronizer in the design before the code goes in. Hopefully, EDA vendors can provide such features for us in the future.

#3: Run Hierarchical + Multimode CDC

Two other enhancements that we applied to our clock domain crossing sign-off to support our complex designs are hierarchical runs and multimode CDC.

Our chips are huge; it is not feasible to run CDC for the entire chip. At the same time, we don’t want to miss any CDC issues at the top level.

First, we run CDC for every block, clean all the failures, and use the CDC tool to generate a database with the model for each block. Then, instead of running the CDC tool for the actual top-level chip, we replace every block with the corresponding model — this will make the CDC top-level run totally feasible.

Another challenge we face is that due to the nature of our ASICs, we need to support SDC files that have multi-mode constraints. This means running the CDC tool for different modes, reviewing CDC issues in each mode separately, and then doing several iterations to ensure that all modes are clean. With the rapid rise in the number of modes in our projects, it is nearly impossible to sign off each mode individually.

The solution is to use the (multimode) CDC tool which has only one setup, one run, and one violation report. We use this tool for CDC sign-off in projects that have multi-mode SDC constraints.

#4: RDC Sign-Off to Avoid Metastability

Another aspect is RDC (reset domain crossing). Our designs integrate a diverse set of peripherals, memories, networking, and I/O components that may come from different sources and different vendors.

They can have different resets, power-up resets, and software resets; sometimes they also have asynchronous resets. In such designs, CDC is not the only reason for metastability.

If the reset of the source register is different from the reset of the destination register, even if the path is in the same clock domain, this will create an asynchronous crossing path. This is called a reset domain crossing, or RDC, and can cause metastability issues.

Interestingly, we had an RDC issue in one of our projects because we assumed all the vendor IPs had synchronous rests.

After that we started to use the RDC tool — again in a hierarchical mode. This tool makes sure designers follow simple reset design rules to avoid metastability issues.

Taking it to Next Level: Using AI/ML to Accelerate Design Flows

In terms of what we are doing to further improve our ASIC design methodologies, we are actively working to use machine learning to aid chip design flows. The goal here is to develop a scalable and general ML-driven framework with rapid evaluation and turn-around time to shorten the chip design cycle.

For that, we use the infrastructure on the Google Cloud Platform. With GCP, we can leverage the compute power and different ML algorithms, which are available in Google Cloud to efficiently apply unique optimizations at all stages of chip design.

We have already applied ML to four different stages:  Verification (both static and dynamic), floor planning, architecture, and design space exploration. Two-pronged strategies have been considered. Existing solutions on GCP include Bayesian optimizations and reinforcement learning.

In addition, we are working on some dedicated and advanced research solutions such as graph neural networks and language models.