Nvidia: Static Sign-Off Best Practices

DAC 2019 panel presentation by John Busco of Nvidia (edited transcript)

Case Study Overview

John Busco of Nvidia presented a case study on Nvidia’s static sign-off methodology and best practices, spanning RTL Linting, RDC and Single mode & multimode CDC, and RTL and gate-level CDC.

Nvidia Chips Push PPA Limits

I think you all associate Nvidia with computer graphics.  We invented the GPU, or Graphics Processing Unit.

Basically, these are among the largest chips that are done in the industry.  We’re always pushing the limits of performance, power and area.

Accordingly, we need very high capacity, very good performance, and very good accuracy from our EDA tools.

We take chips like Volta and we apply it to a number of different markets. Graphics is our bread and butter, and we’ve started doing in hardware which is sort of the holy grail of computer graphics.

The architecture of a GPU is also applicable to several other large and fast-growing industries, such as supercomputing, deep learning for artificial intelligence, and in the future, for driving autonomous vehicles as well. We’re getting traction in all these markets.

When & Where Static Sign-Off Tools Are Used

This is a schematic representation of a chip design cycle, showing when we use various static sign-off tools.

The idea is that we’re using different combinations of the tools in different phases of the design process. When you’re starting with an RTL module creation, what you can do at that point is run lint. 

As the chip comes together and the block is more functionally complete, you can start to run CDC (clock domain crossing) and RDC (reset domain crossing).  And then as the different blocks are assembled, you can run multi-clock CDC and RDC. 

And finally, before tapeout you want to run a gate-level check as well. As you can see, we’re applying static sign-off tools continuously throughout the design cycle.

Problems Addressed by Static Sign-Off

So why do we do this?

Lint is great for an early vetting of the RTL. It’s quicker and easier to get feedback than running simulation or synthesis.  

CDC gives us complete coverage to ensure that our interfaces are clean. We try to minimize the potential for CDC problems by having strong design guidelines, using golden IP, and certain simulation techniques.  However, as Prakash [Prakash Narain, Real Intent] mentioned, those techniques do not cover everything; we need a solution to do so.

RDC is a similar situation. The class of bugs that can be quite subtle and quite hard to find — and if any escape into silicon it can be hard to debug. 

So again, RDC tools provide a solution to catching any problems with your reset synchronization.

Static Sign-Off Successes

Here are just a few examples of types of problems that these tools can find.

On the lint side, it ranges from very simple things to looking at more corner cases of your case statement usage or arithmetic widths — or more subtle problems that might even be missed by simulation, such as self-determined expressions where unless you have the right test case in simulation you might miss it, and yet a lint tool can flag them.

On the CDC side, it can verify all the clock-domain crossing interfaces in your design. By running it after assembly — after all the blocks are put together — you can catch things that may have changed post-synthesis.

In this example, we inserted pipelining registers in our design — perhaps that was done by an in-house tool.  We wanted to make sure that those registers are clocked correctly. So, the CDC tool would catch that at that point.

Finally, why do you run it at gate-level? Things happen during the design closure process. There may be more DFT logic going in and there are ECOs that are changing timing or function. So, you do a final check at gate-level before sign-off.

Best Practices

Best Practice #1 – Automation & Enforcement

To apply these tools, some of the best practices that we recommend, much as Hamid [Hamid Shojaei, Google] mentioned, we have a lot of emphasis on automation.

We want to make it easy for our designers to run things in sort of a push-button manner, and we want the grading of the results to be clearly “Go” or “No go”.

We post it to a dashboard so the chip management and design teams can see whether they’re clean as far as static sign-off. All of this is very automated and run frequently.

What static sign-off is not is someone just running CDC one time, saying it’s okay and telling their manager.

It needs to be repeated and continuous so that you know that no bugs are slipping in.  Our automation provides that capability.

Best Practice #2: Verify Early and Late

Another question is when should you run static sign-off? Should you run it at the beginning? Should you run it at the end?

In the graph, I tried to illustrate that the design completeness is continually increasing over the design cycle. So, if all of the RTL is there, and all of the RTL is correct, your constraints become correct over time.

You get the most complete analysis by waiting, but on the other hand there’s an exponential increase in the cost to fix problems the longer you wait. There’s no one answer. You really need to be running both early and late.

You want to catch the low hanging fruit at the beginning of the design cycle, but then also continuously run static sign-off, so that as things come together and are more accurate, you’re seeing any new problems that emerge.

Key Tool Capabilities Needed

If you’re evaluating new tools, here are some of the things that differentiate the static sign-off tools.

  1. The main thing to look for is the coverage and the correctness of the analysis. That’s why the tool exists. You want to make sure that it’s catching everything that you’re concerned about and that it’s not an optimistic tool. It’s conservative and it’s going to flag everything.
  2. Static sign-off tools are notorious for being noisy.  They’ll generate lots of messages that are either incorrect or not of interest to the designer.  This is another area where tools differentiate from one another. You want a tool that doesn’t generate lots of false violations, you want a tool that generates compact reports.
  3. Ideally, the tool should that focus on the root cause of the problem, rather than the myriad symptoms that are triggered by that root cause. That’s something else to look at when you’re comparing tools.
  4. Finally, because of the emphasis that I think we all have on automation, the tool needs to lend itself to automation script-ability and debug-ability. For example, the implementation tools like synthesis and static timing analysis are really good in this respect. They all have TCl interfaces, they have access to design objects, they have attributes you can query. We need the same things from static sign off tools so we can script things and have them be push-button.

Conclusion: Static Sign-Off is Essential

I hope you’re getting the message that static sign-off is something that is essential.

It can find problems earlier than other tools. It can find some problems that other tools might miss altogether.

But to do that, you need a highly accurate tool with high performance and a high-quality analysis. That’s what you need to find.

And if you find that and put it in your flow, you’re going to have a smoother path to design convergence and much less risk of having bugs in your silicon that you tape-out.

So, thanks for your attention.

Panel Q&A: Multimode Clock Domain Crossing

Question:

From your experience, why would you use multi-mode CDC over single-mode CDC?

John’s response:

I think the analogy is very similar to your static timing analysis tools, where you count on them to be able to do at least multi-clock analysis and multi-mode analysis as well.

Early on, maybe you have sort of a simplistic high-level analysis, where you can define the one clock and run single clock CDC.

But later, as you get further along and are getting closer to sign-off, you want to use your real STCs — your real timing constraints. They have multiple clocks, so you want a tool that can support those SDCs and do the correct analysis.

I think it’s also interesting that people are very tempted to do the analysis using STA (static timing analysis). And it’s sort of easier to pick out the critical clock combination for STA.

But what’s critical for STA is not necessarily critical for CDC. So, you have to be really careful — you shouldn’t make assumptions like that for CDC.