Distinguished Seminar on Electronic Systems Technology I’m Philip Wong, a faculty member in the electrical engineering department. You know electronic systems is the heart of information society that have really profoundly changed our lives. In the coming decades the demand for energy efficient electronic systems will only accelerate. Yet, as Professor Hennessy will show you today, it’s the stated advances that were made in the last 50 years that is now slowing down. So there’s no doubt that electronic systems technology will continue to be important and is at the
cusp of major technology shifts that will be as transformative as those witnessed by Professor John Linvill many decades ago. Now these lectures have been created to help us explore a path of going forward and to honor John Linvill’s enormous legacy as both a faculty member and a department chair. I would like to pass it on to
Professor Stephen Boyd to say a few words now. I’m Stephen Boyd, I’m the chair of electrical engineering as of seven weeks ago, or something like that. I’m extremely happy to be here at this inaugural lecture celebrating basically one of the giants of our department. John [Linvill] basically shaped it, being chair for 16 years.
So I’ve got a long way to go I guess. I’ve been around Stanford long enough to be able to say, happily, that we were actually colleagues. We are very lucky here; it’s very appropriate to have another giant of Stanford EE and indeed of Stanford –
Jim Gibbons – a former student of Linvill. He will introduce the inaugural
lecturer yet another Stanford EE and Stanford giant: John Hennessy. I was John Hennessy’s first — [crowd says, “Linvill”] [laughter] John Linvill, I mean. [laughter] I took John’s first class when he came here, and I thought it was just spectacular. So then I changed my field from […] to working in semiconductors with John. First, this afternoon, I want to welcome Greg Linvill and his wife Betty. [applause] John’s daughter Candace and her
husband Chris wanted to be here but had to send their regrets as they were unable to attend. It is a great pleasure for me to say a few words about John Linvill’s legacy at Stanford. And to introduce John Hennessey the inaugural speaker for the Linvill Distinguished Seminar on Electronic Systems Technology. I’ll start at the beginning. Fred Terman recruited John Linvill in 1954. Asking him to build a program in the application of transistors. At the time, John was on an extended leave from MIT, working at Bell Labs
on transistor circuit design problems. Building a graduate program in transistor circuit design was what Fred Terman had in mind when he asked John to quote “transistorize” Stanford EE curriculum. John said later that he
was very pleased with Terman’s offer — but that he would have returned to MIT
if they had offered him tenure. But they didn’t, and we did. Thus creating a future for
the EE Department at Stanford that is surely beyond even Fred Terman’s imagining. Your program booklet contains a
partial sketch of John Linvill’s legacy. Among its important milestones were the creation of three new laboratories over the period of approximately 23 years. The first
of these labs was created so that the PhD students in electrical engineering
could build semiconductor devices as part of their research program. At the time no EE department in the country had such a laboratory. A photograph (look on the inside of your booklet) shows Stanford’s first silicon device,
built in March 1958. It worked well, unfortunately it was built on a silicon wafer that was the size of a dime and is barely visible in the photo. Sorry about that. A little over 10 years later integrated circuits had replaced
individual transistors as the components of choice for system design.
John and his colleagues especially Jim Meindl then created a new laboratory where
students could build integrated circuits as part of their PhD research.
Same song second verse. John participated actively in this new laboratory where he and his
graduate students created the custom integrated circuits that were necessary
to build the Optacon, a reading aide for the blind, that John invented. The back page of your handout, shows a
picture of John and his blind daughter, Candace. getting their paper at the international
solid-state circuits conference in 1969. Candy read at a speed of 75 words per minute. The paper received a still unmatched
standing ovation from the audience. There was not a dry eye in the house. To describe the last of these laboratories, the Center for Integrated Systems, I’m going to first turn to quotes from recent email exchanges that I’ve had with the Linvill Family. Quote: “We would like to thank you for
passing along the good news about the distinguished lecture series that will
be named for our father, John Linvill. We know he would have been deeply honored by this recognition of his contributions to Stanford and to the field of Engineering. As you know, our father’s interests were very broad. He was always interested in learning new things, and in the ideas, innovations and projects that his colleagues were engaged in. He was especially interested in how technology could be utilized to improve people’s lives, and to address real world problems. He was committed to bringing concepts from different fields together to create new opportunities for collaboration and innovation, as happened when the Center for Integrated Systems was created. Signed, Candy Linvill Berg and Greg Linvill. CIS opened its doors in 1980. This laboratory, from the beginning, was a systems laboratory. Engaging faculty in both EE and CS departments. The components of choice for electronic
system design then, in 1980, were general-purpose microprocessors and the
software that is necessary to program them for specific applications; for both documentation and communications. The CIS research focused increasingly on new computing architectures and on special devices and hardware required for their computation. The future of this new paradigm is going to be addressed by Professor Hennessy. My last contact with the Linvill family, prior to this lecture was to tell them that faculty had chosen John Hennessy to give the inaugural lecture. They wrote back to say, “Thank you for your message. We think John Hennessy is an excellent choice to give the Inaugural lecture, both because he is such a remarkable person, and because our Dad had
such respect and fondness for him”. In this, Candy & Greg expressed our
sentiments precisely. John Hennessy is not only a remarkable person
(he’s also reasonable) [laughter], he has had a truly phenomenal career. We are all indebted to him for his extraordinary leadership in the CS and EE departments; as dean of the School of Engineering; as Provost; and as President
of the University for 16 years. Please join me in welcoming John
to the Inaugural Lecture for the John G. Linvill Distinguished Seminar Series on Electronic Systems Technology. [applause] Thank you, thank you. Well I should say that I first met John
Linvill on the day I interviewed for a faculty position at Stanford. Sitting in
his office in McCulloch, I went over and of course, he was great man. I had already
… my morning had started with an interview with Don Knuth, so you can
imagine I was on thin ice at that point [laughter], but John was welcoming, I could tell
immediately that he was somebody that deeply believed in recruiting great
young faculty and investing. And certainly that was the case with me. He also had a very – as Gibbons alluded to – a very expansive view of what electronic technology very expansive view of what electronic
technology and its impact were about and of course that led the CIS which greatly
affected my career as well. So what I’m going to talk about today is really a massive change in the way we think about computing in the future which is a
coming together of multiple things — technology but not just technology there
are other aspects of it. But first I have to start out by saying what an
incredible golden age! If you look from 1977 the first microprocessor: four bits!
Four bits, right? I mean, about 3,000 transistors that’s it.
Through to today, roughly forty percent annual performance improvement.
Roughly a million times faster in that period – throughput. And what happened?
Well, obviously is you’ve got more transistors things got wider from 8-bits
to 16-bits to 32 to 64. We had a big push on instruction level parallelism. I’m gonna say a lot about that, but basically if you go back to the era of
the 1970s, it takes about ten clock cycles to implement an instruction on
the typical early microprocessor. Now a typical
microprocessor, does four instructions every single clock cycle or tries anyway. And then multi-core from one processor per chip to 32, or maybe even more, if they’re
smaller and more limited in capability. A clock rate from 3 megahertz
to 4 gigahertz – now that’s a combination of technology but also architecture.
There’s a lot of architecture changes that underlie and made it possible to do that. But without the changes in integrated
circuit technology; without the dramatic impact of Moore’s Law, you never could
have gotten this. In fact in some ways what architects’ job has been for
many years was to take a yearly quadratic increase in the number of
transistors, or every 18 months let’s say, and turn them into faster machines.
Because transistors were getting, you’re getting, a lot more at a rate that it
even exceeded how much faster they were getting. So that’s coming to an end.
Moore’s law is beginning to slow down. It’s not ended yet, as I’ll show you, but
it’s slowing down. But probably the bigger crisis right now is the end of
Dennard scaling which basically said that power per transistor shrinks at
basically linear as the transistor gets smaller. That’s going away and that’s created a real crisis. That end of Dennard scaling – that fact that you cannot put more transistors on a chip without increasing the power – has really
created a problem. The fact that Moore’s Law is slowing down, creates a secondary
issue. Both these things actually are going to be problems. If you have transistors that
are not efficiently used, ([e.g.] they take up area, they burn power; their cost..). so the
issues of efficiency are very deeply rooted here. Well, that’s a problem
because some of the things that are happening on the architectural side are
pushing the limits of efficiency and whether or not you can get a more efficient
design. So that’s creating a difference in how we think about architecting those
machines. And then finally there’s a big application shift. The desktop and the
personal computer ruled for so many years. That’s not the important part of
the spectrum. The important part is either in my pocket, or it’s in massive
cloud-based computer centers, right? An acre of computers. A small one: think of a million cores. That’s a small one. So these big, gigantic cloud machines
that have different constraints. This just shows you single processor
performance; what’s happened over time so look at this error right in here:
52% per year improvements in performance. Really dramatic era.
Sort of beginning with the very first RISC processors in here, and then going up,
until it begins to slow down, then it slows down more, in the last two years,
3.5% per year – that’s almost nothing. In terms of unit processor, it’s almost nothing. And energy efficiency has become the new
metric. It’s the thing that everybody cares about. If you look at these portable
devices, while the PALS, the screen tends to take up the most power.
Second the processor and the chipsets surrounding the processor become the
second largest power consumer. So if you’re walking around something battery life matters which it does then you have to worry a lot about that.
The thing that might surprise people is in the cloud you care a lot about energy efficiency.
‘It’s not immediately obvious,’ you said. Well, I’ve got an acre of computers. I’m going to buy my power from the power company with big cables coming in.
“Why do I care?”
Because the cost of the power infrastructure is so high. So in terms of
capital cost: here are the servers there’s power and cooling. Capital cost
is about the same for a large-scale data center. And even when you look at
amortization effective operating cost that’s dominated by the servers
primarily because they get written off in about three to four years, while the
power infrastructure lasts about ten. But it’s second only to the computers.
So big part of the cost equation. Okay so that’s sort of a shift in what we what we care
about in terms of energy and power. At the same time we’re running into the
slowdown in Moore’s law. Perhaps most dramatic in DRAMs. DRAMs have some
unusual things about their structure that makes them more acute in terms of
pushing the edge. The primary one… oh, look at this slowdown: ’77 to ’97 about
1.4x growth/year. Then down to 1.34x, then down to 1.1x. There is not even an
announced follow-on for DDR4 so DDR4 — for 20 years we’ve had these memory specs boom boom boom everybody agrees on what the standard is
there is no DDR5. There’s no announced follow-on spec for next generation
programs. So that’s an amazing change. This is the best graphic I found for explaining it. Here’s the aspect ratio of a modern DRAM depth this is the trench that contains the capacitors 25 to 1. Here’s the tallest
building in the world: 6 to 1. So imagine what’s already been done is phenomenal. To expect to continue that incredible progress is probably unrealistic. But there’s also a slowdown in other measures of Moore’s law. This
shows the transistor counts and Intel processors as a way of measuring
that what’s the density of transistors. And it’s beginning
to deviate from the Moore’s law line that you would have here. Not as
dramatically as DRAMs, but it’s certainly accelerating in terms of its separation. But the end of Dennard scaling which
basically said that you had constant power per mm^2 of silicon –
that’s the simple way to think about it. If you think about what that means —
constant power per mm^2 of silicon, means that energy per
computation is decreasing, because from one generation of silicon to the next I
get more transistors. Those transistors burn the same power,
but they better do more work. And if they do more work, energy is actually
dropping per computation. So that happened for a long time.
From about 1977 until about 1997, then it began fading. And 1997 begins really when it
begins to fade. And then starting 2007, very rapid. Essentially gone today.
So a here’s a plot which shows this. We’re looking at technology in
terms of nanometers; the increase in technology. And then we’re looking at
power. Energy/nm^2. That’s power, not energy. I’m sorry– it’s power not energy. So there is an improvement in performance that goes along with that. So it’s not quite as bad as that chart makes it look.
But it’s not good. It’s not good. And that’s really driving so much of what’s happening as people think about future generations of processors. I think of this as a crisis. It’s a crisis in the sense that it puts
processors in a really difficult situation. If you had told me 20 years
ago that microprocessors would turn themselves off, or slow their clocks down
to prevent overheating, I would say you’re crazy.
That’s never gonna happen every single big processor out there does it today.
Turns off core, slows down clocks, in order to prevent exceeding the thermal
dissipation capability of the package that it’s in. So we’ve got to think about how we how
we solve that problem. The difficulty is – the natural reaction is to say – well,
design architectures that are more efficient in terms of power. But our
problem is that the dominant general-purpose architectural techniques
have reached their limits, and pushing them further creates diminishing returns
and it creates rapidly diminishing returns, when you look at it from a
perspective of energy efficiency. And the reasons for that are some having to do with instruction level parallelism and the way we push those performance of
those processors up in that period where it was going up 50% a year. Some of it has to
do with problems in multi-core, and there are even issues with caches, which have
become the way in which we hide the latency gap between a processor running
in the gigahertz range, and relatively slow Dmaps. Even they have problems.
You just can’t throw more transistors in it and get very much performance out of it. Three minute tutorial on instruction level parallelism: so you can understand what’s
happening here. So this is the way processors used to work up until about
the early RISC days. Let’s say through the
mid-80s, through the early-80s. Imagine red car is an instruction, and
white car is an instruction. So you start red car instruction. Let’s suppose it takes
five clock cycles. So it takes five clock cycles. When it finishes, you start
white car instruction. You get one instruction done every five clock cycles.
That’s how processors indeed used to work. So then there’s a very clever idea
called ‘pipelining’, and it’s exactly what it sounds like. It sounds like an assembly line.
It is an assembly line. You just start red car, you start white car
one cycle later, another white car, a different white car another cycle later,
another white car. Now notice each instruction still takes five clock
cycles, but you get an instruction done every clock cycle. That’s the key idea
behind instruction level parallelism. That’s one key idea.
The next thing you do – so that took us from about early 1980s up until say just after 1990 early-
… mid-1990s maybe. These got deeper and deeper and deeper.
I’m taking five, but imagine in a modern processor – think of that as 15 clock
cycles long. So there are fifteen instructions overlapped in a pipeline
that’s 15 deep. And all I’ve done is I took the steps and I sliced them into
little tiny pieces and I allowed instructions to follow the gang along convoy style one behind the other.
The next thing people did was duplicate the track. Have two racing
tracks so two cars are going at the same time. I can start two cars down the
race track every single clock cycle. Now I’ve got ten times the throughput,
because I have two instructions starting every clock, and they are five times as fast as they used to be. Ten times [increase] out of that idea. These were the two ideas that
dominated architecture from say ’82 till about the mid-’90s. So what happened?
The key thing to understand is that we push this technology really hard.
Initially five stage clocks, today an Intel i7 has a 15 stage pipeline. That allows the clock rate to speed up a lot. It means you have to add
some extra logic, but it’s actually not that hard to do. And then we went to
multiple issue. A modern Intel processor can issue four instructions every single
clock; an IBM power processor can do six. They’re in that range. We’ve got a lot of throughput. While that requires some careful timing… and some energy issues, this requires a big increase in transistor count as well.
There’s a lot of complexity to doing multiple instructions per clock–a lot of
complexity. So that’s where a lot of the transistor count goes. Why did this end?
It simply ended because it became energy inefficient to go any further down that road. A way to think about this is: suppose I have four instructions every
clock starting, so I’ve got four race cars start in a brief cycle, and I’m 15 deep. That means I have 60 instructions in execution what we call ‘in flight’, sixty
instructions are in flight at any given time. 60! Now in a modern Intel processor in order to maintain 60 and to get some throughput,
you probably have to have 120 to 140 instructions in flight.
Well, has anybody ever seen a piece of code with a hundred and twenty
instructions, in sequence, with no branches? Certainly there are loops;
there are ‘if-then-else’ statements in the code. So you’ve got all these branches.
How can I possibly get more than a hundred instructions in flight with all
those branches? The answer is: I guess. I speculate.
So what happened is, we built these elaborate prediction mechanisms that
predict what happens with branches. Whether a branch is taken or not taken.
Fairly easy to do loop branches, right? They’re mostly taken. If the loop runs
many times, then the branch going back to the top of the loop is taken
most of the time. So we predict the branch. We guess that the prediction is
right, and we begin piling instructions into the pipeline, as if the prediction
was right. If, for example, I have 15 instructions I’m looking at, that would typically have about four branches in it. To get all four branches right 94% of the
time I have to predict each one correctly 98.7%. If I have 60 instructions in flight, to get 90% accuracy, that means all 15
branches are predicted correct. I have to have 99% accuracy.
And if I take that up to 120 the number becomes mind-boggling. Very hard to do.
Very hard to have that kind of accuracy. So what does that translate into?
Well unfortunately, when I speculate, and
I speculate wrong, I guess the branch incorrectly. I do a lot of work. It takes me 15 clock
cycles to figure out (by the way). That’s the branch of prediction in this
time, on a modern processor (15 cycles). I don’t find out to 15 cycles later.
I piled in all these instructions, I’m executing them, as if they’re really useful work.
And then I got the branch. I have to pull all that stuff out and throw it away and restart. Okay, that’s what happens when you speculate.
[inaudible] [laughter] So, this is just a simple way of looking
at it. Here are a bunch of benchmarks. These ones are integer, these ones are
floating-point, down here. And this shows you on an Intel Core i7, how much of the
work is wasted. Meaning, I executed instructions that were useless; that I
ended up throwing away. So 30%, 25%, 40% in some benchmarks.
Useless work. Now, obviously I’ve lost all the energy that went along with that. I wasted my time, first of all. I did something useless.
But all the energy that got burnt pursuing those instructions, went down the drain.
It just made the chip get hot. And, it’ not free to clean those instructions out there’s also a bunch of overhead associated with cleaning and restarting
the pipeline. So that’s a real problem. And that’s really what drove the end of the pursuit of instruction level parallelism. It’s not theoretical, it’s actually not a theoretical limit. You can show theoretically that there is plenty of parallelism out there. The problem is it’s just hard to get, unless you can
build an oracle that tells you everything perfectly. So what happened? So everybody said, “Okay,
we got to give up on that.” “Let’s go try another approach. And since we’ve now tried an approach where the compiler and the architect are responsible for
finding all that instruction level parallelism ( it all gets figured out by the
hardware in their compiler). Let’s make the programmer do the hard work. Let’s make
the programmer find the parallels. Make them responsible for identifying things that can execute in parallel, so that I can speed up the performance. So that gave rise to what became
called the Multi-core era. We would run separate threads
designated by the programmer – – thread just means a separate process
that can run in parallel – the program would find those. We’d run them on
separate cores. And now we’ve got a very simple strategy for scaling. You get more
transistors, add more cores. Just put more cores on the chip.
It’s pretty easy, it doesn’t involve a lot of complexity – there’s some in the
caches and things – but it’s pretty straightforward to see how to do it. So of course we’ve got this energy [which] is still proportional to
the number of transistors that are active, that are doing work for us. So I
still have to use those cores efficiently, if I’m going to use this technique to overcome the problem: the performance and the energy limits. But there’s a little difficulty that comes
from Amdahl’s law that makes this difficult. Amdahl’s law is an observation that
Gene Amdahl made in the 1970s about computers that were parallel; that had multiple processors, multiple cores associate with them. What Amdahl said is: the speed-up – how much faster that program will run – is limited by how much of it can only run on one
processor, or maybe four processors, when there are sixteen. […]
And that limits how much faster the program can run. So what about this?
Well let’s look first at a modern multi-core
so you get an idea of what happens. So this is a power eight,
but it doesn’t look that different than an Intel i7 multi-core, or an AMD
multi-core. I’ve got a bunch of cores each one of these is a separate
processor. They have their own caches here. They’re hooked up somehow with a
network and they’re hooked out to some kind of memory control as well as i/o
and other things. But the key thing is that each one of these cores can be
designed separately, and now I can just scale up by adding multiple cores. Well what about the Amdahl’s law effect? How serious is the Amdahl’s Law effect?
So this just shows you, the answer is it’s very serious. It was very serious
when Amdahl predicted it, and it’s still very serious. This shows you – suppose I
have 64 processors – okay? So 64 cores. What happens if 10% of the code can
only be run on one processor? Or 8%? Or 6% Or 4%? Or 2%? Or 1%? 99% of the code can be run on 64 processors.
1% – only one little percent – has to run sequentially. Well, that limits your speed up to 36. Well that’s not very much of the code.
Suppose 5% or 10%… …then you’re going to be down here. Now for many years people have thought there are various ways to overcome Amdahl’s Law. And there are. But overcoming it in a general-purpose computing environment
turns out to be very hard. And repeatedly yet just when everybody thinks, “Well,
we’ve solved the problem!” – for these large workloads that we want to run on cloud
machines, another instance of Amdahl’s Law raises its ugly head. Something becomes a key limiter. To any extent that these processes
need to coordinate with one another and synchronize – that creates an Amdahl’s Law bottleneck immediately. But remember we also need to think about: “what are those processors doing, that aren’t doing any useful work?” because they’re limited by
the Amdahl’s Law of serialization. They’re just waiting. They’re standing there
waiting for one processor to finish this little tiny 1% of the code, so the rest
of them can go ahead and execute the rest of it. But guess what they’re doing
when they’re waiting – they’re burning power. They’re not shut down, because
when you shut them down, it’s a long time to restart them. This is
not a fast process. You do not want to shut them down. All right? So the result
is you have them waiting typically for a signal from the one processor that’s
doing this one little section of code. – Okay – we’re all done; go ahead. So they’re burning power. The problem is, that the solution we picked has another source of
energy inefficiency and that means that the end of Dennard scaling is the end of
multi-core scaling. At least as it’s been done / followed so far. So the result is we
have what we call dark silicon. I mean literally cores get turned off, but you
better be very careful before you turn the core off, because it takes a few
million instruction cycles to get the core turned back on.
Now how serious is this? Well, okay today take 22nM process, large multicore, the Intel E7-8890. It’s a 24-core machine, 2.2 GHh – notice that one of the things, if you look at these large multicore chips,
they already have clock rates which are almost a factor of 2 off of
the small cores. So your desktop machine, you can buy
4GHz desktop machine. You can’t buy a 4GHz large core, because it radiates
heat at an amazing amount. If you take that out to an 11nM process and you do the computation of how many cores
you could fit, it’s about 96 cores and it would run roughly in the
5GHz range. The expected power consumption of that processor would be 295 Watts. 295 W from a piece of silicon this big. If you look at what’s happened with packaging technology – it improves relatively slowly. Today, that’s 165 W chip – this is about average package improvement over that same period
we get to about 180 W of power dissipation. Even if you were aggressive
and you assumed you could get to 200 W, how many cores can you have active?
Well at 165 W only 54 – only 54 of the 96 cores can be active. Only 54.
Almost half of them off! At 180 W you can have almost 60 cores active. At 200 W you can have 65 cores active. You simply can’t have more cores active.
Unless there’s a real breakthrough in removing heat from those packages that’s cost-effective. We could go to liquid cooling. Seymour
Cray was fond of saying, “The hard part of designing computers isn’t the
electronics it’s the plumbing.” We may get back to that. But that shows you how significant this limit is in terms of having a straightforward path to take multicore scaling way up. And if you look at the
combination of what happens because the number of active cores is limited
because the power and Amdahl’s Law, then you get a really grim result. Namely with 96 processors and 1%, only 1%, of the code being sequential, you
get a speed-up of about 38. And that’s it. Less than half the processor count.
So basically your efficiency is less than 50%. So that’s pretty ugly. And that’s why this phrase ‘dark multicore’, ‘dark silicon’ has arisen. Because there’s no simple way around that part of the problem. So is there a road forward?
Luckily, I’m getting close to retirement, so won’t have to worry about that part. [laughter] Unfortunately, I think
there is no obvious path for general-purpose processors. There’s no
obvious path. The failure of Dennard scaling means that any inefficiency
translates into a real problem, in terms of advancing that processor. And
unfortunately, the way we know how to make processors faster, is by making the
+ efficient. By burning transistors, basically as a way to get speed. At
faster than we just get it from the clock rate. So there’s no obvious way. Is
there an alternative way of thinking about the problem? Well, there’s a great draft paper out from our colleagues at MIT, entitled “There’s Lots
of Room at the Top”. And what they point out is that the last
20 years we have abandoned efficiency in software in favor of productivity. So if you
Python. They’re making use of libraries that were developed using
polymorphic techniques that can be used for multiple functions. There’s a lot of
software reuse in that model, but there’s a lot of inefficiency in that model. They
have one example where they take a piece of code written on multiple
levels of interpretation. They rewrite it in C and it runs 10,000 times faster. So
provided that you’re willing to give up some of the efficiency we’ve gotten in
terms of software productivity, then there’s a route back. But this is some
kind of the inverse of where we’ve been for many years, because where we’ve been
for the last 30 or 40 years is: hardware’s getting faster, don’t worry
about it software guys. Just write your code. Anyway you get that code out
there. Don’t worry about how efficient it is. But that may not work anymore.
So that’s one route. But there’s another route, which I think from an architecture
viewpoint is potentially very attractive. Which is to start thinking more about
tailoring the architecture to some specific domain. What we call
domain-specific architectures. So think of them – they’re sort of like ASICs, but
they’re programmable. And the intention is not that the processor does one
function, but it does a family of functions. The best-known examples of
these are GPUs (graphics processor units), They’re programmable, but they’re
clearly designed to do certain classes of problems that have certain structures,
not just graphics, but other things that have similar kinds of structures that
rely on linear algebra. And now, most recently, you’ve seen a big rise of this
with respect to neural networks and deep neural network computing.
Again, it’s a linear algebra problem, so you build some special purpose machines
to do that. In the past, if you look over history, except for very limited applications, the
special purpose machines have never managed to maintain an advantage over
the general purpose architectures. Lots of people have tried, they’ve gotten
out there and gotten a niche, but it’s been hard to hold on to that niche. But I
think there are ways to hold on to that niche. And the ways to hold on to that
niche, is to build something which captures a class of problems, maintains
programmability, maintains flexibility – and that’s the way to hold on to it. So you
might ask yourself, “But why? Why will those domain-specific processors be
faster? What about the architecture? We ought to be able to explain something that tells us why they’ll be more
efficient and faster. And indeed there are some key underlying principles that
people are taking advantage of. First of all – find a more effective way to do
parallelism. In particular, the way we do parallelism in multicore, is what’s
called multiple instruction multiple data (MIMD). We’ve got a set of independent
pieces of code that are running independently. In SIMD (single
instruction multiple data), we have one instruction that’s broadcast to many
different data units, but we only have to fetch one instruction; control is a lot
simpler; and it’s a lot more efficient when it’s a usable programming model. The other thing we do is we convert back to an older idea, an idea called VLIW.
Rather than have the hardware try to figure out which parts of an instruction
can go parallel, we have the software figure it.
Now – you’re not going to take a big gnarly piece of UNIX code, an operating
system, or a C compiler, or something and get the hardware to figure that all out.
But there are domains for which the structure is sufficiently well
understood, that the compiler can analyze the structure, and do that piece of work –
that otherwise we rely on hardware to do. They make more effective use of memory
bandwidth. So one of the great inventions of modern computers was the idea of
caching. It made the gap between DRAMs, or before that, PORE (core memory),
and processors much smaller. The problem is that caches don’t always work and
when they don’t work, they don’t work in ugly ways. So they break down. If the
programmer could manage that memory more efficiently, then I could get a decrease
in the complexity, and an increase in the performance out of it. So we move
to user-controlled memories, versus caches. Eliminate unneeded accuracy. So we all jumped on this bandwagon that moved to IEEE standard arithmetic, very
high precision, lots of things, and for lots of programs, which that
precision is simply not needed. So you can increase the throughput by
using smaller units. Using 8 bits or 16 bits for example. And by actually not
having to be quite so accurate in terms of floating-point rounding and other
issues that are required to implement the standard. The key for all this to
work is that you better have a domain-specific programming model which
makes it possible for the software to match with the hardware. That’s the key. That’s the key thing that
has to happen. So this is a great quote by Dave Kuck. Dave Kuck was a
famous early software guy who did a lot of foundational work on compilers. But he
worked on the Iliac IV, one of the first SI (single instruction) multiple data
machines. And he was, by the way, the software architect, because everybody
knew the hard thing was designing to hardware not the software.
He was the software architect. But he had this key insight: “We really didn’t
understand how to get a good match between the software we wanted to run
and the machine we were designing”. As a result, the Iliac IV never did very
many general-purpose things despite the fact that it was a really pathbreaking machine
in terms of its approaches. So achieving performance in this new era is going to
require thinking differently about it. If you think about what happened in the
1980s, as we moved from people programming in assembly language, to
people programming in C and Fortran. What the architect had to do was to know what
the output of a compiler looked like, because everybody was using a compiler
to compile their C or their Fortran down to something. The architect had to know
what that fairly narrow interface looked like and then they could build the
machine as fast as they wanted underneath it. And indeed all these
speculation ideas came out of that. Oh, by the way, in case you didn’t know –
that speculation idea is the reason there’s a big security hole called
meltdown and spector. It is exactly that speculation idea that creates that
security hole. So here we were being very aggressive architects,
thinking very clever about how we’d make the program go faster – and nobody
realized that in roughly 1997 or ’98, 2000 we opened a giant security hole that’s
been there for more than 15 years. 15 years, right? So, there are a lot of issues here anytime
you think about that. But if – if I can change the way I build an interface
between not just the low-level software but the way in which programs are
written. So that I’m thinking about the algorithms a lot more, and if you think
about what happens, take a linear algebra problem, right? Take a
big matrix multiply, and you boil it down into a bunch of individual instructions,
you’ve lost all the structure. You’ve thrown it away. But that problem has a
lot of structure, and a lot of use – if you can understand that structure
and import it into the architecture. And take advantage of it, you can capture a lot
of efficiency. And that’s, I think, what’s going to have to happen in this era of
domain-specific architecture. It’s going to create a lot of challenges because,
unlike one general-purpose processor which now covers 98% of the applications,
you may end up having five or ten or fifteen or twenty different
architectures optimized for different things. So imagine that your self-driving
car has an architecture optimized for machine learning and deep neural
networks, related to driving. Imagine that your giant machine in the cloud has a
more general-purpose deep neural network machine that’s handling other kinds of
problems, whether it be speech recognition, image recognition, medical
diagnosis. Imagine that your phone has something on it, so you can give
your speech to your phone. And of course your virtual reality headset
has some other kind of special purpose processor – a domain-specific processor –
that’s designed for doing virtual reality or augmented reality kinds of things.
So you may have to have multiple different architectures covering the
space. You’re going to have to get the algorithms people and the people who
understand applications to work much more intimately with both the software
and the hardware people. And you’re going to have to work on the problem of design
cost. Because you don’t just have one big processor covering the space, but
you’ve got to design 10 or 15. Oh boy, you better get the design cost of those 10
or 15 processors down significantly from where it is today. I think this, though, is the direction that we have to go. And I think if it’s
pursued, and if we really can put together or some rethinking of the interfaces
between the hardware and software. That will give us enough of a bridge that
hopefully our friends in the silicon world will reinvent something that can
follow onto silicon and contain all the wonderful properties that we got to take
advantage of in Moore’s law. If we do that, we’ll have a nice path forward.