Efficient Robot Skill Learning: Grounded Simulation Learning and Imitation Learning from Observation

Efficient Robot Skill Learning: Grounded Simulation Learning and Imitation Learning from Observation

Articles Blog

[MUSIC]>>Well, it’s an honor to
have Peter Stone with us today for MSR AI
Distinguished Lecture. Peter is a Centennial Professor and Associate Chair of
Computer Science as well as Chair of the Robotics Program at the University of Texas at Austin. He did his PhD work
at Carnegie Mellon, and he did his bachelors work at the University of
Chicago in mathematics. I should say that the CMU
work was in computer science. After that, he went off to be
a member of technical staff at AT&T Labs for maybe
two or three years, looks like three, and
before coming to UT Austin. His research interests include
planning, machine learning, multi-agent systems,
robotics in e-commerce, and he’s applied various principles of the aforementioned
areas to robot soccer, autonomous bidding agents, and
autonomous traffic management. Peter has developed teams of
robot soccer agents that have won 11 robot soccer tournaments in both simulation and in the
real-world version of that contest, and he’s also been working on
agents that have placed in winning roles in 10 oxygen trading
agent competitions to tack a competition. Peter is a Sloan Fellow, Guggenheim Fellow, AAAI Fellow, IEEE Fellow, AAAS Fellow, and a Fulbright scholar. I’ve put some pen to the list so it shorter but there’s more there. In 2013, he was awarded the University of Texas System
Regents’ Outstanding Teaching Award and inducted into the UT Austin
Academy of Distinguished Teachers, and we’ll be benefiting
maybe a little bit of that today in his lecture. He also received the prestigious IJCAI Computers and Thought Award, which goes to a leading
AI researcher in the 35 just a few years ago, and the Autonomous Agents Research
Award a little bit before that. He’s also a renowned in being the Chair of the
Standing Committee of the One Hundred Year Study on AI and that’s quite a
distinguished role as well. So with that, Peter.>>All right. Thanks, Eric, for the very kind introduction. Yeah. It’s been a real
pleasure and honor working with Eric on the One Hundred Year study
of Artificial Intelligence, and we’re at an interesting
moment now where we’re trying to think about what the
second study will look like. We got a lot of great reception
from the first one is. As Eric likes to say, it’s a longitudinal study. One point doesn’t make a line, but two points do make a line. So there’s a lot of pressure on
what the second one will look like to really get this going. So spending a lot of time
thinking about that. But yeah, thanks for
inviting me to be here, it’s a real honor, and thanks
all of you for coming. In a talk like this, I often
have to make a choice. There’s a bunch of different
things that I do in my lab. So I could pick a single topic and go deeply
and talk just about that, or I could give a broad
and shallow overview of lots of different things. I’m going to try to
skirt the difference between those two and
give you an overview of what I focus on in
my lab in general, but then also take a
couple of deep dives on two technical contributions
that we’ve made recently, some of which were led by people who are sitting in this room happened to be here at Microsoft now. Those are going to focus on
efficient robot skill learning. So first of the overview, the talks that I’ve
given and I’d realized recently I can say for
the last quarter century, almost or many of them had
this research question. It’s the theme that’s driven my research over that
long period of time, and it’s to what degree
can the autonomous intelligent agents learn
in the presence of teammates and their adversaries
in real-time dynamic domains? That’s what unifies all the
different things we do in my lab that leads us to publish
in various different areas, the autonomous agents and
multi-agent systems conferences, robotics as well. Some autonomous agents are
robots, though not all are. Then within machine learning, especially to focus on
reinforcement learning. I’m going to really look at the interplay between
reinforcement learning and robotics in this talk. We do in my lab work from
both ends of the problem. There’s work that starts from
the algorithms and the theory in what I call bottom-up
research towards applications, and we also work from the other end, from some motivating applications and more top-down
trying to think about what kind of research needs are there that we don’t yet have
within our within our arsenal, within AI and to use that
as a pulling function. If I’m going to give
a one slide overview, it’s hard to give the details
of the algorithms in theory, but I can very quickly give
you an overview of some of the motivating application domains that I’ve worked on over the years. Those include, as Eric
said in the introduction, robot soccer and this is a clip from a competition about
almost 15 years ago now, where these robots made by Sony. The Sony Aibo were working
autonomously trying to score a goal. When Sony stopped
making those robots, the standard platform they moved
to these are humanoid robots. The nows that are now
made by SoftBank are robots with the ones with
their hands behind their back, and this was the finals in mid-2012
competition in Mexico City. We were playing against a team
from the University of Raven, the prior year champions. You have to remember,
when you’re looking at this day robots are fully autonomous; the sensing, deciding, and acting. Here’s our robot doing another
breakaway a little bit slowly, but you’ll see it go up and commonly
make them all off of the goal post into the goal. So that was on our
way to a 4-2 victory. When we got back to Austin, they lit the tower orange forests, which they usually only do
when the football team wins. So were very honored by that. I’ll I’ll talk a lot more about
RoboCup on the next slide. I just became the president of
the Robot Soccer Federation. So I’ll talk a little bit about that. But we also work in social
robotics and service robots. This is a video from a joint
work with my colleague Ray Mooney on grounded
language learning for robots. So this is a video that was
illustrating a study that we did where we have these robots like this wandering our hall
more often than not. I tell people you don’t
have to ask us for a demo. You just come to our wing and there’ll be this robot
wandering around. In this particular case, it was over a week-long
study learning the way people would ask for
leading and delivery tasks. We were able to show that
over the course of the week that it got the
dialogues got shorter, people were reporting that they were less frustrated
with the interactions, and it was more
successful at completing the tasks that people wanted to do. I did have a car in the DARPA
Urban Challenge back in 2007. So this is our car in the back that’s waiting
to make a left-hand turn. All the other cars
working on the inside were driven by human
drivers with helmets on, looking a little bit scared,
but we didn’t run into them. So it had to find a gap in
traffic, make a left turn. It did that about eight
loops in 20 minutes. We don’t have that
autonomous car anymore. We retired it. All the car companies are
really investing a lot of money in autonomous cars now, but we do still think about
what the world will look like when all the cars on
the road are autonomous. Will we still need traffic
signals and stop signs, or can we, using multi-agent systems, have something that
looks more like this? What’s going on here is
the cars that are white, this is a simulation of course, the cars that are white
have called ahead for a reservation instead of dealing with red lights
and green lights there, they have a reservation for the space time that they’ll
go through the intersection. The ones that are yellow
don’t have a reservation yet. But once they turn white, they have a guaranteed path
through the intersection that won’t collide with
any of the other cars. One of the first times I showed this was about 12 or 13 years ago at
a talk I was giving in India, and somebody said, “Oh, all the
intersections look like this.” But the difference
here is that we can guarantee that as long as the
cars are following the protocol, there won’t be any accident. So this has led to again a decade and a half or
so of research on how can we make traffic flow
through intersections and city grids more efficient without necessarily building more roads, and that also could be the
subject of a full talk. So these are the kinds
of applications that motivate the research that
I do in my lab that I’m going to talk about in
the technical deep dives. But also I should say that I have a sort of we got to the
point where we feel like reinforcement learning is ready for prime time, for industrial usage. So I formed a company called Cogitai. I’m the President and COO. So did the end of it,
not the beginning. Founded the company with
Satinder Singh and Mark Ring, and there’s illustrious
brain trust we call it, that has been colleagues working on reinforcement learning
with us over the years. Just a few months ago,
we launched Continua, a SaaS platform that’s designed not for people who want to do research on
reinforcement learning. So maybe probably not
the target audience of people in this room so much as much as people who wants to use reinforcement learning in
an industrial setting. So maybe some of you don’t want to necessarily get into
reinforcement learning research but have a problem where
reinforcement learning would apply. This is now available for you, so you can get a 40-day free trial, and then we’ve got some of the
first markets here in automotive, in robotics control, in
semiconductor control. But really, we feel like the use cases for reinforcement
learning are endless, but most of the platforms that are out there are designed
for researchers. So we’ve put all of
our effort into making this scalable and easy to use for people who want to apply
reinforcement learning. The long-term objective, there’s lots of platforms for
supervised learning. We see reinforcement learning
as the stepping stone towards the ultimate objective
of continual learning being able to learn multiple
different tasks from one another in a transfer learning kinds of setting in
one long ongoing existence, just like we do as people. I’m not going to talk so much
about Cogitai in this talk, but I’ll be more than happy to talk with anybody offline
if you’re interested, and of course, you can get
information on our website. So that’s the overview of the kinds of things I work
on and some of the motivation. I’m going to go a little bit
more detail into RoboCup, the Robot Soccer World Cup, because that’s really
the motivation for the efficient robots skill learning
that I’m going to talk about. Then I’m going to do like I say, two more deep dives
on technical aspects. One, in the space of Sim2Real, so learning and simulation, getting that to apply
on a real robot. We introduced a new algorithm known as grounded stimulation learning, and then I’m also going to talk about imitation learning from observation, and in particular, two
different algorithms that we’ve introduced recently. One based on behavioral cloning and one based on inverse reinforcement. So to start with the motivation, I’ve shown you some
clips from RoboCup. I am now the President of
the RoboCup Federation, which includes people from
around the world who all are bound by an ambitious
long-term goal, which is to create a team
of humanoid robots that can beat the World Cup champions on a real soccer field by the year 2050. It’s good to have goals. Who knows? People ask me if it’ll
be possible or not. Thirty years is a long time. We learned a long time
ago in AI that if you’re going to make a prediction like this, best to put a year on it that’s
after you’re likely to have retired because then no
one will hold you to it. But it’s been going on
now for many years. We just had the 23rd
RoboCup I believe, and there’s several
different leagues. It’s got many virtues
as a challenge problem. Very different from having
the physical interaction, makes it very different
from some of the games like chess and Go and Jeopardy
and things like that. There are different leagues. But the best way to see some of the progress is to see what it looked like starting back
in the early years. The first RoboCups
were in 1997 and 1998. These are different leagues. Up in the top left, the
Standard Platform League. Those were the early
Sony Aibo robots, and these are videos from
teams that were created by people from around the world.
Not all my own robots. The middle-size robots up here
were using a real soccer ball, and there were some goals
scored even though there wasn’t much resistance from
goaltender all that often. There is the beginning of the Simulation League,
the Small-size League. I mean, this is frustrating to
watch right now when we look at it. But you have to
remember, back in 1997, a lot of roboticists
didn’t really have robots. They worked on an
aspect of the problem, said this could apply to robots. Getting 30 robots in a room
that only caught on fire occasionally was a big
win back in that time. If you now jump ahead 10 years but still quite some time in the past, 2005, 2006, you can see
a lot of improvements. The robots are moving more quickly. They’re better individually. They’re starting to be teamwork. It was the beginning of
the Humanoid League, so they just did a penalty
shot competition that year. But now there’s games with these humanoid robots and
there were attempted saves. Again, the thing to remember
here is that all of these are fully autonomous. One of the leagues
that we’ve been most successful in is the
3D Simulation League, and in fact, Patrick MacAlpine, who’s now here in the audience, has been really the
driving force behind that. I think we just one for the
eighth time in nine years. This is some highlights
from a couple of years ago. But this gives you a sense of what the 3D Simulation League looks like. Each of the agents is controlled
by a separate process. So unlike a video game
where you could have one program controlling all of them, they have partial information, and we’ve used a several
different techniques. This is Patrick’s favorite highlight. So I always promised
him I’ll show this one. It wasn’t on purpose but the robot managed to kick it
through the legs of the opponent and into the goal. But we’ve used the hierarchical
machine learning method here to learn skills that have turned out to be more robust than the other teams at walking
and kicking and things like that, and also some multi-agent
methods to figure out where the robot should position themselves. So this will come back into the
next segment of the talk as well. But again, the challenges
there are immense. It’s in the physics simulator, so just getting them
to be able to stand up without falling over
is itself a challenge. Again, I could give a whole talk
on the research that’s gone into that and that was really the centerpiece of
Patrick’s dissertation. I did say the goal is by the
year 2050 to have a team of humanoid robots that can beat the best World Cup champions
on the real soccer field. We don’t play against
the World Cup champions, but the champions of
the Middle-size League here often play against
myself and some colleagues. So this is from 2011, and the people who make the robots always say, “Oh, they’re
going to hurt you. They’re way too fast for you.” But then we showed that the
aging amateur soccer players are still able to pass the
ball around the robots. That goal didn’t count because
that was called for offsides. But in principle, you can
see that we’re still faster than the robots, better
low-level skills. But every year, it does
get a little bit harder. Some people say that it’s
because we’re getting older, not because the robots
are getting better. But I actually think it’s
a little bit of both. So we just did that again. Actually, when I look at this now, we just have the 2019 competition, and definitely the robots
are a lot faster and more capable now than they were even just a few
years ago in that video. Also, I want to emphasize that
RoboCup is not just about soccer. There’s RoboCup Rescue for
disaster rescue scenarios, and it also we participate
in my lab in RoboCup at home which is similarly for robotics, multi-agent systems, AI, but now
in a service robot capacity. There’s a few different
leagues there, including a Standard
Platform League where everybody uses this Toyota HSR robot, and the robot has to do tasks like putting away
groceries on shelves, and interacting with
people, setting the table. This is a clip from
taking out the trash. The robot has to go to a trashcan, take off the lid, pick up the garbage. It’s over a five-minute trial. It has to actually pick up
two different garbage bags to take them to a deposit location. So I’m not going to
show the whole video, but these are the kinds of tasks
that we have the robot doing. So this is where it gets
to the second bag and then navigates to the end there. But there’s a lot of
human-robot interaction kinds of challenges that
come up in this event, and including one of the
tasks is a restaurant task, where the robot is taken to a restaurant that has
never been in before. So it hasn’t mapped, where
there’s real customers, is then people will have to tell it orders and it
has to go up to a table, identify what was ordered, pick them up, bring
them to the person. These are the kinds of challenges and this is very, very difficult. It motivates the kinds
of reasoning that we just presented at the
ICAPS Conference. My PhD student Yuqian Jiang is shown here interacting with the robot in an open-world reasoning scenario. So she’s talking with the robot about trying to
bring fruit from the kitchen. Let me turn it up a little bit. Now the point of this video is that she said that
she wants an apple, but the robot doesn’t know for sure. This is open-world
reasoning so it doesn’t know for sure if there is
an apple in the kitchen. So we’re going to have two
different endings to this video, one in which it searches for the apple. I’ll speed
it up a little bit. Searches for the apple on the
locations in the kitchen where the Apple may be and doesn’t
find an apple there. So it then goes to
the next place here. In this version of the video, it finds the apple. Now it basically has to do a step of merging the
hypothetical apple that it had in its knowledge base with the actual incidence
of the apple and then taking it back to Yuqian. See, still there’s
plenty of work that can be done on the human robot
interface as you’ll see here. But in the alternate
version of the world here where there wasn’t an apple, it then goes and looks again for
other places where it might be, but then it keeps the apple in this knowledge
base is being hypothetical. When it doesn’t find one, it has to go back and
report that to the person. This is even a little more awkward, but you’ll hear what the robot says. So we’ll work on the actual language, but the point here
was the reasoning and the ability for the robot
to be able to deal with multiple different actual
worlds and the key to this was keeping some objects as being
hypothetical as opposed to being instantiated fully
in the knowledge base. Okay. So those are
the kinds of things that motivate a lot of the
work that we do in my lab. The robot soccer challenge, the RoboCup at home challenge. One of the things you have to do if you’re going to be able to succeed at these is to have low-level
skills for these robots. So here’s where I’m going to go
into a little bit more technical detail on how we’ve achieved that. First in a Sim2Real context and this is joint work with both Patrick
who’s who’s here and Josiah Hanna, a PhD student of mine, who’s about to graduate
and is going to start a faculty position at
University of Wisconsin. The idea, the motivation here is that learning on physical robots
is not very data efficient. It requires supervision especially if you’re trying to get them to walk. They can fall over, they could break. So it’s very tempting to say, “Well, let’s build a good
simulator and just learn in simulation and make that
work on the real robots.” But you learn very quickly
if you’ve done this, people have tried this for years, that you can learn in simulation
a very robust walk like this one. This is one of our learning trials from early years where we are
having the robot tried to learn as well as possible to walk fast will dribbling
a soccer ball. If we take that walk
skill and put it into the real world, it can execute. So you can actually take
those same commands and execute them on the robot, but after two steps it falls over. It’s not a good policy. It’s an executable policy
but not a good policy. You can see in slow motion here it takes a couple of steps and then it looks like it’s
tripping over the line, but the line is actually flat. So the question here and there’s been a bunch of research
on this problem of Sim2Real. So we’re not the first of course
to think about how can we bridge this reality gap, the gap between
simulation and reality. There’s two classes of approach. One is to try to learn a robust policy to make
your simulator more noisy so that whatever policy you learn is likely to work even in environments that are not
the same as the simulator. So there’s a class of
approaches that try to do that. Then there’s another class that tries to make the simulator more like the real world based on data
from the real world and try to get the simulator to really
align with the real world. The approach I’m going to talk about falls within that second-class. But with the crucial difference is that most people are doing that trying to make
a perfect simulator, and we start this research with the idea and the acceptance that
that’s never going to happen. There’s always going
to be a reality gap. Instead, we’re just
going to try to make the simulator closer
to the real world in a particular place in policy space where
we’re currently searching. So the basic paradigm is going
to be an iterative process, where we take a real-world
policy execution and then some state
action trajectories, use that to ground the
simulator and I’ll tell you exactly how we
do that in a second, take that grounded simulator, and then do policy improvement
in simulation and then repeat. That gives us an improved policy that we can then execute
in the real world. We can then reground the
simulator and keep going. So the crucial question here is
how do we ground the simulator? I should have said two slides back. I should have mentioned that, in general, it’s not that hard to build a simulator that has
the same format of policy, takes the same actions out. It has the same states and rewards. The thing that’s very difficult
is to make it so that the state transitions and the reward function for the action
that you give are the same. They’re typically very different. That’s what makes the simulator
different from reality. So in some sense, what we want to do is to alter the simulated environment so that
it’s closer to the real world. In fact, we’re going to do that
in a black box kind of a way. One way without opening
the simulator at all by just placing a wrapper
around the actions that gets sent from the policy to the simulated environment
with the goal that if an action was sent, we want to change it
to an action such that it has the same
effect in the state and reward space in the
simulated environment as the original action would have
had in the real environment. So that’s the grounding and I’m going to tell you
exactly how we do that. So we replace every
action that comes out of the policy with an
action that produces a more realistic transitions, where in effect learning
this function g shown here, and actually I’m going to open that up into two separate functions. One is forward model of
the simulator that says, “Given the state I’m in and
the action that I just took, what’s the next state that
I’m going to get to?” So that’s a forward dynamics
model of the real world. Then there’s a inverse dynamics
model of the simulator that says, “If I’m in a particular state, s_t, and I want to get to
the next state, s-hat, what’s the action I would need to
take to cause that transition?” That’s an inverse dynamics
model of the simulator. So you can think of that
just in one dimension. If we take my elbow as the
joint that’s being controlled, it might be that in the real world, if I tell it to move to 90 degrees in one time
step, it doesn’t get there. It only gets maybe
a third of the way. But in the simulator, it goes maybe two-thirds of
the way let’s say because the simulator doesn’t have
a slower reaction time. Well then, the forward dynamics model will say that from 180 degrees, if I tell it to go to 90 degrees, it will actually move to
I guess what would it be? One hundred and forty-five
degrees or something like that and then we say, “Well, what’s the command that I would have to give in the simulator from 180 degrees to
get to 145 degrees?” That’s the inverse dynamics
model. So that’s in one joint. Now we want to do this for all
of the joints in the robot. Then once you do that, we have basically a grounding and
action transformation that works at the current point in policy space that we’re operating. Why the current point
in policy space because this forward dynamics
model is based on real-world trajectories from the current policy
that we’re executing. So we learn both the
forward model and the inverse dynamics model built from relatively small number of
real-world trajectories. So 2,000 time steps each, and 15 of these gives
us a whole bunch of transitions of here’s the state
I was in with all my joints, here’s the action that I was given, and here’s what actually
happened to the joints. Then similarly in the simulator, we can get an inverse model
saying here’s a state I was in, here’s the next state I was in, and here’s the action
that got me there. We learned it with a
multilayered neural network that takes state and action
from the real-world in, gets a predicted next state which we can get labels from
this trajectories, and then takes this state
and predicted next state and tries to get the action that
would cause that to happen. So that’s the basic method. We can then evaluate it in
a number of different ways. We have a real robot and then we have two different simulators. So yeah.>>What are the policies used
to collect these things? Because it’s like if I have
policy all listed nothing, then I would get no use for them.>>No, that’s right. So I’m
going to show you that. So our initial policy here, it was the state of the
art fastest walk that anybody had been able to
get to work on this robot. It was developed by some folks at the University of New South Wales, and I’m going to show that
we improved from there. So yeah, it’s not a learning to
walk from scratch from flailing. It’s starting from a walk and can we improve it. That’s
an important point. So we had two different simulators, a lower fidelity simulator, a more physically realistic
simulator, and then the real robot. So we can do three different
kinds of experiments. The main one is going from this
simulator to the real-world, but then in controlled experiments, we can go from high fidelity
simulator to get more data. So that we do in the paper. The policy search algorithm
that we use as the CMA-ES. So that’s the learning algorithm
that’s being used in simulation. That’s a derivative free
stochastic search method. Then yeah, to answer the question, here’s the initial policy. So this is the walk developed by the University
of New South Wales. It was the fastest walk at the time, and it was able to go about
19.3 centimeters per second. Then we grounded the
simulator based on the data we took from that walk
and then did learning, and ended up with a walk after one iteration that was
significantly faster. This is 26.3 centimeters per second. You see it learning to be a little
bit more squat to the ground. Then we repeated and regrounded the simulator
with data from that walk, and ended up with what
I believe is still the fastest stable walk on these robots at 28
centimeters per second. So this was done on this single task. Josiah’s PhD thesis uses this as the central motivator and
he’s now trying it on. We have had some success on
another task on the same robot. We’re now trying on a completely
different Sim2Real task, a process control task
in oil refineries. It’s also opens up some
interesting theoretical questions, empirical questions when did this approach working,
when did it not. But there’s also some really
interesting connections to off-policy evaluation, and reinforcement learning,
and safe learning, and these are the main
theoretical contributions of Josiah’s dissertation, which he’s going to be defending
in about two or three weeks. Yeah, please.>>So the policy estimation and using data [inaudible] simulators as you looked at not changing the
actions but the rewards, because it looks like these
rewards are designer-specified. Could there be other ways
to come up with a device that make the simulated
behavior more learned?>>Yes. That’s a good
question as well. I mean, there’s a lot of research on changing the reward
function, shaping rewards, and there’s theory about
how can you change the reward functions such that you won’t change the optimal policy? So I’d say that that’s
related to this. In this work, we’re focusing
on the transition function. The reward function is usually easier to align between
simulation and reality. I mean, in both cases, the speed
of the robot is measurable. There’s not really
a difference there. The thing that’s really
hard to ground and align between them
is what will happen when you issue a torque command and the foot is hitting the
ground and there’s friction, and that transition
function is the thing that people put a lot of effort
into trying to get right, and that we basically say, “Look, we’re never going to get that right, there’s always going to be a gap. It’s important to note that this grounding function might
make the simulation more incorrect in other parts
of the search space. We just want it to be in the
region where we’re searching. We want to make sure
that it’s better. Yeah. It’s a good question. Yeah.>>[inaudible].>>Yeah, in principle,
I mean if you have a simulator and you can measure the differences between what’s happening in the real world and
happening in the simulator, then this general idea, you could lift up and apply. You could say, what’s
happening in the real world, what would I need to do in
the simulator to make it more like the real-world
effect in the perception, and then let’s use our
learning algorithm in the simulator in that
grounded wrapped simulator. So we haven’t tried
anything like that. But in principle, I’d
be really interested in seeing especially if there’s
a use case for that, like a situation where that’s
particularly needed and the difficulty is at that
perceptual modeling. Then I would be really interested
in talking about that. Yeah.>>What’s currently known about
the second bullet point there? Like when does it work
versus when does it not. I mean, are there properties about the learnability of the forward model and the reverse model that make this appealing or is it just
fundamentally impossible to just learn the effects of the
policy versus the reverse model?>>Yeah. That’s a great
question. We’re really just at the beginning points of
answering that question. We have now two examples of this. This is our best example, the one that we got working first. We have another example of a
motion task on this same robot, when I say we’re trying
it in this third task. Now, we’re starting to
ask questions of like, what properties of the
simulation need to hold for it to be
applicable and to work? But really, we’re not
far enough along for me to give you any real
insight into that yet. I think it’s really early
stages for exploring that, what are the limitations of this? But we have a really nice
success story to launch from. Good. Let me move on to the second
technical deep dive a little bit. It has two parts. This is the PhD thesis
work for us to Robbie, who’s sitting here as well doing
an internship here this summer. It’s in the area of imitation
learning from observation. I’m going to introduce a
model-based approach that we call behavioral
cloning from observation, and a model free approach, which is generative adversarial
imitation from observation. This is joint work
with Faraz and also Garrett Warnell who is a research
scientist at Army Research Lab. So imitation learning in general, the goal is to learn how to make decisions by trying to
imitate another agent. It’s very appealing way to try
to learn skills on a robot especially if you have examples
of what you want them to do. So conventional limitation learning involves having observations of other agents and demonstrations
consisting of state action pairs. So for instance, this is the work
of Scott Niekum where he has a person guiding the robot saying
here’s what I want you to do. While he’s doing that,
the robot is recording the state that it’s joints was in and what actions
that it were taking, where did it move next. But the challenge of this is that it precludes using
a large amount of demonstration data where
the action sequences aren’t given like YouTube videos. You see what happens but you don’t
know what actions were taken. For conventional imitation learning, there’s two general
classes of algorithms. There’s behavioral cloning,
which is basically you take the sequences of these
demonstrations and try to learn a function directly
from when I was in this state, what action was taken. It’s a supervised learning problem. For every state you are
in, what action was taken. If you can learn that, then
you can try to imitate that. The other class of problems is known as inverse
reinforcement learning, where you take the demonstrations, try to reverse engineer, what’s the reward function that the demonstrations are
trying to maximize? Then use reinforcement learning to try to maximize that
same reward function. There have been successes in
both of these approaches. Assuming that you have
access to the actions, not just the state sequences, but the actions as well. In biology, it is possible
definitely to learn without access to those sequences as is very apparent from this video which I’ve always assumed
when I watched this video, that the bird is imitating
the people just by observation although people
have been now pointed out to me that might be the people are
imitating the bird, I don’t know. But either way, some
organism is imitating another organism without access to the actual actions,
just by observation. So that’s the goal here of
imitation from observation. It’s how to perform a task given
state only demonstrations. So the demonstration will
be a sequence of states. You want to learn a policy from
states to actions, and again, we’re not the first to try
to address this challenge. There has been work from other labs on trying to do
this, but with limitations. First, they’ve concentrated mostly on the perception problem of trying
from the states sequence, try to figure out what
actions were taken and try to figure out what the states were that
the agent went through, and then rather than focusing on what actions can we
take to try to be as close to the to the
demonstrations as possible. Also, they mostly have required
time-aligned demonstrations. So meaning that if you have
multiple demonstrations, that you get to the same
state at the same time, which is actually very inconvenient for cyclical actions like walking. Where you can get to the same state multiple times and you might have demonstrations where the walking happens more quickly or more slowly. You want to be able to learn from those kinds of
demonstrations as well. So our two approaches, the model-based one first is called behavioral cloning
from observation. The difference between conventional imitation learning or
behavioral cloning is that rather than having
a demonstration that looks like this with
states and actions, it’s got just states. It doesn’t know the actions and it takes a model-based approach which is learning in inverse dynamics model to try to fill in what
those actions were, try to infer the actions and then use a conventional
behavioral cloning method. So you can imagine
how that would work. The diagram here is you
initialize your policy, run it, collect a bunch of data
to learn what happens when you’re in a state and take an action, what
next state you get. So you can then learn this
inverse dynamics model from state and next state to
what action must have happened. Then once you have that, given your state only demonstrations, you can use that to fill
in the missing actions and then update your policy
using behavioral cloning. So that’s the high level
view of the algorithm and we’ve applied it in these MuJoCo domains like this
Ant that has to run forward. That’s 111 dimensional state-space and eight-dimensional action space. We can compare it here to
existing methods where the dashed line is random
behavior in the simulator. We’ve normalized the
expert demonstrations, they have a performance of 1.0, and then we’re going
to compare against a bunch of different algorithms. Not just on forward speed. It’s a four-dimensional
task including magnitude of the control actions. So one of this is the feature expectation
matching method ends up doing very
poorly on this task. It’s known to not do well, so it gets negative reward on this. But the other state of
the Ant methods, GAIL, Generative Adversarial
Imitation Learning and behavioral cloning which do have access to the actions,
have this performance. Sorry, behavioral cloning, that’s
the red is doing quite well. Behavioral cloning from observation, our method in green, is doing competitively with these even without access to the actions. So that was the first
promising result. Then we started asking, well,
what happens if we do give some interaction and
experience to the method? So in this case, the inverse dynamics model is
learned using a random policy taking analogously to what we did
in grounded simulation learning. You can imagine doing this
in a more iterative fashion, and so we can update the model with the learned policy and then use the parameter Alpha to
control the trade-off between how many interactions you
get and the performance. So when Alpha equals zero, that’s the exact method
that I already showed you. It’s just behavioral cloning from observation from the random policy. As we increase Alpha, we’re increasing the
number of interactions allowed at each iteration. So the only difference between the previous method is that we basically
close the loop here. That once we’ve updated the policy, we then rerun that policy and learn the inverse
dynamics model again. Diagrammatically, all the previous
methods basically get all of their environment interactions after the demonstration and can be very
expensive, post-demonstration. The method that I
already introduced on the top gets a whole bunch of data before the demonstration to
build its inverse dynamics model. But then, doesn’t need
anymore interactions. By adding this Alpha parameter, we’re now bridging the
gap between these two, compromising between these two. So now, you can see
as we increase Alpha, what I’ve shown here is, the red is behavioral cloning, the green is the result
I showed you before. As we increase Alpha, we get closer and closer to the
behavioral cloning results. So that’s behavioral cloning from observation, a
model-based approach. We’ve also explored a
model-free approach called Generative Adversarial Imitation
from Observation or [inaudible] , and the observation here is these
are some state transitions in the hopper domain where
you basically have a four-dimensional plot here that
take two of the state features, where you have the before and after, where let’s say you have
before and after of two different variables which gives
you four different parameters, three of which are plotted on the
axes and one is shown by color. The only thing you have to
take away from here is that the demonstration
data distribution is very different than the
random policies distribution. So again, this motivates the idea from behavioral cloning where you want to try to relearn
the inverse dynamics model. But this also shows that if we can generate a policy that shows
transitions more like this, it’ll be closer to the demonstration. So we’re basically trying to
generate a policy that will have state transitions that look more like the demonstration
state transitions. So the way we do this, it’s motivated by one of these general
generative adversarial methods. The demonstrator on the left there
gets next state transitions, it’s learning a discriminator,
shown in yellow, that classifies all of those
as positive or as one. Then on the imitator side, we take a state,
we’re going to learn, this is going to be the generator, learn a policy that
outputs an action, which then gets sent
into the environment and paired with the same state
that was sent before. Now that same discriminator from
the demonstration side wants to classify those as being
from the imitator. So in the classic generative
adversarial method, the policy is trying to learn something that will
fool the discriminator, and the discriminator
is trying to tell the difference between the
demonstration in the imitation. As you run this process, you get a policy, a generator, that makes it very increasingly difficult to tell
the difference between them. Which means it’s as close as
possible to the demonstration. So that’s the diagrammatic
view of the algorithm. Now, similarly, we can show
compared to random and expert, compared to methods that
do use the actions. Here GAIL was state of the art, and our method GAIfO is, even without access to the actions, doing almost as well. I should say, at this point, this was all with data that’s
generated from proprioception. So we actually have access
to the joint angles. Really, imitation from
observation is about video, where the states are
not fully observable. So we’re going to take that. We’ve now taken the same
approach and we’re going to present some results based
on this IJCAI next week, that basically does that same
idea but now using video frames. So basically, what’s showing on here, you’ve got at the top, you have the generator, the policy that’s taking
now four frames from the video of an agent that
you’re trying to control, and it’s outputting an action. That’s the policy
that’s being learned, and then the discriminator
down here is trying to distinguish the actions and the transitions that are happening between the learned policy
and the demonstration data, exactly the same idea
as the earlier slide, but now sending it through a typical convolutional
neural network stack. In this case, so that
the demonstration now just is really from pixels. So this is the demonstration that’s
being used in the hopper task, and the learn policy that
comes out of this is able to get very similar
state transitions and, in fact, behave quite well. So again, we can plot now. I still have the random and the
demonstration or expert data, I’m also showing a good
policy learning algorithm, TRPO, that’s also learning
straight from video. So that’s, in some sense, the
best that we would expect to do from learning from
visual observations. In this case, none of the other competitor methods
even get close, whereas, GAIfO, as we get up to
about 10 to 15 trajectory, is doing as well as TRPO is. Again, just from
state-only demonstrations. So this leads to a
bunch of ongoing work. Again, this is the ongoing
dissertation for Farazi Torabi, who’s here on in internship,
and in this room. So if I’ve said anything wrong you can correct me during the questions or if anyone asks a
difficult question, I’ll just defer to him. But we are testing
algorithms on more domains, we’re trying to adapt
this for physical robots not just in the simulator. Also, there’s a connection here
trying to do sim-to-real transfer, learning in this way in a
simulator and seeing if it will combine with ideas like
grounded simulation learning. With the ultimate objective
of this trying to be as good as an imitation learning, as humans, this is a favorite video
of babies who just, obviously, had way too much
time watching this video, but somehow have been able to, from observation, copy it to
a pretty impressive degree. But anyway, I won’t show
more of that. Yes, question.>>You stated the problems, you’re stated that all of
that was fundamentally impossible because I
could learn things like, if the left indicator
comes on, then turn left. I couldn’t quite see how the other parameters in this here
of Alpha- unless you’re somehow injecting randomization
or your computer could do some exploration in subsequent
iterations of the algorithm, I don’t see how you can break
kind this conform between the indicator caused
me to turn left or was it something else that
caused me to turn left.>>Yeah. So learning from
demonstration is in its purest form, is simply just copying. Especially behavioral
cloning, there’s not really a way to break that. In inverse reinforcement learning, what you’re learning you’re inducing a reward function
and then learning a policy to maximize
that reward function. In those cases, you can actually get better policies than
the demonstration.>>Without access to
the expert’s actions, I don’t see how we can
break this conform between- in the state space, if I happened to have a post treatment effect
as part of a my state, like so, I took some actions->>[inaudible] if we definitely have the action that will go straight through [inaudible]
causing confusion, not everywhere. Even with actions->>We’re not trying to
learn a causal model here, we’re just trying to learn a behavior>>I agree. I’m trying to say, what in this allows me to
learn the right policy, which actually lets
me get [inaudible]? Because I might learn the wrong
policy which says I’m the car, and if the left indicator
comes on then I’ll turn left. But that’s not quite true,
it won’t generalize well.>>Yes, I see what you’re
saying. What would cause the left indicator
to come on if it doesn’t get turned on by the
person in the first place? Yeah, so let’s take this discussion offline because I think it’s subtle. It does get into issues of causality and really
what’s going on here is just trying to get as close to the state transitions
that are observed. Yes, it’s possible to
do that in a way that would end up having the wrong effect, but same with the
bird and the people, you can get similar
behavior if you’re just trying to mimic. Yes, please.>>[inaudible] similar to that. Is there any setting
in this problems where the demonstrator doesn’t only
give you a set of states, but also some explanation
on those states.>>Yeah.>>In that learning from just observation is hard
for humans as well.>>That’s right.>>I can watch snowboard jumps for a month and I wouldn’t
be able to do it.>>But you will if
I tell you what I’m doing when I do it, and then
you’ll be able to do it. No. I know, but point
very well taken, and in fact that’s the subject of some ongoing research with
Scott Niekum and Ray Mooney. We just got an NSF grant to exactly improve learning from human feedback
with natural language input. As opposed to just, right. So there’s different modalities
that you can learn from. This is just imitation. You can also learn from
positive and negative feedback. A person saying good job or bad job. I have research in my lab
on the TAMER algorithm that Brad Knox introduced that it’s just purely positive
and negative feedback. But yes, now adding. Language is a very rich signal, and there’s lots of different
ways in which it could be used. So yeah, we’re embarking
down that path right now.>>[inaudible] was
going down the line. Like if you could say
it’s safe to go up there.>>Yeah.>>Just a physical correction.>>Yeah. Yeah, exactly, and it does. You’re right. It does also have good connections with safety issues like what are
the guardrails you can put on, and it learns in the
sandbox or learns in a process control without making the factory explode
or something like that. What are the limitations you
can put on these things. Yeah, there’s lots
of issues here that on learning with safety constraints, improving with language feedback. We’re not at all at the limits
of what can be done in terms of learning from human
interaction. Oh, good. I know we’re roughly out of time. So that’s basically
what I wanted to say. This all connects back to
the research question of, “What degree can autonomous
intelligent agents learn in the presence of teammates
in your adversaries in real-time, dynamic domains?” Especially focusing on
reinforcement learning and robotics in this talk. There are other
reinforcement learning. If I had more time I
could tell you about some of our work on curriculum learning. Some of Mathew’s work on deep reinforcement learning
and continuous action space. Other work on learning
from demonstration, but I think in the interest of
time, I’ll skip through that, and also there’s some other work on multi-agent systems that
we’ve done in my lab, and I’m especially excited about the problem of ad hoc
teamwork right now. We have another paper
[inaudible] next week on learning in these
kinds of settings, which is basically the challenge
of how do you get an agent to learn to work with teammates
that it’s never seen before. Like a person playing
a pickup soccer game, or a pick-up basketball game, or robots in a disaster
rescue scenario. Everybody bringing one robot, but then immediately figuring
out how to work together. This challenge of creating
a good team player is now a AAAI challenge problem that we’ve made some good progress on that. Including just tying everything back, including in the robot soccer
domain where we do now have some competitions where people
each bring their own robots, and put them on a field, and even though they’ve never
worked together tried to get them to work together as a team. This is just a video
of one of those games. It looks a little less organized than when one group programmed
all the robots. But that’s the challenge. It’s trying to trying to get them to work together with unknown teammates. There’s a special issue
of AIJ coming up now. There was one of JAAMAS. There’s workshops on this topic. So always happy to talk about that. With that let me wrap up, and just, you know, the real theme of this talk was
efficient robot skill learning. I told you about grounded
simulation learning and imitation from observation, and a bunch of different methods. I’ll be here all day. It’ll be more than happy. I think
I’m meeting with many of you. If people do want to stick around, I’m happy to take more questions now. But I won’t be insulted if
anybody wants to leave. So thanks for your attention.>>A question.>>Yeah.>>Okay. I’ve been here for, sorry. It doesn’t have the [inaudible].>>Go ahead.>>I’d like to know how
much sensitive are you to like compared to speaking
with embodiment, right? The perspective. Like if I
change the [inaudible] and let’s say we had [inaudible].>>Yeah. Actually Ross has
been thinking about this. He is asking about the
embodiment mismatch. Do you want to say
something about that?>>Yeah. I have not
exactly tested that, but the thing that we tested
was like for different, a little bit different
point of view like by changing the heat source, and like moving the window around, and it works but we
have to [inaudible].>>But we haven’t got to the
point of being able to say how robust are we to do this yet. So we’re starting down that path. It’s a good question though.
Yeah. Yes, you’re right. Because like the
birds and the people, there’s an embodiment mismatch in
addition to being observation. If it’s exactly the same body, that’s just still where
we’re at right now. Yeah.>>I mean it kind of
goes hand in hand. They’d say it’s, with the
question I was just asked, basically if there had been any
thought to when you’re trying to learn from a demonstration
where you don’t have the same actually when you did
this embodiment mismatch.>>Yeah.>>I wasn’t sure if
there was a, basically, if there parts of the policy that
you’re trying to imitate that you have no hope of
being able to imitate. I was wondering if there had
been any thought about how to perform imitation in
this form of setting.>>Yeah. In some sense, we’re getting towards that by
once we start dealing with the video inputs because now we don’t have direct
access to even to the states. They’re partially observable
and things like that. But yes, I mean, in principle, these algorithms will work as
long as you have a mapping between the states of the one body and the
states of the other body. You’re just going to try
to get as close as you can to that mapping. So in fact one of the, this is one of the things
I skipped over here. This was some work of [inaudible]
and where he was looking at learning from interaction, where we were having a person in a body suit basically
controlling the motion of a now, and using that to seed some skills. To do that, we had to
make a mapping between the person’s joints in the robot’s
joints which are not the same. Then once you do that, then you can try to have
the robot figure out, well, the persons there. In this case, the idea was that the person was going to
learn to get better at controlling the robot rather than the robot getting better
at imitating the person. Then once doing that, the robot
would capture that skill. But yes, you have to have
that kind of mapping, and they’re not always perfect.>>That’s why I asked you
because the similarities between the whole level statement
after you necessarily won. In case that suppose you’re watching a human
demonstration where the human does some low level move.>>Yeah.>>That your robot has no hope
of accomplishing, but they.>>If they get close it might
get counterproductive. Yeah.>>Yeah, exactly.>>Yeah, that’s true.>>It said you wanted
to do if you heard any noise to try to
imitate whatever it is and that would be really
unsuccessful and very complicated.>>Right. I mean imitation
is always fraught with the idea that usually you’re
imitating imperfect policy. You may learn to imitate
the imperfections. It’s theoretically possible
that the extent to which you are incorrect in your imitation
actually improves performance. But more often than not, you would expect that you’re
actually getting reduced performance and the imitations could also
focus on exactly the wrong thing. Yeah. It’s proceed at your
own risk a little bit. The big appeal of them
is being able to learn from very few examples, right? In this kind of paradigm, I think the clearest
example was actually Brad Knox and his theme or
work was we used the game of Tetris where you can learn to play Tetris using reinforcement
learning with thousands of games. But here, we were learning from just a person saying
good move and bad move. So it’s flashing green when it was a good move and red
when it was a bad move. In this case, even in the first
episode, even the first game. It starts to look much
better than random. It’s not just placing
randomly anymore. By the third game, it starts to look like a really
competent Tetris player, as opposed to after
thousands and thousands. Then, of course, what you really want to do is use the
demonstration to speed up learning and then use reinforcement
learning to actually learn the parts you want and that was
part of Brad’s thesis as well. I wouldn’t recommend learning
only from demonstration. But we’re looking at ways to to learn from demonstration as
effectively as possible. Yeah, in the long run,
you want to try to mix the two to get the
best of both worlds. Yeah?>>How could it be that the
grounded simulated, BCO and GAlfO, the physical assumption
seems to be that although you’re not rigged
for time trying things, you are still expecting that both in simulation or in the
demonstration and the agent, they’re acting at roughly
the same time scales. When reading that
parts about situations where maybe the robot is
just operating at 60 Hertz, but humans are probably
operating at like maybe more. One second, two seconds will affect. Then how does one even begin
to think about formulating it?>>So actually in principle, I mean if we take time out of it, if we’re just looking at
the state transitions, it doesn’t really matter
what the time scale is. It just what state comes
after the next state? So a demonstration in these
methods could be at 60 hertz and the operation could be
at a much slower time scale. All we’re really optimizing here is what comes
after the last thing. Now, you could add into
it if you wanted to keep them time aligned that they
happen at the same rate. But there’s actually no pressure
in anything I’ve talked about here to work at the same rate. It’s just what state comes after it at whatever
rate you’re acting. So I think in these
methods it’s actually if what you’re trying to do is deal with things that act at
different time alignments, this is actually a feature
for what we’ve talked about. If what you’re trying to do is
make them work at the same rate, then we’d have to add
a different device.>>Mainly by increasing the
frequency at which they’re acting. Maybe their transition
models can trivially learn something like just
repeat the previous game because honestly this gives up back 18 and that high-frequency so the transition wanted us
to learn depending on that.>>Oh, I see. So the simulator in the real world operate in the same rate but the actions
operate at different rates.>>Yeah.>>Yes. So then you’re not going to get the mapping between
transitions of the same action. Right. Then we’d have to deal with some kind of other way to
bring those into alignment. But again, I mean the
first cut approach to that would be to just say
I’m going to take the action that lasts 30 frames, that gets me as close as
possible to the action that the demonstrator
got to 30 frames later. Sort of take jumps in
that kind of a space, but who knows? We haven’t tried that.>>You can find this
theoretical proof now that just in space transition, the revision is a lower bound of
the full projected migration. I think this is not that complicated but there’s a
reason the public shows that.>>Yeah.>>[inaudible].>>So you’re talking about having body language
as a way of grounding what the apple was doing. It might have been a bunch of
them. Yeah. I mean in my lab, we haven’t done anything
that that combines that. But I believe I’ve seen it. Maybe somebody else knows
of a point around that. I believe I’ve seen work
that has used gesture as a part of a way of grounding
what a person is talking about. We’ve done research
in my lab on gaze, on using gaze as an indicator. In fact, when a robot’s passing
a person in the hallway, having the robot indicate with a gaze that it’s going to go this direction by looking there and try to influence the direction
that the person goes, which is a form of gesture. But we haven’t specifically
used any pointing. But I do believe there’s research in this space in the human
robot interaction community.>>[inaudible].>>Yeah. Language and
gesture and images. It’s a multi-modal
kind of interaction or multi-modal learning that people
are looking at. Yeah. Please.>>On the center rails,
it seems like you could teach us to adapt the actions. It seems like you could choose
to adapt the states instead. was there a reason you
choose the action side? Is that the side that
you find more mismatched or is it an act or am
I missing something?>>So we are trying to treat
the simulator as a black box. So to change the state transition is opening up the simulator and
changing the actual simulator. By putting a wrapper
around the actions, the simulator can remain the same. It has its internal
transition function. We don’t have to know whether it’s
using matrix multiplications. But we just change the
input to the simulator and the state transition isn’t
something that we really have control over at
the policy level. The policy is selecting
what action to take. It’s not selecting
what state transition to make. That’s the simulator’s job.>>Then the next state
that comes out of the simulator then feeds back
into the policy learning, right? So if you’re at that point, you adapt and say, well the simulator said
he’s in this state and I’m going to change
because I know the real world actually acts this way
and then the policy would presumably incorporate
that correction.>>Yeah. You’d also
then have to change the state of the robot
in the simulator, right? You’d not only be telling, so you have to teleport the joint
to where it actually is going. So it would look a lot more awkward.>>[inaudible]. Yeah.>>I mean, I guess in principle, there could be a parallel
method to explore, but the nice thing about
changing the actions is it really is just a
wrapper to the simulator. We don’t have to know
anything about it. So the simulator is a black box. We just alter the action, and then the simulator
does its normal job. Both changing the physics
of the robot and telling our policy with the
next date is. Please.>>I’m change this up a bit. On the topic of collaboration
and [inaudible] you never met before, one of the way like sometimes it’s not worth it because it’s you’re understanding of the
world that people have, with no chain planning, I’d say. How much that do you think
is an opportunity and challenge in this type of research?>>Yeah. No, I think that’s in the Ad Hoc Team work work
that’s the assumption in some sense is that the agents are all domain experts already
in the domain individually, but they haven’t worked
together before. And so there’s a notion of one
of the typical methods for doing this is to have some bank of types, of agents, of teammates and then
try to say, “Oh, I have experience with the type of agent that always shoots the
ball from really far and misses, or the ones that run a lot
and pass or whatever.” Yeah.>>I think it’s also
a matter of language. How much it is a common grounding. Because I can have the
same notions you do, but we device our own language to understanding with
elemental models.>>Yeah.>>So when we try to work together
it’s completely misshaped.>>So this is a very broad challenge problem that
is designed to encompass that. So there is some work in Ad Hoc
Team work that does say, “Well, if you–” First, a lot of it has assumed that there
is no shared language, or no shared communication. And the fact is I can
go to China and play a pickup soccer game
with people I don’t have any shared verbal language
with and still, you know. But there has been work also on if you do have that language what should you
say and when should you say it, and is there both to try to influence
what your teammates will do, but also possibly to teach your
teammates what you’re going to do. In the same like the
left-hand blinker kind of signal is something
that, you know, it might be that every time
I’m going to turn left, I’m going to turn on the blinker
five seconds beforehand. And if my teammates are
smart enough they’ll be able to now predict what I’m
going to do in the future, and therefore improve their plans. So language and communication
can play a big role in this. And there’s a->>You mentioned
something maybe different from before which [inaudible]
very inhabiting at that moment of collaboration.>>There can be. So the whole premise here is that you don’t get
to change your teammates. So you can change yourself, but you have to be able to deal with the teammates that don’t
know how to learn, the teammates that are really bad, the teammates that
are better than you, the teammates that do learn. And so a really good team player could adapt to all
of those situations. And so number one, recognize which type of teammate
am I interacting with right now. And then given that one, you know if it’s a learning one, yes, let me take actions that
will actually help them improve their performance that
make our make our team shine. If I figured out they’re
not going to learn no matter what and that’s
just a waste of my time. So the premises is, and people when they
go into this say, “Oh let me just change the teammates. Once we do that the whole
thing becomes easy.” But I always insist in this
that we have to be able to deal with whatever you
know whatever you’re given which is the case in the real world. Right? You become teammates
with with other people. So that to me that’s one of the fascinating aspects of
this challenge. Please.>>So going back to Apple
[inaudible] for example, how much does the robot
know and exactly? I think I kind of
missed the [inaudible]. What is the research problem there?>>Yeah. So the research
problem we zeroed in on in that particular
paper was exactly the just the open-world reasoning
with a hypothetical objects that are indicated by language where you don’t know ahead of time whether they actually exist or not. So in this case, the robot knew that fruit within that
appeared in the kitchen. It hadn’t knowledge of where were the three surfaces in the kitchen
where the apple might appear. There was a pretty big
knowledge base already present, and it had the language
capabilities already given. The thing we zeroed in on
right there it was just when the open-world aspect of when a person says I want
an apple from the kitchen, you don’t want to then instantiate an apple and
assume that the apple is really there for sure
because that could lead to contradictions in
your database down the road. You want to maintain
this hypothetical object that’s different from the objects
that are in your database, and then be able to deal with both the situation where you
see one and then you can unify the hypothetical
with the true one and make the plan and
carry on with your plan, or the case where the
hypothesis was false. So I mean, the paper
was presented in the ICAPS conference.The planning
and scheduling conference. So it’s really about
symbolic representations, and reasoning, and planning
in that kind of a setting. Now it’s part of a bigger system which is the whole RoboCup at home. Where we have situations
where the robot doesn’t know the environment ahead of time like in the restaurant task, and where there’s perceptual
problems and challenges, and all those kinds of
things, but we focus just on that open-world reasoning
in that in that paper.>>So was that paper also considering like say
the hypothetical apple and then the robot in
there was too concrete. And was that kind of [inaudible].>>Yeah. So in this
case it was going to it wasn’t then gets to the gesture issue or would there be a way to
disambiguate between it. In this case what it’s
going to do is it’s the first apple it does see it’s going to ground with
that hypothetical one. It’s not going to keep
open the hypothesis that there might be more than
one and this is the wrong one. But that would be the next step of or maybe it’s reasoning
that I just have to find one. That’s good enough. Right?>>[inaudible].>>Yeah. But these are the
kinds of things that people. There’s a lot of
research now in AI on just what can we do with a
neural network end to end, and problems like this
like robot soccer, like RoboCup at home which
is so far from being able to just throw in neural network
from perception to action. It requires really bringing together everything that we’ve done
in AI over all the years. It leverages a lot of the research
that’s been great progress that’s happened envision using neural networks and convolutional neural networks and things like that. But I think we have to also bring in this kind of
reasoning, and symbols, and the things, and
probabilistic modeling, and all of these things
that are I think are still really important
parts of the AI puzzle. And so that’s why I like
these application domains. It forces us to grapple with
these with these issues.>>[inaudible] escape for the
nerves community out there.>>Mm-hmm.>>[inaudible] like for
the 100 years, yes.>>[inaudible] I believe with that. Thanks everybody. Thanks, Peter

1 thought on “Efficient Robot Skill Learning: Grounded Simulation Learning and Imitation Learning from Observation”

Leave a Reply

Your email address will not be published. Required fields are marked *