Fall 2019 Robotics Seminar: Michael Beetz (University Bremen, IAI)

Fall 2019 Robotics Seminar: Michael Beetz (University Bremen, IAI)

Articles Blog


Hi. Welcome, everybody
to the University of Washington
Robotics Colloquium. It’s my great
pleasure to introduce today’s speaker, Michael Beetz. Michael and I– we’ve been
friends for more than 20 years, I think. We actually shared an office
at the University of Bonn, where I was still
working on my PhD. And Michael just came back fresh
from finishing his PhD at Yale and then did a postdoc
position in Bonn. After that, he moved
on to TU Munich and is now a professor at
the University of Bremen. Michael’s work in
robotics has always taken, I would say, an AI
approach to this, where it’s more about
the high-level reasoning. How can we get robots to plan
about things in the world? How can we get them to still
do this reactively and not just in an open-loop
fashion, but always take sensor feedback
also into account when they make their decision. He’s done some really cool
work on leveraging information that we might have from
knowledge basis, from the web. He also did this [INAUDIBLE]. One of his projects
was, how can we build kind of an internet-based
or internet-scale knowledge base that all the
robots can share and that they can learn from
each other’s experience. And today, he’s going to
talk to us exactly related to this knowledge reasoning
and also how all of this might leverage physics-based
models about the world. Welcome, Michael. Thanks a lot for having me here. So I want to talk a little
bit about what additional way of applying AI method to
autonomous robot control. And I’m very interested
in robots that accomplish human-scale manipulation tasks. So I would like to
have robots doing goal-directed manipulation
tasks, such as setting table, cleaning the table. And the question is, how can
we write computer programs that can achieve that? And this is the topic
that we investigate in a collaborative
research center in Bremen. And I can just give you a little
bit of an introduction of what it is that we are after. [VIDEO PLAYBACK] [MUSIC PLAYING] We are essentially trying to
push manipulation capabilities from just doing that to
mastering the scopes– – After many years of
research, we’re now at a stage where our robots can perform
complex manipulation tasks, but only in specific contexts,
for specific objects, and under specific conditions. The transition should
be to mastering human-scale manipulation
tasks in realistic and open situations. If we want to go
from current robots to mastering activity robotics,
the key point is knowledge. The robot will only get key
point task descriptions, such as drop something
from the pot, and it will need to know that
it means tilting the pot– that if it tilts it too
much, it will spill– not too high, not too
low, and that the weight in the pot will change. The question is how to get the
common sense and naive physics into robot control systems
and make it effective so that it basically does
its job without delaying the performance. The robot should not
have to wait a long time until the reasoning
problems are solved. The reasoning should
be ubiquitous– so fast that you don’t
notice that it’s there. [MUSIC PLAYING] [END PLAYBACK] MICHAEL BEETZ: So that’s the
kind of manipulation capability I’m dreaming of. And the question
is, how do we get robots of actually doing that? There is another way
of looking at that. And we look at
how we, as humans, acquire such
manipulation skills. And it’s pretty amazing
that even a boy that is not even two years old is able to
do things like getting something to drink. And it basically takes
the water out of the pot, fills it in the glass. He is composing
different motions, like grasping, translating
the pot, and then tilting it. He’s also aware of the
objectives and also the negative side effects. He is careful in
avoiding spilling. And he even uses the cup in
order to stabilize the pot and to perform the
action more robustly. And that’s pretty amazing. Because at that time,
the boy certainly has not the language
concepts in order to talk about all these things. But still, even at
that age, they already have the common sense and
intuitive physics understanding to do these tasks. And then if we look
at humans and how they evolve that manipulation
capabilities over a lifetime. So step-by-step, we acquire
different methods for pouring. And even pouring
like a barkeeper, throwing up a bottle, catching
it, and filling a glass. And we can do that
by observing others doing these
manipulation actions, by reading, playing,
or exploring. And so one of the
assumptions that is underlying the
work in our lab is that we can build robot
control programs that have the elementary
proficiency of the capabilities of the two-year-old. But we can design these
robot control programs very carefully that they are
very modular, very transparent. So that computer programs
can automatically reason about that. And that they can involve
all these other methods of doing these tasks
by transforming this basic and general
robot control program. So in this talk,
I want to invite you to look at a little bit
at getting these manipulation skills from a programmer’s
point of view. And I want to convince
you that for one thing, that it’s possible
that we can think of these very general plans
that can handle action verbs, such as fetch, place,
cutting, pouring. And that we can combine these
very general plans together with knowledge. And then we come
to a skill level which lets us handle these
manipulation tasks even under open conditions. I then will propose a
hybrid reasoning system that tightly couples
simulation-based reasoning with symbolic reasoning. And then finally, I
try to demonstrate what we gain if we are combining
these different reasoning capabilities. And what we can learn and
what kinds of decision-making we can get in our robots. So the key problem– if you think about humans
doing manipulation tasks being manipulation tasks that we are
getting from others by language instructions or if they
are self-made tasks is that we get tasks like
perform an action type pouring, where the theme is some
substance of type water. So we get very under-determined
tasks, which only spell out explicitly some
aspects of the task, but many others are left out. And if we are getting such
an under-determined task description, we
have to infer things like grasp the
pot by the handle, hold the pot
horizontally, tilt the pot around the axis
between the handles, hold the lid while pouring. If a robot cannot infer
these kinds of missing pieces of information, then there is no
way that we can actually apply manipulation actions
to situations that we haven’t directly
experienced before. So that is one of the
key functionalities that knowledge-enabled robot
control systems must have. So the challenge that
I am trying to focus on in my research
work is the question– if you see that robot in
the popcorn-making task– if you are just looking at
the fetch-and-place tasks, can we have a
single program that can generate all the
fetch-and-place behaviors that we have seen here. Then can we go on
and basically replace objects that were
used in that scenario by other objects and
the program still works? And can we do
variations of the task? And can we change the
environment and still the same program works? Another challenge is, if the
robot knows how to pour popcorn out of a pot and it
sees in a YouTube video another robot pouring
water out of a pot or out of another container,
can it actually see that video, understand it, and then make
very few learning examples in order to basically get
that additional skill? And I think what is one of
the very basic capabilities that such robot control
programs have to have is that they are cognizant
about what they are doing. So by that, I mean that if you
run your robot-control program at any point in time, you
will have to be able to ask what are you doing? Why are you doing it? How are you doing it? What do you believe
at this moment? What are your expectations
about the outcomes of your current action? And only if we can basically
do that open-question answering about the current
activity of the robot, we can expect that
robots are getting really competent and really robust. So what we are doing
is, we are trying to develop a cognitive
architecture for building robot control programs. And essentially, the main work
horses in that architecture are these generalized
plans and the way that they are interpreted. And a knowledge system like a
Siri agent that helps the plan interpreted to interpret
the generalized plans. So the plans are typically
under-determined. And what the plan
interpreter does, it basically identifies where
the current specification is under-determined,
it infers what it needs to know to
execute the action, and turns that into a question
that is passed to the knowledge system. The knowledge system then
returns a motion specification. And if the plan execution
system executes that motion specification, then the hope
is that the manipulation task is solved. And I’m briefly talking about
the plan representation part, but I will mainly focus about
the knowledge representation and reasoning part. So if you look at the
fetch-and-place task, the way we the
fetch-and-place task coded is that we have a plan which
is supposed to fetch and place any object and fetch it and
place it at any location. And the plan has a very simple
and very ocular structure. It simply says, at
the location where the object is, you have
to perform the action to fetch that object. And then at the location
where the destination is, you have to place the
object at the destination. AUDIENCE: Where would
you encode constraints? MICHAEL BEETZ: Oh,
they are coming. So they are implicit. So basically, the
big idea is that we allow the programmers to do
hand-waving and say, somehow, you should do that. And then the system, on the fly,
is actually filling in the gaps and inferring the constraints. And I will talk more about that. So this was the high
level and the part of the plan that tries
to get the information to execute the action. Then we have the low level. The low level is
essentially a motion plan. It’s a model of how
manipulation actions are structured that comes out
of cognitive psychology. And there, we basically
decompose the manipulation action, like fetch and place,
into a reaching motion, into a lifting phase,
into a transporting phase, and a releasing phase. So each of these phases
has a motion goal. So for instance,
the reaching phase has a goal that my hand
has contact to the object. The lifting phase
has a goal that it doesn’t have contact with the
supporting surface anymore. And then, that’s an
extension from us. So we have knowledge
preconditions. So when we basically
start our reaching action, we have to know what the target
point is we want to reach, we have to decide on a pre-grasp
of and a grasp which contains the grasp type and the placement
of the fingers, the force, and so on. So in order to execute these
low-level motion plans, we essentially have to satisfy
the knowledge preconditions of these motion plans. AUDIENCE: I have
a quick question. Maybe you were going to
answer this in a second. So the previous slide is what
we call happy path, right? So this is when
everything goes right, you go, you lift, you reach. MICHAEL BEETZ: Yes. AUDIENCE: Are you
going to talk about how you handle exceptions? MICHAEL BEETZ: Not
explicitly, but the plans are containing about 90% failure
detection, failure analysis, failure recovery, and
continuation afterwards. AUDIENCE: So it goes
in a different system, or is it part of
the vocabulary– MICHAEL BEETZ: It is
part of the system. So the whole plan
language that we are using is a plan language that
has success and failure as a primitive component. So every sub-program
you would call is returning success or
failure and a partial failure description that
you can analyze. So it has a heavy machinery
and also control structures to propagate failures, to handle
them, and to recover from them. AUDIENCE: Are these
typically learned? Or do the people
design these manually? MICHAEL BEETZ: So basically
when you look at the failures, we have a failure taxonomy. So the top level is,
the action just fails. And in this case, you only
have a failure recovery, try it again, which
is very uninformative. If you have a better diagnosis,
like you couldn’t detect the object properly,
then you know that you have to
reposition the robot or change the
detection algorithm. So you are doing refinement. And the better your refinement
is, the more promise you have the failure
recovery methods to actually solve the problem. So the typical thing
is, we are calling these very general plans with
an action like, fetch a cup and put it onto
the kitchen table. And the context would be
a task like table setting. So then we take that
partial description and instantiate the
body of the general task by these partial descriptions. And then what the
system essentially does, it extends these descriptions
by these specifications that the motion plan needs. And then we basically
have to infer these pieces of information
to execute the task. And the most essential
mechanism that allows the plan
interpreter to do that is to post that
body motion query to that knowledge representation
and reasoning system. So it varies– how do I have
to move my body in order to accomplish that
partially-specified task successfully? And it basically
does that by asking these different
parameters for the motion. And we want that
plan to succeed. So some of the big things
is that this goal is only partially specified. So if we are supposed to
put a cup on the table during table setting,
the place of that cup should be on the right side of
the plate and behind the plate. And all these kinds
of informations– they have to be inferred. And the place where these
kinds of informations are the common-sense knowledge
that everybody from us has, but that is so difficult to
explicitly state in robot control. So the hypothesis of our
work is that for these tasks like fetch and place,
grasp, pour, and so on– we can have one
generalized plan, which basically specifies the
structure of all these actions. And then we build
knowledge bases where we basically have modular
and generalized knowledge like, cups used for table
setting have to be clean, and nobody else wants
to use that cup. Sometimes people
have preferred cups, clean cups are in the cupboard. Clean cups are typically
empty, so I don’t have to look out for spilling. Cups have to be
grasped on the outside. And so the power
of that is, if we are having these generalized and
very modular knowledge pieces, then if we get into tasks
that the robot has never seen before, it basically can
change these knowledge pieces and has a good chance of
doing tasks successfully it has never seen. So something like, that
a filled open container has to be held upright– applies to any container and
to any shape of container. And that gives much more
scalability to our system. We are trying to
do these programs. So this is how
that program works. So what you are seeing
here is the robot acting in the real world, doing
a simple table-setting task. Here is a high-level
belief state of the robot. You see here the
collision environment where the robot does
the task planning. Here are the results of
the last object detections here– so a camera view of
the robot control system. You see that the
robot is actually getting the objects out of
different containers, drawers. Also, that it positions itself
that it sees the objects well. Here you also see, if you are
applying these general methods of where the robot should,
for example, stand in order to pick up the object– you will always see a
distribution of possible places where one particular
position is picked out of. And then the manipulation
action is tried. And these also give examples of
learning how to do these motion parameterizations. AUDIENCE: You have these
general notions of empty, clean, and things like that, which
make a lot of sense to us. But how do you ground them
actually in the perception system so that it can detect
whether something is clean or whether there’s
water in a cup? MICHAEL BEETZ: So I
think in many cases, this is not grounded. So in a way, when you
say table setting– some of the knowledge is
coming from putting humans into virtual environments
and say, set the table. And there, you
would actually see where they get the cups from
and where they place them. So I don’t think executing
these manipulation actions will always be under
partial knowledge. So if you are taking
a cup out, you’re assuming that it’s clean,
but you seldomly check. You see that all these different
poses, which are essentially the pre-pose for manipulation,
and then the parameterization of the reaching and
grasping a ton and here you see a particular
difficult object. And here you see that
the same program also scales, not only for
setting the table, but also cleaning the table– so putting them back into the
waste basket or in the fridge. Here on the upper right, you
see that the same program runs on a completely
different robot. Here we have changed
the environment. And here we have a
variation of fetch-and-place where we do assembly,
where basically, the placements have to
be particularly accurate. So this is the way how
we write the programs. The programs are much bigger
because for one thing, it’s all the failure
handling they have to do. But also when they
get the information that they need in order
to refine the action that can only be done
in the course of action. For instance, I can
only localize the cup when I have opened the cupboard. And so that scheduling has
also to be part of the plan. So these plans are written such
that every parameterization decision is explicitly
coded as a question to the knowledge representation
and reasoning system. And the answer is coming
back so we can also use that infrastructure then
for supervised learning. And so that we can
also show if we are mentally predicting the
plans while we are executing. So we are predicting that a
certain parameterization will fail when we are
putting down the object, then the plans are
getting better. And we can automatically
specialize the plans through experience-based
learning. And there you have effects like,
if the robot is, for instance, trying to detect
plates in a cupboard. So it doesn’t have to apply a
very general object detector. But it knows from its
experience that plates would be just horizontal
lines at a certain position in the cupboard. And that perception method
is much simpler, much more efficient, and much more
robust than applying a general detector for plates. So let me now talk a little
bit about the knowledge, and representations,
and reasoning system that we are using
in order to make that kind of decision-making
for manipulation. So if you look at knowledge
representation and reasoning in artificial
intelligence, what is one of the first things
that you are taught in artificial intelligence? That your representation is
at the very abstract level. So it’s about objects
that have identity and that stand for the
objects in the real world. And if you are looking at
actions, it’s even worse. So your conceptual
model is typically a state transition system
with atomic state transitions between the states. And you have a model of the
preconditions and the effects, and you are abstracting
completely away from how the action is executed. If you think of
how we in robotics work on these problems–
for us, it’s the main task. The question is how do we
implement the motions that implement this action
in order to make the action successfully? So because the AI
representations make these abstractions, we
lose all the opportunities to actually make this huge
impact on performing actions in the right way. So essentially what we
need are representations which go down to the motion and
to the image level so that they can be properly counted
and that they give us opportunities to optimize. So our proposal
for that is that we are looking at a hybrid
knowledge representation system which is composed of
a symbolic knowledge base– as it’s typically
implement in AI– and an artificial world which
is essentially a scene graph representation of
the world together with rendering and
simulation infrastructure. Yes. AUDIENCE: I noticed that
you have x, y, and o. Are those reals or are
those still symbols? MICHAEL BEETZ: It
doesn’t matter. So whenever we have
say, a name here, it’s actually the
name of the data structure in the artificial
world or in the scene graph. So basically, we have
exactly the same level of detail as the scene graph. Does that make sense? AUDIENCE: So I guess that
you are saying that when you talk about symbolic,
you can have sort of mixed– actual symbols and reals. MICHAEL BEETZ: Yeah. So we include basically, the
reals because what we want is, we want to generate
the artificial world out of the symbolic knowledge
base and go back. So we want to have a
one-to-one correspondence. And of course, the
assumption is that we are in a world where
we know all the objects and have models of the object. So then we basically want
to take this knowledge representation and we want
to take the real world. And we want to get the real
world and the artificial world so close to each other that,
if we are executing our robot control program and we look
at the execution traces, we couldn’t distinguish whether
the program was actually executed in the real world
or the artificial world. So that’s the idea. If we are getting that, then we
can play a very promising game. So we can pretend with
our robot control program we are executing our program
in the artificial world, but in reality, we are
executing it in the real world. And the big advantage
of doing that is, in the artificial world, we have
much, much better information. We have access to everything. And of course, those
two representations will never be the same. So we have to account for that,
that our robot control programs have to have heavy machinery
for event and failure detection and continuation. But our belief is that
the information that is contained in these
very informative models makes it worse to have
all that machinery. So we are always
monitoring and look at the differences between
the real and the artificial. So let’s see whether
we can actually generate representations that
are matching our expectation. So that’s not possible for
environments like kitchens. But if you are looking at
more structured environments like retail stores, that is in
the reach of getting that done. So we are installing
robots in the supermarket. And we equip the robot with
models about the products, also models about the
furniture pieces, barcodes, and the separators. And then we let the robot
go in the environment and build a symbolic knowledge
base of the supermarket. AUDIENCE: There was a
[INAUDIBLE] on that slide. MICHAEL BEETZ: Oh, there is
every [INAUDIBLE] everywhere. Yeah. So here you see the
robot in the environment. And here, this is the
belief state of the robot. At that point, the robot has
detected the base components of the shelf systems. And now it’s looking for
the individual shelves. Now it has detected
the individual shelves. And now it’s looking for the
separators and the barcodes of the products. And now it’s taking
each barcode and see whether the respective objects
or products are in that shelf. So we are basically constructing
now our artificial world. And so we are able to
build a symbolic knowledge base without human interaction. And we can here use
Prolog, of course, to retrieve semantic
information from that. So we can ask about particular
products and where they are. Or we could ask
where all the empty facings in the shelf system. Or by combining it with back
count knowledge that we have– for instance, from the
worldwide web or other knowledge sources– we can ask questions
like where are products that are dangerous for kids, but
they are in the reach of kids. And so basically any query
that you can semantically formulate with your
back count knowledge is executable on that model. But we can also take
that model and construct the artificial
world out of that– so basically, game
engine environment. And now we can go
shopping and basically, in the model that is
acquired by the robot. And because all of that is a
symbolic knowledge base now, and we basically have the
models of cognitive psychology, we can automatically pass
actions in the virtual reality and basically get information
that we, for instance, can use for imitation learning. So for robots, the most
important thing is, can we actually use these kinds
of models to perform autonomous manipulation actions? So there are only a
handful of objects that we can manipulate
with our grippers. But with respect to the
information content, you can see that we can
do these kinds of things. And then the question,
can we actually do that in real-world environments? So that’s in one of the retail
chain stores in Germany. And you see that the robot is
building these artificial world models of realistic
environments. So this is one part. So the other part is, so that
was a static environment. And that was the
deployment phase. But if you are in the
manipulation part in table setting or in other
case, the question is, can we actually get
models of the current state of the environment, and can
we maintain these models while we are executing? So this is very early,
and we are just starting to develop that system. But here the idea
is that we have the robot in the environment. And like in the
retail store, we want to build a model of
the kitchen environment that corresponds to
the current state. And unlike many others,
we basically take here the models of the object
themselves and place them. And the hope is, because
we can realistically render them and feed them back
into the perception system, that the redundancy
and the simplification of the perception problem helps
for making these tasks much more stable and efficient. So we would hope, when
the robot is actually doing its task in
the kitchen, it can maintain a complete
belief state about the setup of the environment. So when we are looking at
this digital twin knowledge representation and
reasoning framework, we have seen it consists of
the symbolic knowledge base– which are essentially the
assertions and the axioms– and the artificial world, which
provides an environment model through a scene graph
and basically rendering and a simulation structure. And then we have
two functions that convert from the knowledge
base to the artificial world, and from the artificial
world to the knowledge base. And what we want
to have is, if you are looking at the
symbolic knowledge base, then everything we have in
the artificial world is true. And on the other hand, if we
convert the knowledge base into the artificial
world, there shouldn’t be other objects than
the ones that are already in the artificial world. So that is the consistency
condition between two. And then we have that
function that whenever that artificial
world is evolving and there are intentional
activities going on, like the shopping in
the virtual environment, then this should be
automatically passed, and interpreted, and
converted in a knowledge base. So let’s look at that. From the models from
cognitive psychology, we know that for the
segmentation of the motion phases, the important criteria
are the force dynamic events– the hand having
contact with the object being lifted from the
supporting surface, and so on. So we have instrumented the
simulation engine in order to detect these force
dynamic events that we need for action recognition. And so here, we
basically perform a kind of manipulation action. So you also see that we have the
fluid flow detected, the start, and the end. And here you see
the same structure and now doing the online
detection and categorization of these actions,
the motion phases, and automatically recording
the trajectory data that we need to. AUDIENCE: In this
case, for example, if there’s less milk or
more milk in the real milk bottle than the demonstration,
how would it handle that? MICHAEL BEETZ: Well, I guess
it has to do like we do, right? So it has to basically
look when you are filling the glass how the level raises. So it’s always a
question of if we are looking at the states where
we actually try to measure the state and at which place. So there are always
unknown parameters, like the viscosity of the fluid,
like the servers [INAUDIBLE].. And we have to, like
humans, actually act right without knowing
these parameters. AUDIENCE: So that’s all
that has to go into this. Because you don’t know whether
you should simulate actually the tilting motion
or whether you should simulate the milk level. MICHAEL BEETZ: No,
I think in general, when we are doing all these
simulation-based reasoning, we are not doing the
haptic feedback loop. And I think in
general, you would want to parameterize
your actions also often in the effect space. And the mapping from the
effect into the controllable parameters should actually be
through probabilistic methods. So you are pouring
pancake mix on the oven, and you would measure how
fast the pancake grows. And that if you
are doing that, you don’t need to know the
viscosity and the other servers. Because you have something that
is correlated with that then. The question is, what kinds
of manipulation actions we can do these kinds of things. And you basically get
everything automatically in our knowledge
representation system. And you can access
it through Prolog. And this is the most
recent version– now you can see that we have a
full body motion tracking. It’s also with a
physics-based interaction. They can close the
fridge by kicking. So we are trying to get
these kind of simulations also as realistic as possible. And of course, that’s not a
very faithful representation, but it’s starting there. So you can do things
that you have to grasp a plate with two fingers. So one would not
be stable enough. The other thing is the
symbolic knowledge base. So if you’re looking at
the artificial world– the idea is that you could click
on every object, every object part, and you have a symbolic
knowledge base behind it that defines what
these parts and what these objects are in a
machine-understandable way. So whenever you are actually
generating your learning problems or you’re
doing reasoning, you can do that
in a semantic way rather than trying to
correlate features. I think the most
powerful mechanisms that we have in that knowledge
representation and reasoning framework are the mechanism
setup based on a simulation. And because we basically
have our environment, and here we have the simulation,
there we have the rendering. And then we can realize
cognitive capabilities like mental simulation, mental
imagery, imitation learning, and motion replay. So mental simulation– we built
[INAUDIBLE] like simulator, but now on top of unreal. And we basically run the
robot control programs in the very same way
we run in the reality– so it’s without changes,
just exchanging the low level of the control programs. We run them in both frameworks. We can even run a
perception in that frame. Mental imagery– so
that we can mentally imagine scenes that the
robot has experienced and modify them. So we take the log data we
have from the experience of the robot, but now
we can replace the milk with orange juice
and create situations that are very representative for
the tasks we are looking for. And then using the
semantic infrastructure, we can do things like,
show me all the objects above the counter,
or all the objects that I can reach
with the right arm, and all the objects that
are above the plate. And so that essentially
means that every question, you can ask with a symbolic
representation language, you can automatically turn
into a learning problem that the respective concept
is learned as a perception mechanism. AUDIENCE: Motions of liquid
container and things, how did you get those
into the knowledge base? Did people program some
of these predicates? MICHAEL BEETZ: Yeah, so
that would be actually, we have a big encyclopedic
knowledge base. And there is a
concept of a container and a liquid-containing
container. And they are all defined in
terms of logical assertions. Another way of acquiring
knowledge– and this is for manipulation tasks–
particularly important is, if we let the robots
watch YouTube videos. So the idea is, you want to
take textual instructions, and you want to take the video. And you want to
basically combine them to get semantic
representations of the actions you see. So here you see what the system
is doing at the semantic level. It’s basically taking the frame
from textual descriptions. Then it’s filling it
out with the parameters that it’s seeing in the image. So it’s basically saying
that it’s taking action, that it’s a power
small warp grasp, and that it’s a mug
that is grasped. And so you get essentially the
same kinds of representations we are using in our plans, so
that is then very compatible. And we need to do that because
there are many ways of doing manipulation actions. And most of the methods will
not succeed on our robot control systems. So we have to look
at all possibilities to find one that might work. So that’s basic
system architecture. So you have here
deep network that does a pose estimation here,
the detection of the object, the hand pose estimation. Here we have a
Markov logic network, which takes a textual
description and then infers parameter that the
system cannot see. And the combination basically
gives us better information about these videos. And that’s what we
are working now on. So we are trying to get
these observed activities. We try to match
it on the avatars in the artificial
worlds and then getting basically the
semantics underlying that. Because now we are
getting force contact. And that should give, for
imitation learning, much, much richer representations. So that’s just
something where if you think about artificial worlds
being a logical knowledge base, the question is, what is it
that I can actually represent as a logical knowledge base? So how rich can my
knowledge bases be? And this is just
a scenario where we have somebody eating food and
later, it’s basically a cutting action, serving,
and then at the end, it’s the start of
a feeding action. But now, the big
thing is here, now we have multiple agents
in the environment. We have interactions. And again because
we have it that way, we could also train the vision
systems with these images and try to give the robots
better ways of understanding the activities that are
going on in an environment. So the knowledge
representation system has that large hybrid
infrastructure. Everything is wrapped as a
virtual logic knowledge base. We can interface with Prolog. And then we have the
applications perception, question answering,
learning, and recording of episodic memories. So let me just show
you what that actually gives to the system. So if you look at
the episodic memory– so this is a video
on this side, which is completely generated out
of the Prolog knowledge base. So it’s just render the
knowledge content and you’ll see that you have classic
continuous motions of the robot control system. You see object detection,
you see the images. And again, that
is the information that you can use for
reasoning and learning. So the data structure
of episodic memories are all the poses of
the robots, the images it used for major
interpretations, the way the plan
got interpreted. And then interesting is where
the plan interpretations or the computational execution
of the plan and the real world interface through these first
dynamic events and perception events. And they are
essentially represented as logical assertions. To give you an idea
of what that does– we are running our
manipulation episode. So the system is automatically
recording episodic memories, which are a logical
knowledge base. And now we can take
that episodic memory and ask the question, is
there an action of the type? Pick up where the
robot tried to pick up a part that has to weight
more than 2 kilograms. And we can ask, when
did that action start? And we can ask, what is
the pose of the robot at that very moment? We can now take that
question and use it for generating
learning problems. So we are now interested
in the set of poses in which the robot
picked up a heavy object, but it was successful. And we want to collect
the poses for which it was successfully imposed. So we get the positive examples
of picking up a heavy pot. We can also collect
the failures. And now we can learn where the
robot should stand in order to pick up heavy objects. So we are running many
samples in a simulation. We are getting the green
ones as a successful pick up, the red ones are
the failed ones. We are learning the
places to pick the object at the respective position. And now at execution
time, when we see objects with a certain inaccuracy,
we back project the place. And now we get a more
distribution-like representation of
that place, where the higher the distribution is,
the higher the probability that it will be successful. So here you see how
that representation varies when you
change the object, the inaccuracy of the object. And here you see
it in practice when the robot is supposed to
pick up objects from a table. That was all that
I wanted to say. So I tried to convince you that
besides deep learning-based approaches, designing a robot
control programs in a more engineering-based
fashion, and combining these well-designed plans
with lots of knowledge where we basically get these
knowledge pieces from very different sources might be also
a way to scale the manipulation capabilities of future robots. Thanks a lot for your attention. [APPLAUSE] AUDIENCE: So one challenge is
that the more knowledge you put into your knowledge
base, the longer a inference might take. And when humans are in
the loop with the robot, the responsiveness of the
system might become a problem. So I wonder what paths do
you see for keeping systems like this responsive. MICHAEL BEETZ: So for us,
that is not really a problem. And the reason for that is,
from what we have done so far, you don’t have these long
chains that basically generate huge search spaces. So in most cases, it’s
actually combining different pieces of information
more in a data log way. So it’s more like an
SQL query would do. And those queries are
typically very efficient. So that doesn’t seem
to be the bottleneck. And that basically is, there
is very little such space reasoning that we are
doing at the moment. AUDIENCE: And it makes
one big thing coming up these days when people
start doing whatever– also deep learning
from simulation is, of course, the gap
between SIM and real. It seems like you still
have to bridge that gap also to always go from your knowledge
base to the real world. So that you learn on
the perception side. You still need to
learn the controls so that the controls actually
work in the real world. Any thoughts on how
to do that in general? MICHAEL BEETZ: I
think for us, it would be much better behaved. Because we have these
semantic events. And the structures– they are
a well-structured problem. Not everything is
actually a black box. And so we basically know that
the pick up was successful or it slipped during
pick up, which is much better information. And we can be much more easily
adapted to the real world than if you’re having
a black box system. AUDIENCE: Yet it’s
not only that, but even if you,
let’s say, you still need to execute that grasp
in the real world, right? It’s informed by what your
knowledge system came up with. So the question is,
you need to generate these low-level
control commands. And you still need
to perceive the world so that you get your states. MICHAEL BEETZ: I hope I have
said that we are not basically simulating the
haptic feedback loop. So I think this haptic
feedback loop is really one that has to be tackled
with probabilistic method. So you are making this
a more [INAUDIBLE] or– Yeah. AUDIENCE: So it’s something
that has to be done. MICHAEL BEETZ: Yes. But it’s kind of at
the local level value. You’re bottoming out. So I think that you feel
you have a better chance. And then you have
these failure detection and failure recovery
methods that try to fix it from
the higher level. AUDIENCE: Thank you. MICHAEL BEETZ: Good. Thanks a lot. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *