How serves Deep Learning Model Predictions (Sahil Dua)

Hello, people! My name is Sahil Dua. And today I’m going to talk about a very unique
combination of two technical areas. One is Deep Learning. Which is a special branch of machine learning. And then I’m going to talk about containers. Something that almost all the cool kids these
days have been talking about. Including me. So… I’m going to talk about how we productionize
our Deep Learning models to be able to serve realtime predictions at large scale using
containers and Kubernetes. I hope I covered all the buzzwords. So I would like to get an idea of the audience
— can I see hands? People who have used Deep Learning? Okay. And those who know what Deep Learning is? Okay. Quite a lot. That’s nice. And the other side — those who have built
apps which run in containers? Okay. Nice. Okay. So… What it’s going to be about… I’m going to begin with the applications of
Deep Learning that we saw at Some of them. And then I’m going to talk about the life
cycle of a model from a data scientist’s point of view. What are the different phases of a model. And then I’m going to talk about the Deep
Learning pipeline, the production pipeline, which we built using containers. So get excited! Let’s start! First of all, yes… Who I am. Who am I? I’m a backend developer, working at
in the Deep Learning infrastructure team. That’s why I’m talking about this thing here. I’m also a machine learning enthusiast, which
means I spend almost all of my free time, free geeky time, learning about machine learning
and different techniques coming up. And I’m a big Open Source fan. I have contributed to a bunch of products,
like Git, that most of you probably use in your day-to-day life. I’ve submitted a bunch of patches. I have also contributed to the Pandas library,
a Python library, Kinto, and go-github, which is a project by Google. And a bunch of other projects. I’m a tech speaker, which is why I’m here
on the stage today in front of all of you. Generally I talk about topics varying from
data analysis, Deep Learning, A/B testing, containers. Kubernetes. So that’s all about me. So let’s begin with the applications of Deep
Learning at But before that, I would like to talk about
the scale. Because I mentioned at large scale. What does that mean? So we have over 1.2 million room-nights booked
every 24 hours. Across more than 1.3 million properties in
more than 220 countries. Let me make one thing clear. That I’m not here to show off the numbers. The point I’m making here is that, at this
large scale, we have access to a huge amount of data that we can then utilize to improve
the customer experience. So let’s see how we do that. The first application of Deep Learning that
we saw was image tagging. And when we look at an image — for example,
this one — the question that arises is: What do we see in this image? Now, this question is really an easy one,
as well as a really difficult one. So if you ask this question to a person, to
someone from this audience, it’s going to be easy to say what’s there in the image. It’s easy to say what are the objects that
are there in the image. But it’s not an easy problem when you ask
a machine to tell what is there in this image. And what makes this problem even harder is
that the context matters. What you see in this image may not be what
I want to see or I want the machine to detect. For example, if we pass this image through
some of the publicly available image neural networks, imagenet or densenet, these are
the results that we get. We get information like… There is an ocean view. There is nature. There is a building. Okay. That’s good to know. There’s a building. But what do we do with it? These are not the things that we are concerned
about at Booking. However, these are the things that we are
concerned about. Is there a sea view from this particular room? Is there a balcony? Is there a bed? Is there a chair? Or things like that? Now, before you start thinking — okay, it’s
an easy problem. You just need to detect a bunch of objects
that you have. No. First challenge is that it’s not an image
classification problem. It’s an image tagging problem. That means every image is going to have — or
might have multiple tags, so it’s not just that we classify it as something sea view
or balcony. Every object is going to have multiple tags. And to top that, there is going to be hierarchy
of the tags. For example, if you see an image has a bed
in it, we are almost sure that the image is going to be of inside view of the room. Unless you’re in such a room where there is
no bed. Or there is no room. So once we know what’s there in an image,
we can use this information to help the customer decide what they want. We can help them in deciding what kind of
property, what kind of hotel or apartment they want to book if we have this information
about all the images that we have for a particular hotel. Let’s talk about another problem. So this is a classic recommendation problem,
where we have a user X, they booked a hotel Y. Now we have a new user, Z. There’s a debate if it’s zed or zee, but I’m
going to stay with zed. There’s a user, Z. We’re going to predict what hotel they are
going to like. We basically want to recommend what hotel
they are going to like. So the objective we have is: We want to find
the probability of a particular user, Z, booking a particular hotel. And we have certain features, like country
or language of the user. Where they are coming from. We have contextual features, like when they
are looking for it. What’s the day, what’s the season. And then we have some information about the
hotel. The item features. What’s the price? What’s the location? Is it near the beach? Does it have a swimming pool or something
like that? So I’m not going to go into details of this
problem. Because I’m here to talk about the infrastructure
part. And I’m not wanting to disappoint you. So the search has shown that we can get better
results for such a problem of a recommendation system if we use deep neural networks instead
of linear collaborative filtering. So once we figured out that there are going
to be some certain applications of Deep Learning which may be really interesting for us as
a business, we started exploring this field. And thanks to my colleagues, Stass and Imra,
who are probably enjoying the rainy season back in Amsterdam, they started this learning
process, and now we have some very cool Deep Learning models in production. So let’s start talking about the life cycle
of a model. What are the different phases? So there are three phases. Code, train, and deploy. In first phase, data scientists write their
model, write the code for their model, try out different features, different kind of
embeddings, different kind of interactions or different features, and different architectures,
and once they know they are happy with the performance of the model, now they want to
train their model on the production data. Once they are done with the training on the
production data, they want to deploy it, so it can be used by multiple clients. So we use tensor flow, which is a nice machine
learning framework by Google. We use a high-level Python API to write our
models, because it really sort of standardizes the process of writing the model, and helps
get the prototypes quickly. Now, these two parts, the last two parts,
train on production data and deploy — these are the two parts that constitute our production
pipeline, Deep Learning production pipeline, that I’ve been talking about since the beginning
of this talk. So you may argue why training of a model is
a part of our production pipeline. That’s a very valid question. You can also train your models on a laptop,
right? But if you try to train your models on your
laptop, you may end up looking like this. Not really happy, right? So there are a couple of reasons. One is that sometimes your training data is
so large that you can’t really hold all of it in your memory, on your laptop. And other more general reason is that your
laptops are going to have limited resources. Like CPU cores, GPUs… In most of the cases, they’re not going to
have a lot of powerful GPUs. Just to speed up the training. So if you want to speed up the training, you
should train… You should not train on your laptop. And we’ll see how we do that in production. So we have our big servers, which have a lot
of CPU cores and powerful GPU support. What we do is we wrap our training script,
we run them on our big servers. Problem solved, right? Not quite. So once we do this, there is a limitation
with that. There are going to be multiple data scientists
who are going to run their training at the same time, possibly on the same servers. So we might not be able to provide the independent
environment that they might like to have. Plus if they want to use a different version
of tensor flow, for example, we need to create a different environment for that particular
training so it defines what it wants to use, so that we don’t really need to have global
level dependencies on particular packages. So this is what we do. We wrap our training script in a container? So this yellow thing is… Let’s consider that as a container. That’s the best I could draw. I’m not a designer. So we wrap our training script in a container. And then we run that container on our servers. So now, what is a container? A container is a lightweight package of the
software that contains all the dependencies that it needs to run. And then we run that on our servers. Easy. So that enables us to package all the versions,
all the particular versions, that we want for our training. Package that up. And ship it as a container. So that solves the problem of having multiple
different versions of multiple dependencies. And these containers can also utilize the
GPU support that we might have on our servers. Which you can… You just utilize whatever GPUs we have. So… In a nutshell, this is how it looks like. We have production data in Hadoop storage
that we want to use to train our model. We spawn up a new container every time we
want to run a training. That container, again, has all the dependencies
it needs, has the training script, and it takes the data from Hadoop storage. Once it has the data, it runs the training. And now while the model is being trained,
we want to make sure that we can somehow save that model. We should be able to save the trained model,
so that we can utilize that later. We can deploy it somewhere. So we save the model check points back to
Hadoop storage. Model check points are like the model weights
that are required to load the model again at the deployment stage. Once we save the model check points, the container
dies. Yes. Who can be more selfless than a container? It takes birth to do the work that you assign. And then it dies. That’s the entire life of a container. So once we have trained that model on production
data, we have already covered one step of going into production. So next step is to use this model, the train
model, and put in production, so that people — not people, but different clients can use
it to get the predictions back. So this is what we do: We have a simple Python
app, it takes the model weights that we just stored in the last step from Hadoop storage. It loads the model. No, no, no. Yes. That’s weird. So it takes the model check points from Hadoop
storage. It loads the model in memory. And it’s ready to serve predictions. So let me cover it again. So to be able to serve predictions from a
model, we need two things. We need model definition, we defines what
are the features, what are the interactions, and all that stuff, and then we need to have
the model weights that we gathered from the training. Once we combine both of these, we load the
model in memory. It’s a tensor flow model in memory. And then we’re able to serve the request. And to top this, we have a nice way to abstract
all this, and provide a nice URL, where the client can send — get an HTTP GET request,
and get the predictions back. In the end, it’s just as simple as sending
the GET request and getting the predictions back. So this is what it looks like. We have an app that I just mentioned. We wrap it in a container. Once again. Because having containers solve the problem
of so called — it runs on my machine but doesn’t run on yours. It contains all the dependencies. So it’s going to run on all the machines that
support containers. And so once we wrap the app in the container,
any client should be able to send the request with all the features that are necessary to
get the prediction back. And in the response, it gets the prediction. As I mentioned, things are not quite simple
at a large scale. So what we need to have is we need to have
multiple servers. Easy solution. We had one app. We were not able to serve all the requests
properly. So we have multiple apps. We replicate the same containers that we had. Put them behind a load balancer, and the client
doesn’t know how many apps are actually serving behind. It just knows the IP of the load balancer,
and the load balancer is responsible for readouting all the requests. Quite simple. But things are not generally simple at large
scale. So instead of six servers in this, we need
to have quite a lot of servers. As we keep on increasing the number of — when
I call server, it’s actually this container. So as we keep on increasing the number of
containers we have for a particular application, we need to find a way to manage these properly. Because once you have hundreds of containers
running a particular app, particular model, things are going to get sketchy. So we use Kubernetes to surround all of this
infrastructure that we have. Kubernetes is a container orchestration platform
which helps in scheduling, managing, and scaling containers easily. It provides a really easy way to scale up
or scale down the number of containers that we have, based on certain principles. And it also provides an easy way to make sure
that at any point we have a specific number of containers running, and if something going
wrong, some of the containers die, Kubernetes is going to make sure that it spawns up new
containers to maintain that number that you specified. So that was about putting in production. But once you put things in production, a lot
of responsibility comes on you. Now, you want to be able to measure the performance
of these models on your production server. You want to be able to answer some of the
questions like: How is this model behaving? What are the latency versus throughput of
your model? So let’s say your model is taking some computation
time to give the predictions back? But this is not going to be the time that
your client is going to see. There’s going to be some request overhead. And so in general, the total prediction time
is going to be the sum of your request overhead and the computation time. And in case you send, like, more than one
instance to predict on, in one request — for example, if you have an instance that you
want to predict on in one request, it’s going to be the number that your model requires
to predict. We can see here that for some of the simple
models like linear regression, logistical regression, simple classifiers, with a lesser
number of features, the request overhead is going to be the bottleneck, because the computation
time is going to be too small, compared to the request overhead. So you need to be considerate about what model
you have and what is going to be the bottleneck for your particular case. Once you have this information about your
model, you want to be able to optimize it. And you can do that for two things. So one is: You want to optimize for either
latency or throughput. Let’s talk about them one by one. So first is: If you want to optimize your
serving for latency — what’s latency? Latency is the amount of time taken by your
server to serve one request. And this is from the client’s point of view. So how can you optimize for latency? First of all, when do you want to do that? When you have some application? Where you want to show some information to
the user, realtime? You definitely want to optimize so that you
can serve the request, serve the response as soon as possible. The first optimization that you can do is:
Do not predict in realtime if you can precompute. That may sound silly but in a lot of cases,
it’s possible that you can… You already know all the space, featured space
that you have. And instead of computing realtime, predicting
realtime, you can predict for all those cases and save the results in a lookup table and
just use the lookup table when an actual prediction request comes. Next way is to reduce the request overhead. For simpler models, when you know that the
request overhead is the main bottleneck for your serving, for serving those predictions,
you can optimize to have lesser request overhead. And how you do that? You can… So one of the things you can do is you can
embed your model right into the application that it’s serving. And that’s what we do. We load the model right into the memory of
the app. So that there’s no latency, no request overhead
to go from the request, http request, to the model. The model is right there in the memory. Once the request comes, it’s going to serve
right there. Next thing is predict for one instance. Let’s say on a web page, you want predictions
for three different models or three different instances. When you want to show the information, instead
of batching it, you should send three requests with one each prediction. This is a specific technique for speeding
up the prediction. The quantization. What it means is you change your float 32
types to 8-bit fixed type. How is that going to help? Now, your CPU processor is going to hold more
amount of — four times more data in the resistors, and it’s able to process the data faster. So that means you’re going to be able to compute
the predictions faster. Four times faster, to be accurate. And then there are some tensorFlow specific
things. It really depends what library you use for
TensorFlow. There are some things like freezing the network. It means there are some TensorFlow variables
in the computation of the graph, and we can change those variables to a constant, so that
there is less overhead when we try to compute. And the next thing is: We optimize for inference,
which means we change… We remove all the unwanted nodes from the
computation graph, so that we don’t really compute unnecessary things that we don’t need,
in order to predict something. Now, the second application is optimizing
for throughput. What is throughput? Throughput is the amount of work done in one
unit time. So when do you want to optimize for throughput? Let’s say you have some cron job, or some
offline work that needs to be done every night or every hour or every ten minutes. There you don’t really care about when a particular
request arrives. Or when it gets a response. What you care about is: How much time it takes
to do this bunch of work. So that’s when we want to optimize for throughput,
so that we just care about how much work is being done in a particular time, and we just
don’t care about when a particular request or how much time a particular prediction takes. Again, the first thing we can do is: Don’t
predict in realtime. And this is not that I’m trying to enforce
this, that you should do this. But this is something that you should always
consider. Is it possible to precompute and have a lookup
table to serve the request? If so, always do that. It’s going to be much faster. Okay. Batch requests. So once you know that you’re going to do millions
of rows of work, it’s better that, instead of sending one request per prediction, you
batch your input features. Like, you have thousands of input features,
right in one request, and you send the request at once, and get the response back. So by doing this, you reduce the request overhead. Because if you send 10,000, for example, predictions
in one request, you are going to get rid of all of the request overhead that was going
to be there. This is really interesting. Parallelize your request. So when you want to have a lot of work done,
instead of waiting for your request to get the response back, do the work asynchronously. You send the requests parallelly, using multiple
workers. And do same kind of callbacking option to
make sure you collect the results better and faster. So let me try to summarize what I talked about. First of all, I talked about the training
of models in containers. We spawn up a container every time we want
to train, get the data from Hadoop storage. Once the training is done, store back the
model weights into the Hadoop storage. Now, the next step is to be able to serve
those model weights in production. In containers. We spawn up containers that have the Python
app that we want to run. It loads the model definition, as well as
the model weights. Once we have that model, it’s ready to be
served. To answer any HTTP GET request. And next we optimize serving for particular
applications. It may be a realtime application where we
want to optimize for latency. It may be some offline application where we
want to optimize for throughput. That’s all I have. And if you want to get in touch with me, you
can get in touch with me on all of these social media networks. I go by sahildua2305 user name. And yeah. Thank you! (applause)>>Thank you. That was really interesting. There’s loads of questions. I lost my volume. Okay. This is an area that I’m actively working
in as well. So I feel like I could just ask questions
all day. But I’ll try to keep to the questions that
everyone else has asked. So one of the questions was around how these
containers can make use of the GPUs, natively. So TensorFlow can let you do that sort of
through Docker things. Where is the tech at with that at the moment? Do you start to get limited when you’re using
containers?>>Yeah. So as I mentioned, we use Kubernetes to manage
our containers. And when we define the configuration for a
particular container in Kubernetes, we are able to define what are the resources that
we are going to use from the host machine. Because every container in that is going to
land up on some host machine. And we have the flexibility to mention what
resources we want to use. And we can specify we want to use GPU in that
case, and it’s going to link up or make the GPU available for this container.>>Nice. And is that just through TensorFlow? Or do other libraries support that as well? The GPU connection? Or do things like Karas and other things support
that as well?>>It has nothing to do with TensorFlow as
of now. Once you have this GPU available on a container,
once you run the TensorFlow on this container, it’s going to utilize that. So it’s just a feature of making the resources
from a host machine available to the container.>>Okay. There’s a question: Are you running this on
your own hardware, or are you using cloud providers to run the servers that are running
the containers?>>It’s a little tricky. Right now we have most of this in our own
hardware. But we are slowly moving towards using cloud.>>Have you found the cloud has caught up? Or are you still not able to do that yet,
because you’ve got such specialist hardware?>>I would say we are still not there because
of all the legacy stuff that we have. But it’s going to be there.>>A question here. Have you had issues with the size of the training
networks pushing the memory requirements on your app servers too far? Have you played with, I guess, distributed
models and other things to try and reduce that?>>Yeah, this is something that I was working
on, actually, last month. So yeah. We have tried out distributed training. TensorFlow comes with a really nice feature
of distributed training. And we have definitely tried it. We have tried to put it into production, make
it automated, so we don’t have to spawn all the workers, all the parameter and master
nodes.>>I know Google do this heavily. Distributed nets across multiple areas. And one of their active areas of research
has been creating nets that estimate what part of the net would do. And update it. So you’ve got nets living within nets, doing
estimations. So it gets crazy. So one question I had was: Once you’ve trained
your model and you’ve got a new set of weights, you obviously need to load those back in. Do you just spawn a whole new version of the
model with those in? Do you update an existing model with new weights? And how often are you updating the weights
in training?>>It’s actually a very nice question. There’s a feature in Kubernetes about how
you do your deployment, and this is exactly about how you update your model. That’s what your question is, right? So there are different options. One is… Say you have 15 containers running for the
previous version. And now you have a new version. There are a couple of things that you can
do. One is you spawn up new 50 containers and
you switch over. And another thing is you start creating the
newest containers and start killing them as you have the new ones available. So there are limitations like — in the second
wave, there are going to be two versions running at the same time. So it depends. Are you okay with having two versions at the
same time? If that’s the case, then you go for this one. Or if you want to have only one version working
at any point, you need to switch over once all the 50 new versions are ready. Just switch over from there.>>Have you experimented with migrating the
actual weights from the existing nodes to the new weights? Rather than spawning whole new containers
every time you train?>>That’s actually a good point. But no, we haven’t tried this. I think it’s not going to be as efficient,
because once we have the TensorFlow model in the memory, it’s going to take a lot of
time to go and replace all of those weights. So I think it’s still better… But yeah, it’s definitely worth trying… It’s better to create a new model, load the
new model, instead of going into all the nodes in the graph and updating their weights.>>And how regularly do you train new weights?>>Sorry?>>How often are you training new weights?>>It depends on the application. There are some models that need to be trained
on every day new data. Or there are some models that just need to
be trained once and then goes for months. So it really depends on what kind of model
we have. What kind of application.>>Do you see yourself moving towards realtime
updating of training in any time soon, or are we not there yet?>>What do you mean by that?>>As in it’s constantly updating the model
weights based on new data. Rather than doing a daily train, it does a
train every 10 seconds or 5 seconds.>>No. We haven’t really found an application for
that yet. But once we find something around that, we’re
definitely going to try it out.>>Cool. I guess I always find it amazing if I’m searching
on Google and I’m playing about in a different programming language to what I’m used to,
when I start typing 30 seconds later, it then starts giving me results back for that other
programming language, even if I don’t mention the language. It’s clearly switched what it’s doing to know
the language I’m in at the moment. Which I guess is some training along the route
of that. It’s fascinating. Two more questions. One question was: Why not Docker swarm for
orchestration?>>Ah. Why not Docker swarm? So previously we were using marathon to manage
our containers. But then we moved to Kubernetes for a couple
of reasons. Like… It has awesome community support behind it. It’s more reliable in terms of its being evolved
more often. I’m not really sure about the comparison with
Docker swarm. But definitely there are certain points why
we moved from Marathon to Kubernetes. And this is one of them.>>Last question for now. Are you planning to Open Source any of your
trained models? Maybe the image tagging model, for example? Obviously training can be expensive.>>I can’t speak for the company. I’m not sure about that.>>Do you actively research as a company? Do you try imagenet competitions and things
like that?>>Not actively. But we have tried it a couple of times before. And regarding the image tagging… We did not really Open Source the model, but
we Open Sourced the entire technique, what we did, and wrote up a blog post about it. So it’s all there, what we did, and how we
did it. Just that the end result is not there.>>Cool. We’ll tweet that out for you. Cool. Thank you very much, Sahil.

Leave a Reply

Your email address will not be published. Required fields are marked *