White Papers

27 CheXNet Inference with Nvidia T4 on Dell EMC PowerEdge R7425
sample input. TensorRT™ will then perform inference in fp32 and gather statistics about
intermediate activation layers that it will use to build the reduce precision int8 engine. When the
engine is built, TensorRT™ makes copies of the weights. The TensorRT™ network definition
contains pointers to model weights, the builder copies the weights into the optimized engine, and
the parser will own the memory occupied by the weights; the parser object is then deleted after
the builder has run for inference.
Serialize and Deserialize the model
//1-Run the builder as a prior offline step and then serialize:
HostMemory *serializedModel = engine->serialize();
//Store model to disk
assert(serializedModel);
p.write(reinterpret_cast<const char*>(serializedModel->data()), serializedModel->size());
serializedModel->destroy();
//2-Create a runtime object to deserialize:
IRuntime* runtime = createInferRuntime(gLogger);
ICudaEngine* engine = runtime->deserializeCudaEngine(modelData, modelSize, nullptr);
It is not mandatory to serialize and deserialize a model before using it for inference, if
desirable, the engine object can be used for inference directly. Since creating an engine from
the network definition can be time consuming, we can avoid rebuilding the engine every time
the application reruns by serializing it once and deserializing it while inferencing. Therefore,
after the engine is built, it is common to serialize it for later use [17].
Perform Inference feeding the engine
//1-Create the execution context to hold the network definition, trained parameters, necessary space:
IExecutionContext *context = engine->createExecutionContext();
//2-Use the input and output tensor names to get the corresponding input and output index:
int inputIndex = engine.getBindingIndex(input_tensor);