White Papers

27 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425

sample input. TensorRT™ will then perform inference in fp32 and gather statistics about

intermediate activation layers that it will use to build the reduce precision int8 engine. When the

engine is built, TensorRT™ makes copies of the weights. The TensorRT™ network definition

contains pointers to model weights, the builder copies the weights into the optimized engine, and

the parser will own the memory occupied by the weights; the parser object is then deleted after

the builder has run for inference.

Serialize and Deserialize the model

//1-Run the builder as a prior offline step and then serialize:

HostMemory *serializedModel = engine->serialize();

//Store model to disk

assert(serializedModel);

p.write(reinterpret_cast<const char*>(serializedModel->data()), serializedModel->size());

serializedModel->destroy();

//2-Create a runtime object to deserialize:

IRuntime* runtime = createInferRuntime(gLogger);

ICudaEngine* engine = runtime->deserializeCudaEngine(modelData, modelSize, nullptr);

It is not mandatory to serialize and deserialize a model before using it for inference, if

desirable, the engine object can be used for inference directly. Since creating an engine from

the network definition can be time consuming, we can avoid rebuilding the engine every time

the application reruns by serializing it once and deserializing it while inferencing. Therefore,

after the engine is built, it is common to serialize it for later use [17].

Perform Inference feeding the engine

//1-Create the execution context to hold the network definition, trained parameters, necessary space:

IExecutionContext *context = engine->createExecutionContext();

//2-Use the input and output tensor names to get the corresponding input and output index:

int inputIndex = engine.getBindingIndex(“input_tensor”);