TensorRT 使用指南（2）：模型量化

fp16 量化

如果显卡支持 fp16 运算加速，那么使用 fp16 量化能显著提升模型的推理速度，且由于 fp16 的表示范围为 (-65504, +65504)，一般都能包含模型的权重数值范围，所以直接截断对精度的影响很小。因此在 TensorRT 中实现 fp16 量化的方法相当简单，只需要在构建 engine 时添加一行配置即可

config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)

int8 量化

int8 量化能进一步压缩模型体积以及提升推理速度，但是 8bit 整数能表示的范围只有 -127~128，模型权重值一般都超过了这个范围。因此需要对权重进行 scale 和 shift，使得量化后的权重能够落在 int8 表示的范围内。但是模型推理是权重和特征张量（一般被称为 activation）的计算过程，仅量化权重是不够的，还需要对 activation 进行量化。为了更精准的量化 activation，一般的解决方案是用一批实际样本数据来做标定，也就是说根据实际样本的数值分布计算出 activation 的 scale 和 shift，然后将这些参数保存到 engine 中，这样在推理时就能直接使用这些参数进行量化了。

在 TensorRT 中，使用 int8 量化的配置如下

config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = EngineCalibrator(...)

其中 EngineCalibrator 是我们自定义的实现了 TensorRT 标定接口的类，目前官方推荐的两个接口 IInt8MinMaxCalibrator 和 IInt8EntropyCalibrator2，前者适用于 NLP 任务，后者适用于 CNN 任务。这里我们使用 IInt8EntropyCalibrator2 作为示例

MEAN = np.array([0.485, 0.456, 0.406])
STD = np.array([0.229, 0.224, 0.225])

class EngineCalibrator(trt.IInt8EntropyCalibrator2):

    def __init__(self, 
                 calib_dir: str, 
                 cache_file: str, 
                 batch_size: int,
                 img_size: tuple = (224, 224)):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.calib_dir = calib_dir
        self.cache_file = cache_file
        self.batch_size = batch_size
        self.img_size = img_size
        self.img_paths = glob.glob("{}/*.jpg".format(calib_dir)) + glob.glob("{}/*.png".format(calib_dir))
        self.batch_generator = self._batch_generator()
        if not os.path.exists(os.path.dirname(cache_file)):
            os.makedirs(os.path.dirname(cache_file))
        
        nbytes = self.batch_size * 3 * self.img_size[0] * self.img_size[1] * np.dtype(np.float32).itemsize
        self.input_buffer = cudart.cudaMalloc(nbytes)[1]

    def _batch_generator(self):
        for i in range(len(self.img_paths)):
            batch = self.img_paths[i: min(i + self.batch_size, len(self.img_paths))]
            i += self.batch_size
            yield batch

    def get_batch_size(self):
        return self.batch_size
    

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()
            
    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

    def _preprocess(self, img: np.ndarray):
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, self.img_size)
        img = img.astype(np.float32) / 255.0
        img -= MEAN
        img /= STD
        img = img.transpose((2, 0, 1))
        return img

    def get_batch(self, names):
        try:
            img_paths = next(self.batch_generator)
            batch = []
            for img_path in img_paths:
                img = cv2.imread(img_path)
                img = self._preprocess(img)
                batch.append(img)

            batch = np.stack(batch, axis=0)
            batch = np.ascontiguousarray(batch)
            batch = batch.astype(np.float32)
            cudart.cudaMemcpy(self.input_buffer, batch.ctypes.data, batch.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)
            return [self.input_buffer]

        except Exception as e:
            return None

其中来自 IInt8EntropyCalibrator2 接口的方法有

get_batch_size，返回标定时的 batch size。
read_calibration_cache，读取标定缓存，如果存在的话，标定缓存指的是标定过程中计算的数据信息，因为标定过程包含大量推理，比较耗时，如果一次标定之后将数据信息保存到文件中，下次再标定时就可以直接读取，可以节省时间。
write_calibration_cache，写入标定缓存。
get_batch，返回一个 batch 的数据，这是一个包含 cuda 内存指针的列表。

完成以上配置之后，再按常规的方式构建，即可得到一个 int8 量化的 engine。

参考

Detectron 2 Mask R-CNN R50-FPN 3x in TensorRT