我这里有个需求就是能够使用本机多个GPU对只能使用单GPU的模型进行推理,以能够释放多GPU的潜力,加速推理,节约时间。因为模型需要使用 torch 进行GPU运算,简单的调用 python 内建的 multiprocessing 无法正常执行,需要使用 torch.multiprocessing,后者支持前者完全相同的操作,但扩展了前者以便通过 multiprocessing.Queue 发送的所有张量将其数据移动到共享内存中,并且只会向其他进程发送一个句柄。
多进程使用多GPU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 import mathimport osimport torch.multiprocessing as mpfrom tqdm import tqdmfrom fluorescence import add_pseudo_color, detect_fluorescencefrom utils import show_allfilesdef split_list_to_nested_list (img_path_list, divider=8 ): """ 把一个长列表划分为均匀的子列表,返回包含这些子列表的嵌套列表 @param img_path_list: 图像列表 @param divider: 划分子列表的个数 """ stride = int (math.ceil(len (img_path_list) / divider)) img_paths_nested = [ img_path_list[i * stride : i * stride + stride] for i in range (divider) ] return img_paths_nested if __name__ == "__main__" : data_path = "/disk0/images" img_paths = show_allfiles(path=data_path) img_paths_nested = split_list_to_nested_list(img_path_list=img_paths, divider=divider) for i, p in enumerate (img_paths_nested): print (f"第{i} 段{len (p)} 张图像" ) mp.set_start_method("spawn" , force=True ) divider = torch.cuda.device_count() devices = [ torch.device(f"cuda:{i} " ) if torch.cuda.is_available() else torch.device("cpu" ) for i in range (divider) ] threshold_remove_flu = 8.1 processes = [] for dev, imgs in zip (devices, img_paths_nested): p = mp.Process( target=detect_fluorescence, args=( imgs, dev, "vit_h" , "/disk1/datasets/models/sam/sam_vit_h_4b8939.pth" , threshold_remove_flu, 64 , 0.75 , 0.75 , 100 , 1500 , 150000 , 0.5 , ), name=f"Process-{dev} " , ) p.start() processes.append(p) print (f"Started {p.name} " ) for p in processes: p.join() print (f"Finished {p.name} " ) print (f"Finished all" )
参考文献
多进程最佳实践
PyTorch多进程分布式训练最简单最好用的实施办法