Evaluation of algorithms for Multi-Modality Whole Heart Segmentation: An open-access grand challenge.
Zhuang X., Li L., Payer C., Štern D., Urschler M., Heinrich MP., Oster J., Wang C., Smedby Ö., Bian C., Yang X., Heng P-A., Mortazi A., Bagci U., Yang G., Sun C., Galisot G., Ramel J-Y., Brouard T., Tong Q., Si W., Liao X., Zeng G., Shi Z., Zheng G., Wang C., MacGillivray T., Newby D., Rhode K., Ourselin S., Mohiaddin R., Keegan J., Firmin D., Yang G.
Knowledge of whole heart anatomy is a prerequisite for many clinical applications. Whole heart segmentation (WHS), which delineates substructures of the heart, can be very valuable for modeling and analysis of the anatomy and functions of the heart. However, automating this segmentation can be challenging due to the large variation of the heart shape, and different image qualities of the clinical data. To achieve this goal, an initial set of training data is generally needed for constructing priors or for training. Furthermore, it is difficult to perform comparisons between different methods, largely due to differences in the datasets and evaluation metrics used. This manuscript presents the methodologies and evaluation results for the WHS algorithms selected from the submissions to the Multi-Modality Whole Heart Segmentation (MM-WHS) challenge, in conjunction with MICCAI 2017. The challenge provided 120 three-dimensional cardiac images covering the whole heart, including 60 CT and 60 MRI volumes, all acquired in clinical environments with manual delineation. Ten algorithms for CT data and eleven algorithms for MRI data, submitted from twelve groups, have been evaluated. The results showed that the performance of CT WHS was generally better than that of MRI WHS. The segmentation of the substructures for different categories of patients could present different levels of challenge due to the difference in imaging and variations of heart shapes. The deep learning (DL)-based methods demonstrated great potential, though several of them reported poor results in the blinded evaluation. Their performance could vary greatly across different network structures and training strategies. The conventional algorithms, mainly based on multi-atlas segmentation, demonstrated good performance, though the accuracy and computational efficiency could be limited. The challenge, including provision of the annotated training data and the blinded evaluation for submitted algorithms on the test data, continues as an ongoing benchmarking resource via its homepage (www.sdspeople.fudan.edu.cn/zhuangxiahai/0/mmwhs/).