Collaborative Annotation of Semantic Objects in Images with Multi-granularity Supervisions

Lishi Zhang*    Chenghan Fu*    Jia Li+

* equal contribution

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University

Published in ACM Multimedia, July. 2018

Per-pixel masks of semantic objects are very useful in many applications, which, however, are tedious to be annotated. In this paper, we propose a human-agent collaborative annotation approach that can efficiently generate per-pixel masks of semantic objects in tagged images with multi-granularity supervisions. Given a set of tagged image, a computer agent is first dynamically generated to roughly localize the semantic objects described by the tag. The agent first extracts massive object proposals from an image and then infer the tag-related ones under the weak and strong supervisions from linguistically and visually similar images and previously annotated object masks. By representing such supervisions by over-complete dictionaries, the tag-related object proposals can pop-out according to their sparse coding length, which are then converted to superpixels with binary labels. After that, human annotators participate in the annotation process by flipping labels and dividing superpixels with mouse clicks, which are used as click supervisions that teach the agent to recover false positives/negatives in processing images with the same tags. Expperimental results show that our approach can facilitate the annotation process and generate object masks that are highly consistent with those generated by the LabelMe toolbox.


There are 40 classes of ImageNet annotated by our annotation tool and LabelMe. (a,g) Tagged images, (b,h) DeepMask, (c,i) SharpMask, (d,j) Our initialization results, (e,k) Our final annotation results, (f,l) LabelMe results (ground-truth).


Framework of the proposed approach. Given a set of images tagged with ``Cat``,'' a computer agent is dynamically generated with weak, strong and flip dictionaries. It first extracts object proposals and superpixels, and the tag-related object proposals are then inferred by measuring the sparse coding length of weak and strong dictionaries. By converting the tag-related objects into the binary labels of superpixels, the human annotator can participate to flip the superpixel label or divide coarse superpixel into finer ones via mouse clicks. Such clicks are then used to form flip dictionaries which can be used to supervise the automatic refinement of subsequent images.


Two state-of-the-art automatic segmentation models are tested on ImageNet (40 classes ).