OpenSDI: Spotting Diffusion-Generated Images in the Open World

OpenSDID Dataset

We introduce OpenSDID, a large-scale dataset specifically curated for the OpenSDI challenge. Our dataset design addresses the three core requirements essential for open-world spotting of AI-generated content: user diversity, model innovation, and manipulation scope.

User Diversity: Simulates diverse user preferences and editing intentions by employing multiple large vision-language models (VLMs).
Model Innovation: Includes a variety of state-of-the-art diffusion models (SD1.5, SD2.1, SDXL, SD3, Flux.1) to address the challenges posed by rapid model iteration.
Manipulation Scope: The dataset covers both global image synthesis and precise local edits, simulating complex real-world image forgeries.

OpenSDID comprises 300,000 images, evenly distributed between real and fake samples, divided into training and testing sets. Refer to the CVPR'25 paper for more dataset details.

An OpenSDID pipeline for local modification on real image content: (A) Sampling real images from the Megalith-10M dataset, (B) Generating textual instructions for editing using Vision Language Models (VLMs), (C) Creating visual masks for modification through segmentation models, and (D) Producing AI-generated images with image generators based on the instructions and masks. For global image content generation, OpenSDID merely uses (B) and (D) without using real images to produce masks.

Some samples of OpenSDID.

Leaderboard

Pixel-level Localization Performance

Method	SD1.5 IoU	SD1.5 F1	SD2.1 IoU	SD2.1 F1	SDXL IoU	SDXL F1	SD3 IoU	SD3 F1	Flux.1 IoU	Flux.1 F1	AVG IoU	AVG F1
MVSS-Net [1]	0.5785	0.6533	0.4490	0.5176	0.1467	0.1851	0.2692	0.3271	0.0479	0.0636	0.2983	0.3493
CAT-Net [2]	0.6636	0.7480	0.5458	0.6232	0.2550	0.3074	0.3555	0.4207	0.0497	0.0658	0.3739	0.4330
PSCC-Net [3]	0.5470	0.6422	0.3667	0.4479	0.1973	0.2605	0.2926	0.3728	0.0816	0.1156	0.2970	0.3678
ObjectFormer [4]	0.5119	0.6568	0.4739	0.4144	0.0741	0.0984	0.0941	0.1258	0.0529	0.0731	0.2414	0.2737
TruFor [5]	0.6342	0.7100	0.5467	0.6188	0.2655	0.3185	0.3229	0.3852	0.0760	0.0970	0.3691	0.4259
DeCLIP [6]	0.3718	0.4344	0.3569	0.4187	0.1459	0.1822	0.2734	0.3344	0.1121	0.1429	0.2520	0.3025
IML-ViT [7]	0.6651	0.7362	0.4479	0.5063	0.2149	0.2597	0.2363	0.2835	0.0611	0.0791	0.3251	0.3730
MaskCLIP [14]	0.6712	0.7563	0.5550	0.6289	0.3098	0.3700	0.4375	0.5121	0.1622	0.2034	0.4271	0.4941

Image-level Detection Performance

Method	SD1.5 F1	SD1.5 Acc	SD2.1 F1	SD2.1 Acc	SDXL F1	SDXL Acc	SD3 F1	SD3 Acc	Flux.1 F1	Flux.1 Acc	AVG F1	AVG Acc
CNNDet [8]	0.8460	0.8504	0.7156	0.7594	0.5970	0.6872	0.5627	0.6708	0.3572	0.5757	0.6157	0.7087
GramNet [9]	0.8051	0.8035	0.7401	0.7666	0.6528	0.7076	0.6435	0.7029	0.5200	0.6337	0.6723	0.7229
FreqNet [10]	0.7588	0.7770	0.6097	0.6837	0.5315	0.6402	0.5350	0.6437	0.3847	0.5708	0.5639	0.6631
NPR [11]	0.7941	0.7928	0.8167	0.8184	0.7212	0.7428	0.7343	0.7547	0.6762	0.7136	0.7485	0.7645
UniFD [12]	0.7745	0.7760	0.8062	0.8192	0.7074	0.7483	0.7109	0.7517	0.6110	0.6906	0.7220	0.7572
RINE [13]	0.9108	0.9098	0.8747	0.8812	0.7343	0.7876	0.7205	0.7678	0.5586	0.6702	0.7598	0.8033
MVSS-Net [1]	0.9347	0.9365	0.7927	0.8233	0.5985	0.7042	0.6280	0.7213	0.2759	0.5678	0.6460	0.7506
CAT-Net [2]	0.9615	0.9615	0.7932	0.8246	0.6476	0.7334	0.6526	0.7361	0.2266	0.5526	0.6563	0.7616
PSCC-Net [3]	0.9607	0.9614	0.7685	0.8094	0.5570	0.6881	0.5978	0.7089	0.5177	0.6704	0.6803	0.7676
ObjectFormer [4]	0.7172	0.7522	0.6679	0.7255	0.4919	0.6292	0.4832	0.6254	0.3792	0.5805	0.5479	0.6626
TruFor [5]	0.9012	0.9773	0.3593	0.5562	0.5804	0.6641	0.5973	0.6751	0.4912	0.6162	0.5859	0.6978
DeCLIP [6]	0.8068	0.7831	0.8402	0.8277	0.7069	0.7055	0.6993	0.6840	0.5177	0.6561	0.7142	0.7313
IML-ViT [7]	0.9447	0.7573	0.6970	0.6119	0.4098	0.4995	0.4469	0.5125	0.1820	0.4362	0.5361	0.5635
MaskCLIP [14]	0.9264	0.9272	0.8871	0.8945	0.7802	0.8122	0.7307	0.7801	0.5649	0.6850	0.7779	0.8198

References

[1] Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.

[2] Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision (IJCV), 2022.

[3] Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022.

[4] Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Objectformer for image manipulation detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[5] Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[6] Stefan Smeu, Elisabeta Oneata, and Dan Oneata. Declip: Decoding clip representations for deepfake localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025.

[7] Xiaochen Ma, Bo Du, Xianggen Liu, Ahmed Y Al Hammadi, and Jizhe Zhou. Iml-vit: Image manipulation localization by vision transformer. In Association for the Advancement of Artificial Intelligence (AAAI), 2024.

[8] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[9] Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[10] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, In Association for the Advancement of Artificial Intelligence (AAAI), 2024.

[11] Chuangchuang Tan, Huan Liu, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection, In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), 2024.

[12] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[13] Christos Koutlis and Symeon Papadopoulos. Leveraging representations from intermediate encoder-blocks for synthetic image detection. In European Conference on Computer Vision (ECCV), 2024.

[14] Yabin Wang, Zhiwu Huang, Xiaopeng Hong. OpenSDI: Spotting Diffusion-Generated Images in the Open World. In Computer Vision and Pattern Recognition (CVPR), 2025.

Results Showcase

The following are qualitative result examples of MaskCLIP and other methods on the OpenSDID dataset. It showcases the detection and localization performance on images generated by different diffusion models.

SD1.5 Result Example

SD2 Result Example

SDXL Result Example

SD3 Result Example

Flux.1 Result Example

Download & Code

The OpenSDID dataset and the code for MaskCLIP are open-sourced on GitHub:

https://github.com/iamwangyabin/OpenSDI

Citation

If you use the OpenSDID dataset or MaskCLIP model in your research, please cite our paper:


@InProceedings{wang2025opensdi,
    author={Wang, Yabin and Huang, Zhiwu and Hong, Xiaopeng},
    title={OpenSDI: Spotting Diffusion-Generated Images in the Open World},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
  }