Readme
About
Cog implementation of Falconsai/nsfw_image_detection extended for video models
Model Card
Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification
Model Description
The Fine-Tuned Vision Transformer (ViT) is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks. This specific model, named “google/vit-base-patch16-224-in21k,” is pre-trained on a substantial collection of images in a supervised manner, leveraging the ImageNet-21k dataset. The images in the pre-training dataset are resized to a resolution of 224x224 pixels, making it suitable for a wide range of image recognition tasks.
Intended Uses & Limitations
NSFW Image Classification: The primary intended use of this model is for the classification of NSFW (Not Safe for Work) images. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications.
This model will return either the word: “normal” or “nsfw”