I've given it a try, it's rather easy to setup (but be careful of the dependencies/hardware support), however, it's a VRAM Hog: 24GB is the bare minimum, 30GB is better (with the detokenizer disabled).
Basically, it's OK-ish to identify sex scenes, with a prompt specifically saying to expect sex scenes, I sometimes have correct results, but it often days "a dog is barking"

The main advantage is that it's completely unnecessary to train it on specific data to make it "work". However, as they released the raw model, fine tuning it for sex scene identification should be feasible.
I'll post more details when I'll get them from my other computer.