In the data I've scraped in my own nsfw programming work there's a great deal of bad tags with content out there. xvideos is a great example of crowdsourcing that has horrible error rates.
Yes even within a particular video there are lots of frames where the act is implied not directly shown, like a close-up of others faces. Karpathy et al. showed they could still learn from the sports video database even with random crowd shots or announcer shots not being removed.
I think the quality for the data influences the result and hand crafting the dataset is what lead to 95% accuracy on new instances.