Grabbing just every Nth frame isn't as good as it sounds, because some shots are much longer than others. I don't want to miss a half-second shot that happens to have the great visual I wanted for that special high note in the chorus... but then if I grab a frame every half-second, I'll be stuck with twenty frames from a boring ten-second shot. I have a better solution that's already more or less debugged, though. I use a distance measure called the L1 vector norm to take the difference between successive frames. There is almost always at least some difference because of compression artifacts, even on a shot that looks like just a still image. Then I look for sudden spikes in the difference. If each frame looks a lot like the one before it, but then there's suddenly a frame that's totally different, chances are that's a cut. I keep track of the cuts and use those as boundaries for the shots; then I grab one frame from the middle of each shot. I've built and tested this and it seems to work pretty well. Where it falls down are shots with a lot of action (it tends to grab too many frames from those) and shots that dissolve slowly into each other - which, fortunately, aren't that common in anime anyway.
For grouping similar images together, well, yes, it's a problem to define just what "similar" means. However, it's a problem that is well studied and that happens to fall pretty close to my research area (I'm a computer science grad student), so it's something I know a fair bit about. I chose the L1 norm for the shot-finding thing because it tends to flag differences based on how much of the screen area is different, but not much emphasis on how different it is. I figure that's well suited to anime, which has a lot of flat areas of colour. When a new shot starts, a whole lot of the screen changes (background colour, even); but within a shot, you just have a person's mouth moving, or if someone walks across the frame, the leading and trailing edges of their figure. Those pixels are totally different from frame to frame, but the number of pixels that are different is relatively small.
I've run a few experiments with grouping images based on the L1 norm (simply because I already had the code written, for the shot-finding program) and it doesn't work very well. I'm considering switching to histogram matching - call two images the same if they have the same colour scheme, ignoring just how the colours are arranged. I think that might actually be best because how I imagine using this is "I want to find a picture of such-and-such character". Every character tends to have his or her own skin, hair, and clothing colours. But I haven't actually tried this one yet, because although it'll be more efficient to compute, it'll require some significant changes to the software I wrote.
As for Zarxrax's suggestion: that's what I've been doing, more or less, and the problems I've found are that A. my computer is too slow, making that kind of zooming around kind of painful, and B. as with the "grab every Nth frame" thing, it's hard to set the zoom level so that I don't either miss short shots or get bogged down in long shots. I'm also too impatient to do just one episode at a time; I want to be able to scan *all* my source material, *all* at once, and I want it *right now* without having to buy a new PC!
Yeah, trust me to spend 20 hours programming to save 2 hours of video searching.