In all my video projects I've been struggling with keeping track of my source material; even in a single 13-episode season of one show, that's about five hours of anime, with thousands of individual shots, and I have trouble finding just the shot I want for a given point, or (worse) remembering just what's available. Now, I like anime as much as the next guy, but I can't be viewing the entire show over again, even on fast forward, for every cut that I use in my video. I don't even know how those of you who do multiple-show videos organize your work. So (see my journal) I'm experimenting with the idea of building a visual index: a bunch of script-generated Web pages that show screen captures from the source footage, marked up with the time references so I can quickly skim through for the shot I want and then know where to find it. I also want to build something that will group similar images together, so if I can find one that's sort of like what I want, I can see what variations are available.
Does anyone have any thoughts or reactions on this? Has something like it been done already? If I get it working, will anyone want copies of my indexes (which, obviously, are only useful for source material I have indexed) or my software (which will be designed for Linux and source material on DVD)?
Visual indexing of footage
- koronoru
- Joined: Mon Oct 21, 2002 10:03 am
- Location: Waterloo, Ontario
-
- is
- Joined: Tue Jul 23, 2002 5:54 am
- Status: N͋̀͒̆ͣ͋ͤ̍ͮ͌ͭ̔̊͒ͧ̿
- Location: N????????????????
Sounds cool, but I wonder how you're going to tackle this.
Indexing isn't too hard; you could just use libxine or mplayer to generate frameshots every N frames. HTML generation would be trivial.
The part I want to know is exactly how you plan to tackle visual similarity. If it's going to be automated, things like average brightness, average color, and so forth are good, easy choices, but there's so much else to take into account. Are you planning something like a browser interface, in which the user can say "well, these shots are similar, I'll group them into category X"?
If so, how does that solve the problem of having to wade through lots of material?
It sounds like a neat idea, anyway. If I had any free time I'd ask if I could collaborate with you on the project
Indexing isn't too hard; you could just use libxine or mplayer to generate frameshots every N frames. HTML generation would be trivial.
The part I want to know is exactly how you plan to tackle visual similarity. If it's going to be automated, things like average brightness, average color, and so forth are good, easy choices, but there's so much else to take into account. Are you planning something like a browser interface, in which the user can say "well, these shots are similar, I'll group them into category X"?
If so, how does that solve the problem of having to wade through lots of material?
It sounds like a neat idea, anyway. If I had any free time I'd ask if I could collaborate with you on the project

- Zarxrax
- Joined: Sun Apr 01, 2001 6:37 pm
- Contact:
- koronoru
- Joined: Mon Oct 21, 2002 10:03 am
- Location: Waterloo, Ontario
Grabbing just every Nth frame isn't as good as it sounds, because some shots are much longer than others. I don't want to miss a half-second shot that happens to have the great visual I wanted for that special high note in the chorus... but then if I grab a frame every half-second, I'll be stuck with twenty frames from a boring ten-second shot. I have a better solution that's already more or less debugged, though. I use a distance measure called the L1 vector norm to take the difference between successive frames. There is almost always at least some difference because of compression artifacts, even on a shot that looks like just a still image. Then I look for sudden spikes in the difference. If each frame looks a lot like the one before it, but then there's suddenly a frame that's totally different, chances are that's a cut. I keep track of the cuts and use those as boundaries for the shots; then I grab one frame from the middle of each shot. I've built and tested this and it seems to work pretty well. Where it falls down are shots with a lot of action (it tends to grab too many frames from those) and shots that dissolve slowly into each other - which, fortunately, aren't that common in anime anyway.
For grouping similar images together, well, yes, it's a problem to define just what "similar" means. However, it's a problem that is well studied and that happens to fall pretty close to my research area (I'm a computer science grad student), so it's something I know a fair bit about. I chose the L1 norm for the shot-finding thing because it tends to flag differences based on how much of the screen area is different, but not much emphasis on how different it is. I figure that's well suited to anime, which has a lot of flat areas of colour. When a new shot starts, a whole lot of the screen changes (background colour, even); but within a shot, you just have a person's mouth moving, or if someone walks across the frame, the leading and trailing edges of their figure. Those pixels are totally different from frame to frame, but the number of pixels that are different is relatively small.
I've run a few experiments with grouping images based on the L1 norm (simply because I already had the code written, for the shot-finding program) and it doesn't work very well. I'm considering switching to histogram matching - call two images the same if they have the same colour scheme, ignoring just how the colours are arranged. I think that might actually be best because how I imagine using this is "I want to find a picture of such-and-such character". Every character tends to have his or her own skin, hair, and clothing colours. But I haven't actually tried this one yet, because although it'll be more efficient to compute, it'll require some significant changes to the software I wrote.
As for Zarxrax's suggestion: that's what I've been doing, more or less, and the problems I've found are that A. my computer is too slow, making that kind of zooming around kind of painful, and B. as with the "grab every Nth frame" thing, it's hard to set the zoom level so that I don't either miss short shots or get bogged down in long shots. I'm also too impatient to do just one episode at a time; I want to be able to scan *all* my source material, *all* at once, and I want it *right now* without having to buy a new PC!
Yeah, trust me to spend 20 hours programming to save 2 hours of video searching.
For grouping similar images together, well, yes, it's a problem to define just what "similar" means. However, it's a problem that is well studied and that happens to fall pretty close to my research area (I'm a computer science grad student), so it's something I know a fair bit about. I chose the L1 norm for the shot-finding thing because it tends to flag differences based on how much of the screen area is different, but not much emphasis on how different it is. I figure that's well suited to anime, which has a lot of flat areas of colour. When a new shot starts, a whole lot of the screen changes (background colour, even); but within a shot, you just have a person's mouth moving, or if someone walks across the frame, the leading and trailing edges of their figure. Those pixels are totally different from frame to frame, but the number of pixels that are different is relatively small.
I've run a few experiments with grouping images based on the L1 norm (simply because I already had the code written, for the shot-finding program) and it doesn't work very well. I'm considering switching to histogram matching - call two images the same if they have the same colour scheme, ignoring just how the colours are arranged. I think that might actually be best because how I imagine using this is "I want to find a picture of such-and-such character". Every character tends to have his or her own skin, hair, and clothing colours. But I haven't actually tried this one yet, because although it'll be more efficient to compute, it'll require some significant changes to the software I wrote.
As for Zarxrax's suggestion: that's what I've been doing, more or less, and the problems I've found are that A. my computer is too slow, making that kind of zooming around kind of painful, and B. as with the "grab every Nth frame" thing, it's hard to set the zoom level so that I don't either miss short shots or get bogged down in long shots. I'm also too impatient to do just one episode at a time; I want to be able to scan *all* my source material, *all* at once, and I want it *right now* without having to buy a new PC!

Yeah, trust me to spend 20 hours programming to save 2 hours of video searching.
- Zarxrax
- Joined: Sun Apr 01, 2001 6:37 pm
- Contact:
Why not just convert them to divx then use the next keyframe button in virtualdub. Keyframes are at scene changes, so you probly wont miss much. You could also have program just go through and automate this process. Skip to a keyframe, save the frame as a jpg thumbnail, slap it onto an html page for viewing, go to the next keyframe...