Post
by rose4emily » Mon Oct 04, 2004 3:35 am
I actually have developed a tree-based scheme that provides a computationally efficient method for evaluating an input string against a list of pre-existing strings. It was originally implemented as a way of quickly identifying and removing any of a large number of similar strings from very large files (multi-gigabyte ASCII text files, if you can imagine such a thing), but could be easily modified to simply return whether or not the new string is a match for any of the strings on the list, and a set of similar strings from the list if there was no exact match. In this manner, a new string (artist) can be evaluated against the existing list (known artists) and a prompt can be shown if there is no exact match asking the user if they meant to write say, "Evanescence" rather than "Evenescence".
If the database is relational, for that matter, it shouldn't be too difficult to correct the existing entries (at least the misspellings) before compiling the list of "known artists".
Personally, I'd suggest that "known artists" be hand-evaluated, and that a second list of "possible artists" be created when the user overrides the system's suggestions for "known artist".
If one of the admins is interested, they can e-mail me a sample list of "known artists", a set of sample inputs, and a description of the system this routine would be integrated into so I would have some idea of what I need to do to the program to make it something that they can include in their existing system. The current implementation is programmed in Java, my language of choice, but is very computationally efficient (it minimizes object creation - the primary speedtrap of the Java language) and shouldn't present a large increase in system load, especially if used only on submissions, rather than searches.
It has O(n) scalability with respect to the number of input strings being presented to it, while scalabiltiy in terms of the number of strings to be searched against and the length of those strings gets more complicated. These also have an O(n) worst-case scenario (the same as is encountered for every string comparison scheme I am aware of), but would have typical performance far better than that thanks to the manner in which the tree benefits from the natural redundancy in words and also runs down that tree intelligently with respect to the word for which the comparison is being made.
may seeds of dreams fall from my hands -
and by yours be pressed into the ground.