[Info-vax] Looking for some text search ideas
Paul Sture
nospam at sture.ch
Fri Sep 26 19:46:11 EDT 2014
On 2014-09-26, Bill Pedersen <pedersen at ccsscorp.com> wrote:
> On 9/26/2014 1:47 PM, Bill Pedersen wrote:
>>
>> The specifics of how you do your search and handle potential matches has
>> been researched over the years.
>>
>> A recent example was to search only for the leading letter of the
>> string. On match then check the character at the position of the last
>> letter in desired srting for a match, if not match continue with
>> comparison search for first letter of string at position after the last
>> letter failed. This has been shown to speed up the searches.
>>
>> It is not clear how much more you can do as far as improving search
>> performance but I am certain there are papers on this and other option.
>>
>
> Although on reflection I believe that gives a hole and that you actually
> need to continue the search after the matched first character after the
> failure to match the terminal character of the desired string.
You also need to take word delimiters into account. The outcome of my
first run with a similar project earlier this year returned way
more results than I had anticipated. For the word 'parliament' I got
back parliaments, parliamentary, parliamentarian, parliamentarians,
parliamentarianism - I think that was it :-)
Space, comma and period are not the only word delimiters of course, and
with part descriptions weights and measures may be involved*
Achieving a satisfactory search solution here will quite possibly involve
a data cleaning exercise on the existing database(s).
* Obligatory comment: beware data coming from users who have "smart quotes"
or other typographical "enhancements" switched on - (oops was that an
emdash?) - they make a mess of feet and inches symbols and other characters
you may wish to match on in a database search.
--
A quick recap of Thursday 25th September 2014:
http://pbs.twimg.com/media/ByZfyyXIQAAXTai.jpg
Happy Thursday!
More information about the Info-vax
mailing list