[Info-vax] Looking for some text search ideas
David Froble
davef at tsoft-inc.com
Fri Sep 26 21:00:16 EDT 2014
Paul Sture wrote:
> On 2014-09-26, Bill Pedersen <pedersen at ccsscorp.com> wrote:
>> On 9/26/2014 1:47 PM, Bill Pedersen wrote:
>>> The specifics of how you do your search and handle potential matches has
>>> been researched over the years.
>>>
>>> A recent example was to search only for the leading letter of the
>>> string. On match then check the character at the position of the last
>>> letter in desired srting for a match, if not match continue with
>>> comparison search for first letter of string at position after the last
>>> letter failed. This has been shown to speed up the searches.
>>>
>>> It is not clear how much more you can do as far as improving search
>>> performance but I am certain there are papers on this and other option.
>>>
>> Although on reflection I believe that gives a hole and that you actually
>> need to continue the search after the matched first character after the
>> failure to match the terminal character of the desired string.
>
> You also need to take word delimiters into account. The outcome of my
> first run with a similar project earlier this year returned way
> more results than I had anticipated. For the word 'parliament' I got
> back parliaments, parliamentary, parliamentarian, parliamentarians,
> parliamentarianism - I think that was it :-)
>
> Space, comma and period are not the only word delimiters of course, and
> with part descriptions weights and measures may be involved*
>
> Achieving a satisfactory search solution here will quite possibly involve
> a data cleaning exercise on the existing database(s).
>
> * Obligatory comment: beware data coming from users who have "smart quotes"
> or other typographical "enhancements" switched on - (oops was that an
> emdash?) - they make a mess of feet and inches symbols and other characters
> you may wish to match on in a database search.
>
The policy is, you get a group of product records from the
manufacturers, and you just plug it in. There is a rather good reason
for this. Some third party, don't know who, has parts explosion
pictures for most of this stuff, and everybody needs to be using the
same part numbers, otherwise when someone selects a part from the GUI
software, there won't be a match on the distributor's system. So
whatever the manufacturer's set up is used.
Part supersuccession (happens very often) would also suffer.
More information about the Info-vax
mailing list