[Info-vax] Looking for some text search ideas

Sat Sep 27 19:34:23 EDT 2014

Bob Gezelter wrote:
> On Friday, September 26, 2014 1:27:04 PM UTC-4, David Froble wrote:
>> Perhaps in place of discussing non-existent malware on the non-existent 
>> VMS on x86, I might solicit some ideas.
>>
>> Our applications are not using a RDBMS.
>>
>> A request has come up to be able to find any data which contains some 
>> specific text.  An example might be any product description that 
>> contains the text "gasket".  Using keys won't help, because the key 
>> might be "head gasket".
>>
>> This is similar I believe to the SQL request something like
>>
>> SELECT * from PRODUCT where DESCRIPTION %like% gasket
>>
>> My perspective is that on today's systems with gobs of memory that much 
>> of a database's information is probably in memory, thus not incurring 
>> the overhead of lots of disk seeks.
>>
>> It's also my perspective that such a search is a sequential pass through 
>> the data looking for matches.
>>
>> And so this is my question.  Does anyone know of a more effective method 
>> than a sequential pass through the data of searching a list of data 
>> looking for text matches?
>>
>> I'm looking at possibilities from global data making the search 
>> available to all, to storage inside the one function currently needing 
>> this capability.  Some more research into the application needs will 
>> determine the answer to this question.
> 
> David,
> 
> Brute force can get expensive with large lists and frequent searches.
> 
> Solutions are a matter of overall efficiency. If search operations are infrequent, and system cycles are available, sequential search may work. 
> 
> If realtime performance is an issue, I would probably opt to build reverse indices on a word-by-word basis, and then combine the results (e.g., "hack" and "saw", and "blade"). Such searches have the benefit of being keyword order insensitive (e.g., "blade, hack saw" and "hack saw blade" both would match).
> 
> For a realtime system, with hundreds of users, frequently doing searches, I would use the reverse index. For one or two searches a day, I would probably not bother.
> 
> If there are questions concerning the above, please feel free to contact me.
> 
> - Bob Gezelter, http://www.rlgsc.com

Good ideas.

I've asked Bill to keep track of the actual usage.  Might need to take a 
second look if usage is high.  Right now, it's looking like about .05 
seconds for a search and return list.  Going to have to happen a lot to 
worry about that kind of timing.

The "user" is a web server.  Would not call that real time.  Usually.

Thanks.