[Info-vax] Looking for some text search ideas

Sat Sep 27 18:57:03 EDT 2014

Johnny Billquist wrote:
> On 2014-09-27 03:17, David Froble wrote:
>> johnwallace4 at yahoo.co.uk wrote:
>>
>>> Me: Find a way to split the workload across multiple concurrent
>>> threads of execution, which then means it can be spread across
>>> multiple processors, thereby reducing the time to completion. 4
>>> processors => roughly a quarter of the single-processor elapsed time,
>>> though the actual effort required will increase slightly... maybe.
>>
>> I'm not sure I have a clue on how to do such a thing.
>>
>> Regardless, a small fraction of a second to perform the task, before any
>> optimization, is more than adequate.
>>
>> Frankly, I just don't see any way to optimize a search through data. The
>> brute force of just checking all of it sequentially seems the only
>> option.  But what do I know?  Could be, someone has come up with
>> something smart that could do a quicker job.  And so I throw the
>> question out to some smart people.
>>
>> :-)
> 
> Hmm. People show a lack in CS education here. :-)
> If you want to search for a substring within a large gob of memory, then 
> yes, a brute force checking all characters sequentially is definitely 
> not the best we've come up with.
> This was figured out a long time ago.
> 
> Essentially, you can do much better. Not that it matters for really 
> small gobs of memory, but if you're searching through a few megabytes, 
> then your search can be sped up by orders of magnitude if you are more 
> clever.
> 
> I don't have time to write it down right now, but the basic idea is that 
> you search forward, but match backwards. And if you don't match, you can 
> figure out how much you can jump forward, which most of the time is more 
> than just one character.
> 
> If people don't know how to do this, I can write it down in more detail 
> later.
> 
>     Johnny
> 

No, it's not what you're thinking.

Each description is 30-40 bytes.  I forget which.  Each description is 
searched for matches.  We are not just searching a large amount of text.

Now, if you got some better ideas about searching through a million 40 
byte strings, that's what I originally was asking for.

I "think" that I need to search each 40 byte string one at a time, 
because I then need to associate the index with the Mfg code and Part #, 
which is used to access the entire data record.