[Info-vax] Looking for some text search ideas
David Froble
davef at tsoft-inc.com
Sat Sep 27 18:57:03 EDT 2014
Johnny Billquist wrote:
> On 2014-09-27 03:17, David Froble wrote:
>> johnwallace4 at yahoo.co.uk wrote:
>>
>>> Me: Find a way to split the workload across multiple concurrent
>>> threads of execution, which then means it can be spread across
>>> multiple processors, thereby reducing the time to completion. 4
>>> processors => roughly a quarter of the single-processor elapsed time,
>>> though the actual effort required will increase slightly... maybe.
>>
>> I'm not sure I have a clue on how to do such a thing.
>>
>> Regardless, a small fraction of a second to perform the task, before any
>> optimization, is more than adequate.
>>
>> Frankly, I just don't see any way to optimize a search through data. The
>> brute force of just checking all of it sequentially seems the only
>> option. But what do I know? Could be, someone has come up with
>> something smart that could do a quicker job. And so I throw the
>> question out to some smart people.
>>
>> :-)
>
> Hmm. People show a lack in CS education here. :-)
> If you want to search for a substring within a large gob of memory, then
> yes, a brute force checking all characters sequentially is definitely
> not the best we've come up with.
> This was figured out a long time ago.
>
> Essentially, you can do much better. Not that it matters for really
> small gobs of memory, but if you're searching through a few megabytes,
> then your search can be sped up by orders of magnitude if you are more
> clever.
>
> I don't have time to write it down right now, but the basic idea is that
> you search forward, but match backwards. And if you don't match, you can
> figure out how much you can jump forward, which most of the time is more
> than just one character.
>
> If people don't know how to do this, I can write it down in more detail
> later.
>
> Johnny
>
No, it's not what you're thinking.
Each description is 30-40 bytes. I forget which. Each description is
searched for matches. We are not just searching a large amount of text.
Now, if you got some better ideas about searching through a million 40
byte strings, that's what I originally was asking for.
I "think" that I need to search each 40 byte string one at a time,
because I then need to associate the index with the Mfg code and Part #,
which is used to access the entire data record.
More information about the Info-vax
mailing list