[Info-vax] Looking for some text search ideas
Craig A. Berry
craigberry at nospam.mac.com
Sat Sep 27 12:02:08 EDT 2014
On 9/27/14, 7:01 AM, Paul Sture wrote:
> On 2014-09-27, VAXman- @SendSpamHere.ORG <VAXman- at SendSpamHere.ORG> wrote:
>> "Regular expressions" are incomprehensible Geekery expressed as gibberish
>> to denote WHAT to search for but it does NOT specify the mechanics of HOW
>> to search for it!
It doesn't specify how but "a good engine" as I suggested will be highly
optimized to go about it in the most efficient way.
> True, but Craig specifically mentioned the word "engine" and then Jan-Erik
> mentioned "libraries". It's always a compromise though. If there aren't
> such libraries available for the flavour of Basic David is using then he's
> going to get into the joys of the system management side of Perl or Python
> or... (shouldn't be much, but it's yet another cost in man hours).
>
It would be quite foolhardy to try to implement one's own regex engine.
If it were me and I were using VMS BASIC, I'd get a port of PCRE up and
running and write wrappers around the functions that accepted dynamic
string descriptors in a BASIC-friendly way.
But my real point in bringing up using a regex engine is that it is
often beneficial performance-wise even if you aren't using fancy
patterns, especially if the string is long and/or you need to make
multiple passes through it for multiple match criteria; the discussion
did start off, after all, with making sure the search has good performance.
So for example, searching a string for the supremely simple regex
pattern /gasket/ is likely to be faster than BASIC's INSTR(1%, MYSTR$,
"gasket") or Fortran's INDEX(MYSTR, 'gasket') functions. There are
various mechanisms to accomplish this; usually it indexes each character
in the string on-the-fly. It is possible for extremely simple cases on
very short strings that the overhead of setting that up cancels the
performance improvement, but usually not.
Of course requirements often change, and while everybody might be happy
with that simple search for "gasket" initially, pretty soon they'll want
to search for "head" OR "gasket" anywhere in the string. With a regex,
you just change your pattern to /(head|gasket)/ and you're done (and you
won't be scanning the entire string twice).
Note that I never said the end user would be dealing directly with
regular expressions -- that's usually a bad idea.
More information about the Info-vax
mailing list