[Info-vax] What does VMS get used for, these days?
Arne Vajhøj
arne at vajhoej.dk
Sun Oct 23 20:23:14 EDT 2022
On 10/19/2022 12:47 PM, Johnny Billquist wrote:
> On 2022-10-17 01:53, Arne Vajhøj wrote:
>> On 10/16/2022 7:32 PM, kemain.nospam at gmail.com wrote:
>>> Also, replacing a compiled application with an interpreted language??
>>
>> Python for data processing? Sure! Why not!
>>
>> There is a company called Google that until a few years ago
>> processed all the public web pages in the world with Python.
>
> No. They did not.
> Google used/use Python for a bunch of stuff, but it was decided almost
> 10 years ago that no new stuff should be written in Python. And even
> back then, most scraping and processing was in C++. Python was used for
> other bits and pieces. But not for processing everything that Google
> scraped.
>
> The reason for stop using Python was several. Performance was one part,
> but much more it was realized that large software systems become really
> hard to develop in a safe way, or maintain, using Python. It's not
> really suitable for any of this. It's nice for small, simple programs
> you want to throw together quickly.
Google is not known for saying much about their technology, but some is
known.
Fact (source: Brin & Page): 1996 both crawler and search engine was in
Python
Fact (source: Steve Levy): 1999 both crawler and search engine was still
in Python, search engine had performance problems
Internet gossip: mid 00's crawler was still Python, search engine had
been rewritten in C++
Internet gossip: around when 00's became 10's the company generally
started switching away from Python
Internet gossip: mid 10's both crawler and search engine are now C++
It is not known if the crawler was rewritten before other Python stuff
(like 2005),
when other Python stuff was done (like 2010) or after other Python stuff
(like 2015).
But given what we know about the Python crawler and how much Google indexes:
1996 - 24 million pages
2000 - 1 billion pages
2004 - 4 billion pages
2005 - 25 billion pages
2013 - 30 trillion pages
2016 - 130 trillion pages
then that Python crawler must have indexed a lot of pages.
And several orders of magnitude larger data than the financial data ETL that
started this subthread.
And it is actually not so bad performance wise.
Interpretation is typical 10-20 times slower than compiled code.
So parsing those millions/billions of pages in Python would be 10-20
times slower
than doing it in C++. Meaning 10-20 times as many servers. Disaster.
But that is probably not relevant at all.
If Google's Python crawler is like most high performance Python, then
the model is
Python code call a Python module consisting of a Python wrapper around a
native code library.
So all the CPU intensive parsing would be done in that native code and
the Python code
would just have the logic on how to process based on the parsing output.
Python high level logic utilizing modules with native code may just be
10-20% slower
than C++.
And then it would just be 10-20% more servers. Acceptable at least until
a certain level.
Arne
More information about the Info-vax
mailing list