Google platform
Being one of the most popular Internet search engines, Google requires large computational resources in order to provide their service.
Network topology
Google has several clusters in locations across the world. When an attempt to connect to Google is made, Google's DNS servers perform load balancing to allow the user to access Google's content most rapidly. This is done by sending the user the IP address of a cluster that is not under heavy load, and is geographically proximate to them. Each cluster has a few thousand servers, and upon connection to a cluster further load balancing is performed by hardware in the cluster, in order to send the queries to the least loaded Web Server.
Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), new servers are 2U Rackmount systems. Each rack has a switch. Servers are connected via a 100 Mbit/s Ethernet link to the local switch. Switches are connected to core gigabit switch using one or two gigabit uplinks.
Main index
Since queries are composed of words, an inverted index of documents is required. Such an index allows obtaining a list of documents by a query word. The index is very large due to the number of documents stored in the servers, therefore it needs to be split up into "index shards". Each shard is hosted by a set of index servers. The load balancer decides which index server to query based on the availability of each server.
Server types
Google's server infrastructure is divided in several types, each assigned to a different purpose:
Google Web Servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.
Data-gathering servers are permanently dedicated to spidering the Web. They update the index and document databases and apply Google's algorithms to assign ranks to pages.
Index servers each contain a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.
Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.
Ad servers manage advertisements offered by services like AdWords and AdSense.
Spelling servers make suggestions about the spelling of queries.
Google has several clusters in locations across the world. When an attempt to connect to Google is made, Google's DNS servers perform load balancing to allow the user to access Google's content most rapidly. This is done by sending the user the IP address of a cluster that is not under heavy load, and is geographically proximate to them. Each cluster has a few thousand servers, and upon connection to a cluster further load balancing is performed by hardware in the cluster, in order to send the queries to the least loaded Web Server.
Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), new servers are 2U Rackmount systems. Each rack has a switch. Servers are connected via a 100 Mbit/s Ethernet link to the local switch. Switches are connected to core gigabit switch using one or two gigabit uplinks.
Main index
Since queries are composed of words, an inverted index of documents is required. Such an index allows obtaining a list of documents by a query word. The index is very large due to the number of documents stored in the servers, therefore it needs to be split up into "index shards". Each shard is hosted by a set of index servers. The load balancer decides which index server to query based on the availability of each server.
Server types
Google's server infrastructure is divided in several types, each assigned to a different purpose:
Google Web Servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.
Data-gathering servers are permanently dedicated to spidering the Web. They update the index and document databases and apply Google's algorithms to assign ranks to pages.
Index servers each contain a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.
Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.
Ad servers manage advertisements offered by services like AdWords and AdSense.
Spelling servers make suggestions about the spelling of queries.
List of Google server types
Google's servers use a wide-range of software to complete the requests. Although there is no official information on the software Google uses for its servers, information about the server is available via HTTP headers. Below is a list of Google's Services/Tools, and the server software that runs each. This is an unofficial list (From Wikipedia) generated by analysis of HTTP headers received on different Google pages.
Services and Server Software
(*) Denotes Encrypted Page
(GWS: Google Web Server)
Main Search: GWS/2.1
Google Accounts: GFE/1.3*
Google AdSense: GFE/1.3*
Google AdWords: GFE/1.3*
Google Hosted(mail for your domain) : GWS/2.1
Google Analytics (Login Page): GWS/2.1
Google Analytics (Auth Page): GFE/1.3*
Google Analytics (Other Pages): ucfe*
Google Analytics (Analysis Image and JS): ucfe
Google Analytics (Images/JS/CSS/Flash): ga-reporting-fe
Google Answers: GFE/1.3
Google Base: asfe
Blogger: Apache
Google Book Search: OFE/0.1
Google Calendar: GFE/1.3
Google Catalogs: OFE/0.1
Google Checkout: GFE/1.3*
Google Code: codesite/2750796
Google Co-Op: pfe
Google Desktop: GFE/1.3
Google Directory: DFE/1.0
Google Downloads: GWS/2.1
Google Finance: SFE/0.8
Google Finance Stock Charts (Images): FTS (C)1997-2006 IS.Teledata AG
Froogle: cffe
Google Groups: GWS-GRFE/0.50
Hello: Apache/2.0.53
Google Help Pages: TrakhelpServer/1.0a
Google Images: GWS/2.1
Google Labs: Apache
Google Local / Maps: mfe
Google Local/Maps (Images): Keyhole Server 2.4
Google Mail: GFE/1.3
Google Mobile: GWS/2.1
Google Moon: mfe
Google Moon (Images): Keyhole Server 2.4
Google Music Search: mws
Google News: NFE/1.0
Google Notebook GFE/1.3
Orkut: GFE/1.3*
Google Pack: COMINST/1.0
Picasa (.com): Apache/2.0.53
Picasa (.google.com): GWS/2.1
Picasa Web Album: GFE/1.3
Picasa Web Album (Static Images): staticfe
Picasa Web Album (Uploaded Images): cachefe:image
Google Page Creator (Sign-up page): GFE/1.3*
Google Page Creator (User pages): GFE/1.3
Google Personalized Homepage: igfe
Google Scholar: GWS/2.1
Google Search History: Search-History HTTP Server
Google Sets: Apache
Google Site-Flavored: GWS/2.1
Google Sitemaps: GFE/1.3
Google SMS: GWS/2.1
Google SMS Search Requests: SMPP server 1.0
Google SMS (GMail Registration): GFE/1.3*
Google SMS (Page Viewer): GFE/1.3
Google Spreadsheet: GFE/1.3
Google Suggest: Auto-Completion Server
Google Transit: mfe
Google Translate: TWS/0.9
Google Trends: Google Trends
Google Video: GFE/1.3
Google Video (Thumbnails): thumbnail_server.cc version 1.0
Google Reader: GFE/1.3
Google Ride Finder: Apache
Google Talk: GWS/2.1
Google Toolbar: GFE/1.3
Google Toolbar (PR Lookup): GWS/2.1
Google Web Accelerator: GFE/1.3
Google Web Alerts: PSFE/4.0
Writely: GFE/1.3
Server hardware and software:
Original hardware
The original hardware circa 1998 used by Google included:
Sun Ultra II with dual 200MHz processors, and 256MB of RAM. This was the main machine for the original Backrub system.
2 x 300 MHz Dual Pentium II Servers donated by Intel, they included 512MB of RAM and 9 x 9GB hard drives between the two. It was on these that the main search ran.
F50 IBM RS/6000 donated by IBM, included 4 processors, 512MB of memory and 8 x 9GB hard drives.
Two additional boxes included 3 x 9GB hard drives and 6 x 4GB hard drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II.
IBM disk expansion box with another 8 x 9GB hard drives donated by IBM.
Homemade disk box which contained 10 x 9GB SCSI hard drives
Current hardware
Servers are commodity-class x86 PCs running customized versions of GNU/Linux. Indeed, the goal is to purchase CPU generations that offer the best performance per unit of power, not absolute performance. Other than the wages bill, the biggest cost that Google faces is electric power consumption. Estimates of the power required for over 250,000 servers range upwards of 20 megawatts, which could cost on the order of 1-2 million $US per month in electricity charges.
For this reason, the Pentium II has been the most favoured processor, but this could change in the future as processor manufacturers are increasingly limited by the power output of their devices.
Published specifications:
over-250,000 servers ranging from 533 MHz Intel Celeron to dual 1.4 GHz Intel Pentium III (as of 2005)
One or more 80GB hard disk per server. (2003)
2–4 GiB memory per machine (2004)
The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. According to John Hennessy and David Patterson's Computer Architecture: A Quantitative Approach, Google's server farm computer cluster in the year 2000 consisted of approximately 6000 processors, 12000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and two in Virginia. Each site had an OC-48 (2488 Mbit/s) internet connection and an OC-12 (622 Mbit/s) connection to other Google sites. The connections are eventually routed down to 4 x 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two ethernet switches. Google has almost certainly dramatically changed and enlarged their network architecture since then. In 2006, they started work on a large complex in The Dalles, Oregon; one attraction of this site was cheap hydro-electric power.
Based on the Google IPO S-1 form released in April 2004, Tristan Louis estimated the current server farm to contain something like the following:
719 racks
63,272 machines
126,544 CPUs
253 THz of processing power
126,544 GB (approx. 123.58 TB) of RAM
5,062 TB (approx. 4.77 PB) of hard drive space
According to this estimate, the Google server farm constitutes one of the most powerful supercomputers in the world. At 126–316 teraflops, it can perform at over one third the speed of the Blue Gene supercomputer, which is (as of 2006) the top entry in the TOP500 list of most powerful unclassified computing machines in the world.
Future hardware
Google is in the process of developing a new Google complex according to The Telegraph, Google is building a vast complex said to be the size of two football pitches with cooling towers four floors high in Oregon. The new Google "powerplant", which is known as Project 02, has already created hundreds of jobs.
Server operation
Most operations are read-only. When an update is required, queries are redirected to other servers, such as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.
In order to avoid the effects of unavoidable hardware failure, data stored in the servers may be mirrored using hardware RAID. Software is also designed to be fault tolerant. Thus when a system goes down, data is still available on other servers, which increases the throughput.
(*) Denotes Encrypted Page
(GWS: Google Web Server)
Main Search: GWS/2.1
Google Accounts: GFE/1.3*
Google AdSense: GFE/1.3*
Google AdWords: GFE/1.3*
Google Hosted(mail for your domain) : GWS/2.1
Google Analytics (Login Page): GWS/2.1
Google Analytics (Auth Page): GFE/1.3*
Google Analytics (Other Pages): ucfe*
Google Analytics (Analysis Image and JS): ucfe
Google Analytics (Images/JS/CSS/Flash): ga-reporting-fe
Google Answers: GFE/1.3
Google Base: asfe
Blogger: Apache
Google Book Search: OFE/0.1
Google Calendar: GFE/1.3
Google Catalogs: OFE/0.1
Google Checkout: GFE/1.3*
Google Code: codesite/2750796
Google Co-Op: pfe
Google Desktop: GFE/1.3
Google Directory: DFE/1.0
Google Downloads: GWS/2.1
Google Finance: SFE/0.8
Google Finance Stock Charts (Images): FTS (C)1997-2006 IS.Teledata AG
Froogle: cffe
Google Groups: GWS-GRFE/0.50
Hello: Apache/2.0.53
Google Help Pages: TrakhelpServer/1.0a
Google Images: GWS/2.1
Google Labs: Apache
Google Local / Maps: mfe
Google Local/Maps (Images): Keyhole Server 2.4
Google Mail: GFE/1.3
Google Mobile: GWS/2.1
Google Moon: mfe
Google Moon (Images): Keyhole Server 2.4
Google Music Search: mws
Google News: NFE/1.0
Google Notebook GFE/1.3
Orkut: GFE/1.3*
Google Pack: COMINST/1.0
Picasa (.com): Apache/2.0.53
Picasa (.google.com): GWS/2.1
Picasa Web Album: GFE/1.3
Picasa Web Album (Static Images): staticfe
Picasa Web Album (Uploaded Images): cachefe:image
Google Page Creator (Sign-up page): GFE/1.3*
Google Page Creator (User pages): GFE/1.3
Google Personalized Homepage: igfe
Google Scholar: GWS/2.1
Google Search History: Search-History HTTP Server
Google Sets: Apache
Google Site-Flavored: GWS/2.1
Google Sitemaps: GFE/1.3
Google SMS: GWS/2.1
Google SMS Search Requests: SMPP server 1.0
Google SMS (GMail Registration): GFE/1.3*
Google SMS (Page Viewer): GFE/1.3
Google Spreadsheet: GFE/1.3
Google Suggest: Auto-Completion Server
Google Transit: mfe
Google Translate: TWS/0.9
Google Trends: Google Trends
Google Video: GFE/1.3
Google Video (Thumbnails): thumbnail_server.cc version 1.0
Google Reader: GFE/1.3
Google Ride Finder: Apache
Google Talk: GWS/2.1
Google Toolbar: GFE/1.3
Google Toolbar (PR Lookup): GWS/2.1
Google Web Accelerator: GFE/1.3
Google Web Alerts: PSFE/4.0
Writely: GFE/1.3
Server hardware and software:
Original hardware
The original hardware circa 1998 used by Google included:
Sun Ultra II with dual 200MHz processors, and 256MB of RAM. This was the main machine for the original Backrub system.
2 x 300 MHz Dual Pentium II Servers donated by Intel, they included 512MB of RAM and 9 x 9GB hard drives between the two. It was on these that the main search ran.
F50 IBM RS/6000 donated by IBM, included 4 processors, 512MB of memory and 8 x 9GB hard drives.
Two additional boxes included 3 x 9GB hard drives and 6 x 4GB hard drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II.
IBM disk expansion box with another 8 x 9GB hard drives donated by IBM.
Homemade disk box which contained 10 x 9GB SCSI hard drives
Current hardware
Servers are commodity-class x86 PCs running customized versions of GNU/Linux. Indeed, the goal is to purchase CPU generations that offer the best performance per unit of power, not absolute performance. Other than the wages bill, the biggest cost that Google faces is electric power consumption. Estimates of the power required for over 250,000 servers range upwards of 20 megawatts, which could cost on the order of 1-2 million $US per month in electricity charges.
For this reason, the Pentium II has been the most favoured processor, but this could change in the future as processor manufacturers are increasingly limited by the power output of their devices.
Published specifications:
over-250,000 servers ranging from 533 MHz Intel Celeron to dual 1.4 GHz Intel Pentium III (as of 2005)
One or more 80GB hard disk per server. (2003)
2–4 GiB memory per machine (2004)
The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. According to John Hennessy and David Patterson's Computer Architecture: A Quantitative Approach, Google's server farm computer cluster in the year 2000 consisted of approximately 6000 processors, 12000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and two in Virginia. Each site had an OC-48 (2488 Mbit/s) internet connection and an OC-12 (622 Mbit/s) connection to other Google sites. The connections are eventually routed down to 4 x 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two ethernet switches. Google has almost certainly dramatically changed and enlarged their network architecture since then. In 2006, they started work on a large complex in The Dalles, Oregon; one attraction of this site was cheap hydro-electric power.
Based on the Google IPO S-1 form released in April 2004, Tristan Louis estimated the current server farm to contain something like the following:
719 racks
63,272 machines
126,544 CPUs
253 THz of processing power
126,544 GB (approx. 123.58 TB) of RAM
5,062 TB (approx. 4.77 PB) of hard drive space
According to this estimate, the Google server farm constitutes one of the most powerful supercomputers in the world. At 126–316 teraflops, it can perform at over one third the speed of the Blue Gene supercomputer, which is (as of 2006) the top entry in the TOP500 list of most powerful unclassified computing machines in the world.
Future hardware
Google is in the process of developing a new Google complex according to The Telegraph, Google is building a vast complex said to be the size of two football pitches with cooling towers four floors high in Oregon. The new Google "powerplant", which is known as Project 02, has already created hundreds of jobs.
Server operation
Most operations are read-only. When an update is required, queries are redirected to other servers, such as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.
In order to avoid the effects of unavoidable hardware failure, data stored in the servers may be mirrored using hardware RAID. Software is also designed to be fault tolerant. Thus when a system goes down, data is still available on other servers, which increases the throughput.
From Wikipedia, the free encyclopedia
Links:
0 commentaires:
Post a Comment